Algorithm
1) Feature Extraction and finding similar image pairs
The purpose of the task is to compute global descriptors (feature representations such as image embeddings) for images using a deep learning model,
then generate pairs of images based on the similarity of these descriptors. We use a pre-trained deep learning model (tf_efficientnet_b7) to extract
features from the input images. The descriptors are normalized using L2 normalization to ensure that each descriptor vector has a unit norm, which
can improve the robustness and effectiveness of subsequent similarity calculations. The pairwise distances are computed on these descriptors to
extract pairs of images that are most similar to each other.
2) Keypoint detection
Keypoints are spatial locations in an image that represent regions of an image that are unique and can be robustly matched across different images
under various transformations such as changes in viewpoint, scale, rotation, and illumination. We want to identify these informative points within
an image as they serve as landmarks or reference points to establish correspondences between different images for image matching. We match
keypoints and descriptors between pairs of images using the LightGlue algorithm (ALIKED).
3) Keypoint merger
Combining keypoints and matches from multiple images or sources into a unified dataset is necessary when using multiple keypoint detection and
matching algorithms as each algorithm may detect different sets of keypoints due to variations in their detection methods, sensitivity to
different image features, and robustness to different image conditions. By merging keypoints and matches from multiple algorithms or datasets,
we obtain a more complete representation of the scene or object being analyzed. Since our algorithm uses DoGHardNet and ALIKED, we include a
keypoint merger.
4) Calculate the fundamental matrix
Matches between keypoints can contain outliers due to factors like noise, occlusion, or mismatches. We employ RANSAC to remove the outliers
between the keypoint matches. RANSAC randomly selects a minimal subset of matches to estimate the initial matrix parameters. All the matches
are then evaluated against the estimated matrix and are classified as inliers and outliers. After several iterations, RANSAC selects the matrix
with the highest number of inliers. This matrix is then refined using all inlier matches to obtain a more accurate estimate of the transformation
parameters. The final matrix is known as the fundamental matrix.
5) Bundle adjustment
Upon obtaining the keypoint matches without outliers, we use pycolmap, which offers an incremental reconstruction algorithm that starts from two
pairs of images and continually adds more and more images to the scene, resulting in a reconstructed scene with camera information (translation
vector and rotation matrix).
We further perform bundle adjustment to optimize camera parameters and 3D point positions based on observed image data to refine the 3D
reconstructed models. Its primary goal is to minimize the reprojection error, which is the difference between the observed 2D keypoints in images
and their corresponding projections from the estimated 3D structure.