Image Matching: Structure from Motion

We devise/improve upon an algorithm that estimates a three-dimensional structure from two-dimensional image sequences. Multiple photographs of an object taken from different angles and at varying distances are analyzed and matched to estimate the 3D structure of the object. Matching many images across different viewpoints to reconstruct a 3D model of an environment from a collection of images is a computer vision problem known as Structure from Motion (SfM).

Abstract

The typical steps that are involved in estimating structure from motion are: i) extraction of image embeddings, ii) key point detection and matching, iii) rotation correction (if needed), iv) triangulation and v) bundle adjustment. We explore model ensembling techniques for image embedding extraction and for keypoint detection and matching in order to utilize the capabilities of the models to capture the various details in an image while improving the overall accuracy of reconstruction. We experiment with different models and hyperparameters and orientation correction to obtain the best combination for SfM. Since this algorithm was devised as part of the Kaggle Image Matching Challenge 2024, the code executes within 9 hours on a GPU and was optimized to obtain the highest accuracy for the task among the many variations tested.

MY ALT TEXT

Dataset

The training data contains different classes with their respective images and the 3-D reconstructions for the images, which can be opened with colmap. The training labels are: i) dataset, ii) scene, iii) image_path, iv) rotation_matrix and v) the translation_vector. The test data contains only the images for each class. The inferencing of the model is done on the test dataset.
The images displayed in the above section are examples taken from one of the classes in the test data. The multiple images taken from various viewpoints are used to reconstruct the 3D model of the actual object and to accurately pose (obtain the rotation matrix and translation vector of the camera's position) the images. There are two resolutions of images present in the dataset: 1024 x 768 pixels (landscape) and 768 x 1024 pixels (portrait).

Algorithm

1) Feature Extraction and finding similar image pairs

The purpose of the task is to compute global descriptors (feature representations such as image embeddings) for images using a deep learning model, then generate pairs of images based on the similarity of these descriptors. We use a pre-trained deep learning model (tf_efficientnet_b7) to extract features from the input images. The descriptors are normalized using L2 normalization to ensure that each descriptor vector has a unit norm, which can improve the robustness and effectiveness of subsequent similarity calculations. The pairwise distances are computed on these descriptors to extract pairs of images that are most similar to each other.

2) Keypoint detection

Keypoints are spatial locations in an image that represent regions of an image that are unique and can be robustly matched across different images under various transformations such as changes in viewpoint, scale, rotation, and illumination. We want to identify these informative points within an image as they serve as landmarks or reference points to establish correspondences between different images for image matching. We match keypoints and descriptors between pairs of images using the LightGlue algorithm (ALIKED).

3) Keypoint merger

Combining keypoints and matches from multiple images or sources into a unified dataset is necessary when using multiple keypoint detection and matching algorithms as each algorithm may detect different sets of keypoints due to variations in their detection methods, sensitivity to different image features, and robustness to different image conditions. By merging keypoints and matches from multiple algorithms or datasets, we obtain a more complete representation of the scene or object being analyzed. Since our algorithm uses DoGHardNet and ALIKED, we include a keypoint merger.

4) Calculate the fundamental matrix

Matches between keypoints can contain outliers due to factors like noise, occlusion, or mismatches. We employ RANSAC to remove the outliers between the keypoint matches. RANSAC randomly selects a minimal subset of matches to estimate the initial matrix parameters. All the matches are then evaluated against the estimated matrix and are classified as inliers and outliers. After several iterations, RANSAC selects the matrix with the highest number of inliers. This matrix is then refined using all inlier matches to obtain a more accurate estimate of the transformation parameters. The final matrix is known as the fundamental matrix.

5) Bundle adjustment

Upon obtaining the keypoint matches without outliers, we use pycolmap, which offers an incremental reconstruction algorithm that starts from two pairs of images and continually adds more and more images to the scene, resulting in a reconstructed scene with camera information (translation vector and rotation matrix). We further perform bundle adjustment to optimize camera parameters and 3D point positions based on observed image data to refine the 3D reconstructed models. Its primary goal is to minimize the reprojection error, which is the difference between the observed 2D keypoints in images and their corresponding projections from the estimated 3D structure.

Experiments

We performed multiple experiments on the algorithm such as model ensembling, orientation detection and hyperparameter tuning. We detail the change in performance of the overall algorithm on the held out testset for every experiment in the table below with the corresponding changes to the model/algorithm.
Method
Score obtained
Baseline (DinoV2. Image size 1024. ALIKED. 4196 features) 0.119
Ensemble of Dinov2, Swinv2 tiny, ViT. Image size 1024 → 1280. ALIKED. 4196 features 0.129
Ensemble of Dinov2, Swinv2 tiny, ViT. Image size 1280. ALIKED. Features 4196 → 8192 0.131
tf_efficientnet_b7. Image size 1280. ALIKED. 8192 features 0.139
tf_efficientnet_b7. Image size 1280 → 1600. ALIKED + DoGHardNet ensemble. 8192 features 0.147
tf_efficientnet_b7. Image size 1600 → 1920. ALIKED + DoGHardNet ensemble. 8192 features. min_matches 15 → 10 0.161

Baseline: The baseline algorithm uses a single model to extract image embeddings (DinoV2) and a single model for feature extraction (LightGlue ALIKED).

An increase in the image size: Although a higher image resolution improves the score of the algorithm, it affects the runtime and the complexity of our model (more multiplications to run the same model for a higher image resolution).

More features considered in keypoint matching: Increasing the number of features that are to be detected in the images naturally increases the runtime of the algorithm. However, it does make the algorithm robust since we are increasing the number of unique points that are being considered in each image, which makes it easier for the network to identify the said object/location in the image.

Ensembling: We experiment with/utilize model ensembling to extract image embeddings and for keypoint detection. While one model might not be able to effectively capture the nuances in the images, using multiple models ensures that we leverage the independent capabilities of the models and combine them to capture the details from the images, necessary for the task.

Conclusion

To summarize, we work on improving a solution to the structure-from-motion problem to reconstruct 3D models from multiple 2D images that are taken from various viewpoints and in different settings.