This is my webpage submission for 16822 assignment 1.
Q1: 8-point and 7-point algorithm
Q1a: F matrix using 8-point algorithm
I solved the fundamental matrix equation x′TFx=0 with 8 point correspondences between images using the following steps:
Normalize the image pixels in each image so that they are mean-centered with root-mean-squared distance to the origin equal to 2. This normalizing transformation uses a scale factor of s and mean x and y values of cx and cy. It has the form T=s000s0−scx−scy1
Use constraints reminiscent of DLT from prior assignments. That is, each row of the A matrix has the form (x1′x1x1′y1x1′y1′x1y1′y1y1′x1y11).
With 8 such correspondences, I obtain an 8x9 A matrix from which I can solve for the 9x1 matrix f via SVD: Af=0.
Reforming the resulting 9 values into a 3x3 matrix, I enforce the additional constraint that the determinant is equal to 0 by setting the last element of the diagonal matrix to 0. I reconstruct F using this new diagonal matrix and the same U and V^T matrices that SVD yielded.
I denormalize the F matrix according to T'^TFT and divide the resulting matrix by the last element.
Estimated E for remote:
2.5298064−6.62032749−14.50071092−0.67056178−3.0276846626.418247817.33094837−28.298616651
Q1b: 7-point algorithm
I solved the fundamental matrix equation x′TFx=0 with 8 point correspondences between images using the following steps:
Normalize each point using a transformation identical to that described in part a above.
Create an A matrix using the same constraints as described in part a above, except using 7 instead of 8 point correspondences
Solve with SVD now provides F1 and F2 matrices due to the larger null space from having 7 instead of 8 constraints.
Represent F=αF1+(1−α)F2 and solve the cubic polynomial that arises from setting its determinant equal to 0. Specifically, I choose arbitrary α values of 0, 0.31, 0.68, and 1 and plug these into the cubic polynomial. These coefficients give the A matrix for the next solution. The b matrix for which I am solving is the determinant of F with each α value.
Solving this system of equations gives roots from which I solve for F. If the root has a nonzero imaginary component, I discard it, which yields either 1 or 3 possible F matrices.
To decide on the correct F matrix, I observe the epipolar lines that each produces. In the case of the ball object, there is only one F matrix. In the case of the hydrant object, there are three F matrices and I conclude through observation that the third F matrix is correct.
Sample eight of the many provided point correspondences
Compute the F matrix using these sampled points
For each correspondence, use this calculated F to determine the Sampson error
If the error is less than a predetermined threshold (I used 0.5), then classify those points as inliers
Track the best F as the one that produces the most inliers
Repeat a predetermined number of times (I ran 10,000 iterations)
To implement RANSAC with the 7pt algorithm was largely the same as the 8pt algorithm. A key difference was that I computed the error for all F matrices if there were more than one produced from the 7pt algorithm. Also, my sample of point correspondences only included 7 instead of 8 for this exercise.
Note that for the following results, many image pairs produced more than one F matrix using the 7pt algorithm. For the sake of readability, in these results, I show only the best results. These results have been hand-picked based on the appearance of their epipolar lines.
8pt Bench results:
Best F:
−1.43391108e−07−9.27002442e−08−1.14221489e−03−2.90583671e−062.54945574e−07−2.32675030e−023.02620285e−042.51935268e−021.00000000e+00
Triangulating allows us to reconstruct a 3D scene from 2D point correspondences. To accomplish this, I took the following steps:
For each point correspondence, construct an A matrix of the form
A=xp3T−p1Typ3T−p2Tx′p′3T−p′1Ty′p′3T−p′1T
Solving for the world coordinate X via SVD and dividing by the fourth element produces homogeneous world coordinates ready for plotting. For this 3D point's color, I take an average of the colors at the corresponding pixel points.
Repeating this for all point correspondences gives an array of world coordinates that can be plotted to illustrate a 3D point cloud.
Sloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUI
Sloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUI
Multi-view mug images:
Sloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUI
Sloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUI
Sloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUISloth GIF using GUI
Results using COLMAP GUI:
Sloth GIF using GUIMug GIF using GUI
Results using pycolmap package. The green spheres represent the cameras used to capture each image.
Sloth GIF using pycolmapMug GIF using pycolmap
Q5 Fundamental matrix estimation on your own images
Q6 Stress test the hyperparameters of COLMAP
Chosen questions for stress tests:
What happens if we reduce the number of input images
What happens if we play with some tolerance parameters? I chose the number of SIFT features and the peak threshold
Cutting the number of images in half (choosing every other image) results in a far poorer 3D reconstruction. It stands to reason from this result that more images of the scene lead to a more faithful reconstruction
Sloth GIF using half of imagesMug GIF using half of images
The number of SIFT features defaults to a maximum of 8192. Limiting this to 1000 leads to a sparser reconstruction, because there are fewer features on which to match between the provided images. Fewer matches means fewer correspondences and therefore fewer 3D points projected. By similar logic, doubling the number of SIFT features to 16384 yields a denser reconstruction.
The peak_threshold parameter in COLMAP defaults to 0.006667 and is the cutoff for a pixel to be marked as a keypoint when using SIFT and the difference-of-Gaussians metric to compare pixel neighborhoods at varying scales. A lower threshold means that regions with lower contrast will be identified as keypoints, thereby typically increasing the number of keypoints. Conversely, a higher threshold means that the region around a given pixel must have a higher contrast to be considered a keypoint, thereby typically decreasing the number of keypoints. I observe this impact in my reconstructions, where the affect of halving this threshold is a denser reconstruction with more background points included. Conversely, the impact of doubling this threshold is that the reconstruction is a bit sparser.