Home Visual Odometry and Visual Algorithms [Part 9] - Structure from Motion
Post
Cancel

Visual Odometry and Visual Algorithms [Part 9] - Structure from Motion

Chapter 9 - Structure from Motion

Lecture 8

Slides 1 - 43

Structure from Motion (SFM) encompasses the application when one wants to get the 3D positions of the points in a setting where the input data is not a sequence of images (frames) but rather temporal independent images from different viewangles of the object. Usually in SFM it is possible that the images are takes by different cameras.

The overal problem formulation is like the following: Given a certain amount (n) point correspondences between two images we qant to estimate the 3D position, the camera pose and motion as well as the camera intrinsics.

If we work with calibrated cameras we obviously do not have to find the intrinsics K as they are already known.

Scale Ambiguity

The first challenge we encounter in SFM is that since we do not have information about the real size and dimensions of the scene we can not find the correct scale. Rescaling the entire scene by a constant factor does not change the images and therefore stays undetected.

Scale Ambiguity Figure 1: Scale Ambiguity. source

As a result in monocular vision it is impossible to recover the absoculte scale. However in stereo vision it is possible under certain circumstances. This is becasue in stereo vision we have only 5 degrees of freedom, 3 perameters for the rotation and 2 for translation. For translation we only know the direction but not the distance due to the scale ambiguity. To show that it is possible to overcome the scale ambiguity we now check how many knows and unknown we have. We have 4 knows for each correspondence. This is bacause we have $u$ and $v$ for both images. Therefore we have $4n$ knowns. As axplained before w have 5 unknowns for the motion (3 for rotation, 2 for translation) Additionally we have 3 unknowns for each of the correspondences due to the unknown 3D position of their point. To be able to solve such a system we have to have more knowns that unknowns. Therefor the following raltaion must hold: $4n \geq 5+3n$. This hols for $n \geq 5$

Why 2 DOF for translation Figure 1b: Why 2 DOF for translation. source

Epopolar Geometry

In this section we want to show that and how it is possible to estimate the relative motion independently fromt he 3D structure.

In Chapter 8 we have introduces the notion of epipolar lines and planes. As a reminder the epipolar plane is the plane which si span by the camera centers as well as the 3D-point. There line in which this planes intercepts the camera plane is the epipolar line. The only positions on the image plane where the image of tha 3D-point can is are along that line.

From the defintion of the epipolar plane we know that the vector from the camera centers to the image of a point on the image plane are coplanar with the vector connecting both camera centers. this leads to the concliustion that the dot-product of the camera-image vector and the normal vector on the epipolar plane is 0. Since we can express the normal vector as the cross product of the line connecting both amera centers (T) and the image etor if the opposing camera we can craate the following formula:

\[\begin{align*} p_1 \cdot n = 0 \implies p_1 \cdot (T \times p_2') = 0 \implies p_1 \cdot (T \times Rp_2) = 0 \implies p_1 {[T]}_\times R p_2 = 0 \end{align*}\]

where $p_2’$ is the vator $p_2$ expressed in the coordinates of Camera 1 ($C_1$) and vis versa. We can express ${[T]}_\times R$ ans $E$ whichw e call essentail matrix

\[\begin{align*} \implies p_1 E p_2 = 0 \end{align*}\]

This equation is called Epipolar Constraint or Longuet_higgins equation.

When decomposing E into ${[T]}_\times$ and $R$ has four distinct solutions.

Epipolar Constraint Figure 2: Epipolar Constraint. source

As a short illustration we now look at the case where we have two rectified stereo images. As we have seen in Chapter 4 there is no rotation between rectified images. Therefore we can set $R$ and s 3x3 identity matrix. As a result the essentail matrix only depends on the translation.

Recified Case Figure 3: Recified Case. source

So how many point-corresponedences do we need to be able to calculate the essential matrix. This is a question that many reasearchers have looked into. Kruppa shoed in 1913 that at least 5 points are needed and that in that case there can be up to 11 solutions. In 1988 Demazure showed that actually the are at most 10 distinct solutions. In 2004 Philipp proposed an iterative algorithm for solving this problem. And in 2004 the first non iterative and efficient algorithm was developed by Nister based on the groebner basis decomposistion. However the first popular solution method uses 8 points to get a unique solution and is popular due to its relative easy implementation. This algorithm is called 8-point algorithm or Longuet-Higgins algorithm.

The 8-point Algorithm for calibrated cameras

As seen befor the matrix $E$ is a 3x3 matrix. When applying the epipolar constraint to a pair of correspondence we get the following equation baes on the elements of matrix e.

\[\begin{align*} u_2u_1e_{11} + u_2v_1e_{12} + u_2e_{13} + v_2u_1e_{21} + v_2v_1e_{22} v_2e_{23} + u_1e_{31} + v_1e_{32} + e_{33} = 0 \end{align*}\]

Essentail Matrix Figure 4: Essentail Matrix. source

For many points we can repeat that and get the following structure:

Q and E Figure 5: Q and E. source

The matrix Q has the dimensions $n \times 9$ and therefore should have rank 8 to have a unique solution which is non trivial. We see that each point correspondence provides 1 independent equation and therefore we need 8 point correspondences. When we have more than 8 points then a solution is to minimize ${\lVert Q E \rVert}^2$ which also fulfills the constraint ${\lVert E \rVert}^2 = 1$. We can take the eigenvactor corresponding to the smalles eigenvalue of the matrix $Q^TQ$ as the solutiion because is is the unit vector x the minimizes ${\lVert Qx \rVert}^2 = x^TQ^TQx$. In practice we can solve this easily using the Singular Value Decomposition (SVD). There are some configuration in which the 8-point algorith is degenerate. This is when the 3D points are coplanar. In this case however the 5-point algorithm still works.

Interpretation

Now we give more insights into how the 8-point algorithm works and give some interpretation. As mentioned before the goal is to find a solution that fulfulls the condition that ${p_2}^TE{p_1}^t = 0$. With multiple points we can also express this forumation as minimizing the error (deviation) from that solution. Thus:

\[\begin{align*} \sum_{n=1}^{N} ({p_2}^TE{p_1})^2 \end{align*}\]

Andthis expression can be written using the definition of the dot ptoduct.

\[\begin{align*} {p_2}^T \cdot E p_1 = \lVert p_2 \rVert \lVert E p_1 \rVert cos(\theta) \end{align*}\]

We see that this product depends on the angle between image vector $p_1$ and the normal of the to the epipolarplane. It is non zero when $p_1$, $p_2$ and $T$ are not coplanar. Or it is zero when they are coplanar supporting the definition of the epipolar plane. See Figure 2.

(The following wont be asked…)

When we have the messential matrix $E$ we can decompose it into the rotation $R$ and translation $t$. For this we can use the SVD ad explained before.

\[\begin{align*} E = U S V^T \end{align*}\]

$S$ is a 3x3 diagonal matrix with the eigenvalues as diagonal elements. Due to the constraint that for solvability it hast to have rank 2 we can set the smalles eigenvalue to 0.

\[\begin{align*} S = \begin{bmatrix} \sigma_1 & 0 & 0\\ 0 & \sigma_2 & 0\\ 0 & 0 & 0 \end{bmatrix} \end{align*}\]

\(\begin{align*} \hat{T} = U \begin{bmatrix} 0 & \mp 1 & 0\\ \pm 1 & 0 & 0\\ 0 & 0 & 0 \end{bmatrix}S V^T \end{align*}\) \(\begin{align*} \hat{R} = U \begin{bmatrix} 0 & \mp 1 & 0\\ \pm 1 & 0 & 0\\ 0 & 0 & 1 \end{bmatrix} V^T \end{align*}\) take note of the plus-minus and minus-plus order. We can turn the translation matrix into a single translation vector.

\[\begin{align*} \hat{T} = \begin{bmatrix} 0 & -t_z & t_y\\ t_z & 0 & t_x\\ -t_y & t_x & 0 \end{bmatrix} \implies \hat{t} = \begin{bmatrix} t_x & t_y & t_z \end{bmatrix} \end{align*}\]

In the end we have to apply the camera intrinsics to get exact results.

\[\begin{align*} t &= K_2\hat{t}\\ R &= K_2\hat{R}{K_1}^{-1} \end{align*}\]

We see from the mathematical notation that there are 4 possilbe solution, but onyl one of them has the points in front of both cameras.

Four Solutions Figure 6: Four Solutions source

8-point Algorithm for uncalibrated cameras

When the camera is uncalibrated we just have include the intrinsics in the notion of the normalized image coordinates. Therefore we have to extract first the camera intrinsics £k£ from the image coordinated $\bar{u}$ and $\bar{v}$

Extract K Figure 7: Extract K source

When we do that we introducs the notion of the camera intrinsics $K$ for both cameras. We can combine these inverses of K together with essential Matrix to the the Fundamental matrix.

Fundamental Matrix Figure 8: Fundamental Matrix source

As a result we can apply the exact same 8-point algorithm to calculate the Fundamentl matrix instead of the essential matrix. This is advantegeous since like this we work directly on the image coordinates. However there is a issue. The matrix Q can be poorly conditioned since the entries sometimes are the product of $u$ and $v$ of the both images. If $u$ and $v$ differ alot for and inbetween the images then some of the products can be really large and others can be small. (e.g. $u_1 = 10$, $v_1 = 100$, $u_2 = 1$, $v_2 = 100$ $\implies$ $q_1 = 10, q_2 = 100, q_3 = 1, q_4 = 1000, q_5 = 10000, q_6 = 100, q_7 = 10, q_8 = 100, q_9 = 1$, see Figure 5). Give such a poor numerical conditioning the algorithms gets very sensitive to noise.

This issue is easely solved by applying the normlized 8-point algorithm by Hartley (1997). The key of this normalized approach is that instad of having the origin of the image coordinates on the top left corner it is moved towards the center. like this we get smaller coordinate numbers but now we also get negative numbers. Additionally the values for the coordinates are normalized to be between -1 and 1. From these normalized coordinates, which are better conditioned, we can now calculated the fundamental matrix and then unnormalize it to get the fundamental matrix of the original, unnormalized coordinates.

Normalizing Figure 9: Normalizing source

There are other rescaling ratios that can be used. For example the original scaling of Hertley 97 sets the centroid of the set of points to 0 and the mean standatd deviation to $\sqrt{2}$. As a formula this looks as follows:

\[\begin{align*} \hat{p} = \dfrac{\sqrt{2}}{\sigma}(p-\mu) \end{align*}\]

Where $\mu$ is the centroid (geometric center) of all the points and $\sigma$ is the mean standard deviation.

\[\begin{align*} \mu &= (\mu_x, \mu_y) = \dfrac{1}{N} \sum_{i=1}^{n} p^i \\ \sigma &= \dfrac{1}{N} \sum_{i=1}^{n} \lVert p^i - \mu \rVert^2 \end{align*}\]

This transformation can also be expressed as a matrix.

Hartley Scaling Figure 10: Hartley Scaling source

The backscaling can then be done using the transpose of the Transformation matrix. $F = T^T\hat{F}T$

It should be clear now that getting the fundamental matrix should not be an issue. Now you might ask if we can recover the rotation $R$, translation $T$ and the intrinsics $K_1$ and $K_2$ from the Fundamental matrix $F$. Well first of all this is not allways possible. In general ther are infinite solutions. We are only interested in the case where we can get a meaningfull result, so when there is a unique solution. This can occurs if the principle points of the cameraas are know and if both cameras have the same (unknown) focal length $f$. In that case we can recover $R$, $T$ and $f$. However we won’t go into how you can recover them in this chapter. It is nowhere described how we can do this.

Error Measures

We can measure the quality of the fundamental/ essential matrix using different metrics. First of all we can use the algebraic error which is directly defined by the epipolar constraint:

\[\begin{align*} err = \sum_{i=1}^{N} {p_2}^TE{p_1} \end{align*}\]

This error will be 0 if we used 8 points for the calculation of $E$. If more points are used the error can be larger than 0 because of the noise or outliers.

There is also the directional error which takes the sum of the squared cosines of the angle between the normal of the epipolar plane ($n = Ep_1$) and the image vector of image two.

\[\begin{align*} err &= \sum_{i=1}^{N} cos(\theta_i)^2 \\ cos(\theta) &= \dfrac{{p_2}^T \cdot E p_1}{\lVert p_2 \rVert \lVert Ep_1 \rVert} \end{align*}\]

Directional Error Figure 11: Directional Error source

Another error measure is the epipolar line distance / Squared Epipolar-Line_to_points distance wher ewe measure the squared distance if the image point to the epipolar line.

\[\begin{align*} err &= \sum_{i=1}^{N} d^2({p_1}^i,{l_1}^i) + d^2({p_2}^i, {l_2}^i) \end{align*}\]

Epipolar Line Distance Figure 12: Epipolar Line Distance source

The most popular and most accurate emasure is the Reprojection Error, eventhough it is computational expensive. For this measure we obsever the distance between the image observed and the image produced by a reprojetion of our estimated 3D point onto the image plane.

\[\begin{align*} err &= \sum_{i=1}^{N} \lVert {p_1}^i - \pi_1(P^i)\rVert ^2 + \lVert {p_2}^i - \pi_2(P^i, R, T) \rVert^2 \end{align*}\]

where $\pi$ is the reprojection of the point

Reprojection Error Figure 12: Reprojection Error source

Summary

  • In SFM with calibrated cameras we need at least 5 correspondences. This is because for each correspondence image we have 2 knows, so in total 4n knoes for n correspondences. For each correspondence we have 3 unknows, one for each dimension of the 3D point. Additionally we have 5 unknows (3 DoF for rotation, 2DoF for translation) For a solvable systen the number of knows has to greater as the ones for unkown => 5+3n < 4n => 5<n
  • we know that the image vector 1 has to be normal to the normal vector of the epipolar plane which we can express as the cross product of the line connecting both cameras the image vector 2 in terms of camera 1 coordinates. Due to this normality we can set the dot-product to 0. we can derive the image vector of 2 in camera 1 coordinates by applying the rotation R. As a result we can get $p_1 T R p_2$ with $TR$ as E
  • The essential matrix is the matrix that fulfills the epipolar constraint
    1. Construct the n x 9 matrix Q 2.Find the SVD of Q: $Q = UDV^T$ 3.The entries of F are the components of thecolumn of V corresponding to the least s.v 4. Find the SVD of F: $F = UDV^T$ 5. Set smallest s.v. of F to 0 to create $D$ 6. Recompute F: $F = U D V$
  • An essential matrix can be decomposed into 4 combinations of T and R. T however is unique. Only one of these combinations have the object in front of camera plane.
  • the epipolar constraint describes mathematically that the image points, the 3D point as well as the camera points have to be on one plain.
  • the essential matrix is resulting from the 8-algorithm with calibrated cameras , so known K, The fundamental matrix has a notation of K inside since it is based on the situation where the K are not known.
  • normalizin the 8-point algorithm is important since otherwise the Q matrix can be bad conditioned (many different order of magnitude) therefore it get unrobust to noise and outliers.
  • To normalize the center of the image coords is moved to the center. Then the values can be normalised to be in [-1,1]. Another method is to normalize such that the centroid over all points is at 0 and then we scale with $\sqrt{2}$ as means std deviation.
  • in the normalized 8-p algo. we first normalize the image coords then we apply the normal 8-p algo and the the essential matirx (fundamental matrix) is converted back to the original image coordinates.
  • ther are multiple error measures such as algebraic error, directional error, epipolar line error or the reprojection error.
This post is licensed under CC BY 4.0 by the author.