Chapter 8 - Direct Approaches to Visual SLAM
SLAM = Simultaneous Localization And Mapping
Indirect vs. Direct methods
Classical Multi-View Reconstruction Pipeline
The classical approach tackles structure + mostion estimation (i.e. visual SLAM) like this:
- Extract feature points (e.g. corners)
- Determine correspondence of feature points across the images
- either via local tracking (optical flow approach)
- or random sampling of possible partners based on feature descriptor
- Estimate camera motion
- eight-point algorithm and/or bundle adjustment
- Compute dense reconstruction from camera motion
- photometric stereo approaches
Disadvantages of classical “indirect” approaches
- disregards any information not included in the selected feature points: all remaining information in the color
- lack robustness to errors in the point correspondence
Toward Direct Approaches
Skip feature point extraction: reconstruct dense/semi-dense scene directly from input images.
- more robust to noise: exploit all available information
- provide semi-dense reconstruction: better than sparse point cloud by eight-point algorithm/bundle adjustment!
- even faster: no feature point extraction
Idea: “find geometry of the scene in a way such that the colors in the images are consistent”.
Some Concrete Direct SLAM Methods
Stühmer, Gumhold, Cremers, DAGM 2010: Real-Time Dense Geometry from Handheld Camera
Suppose we have $n$ images, $g_i \in SE(3)$ is the i-th image rigid body motion, $I_i: \Omega \to \mathbb{R}$ the i-th image. Goal: compute depth map $h: \Omega \to \mathbb{R}$ that maps each pixel to its depth.
Solve optimization problem: \(\min_h \sum_{i=2}^{n} \int_\Omega |I_1(x) - I_i(\pi g_i(h(x) x)) | \,dx + \lambda \int_\Omega |\nabla h(x)| \,dx\)
- For each image $i$ and each point $x$ in image 1, the intensity of $x$ in image 1 should be close to the intensity of the corresponding point in image i
- The total variation of depths should be small (depths should not wildly fluctuate) - achieved via the regularization term
Actual minimization achieved with similar strategies like in optical flow estimation (above integral looks similar) - not detailed here. Roughly:
- linearize above terms in h (via Taylor expansions) - only holds for small h
- coarse-to-fine linearization
Comment: the total variation regularization adds in a “soap film” effect, where unknown/unseen parts are filled in/”made up” for the reconstruction.
Steinbrücker, Sturm & Cremers (2011): Camera motion of RGB-D Camera
Use a similar cost function as [[12d - Examples of Direct SLAM papers from 2011 - 2014#Stühmer Gumhold Cremers DAGM 2010 Real-Time Dense Geometry from Handheld Camera|before]] (without the regularization and with L2 instead of L1), but assume the depth is known and optimize over the rigid body motion $\xi \in se(3)$.
\[E(\xi) = \int_\Omega \big(I_1(x) - I_2(\pi g_{\xi}(h(x) x))\big)^2 \,dx\]Linearize this by using the Taylor expansion around some initial $\xi_0$:
\[E(\xi) \approx \int_\Omega \big(I_1(x) - I_2(\pi g_{\xi_0}(h(x) x)) - \nabla I_2^\top (\tfrac{d\pi}{d g_\xi})(\tfrac{dg_\xi}{d\xi}) \xi \big)^2 \,dx\]=> Convex quadratic cost function, linear optimality condition $\tfrac{dE(\xi)}{d\xi} = A\xi + b = 0$ (Linear system with 6 equations).
The linearization is identical to Gauss-Newton (approximation of a Hessian by a PSD matrix).
Comment: The Taylor approximation only works well for small camera movements (which you typically have in practice). For too large movements, the baseline generalized iterated closest points (GICP) works better.
Kerl, Sturm, Cremers (IROS 2013): Extension
Additionally to color consistency, also ask for depth value consistency. Assume the vector $r_i = (r_{ci}, r_{zi})$ of color and geometric discrepancy for the i-th pixel follows a bivariate t-distribution, the max-likelihood pose estimate depending on the parameter $\nu$ from the t-distribution is
\[\min_xi \sum_i w_i r_i^\top \Sigma^{-1} r_i, ~ w_i = \frac{\nu + 1}{\nu + r_i^\top \Sigma^{-1} r_i}\]=> non-linear weighted least squares problem
(t-distribution in a sense interpolates between a uniform and a Gaussian distribution)
Loop Closure/Global Consistency
What to do about errors that accumulate when walking around?
- either bundle adjustment which produces globally consistent solution (expensive though, not online)
Loop closure: estimate a lot of camera motions $\hat{\xi}{ij}$ for image pairs $(i, j)$. Then estimate a globally consistent trajectory ${\xi_i}{i}$ that minimizes: \(\min_\xi \sum_{i,j} (\hat{\xi}_{ij} - \xi_i \circ \xi_j^{-1})^\top \Sigma_{ij}^{-1} (\hat{\xi}_{ij} - \xi_i \circ \xi_j^{-1})\) Here $\Sigma_{ij}$ denotes the uncertainty of measurement $(i, j)$. This minimization can be solved by [[12c - Optimization Algorithms, Direct Visual SLAM#Levenberg-Marquardt Algorithm Levenberg-Marquardt]]
In the loop closure approach, we don’t store all camera positions for efficiency, but a lot of keyframes.
Newcombe, Lovegrove & Davison (ICCV 2011): Dense Tracking and Mapping - DTAM
Combines the previous tasks of (a) reconstructing dense geometry, and (b) reconstructing camera motion (and actually one of the first papers to do this).
Chicken-egg-problem! (Since here, RGB instead of RGB-D camera)
Depth estimation
Depth estimation very similar to [[#Stühmer Gumhold Cremers DAGM 2010 Real-Time Dense Geometry from Handheld Camera|Stühmer et al.]], but regularization over total variation of inverse depth $u=1/h$ instead of depth $h$. This alleviates a bias such that farther-away structures which have less pixels don’t have a difference surface smoothing strength.
The regularization terms is also made adaptive by weighting the integrand by $\rho(x) = \exp(- | \nabla I_\sigma(x) | ^\alpha)$: This is small if a big gradient exists at $x$, which means that big changes are penalized less at points in the image where the color changes by a lot. The final regularization term is |
This can, but needn’t be a good idea (e.g. reduced smoothing on a zebra).
Camera tracking
Similar to [[#Steinbrücker Sturm Cremers 2011 Camera motion of RGB-D Camera|Steinbrücker et al.]] (not detailed here).
Then: find good initialization (initial estimate for camera motion), alternate
Engel, Sturm, Cremers (ICCV 2013): LSD-SLAM (Large-Scale Direct Monocular SLAM)
- not dense, but semi-dense: reconstruct everywhere where there is a usable gradient (~50% of points in practice)
- additional uncertainty propagation, similar to Kalman filter
- additional scale parameter in camera motions: Use Lie group of 3D similarity transformations \(\textit{Sim}(3) = \bigg\{ \begin{pmatrix} sR & T \\ 0 & 1 \end{pmatrix} \mid R \in SO(3), T \in \mathbb{R}^3, s \geq 0 \bigg\}\)
Minimize non-linear least squares problem \(\min_{\xi \in \textit{sim}(3)} \sum_i w_i r_i^2(\xi)\) Here $r_i$ is a color residuum, $w_i$ a weighting, similar to [[#Kerl Sturm Cremers IROS 2013 Extension|Kerl et al.]]; this is minimized by a weighted Gauss-Newton algorithm on $\textit{Sim}(3)$