Chapter 11 - Visual Odometry / Visual SLAM
Slides 8 - 61
Slides 1 - 29
In this chapter we go over Structure from Motion in more detail. First we will look the the case where we want to construct a 3d scene from many not sequetially ordered images. Then we will traverse towards the case where we have sequetial ordered images. Towards the end of this chapter we will focus on the problem of loopclosure and itroduce therefore the SLAM algorithm.
Multi-view SFM
In structure from motion we want to recover the 3D structure of a scene from many differen images of a moving camera. However the images do not always have to be in a squetial order. Sometimes we also want to construct a scene from a set of images of the same scene from different alges, cameras and even from different days. In this case we can not rely on the sequetial order and therefore we do not know the spacial and temporal relation between different images. In such a setting we have to apply the Hierarchical SFM.
Hierarchical SFM
The name for this methods is based on the hierarchical structure in which the images are ordered to give introduce some kind of relation between the images. In the following we will how we can creat such a structure.
First we have to extract features from the images and then we have to search the orther images for images which have matching features. We match two images together if they have matching images/frames, we call them nearby frames. The we spilt the image set into clusters of 3 nearby/matching frames. Now we can use two of the images to apply SFM. As a result we get the translation and rotation between the cameras. From this informtaion we can now triangluate the points int he 3D space and create a point cloud. We can then merge the third point into the pointcloud by using the 3-point RANSAC between pointcloud and third image. we get again the rotation and translation of the third view in respect to a reference view of the pointcloud. Refer the the section about 3D-to-2D later on in this chapter. Like this we then can add new features of the third view to the point cloud. We can now merge clusters pairwisse using matching features in both clusters.this we can do as before using the roation and translation between both reference views. This can be done using RANSAC again. In the end we can use Bundle Adjustment to refine the tructure as well as camera locations.
Figure 1: Hierarchical SFM. source
Sequential SFM aka Visual Odometry
In sequential SFM also called Visual Odometry we have sequetial images. Also Visual Odometry comuptes the path of the camera sequetially, so pose after pose. This aloows for real applications. The general flow of a VO algorithm can be split into two parts, the front-end and the back-end. The front-end is responsible for the feature detection and amtching and the pose estimations. In the backend the pose estimations the the locally optimized.
Figure 2: VO Flow Chart. source
Most of the front-end tasks we have already seen in previous chapter. We will not repeate the feture detection and matching here. However we will mention some things about the motion estimation, which also has already been discussed in chapter XY. As it can be seen in Figure 3 there are different motion estimation settings which we will now briefly explain.
2D-to-2D / Motion from Image Feature Correspondences
In this setting we have to feature points $f_{k-1}$ and $f_{k}$ where $k$ denotes the frame number. Both of these feature points are specified in 2D so on the image plane. Using the 5-point algorithm we can get a solution, therefore we need at least 5 point correspondences. In gereral the more the better so we can also use the 8-point algorithm. The solution is then found by minimizing the reprojection error.
Figure 3: 2D-to-2D. source
3D-to-2D / Motion from 3D Structure and Image Correspondences
In such a case we have the feature $f_{k-1}$ in 3D and $f_k$ in 2D. So in other words we have a point cloud and a image. We know this problem as the camera from resection or more familiar as PnP problem. From chapter 4 we know that we need at least 3+1 correspondences to get a unique solution. We can use the P3P algorithm to find a solution.
Figure 4: 3D-to-2D. source
3D-to-3D / Motion from 3D Structure and Image Correspondences
In this last setting we have for both features $f_{k-1}$ and $f_k$ the 3D points. To achieve this we need to have triangulated the points first using a stereo camera for example. In this case we need at least 3 non-colliniear correspondences. The solution is then found by minimizing the 3D-3D euclidean distance. The popular algorithm in this case is Arun87 (Least-Squares Fitting of Two 3-D Point Sets) for gloabal registraion with bundle adjustment.
Figure 5: 3D-to-3D. source
We can sort the above noted motion estimation intoo monocular and stereo vision as following:
Type of correspondences | Monocular | Stereo |
---|---|---|
2D-2D | X | - |
3D-2D | X | X |
3D-3D | - | X |
Molocular VO Example
To start of we have to initialize the algorithm with a pointcloud. So we start of with a 2D-to-2D case. For the initialisation we need two frames that overlap in feature points however thes also should not overlap to much as otherwise the point clound would not be big enought. As seen before we can use 5 or 8 point algorithhm for this initialisation. after that we can use bundle adjustment to refine the pointcloud and the pose of the two frames. The to frames we use for initalisation we call are also keyframes. The whole initialsation process is called bootstrapping.
Figure 6: Bootstrapping. source
So one of the key questions is how far apart do the images have to be to achieve good initialisation. If they are to close together, so when the baseline is too small, then there is a large depth uncertainty. If they are far apart then the depth is mor certain. However too far apart the images should also not be because otherwise there are not anought overlapping features anymore. As a result we skip some of the frame to have frames that are not to cloe to each others. We skipp untill the uncertainty falls below a threshold. The resulting selected frames are called keyframes. As a rule of thumb we add a ekyframe if the ratio of keframe distance devided by the average depth is larger than 10-20%.
Figure 7: skipping frames. source
From the pointcloud prodiced by the first two keyframes we can calculate the pose for every subsequent frame using the P3P or DLT algorithm as we are now in a PnP or 3D-to-2D setting. We can get the pose as long as there are still enough point correspondencces between the existing pointcloud and the new frames. Depending on the algorithms we need at least 6 (DLT) or 3+1 (P3P) points.
Figure 8: Localisation. source
When there are not enought correspondencs anymore, then we have to set a new keyframe. By adding a new Keyframe we can add new points to the pointcloud using triangulation. Now that we have new point in the pointcloud we can get the pose for the next few frames without adding new 3D points. However since the pointcloud is already initialized we would also be able to add every frame as key frame from now on. »>Can we find More Pro/Cons here…«<
Figure 9: Mapping. source
Back-end
Now that we have the estimaton for the pose/motion and the structure we can optimize over the last few poses to refine the trajectory. We can either use a sliding window and apply bundle adjustment for it. However it is also possible to compute the transformations between non-adjacent frames. Like this is features from previous keyframes are still visible the can be used as additional constraints.
Figure 10: Adjacent Frames. source
Figure 11: Pose-Graph optimization. source
For the case where we are only interested at the pose of the camera and not at the 3D points of the scene, then we can use the Pose-Graph optimization for which we want to minimize the following:
\[\begin{align*} C_k = argmin_{C_k} \sum_{i} \sum_{j} \lVert C_i - C_jT_{ij} \rVert^2 \end{align*}\]Usually only the last $m$ keyframes are used for efficiency.
If we additionally want to optimize the 3D points then we use the Bundle Adjustment. As a reminder for Bundle Adjustments we use the following formulation where $X^i$ is the 3D-point-position of point $i$ and $C_k$ is the pose of the camera in the $k$-th frame.
\[\begin{align*} X^i, C_k = argmin_{X^i, C_k} \sum_{i} \sum_{j} \rho( {p_k}^i - \pi(X^i, C_k) ) \end{align*}\]Where $\pi$ is the reprojectiona and $\rho$ is a robust const function like Huber or Tukey.
Figure 12: Bundle Adjustment. source
In general bundle adjustment is more precise that pose-graph optimization since it adds an additional ladmark constraint. However the BA is also more costly $O((qM+lN)^3)$ where M and N are number of points and camera poses and q and l are the parameters for the points adn the camera poses. We can speed up bundle adjustment by applying it to a sliding window since like this the number of points and pose parameters is reduced. Also we can reduce the complexity by just optimizing over the camera parameters and keeping the 3D points fixed.
Another way of optimizing the pose estimations is the loop-closure-detection. This a also known as place recognizion since we try to reidentify locations where we were previously. Now knowing that we are at the same location at back then we can adjust our trajectory to have a closed loop so that the location when we observed the current scene is the same as now. Therefor we can spilt the loop closing into two parts. First the detection of previousl mapped area. And the loop correction which adjusts the trajectory. Also in such a case where we encounter previously mapped areas it is important not to produce a map duplication otherwise we have a pointcloud with multiple times the same 3D features at different locations. This ould lead to a problem when detecting the same features a third time. then we would not know where to close the loop.
When we use Visoual odometry together with loop closing and graph optimization we call it SLAM
Figure 13: Loop Closing. source
Open Source Monocular VO and SLAM algorithms
There is a plentifull of open source Visial odometry oder SLAM algorithms. In the following we will go over some of them and highlite some of their advantages and disantvantages. But first we will give some insights into the different categories of such methods.
The two main categories for VO and SLAM algorithm the feature-based and the direct methods.
Feature based Methods
Feature based methods extract features from the images and tries to match them. Additionally RANSAC can be used to filter outliers. Then they try to minimize the reprojection error.
The resulting transformation is derived as follows: \(\begin{align*} T_{k,k-1} = arg min_{T} \sum_{i} \lVert {u'}_i - \pi(p_i) \rVert ^2 \end{align*}\) where ${u’}i = \pi(T{k,k-1}(\pi^{-1}(u_i)d))$
Figure 14: Feature based Methods. source
Advantages:
- ability to work on large frame-to-frame motions
- is very accurate due to efficient optimization of structure and motion
Disantvages:
- slow due to costly feature extraction and matching
- There are outliers in the matching (RANSAC)
Direct methods (photometric methods)
In contrast to the feature base methods the direct methods do not work with any features. As the name suggests these type of method directly compares the photometic values of the image-pixles. Therefore these methods try to minimize the photometric error between the pixle values of the to images.
The resulting transformation is derived as follows: \(\begin{align*} T_{k,k-1} = arg min_{T} \sum_{i} \lVert I_k({u'}_i) - I_{k-1}(u_i) \rVert ^2 \end{align*}\) where ${u’}i = \pi(T{k,k-1}(\pi^{-1}(u_i)d))$
Figure 15: Direct Methods. source
Advantages:
- all information in the image can be used not only features (more precision and robustness)
- Increasing framerate so reducing inter-frame-motion reduces the computational cost per frame
Disantvages:
- Limited frame-to-frame motion
- joint optimization of structure and motion too expensive.
The direct methods can then be further split into categories depending on the amount of pixle the photometric error is applied on. For the Dense methods all pixles in the image are considered. For Semi-dense methods only the pixles along edges are considered which is a greatly reduced number. And for the sparse methods apply the photometric error only on feature like pixle regions.
Figure 16: Sparse to Dense. source
There are some difference between those methods. Dense and semi-dense behave quite similarely. As a result we only use the desne method when there is motion blur because otherwise the semi-dense approach is more efficient. Sparse methods are only a option if the overlap between rames is more than 30%.
Figure 17: Comparison. source
ORB-SLAM
This is a feature based method which uses FAST corners together with the ORB feature descriptor. This ORB descriptor is a binary descriptor which is fast to compute and allows for very fast matching using hte Hamming distance. For the optimization it minimizes the reprojection error. This SLAM methos also includes loop closing, relocation and a final optimization. It workt in real time (30Hz).
LSD-SLAM
Is a semi-dense direct methods and therefore minimizes the photometric error. In this approach the 3D geometry is represented in a semi-dense depth map. It also optimizes the poses as well as the structure. LSD-SLAM also closes loops uses relocation and uses a final optimization. It also works in realtime (30Hz)
DSO
Is a sparse direct methods and therefore minimizes the photometric error. The 3D geometry is represented as sparse large gradients. It optimizes the poses and the structure at the same time using a sliding window. For the optimization it includes the photometric correction to compensate the exposure time change. Also work in realtime (30 Hz)
SVO
Direct method minimizing the photometric error using edgelets and forners to estimate frame-to-frame motion. But also feature absed methods minimizing the reprojection error on a frame-to-keyframe pose refinement. Mapps probabilistic depth estimations. There are also varialnt of this algorithm that support fisheye and onmi cameras as well as multi camera systems. This algoritm is meant for high speed so for 100-400 fps.
Figure 18: Comparison. source
Evaluating VO/SLAM algorithms / Benchmarking
NOT COVERED IN EXAM
As seen before there are plentyfull of differen talgorithms available. Therefore it is important to be able to compare them and asses which one is more fitted for the problem one wants to solve. This process of measuring the performance is called benchmarking. First of all we have to decide what metric should be used. Dependent on these metric the evaluation then can be made.
Figure 19: Metrics. source
Accuracy
For the accuracy we want tu measure the deviation between the ground truth position and the final estimation. However how exactly is this deviation defined and also how can one handle differentreference frames, scales or time stamps? We can not just align the first pose and then measure the end-pose error because it is not clear how many poses to use for alignment and for the error measuring. also this is very sensitiv to trjectory shapes, e.g when it makes a curve. also it can not capture error statisics.
Absolute Trajectory Error (ATE)
For the error the trajectory is first alligned and in a second step the root mean squared error between the aligned path and the ground truth is calculated. Unfortunately this need many parameters whcih need to be specified but as an advantage it results in a single number metric. \(\begin{align*} \sqrt{\dfrac{\sum_{i=1}^{N} \lVert \hat{t_i}-sRt_i-T \lVert^2}{N}} \end{align*}\) Where $\hat{t}$ is the ground trith and R,T,s are parameters.
Figure 20: ATE. source