Home Visual Odometry and Visual Algorithms [Part 13] - Dense 3D Reconstruction
Post
Cancel

Visual Odometry and Visual Algorithms [Part 13] - Dense 3D Reconstruction

Chapter 13 - Dense 3D Reconstruction

Lecture 12a

Slides 1 - 47

In all previous chapters, we focused on sparse features: Small but easily recognizable, characteristic features in our scene. We tracked and matches such sparse features for pose estimations and structure from motion, to name a few applications. However, in this chapter, we will focus on Dense reconstruction. This means that we will not only track special feature points but every single pixel. If we could successfully match every pixel between two images from the same object, we could generate a nearly perfect 3D representation since we knew for every pixel its depth value. For a successfull 3D reconstruction, we need calibrated cameras (K is given) and their viewport positions (R, t).

3D Reconstructions using multiple images of the same Object Figure 1: 3D Reconstructions using multiple images of the same Object. source

The degree to which the viewport changes highly influences our depth estimation. If we have a very large baseline, we have a clear projection location of where the points depth might lay. However, it is difficult to find the correspondance point in the reference image. On the contrast, small baselines give us a good chance of finding the correspondance point, but the depth error might be quite large since we stil have quite a large range of possible depths that would result in nearly the same view. The solution is to take relatively small baselines (we prefer finding the correspondance) and take a new reference frame whenever the angle between new frames with respect to the reference image becomes too large. This will of course start to 3D reconstruct a new object which we will have to marge with the previous ones.

Depth uncertainty with differente viewport changes Figure 2: Depth uncertainty with differente viewport changes. source

For a 3D reconstruction, we need a depth evaluation for every single pixel. We therefore take one image as the reference image, every other image than acts as a new matching partner to find out new depth estimations for every pixel in the reference image. Note that pixel mappings reveals several problems that we did not have when dealing with sparese features. Sparese features were specifically choosen to be matchable and recognizable. They are mostly blobs or corners. Now that we match pixels, we have to match edges and flat regions as well, giving a high degree of ambiguity that we have to overcome.

To overcome this challenge, we will perform two sequential steps. First, we will use local methods to estimate the depth for each pixel independently. Of course, the quality of this estimation will heavily depend on the quality of the underlaying patch. In a second step, we will fix bad estimations using global methods: We’ll refine the depth map as a whole by making sure it is smooth and does not have abrupt changes. This is called regularization.

Local Method: Aggregated Photometric Error

First, let’s focus on the local methods. Given two images and their relative positions [R,t], we can of course construct an epipolar line for each pixel that hints us where the we can find the pixels reference in the second image. The position on the epipolar-line corresponds to the depth. Let’s assume we have a base image to our left and a subsequent frame taken to it’s right. The camera would therefore be at the left side of the epipolar line, as seen from the subequent frame. The further to the right we travel on the epipolar line, the farther away we go from the reference camera. The position on the epipolar line on which we find the match therefore defines the depth.

Given one Pixel on the reference Image, we can therefore loop over all subsequent frames and compute for each position on the epipolar line the correspondance. The sum of all correspondances of all subsequent frames is called the Aggregated Photometric Error. To simplyfy the problem a little bit, we only search for correspondances at fixed depth levels, e.g. at fixed positions along the epipolar line. We therefore get a correspondance depth estimation from every subsequential frames at the same pixel locations, making it easier to us to find the optimal depth value later on. It also drastically reduces the computation power since we can define the min- and max search depth we want to explore, as well as how many different depth values we want to try out between the minimum and maximum. If we have good a-priori knowledge about the scene, we can speed up the computation by a high degree.

Aggregated Photometric Error Figure 3: Aggregated Photometric Error. source

Let’s say we want to explore 100 depth values between 0m and 1m for a set of 240x180 pixel images. For each pixel, we get 100 x n depth estimations (where n is the number of images - 1). If we take the aggregated pixel values, we get 100 depth estimations, for depths [0.00, 0.01, 0.02, …, 0.99, 1.00]. We can represent these values in a tensor of dimension 240x180x100. This tensor is often called the Disparity Space Image, or short DSI. Low values in the DSI correspond to a good estimate (low error), while large values in the DSI imply bad estimates (high error).

The Disparity Space Image has three dimensions, (u, v, d), where d is of course the depth hypothesis. For a given image point (u,v) with depth hypothesis d, the Aggregated Photometric Error with respect to the reference frame can now be accessed via C(U, V, d). C here is just a new notation for DSI. The values in the DSI can be calculated by calculating the photometric error (like SSD, L1, L1, Hubert norm etc.) of the reference image IRef(u, v) and all other images Ik(u’, v’, d), where (u’, v’) is the location on the epipolar line that, using depth d, correspond to (u, v) in the reference image.

Remember how we get (u’, v’) given the epipolar line and a depth d? We first need to project the point (u,v) from the reference image into the world. Since we have the depth d given, we can multiply the projection $\pi$R-1 (u, v) * d to get a fixed point in the world coordinate system. We then need to translate the point into the coordinate system of the k’th frame using the backprojection function $\pi$k. Voila, we get coordinates (u’, v’) in the Image Ik given distance d.

Calculate Disparity Space Image Figure 4: Calculate Disparity Space Image. source

Do find the best depth-guess of all values in the DSI for each pixel, we only need a function that minimizes the aggregated depth errors at each pixel level (u, v). The easiest solution is to just take the smallest of all values, since at this depth we have had the smallest error.

Calculate final depth using DSI Figure 5: Calculate final depth using DSI. source

Note that we do NOT compare single pixels here. When we speak of I(u,v), we do NOT mean the pixel at I(u,v), but rather the patch around (u,v). The patch can have variable size, and it changes the effect of our results to a great extend. The smaller we choose the patch, the more detailed results we get, but we also have many outliers and a ton of noise. However, if we choose the patch to be large, we have to deal with less noise and smoother disparity maps, but our map will lack details.

Influence of Path on Depth Map Figure 6: Influence of Path on Depth Map. source

We talked about the quality of our depth estimation depending on the feature quality. While corners can of course be matched with high clarity, even edges can be hard to match, not even speaking of the ambiguity introduced by flat regions. We can show this when we plot the individual Photometric Error values over depth d: For flat regions, the response is very random and no clear minimum can be found for the aggregated Photometric Error. However, for clear corners, we find a clear and sharp depth response in the aggregated Photometric Error.

(Aggregated) Photometric Error for different features Figure 7: (Aggregated) Photometric Error for different features. source

Global Method: Regularization

Now that we have a valid depth estimation map, we have to remove all the noise generated by bad features like flat regions. A comfortable way to do this is by penalize non-smooth changes in the 3D depth map by adding a regularization to the optimization. So instead of just taking the minimum value of all depth values at a given pixel position, we do piecewie smoothing first. We look at the global scope, not just the local scope, and can therefore detect too abrupt changes or outliers.

The regularization term is added to the function that chooses which depth value to take from the DSI. Previously, we have just taken the minimum argument. Now, we add a regularization term to it, parametrized by the regularization factor lambda l. With increasing l, we increase the influence of the regularization, which leads to a smoother outcome.

Regularization term Figure 8: Regularization term. source

Depth map change with increasing lambda Figure 9: Depth map change with increasing lambda. source

We choose the regularization function in a way that ensrues that we do in fact not destroy real discontinuitites and sharp edges in our scene by using the fact that depth discontinuities often coincide with intensity discontinuities. Since depth dictontinuitits will often appear on smooth surfaces, but smooth surfaces often do NOT imply a change in 3D shape, we regularize these heavily. However, if we observe a sharp depth change that comes together with a sudden change in color intensity, we assume this is a real object surface change and apply only small regularization.

We can therefore make the regularization term dependent on the image gradient: The higher the gradient, the smaller the regularization. We can do so by the squared partial derivatives before the regularization term, surrounded by a monotonically decreasing function (small to high, high to small).

Regularization relative to gradient Figure 10: Regularization relative to gradient. source

GPU for 3D reconstructions

Compared to CPUs, GPUs are slow in sequential tasks but can perform tousands of tasks in parallel since they have a large number of cores. This makes GPUs very efficient for parallizable taks such as image manipulations, matrix operations etc.

In dense 3D reconstruction, GPUs can perform a lot of the important tasks:

  • Image processing (Filtering, Feature extraction, Warping)
  • Multiple-view geometry (correspondance-search, Aggregated Photometric Error)
  • Global optimization (Regularization)
This post is licensed under CC BY 4.0 by the author.