Home Visual Odometry and Visual Algorithms [Part 17] - Application questions
Post
Cancel

Visual Odometry and Visual Algorithms [Part 17] - Application questions

Application questions

1. Summarize the building blocks of a visual odometry (VO) or visual SLAM (VSLAM) algorithm

  • Image capturing (sequential)
  • Feature detection & decription (SIFT)
  • Feature matching (description matching e.g SSD (robust error minimizing))
  • motion estimation (2D-2D (8-point RANSAC) then 3D-2D (P3+1P) (3D-3D (minimizing euclidean distance))
  • local optmization (bundle adjustment & pose graph opt.)
  • global optimization (SLAM) loop closing

2. Augmented reality (AR) is a view of a physical scene augmented by computer-generated sensory inputs, such as data or graphics. Suppose you want to design an augmented reality system that super-imposes text labels to the image of real physical objects. Summarize the building blocks of an AR algorithm.

  • VO for getting pose of camera
  • 3D pointcloud
  • surface estimation (dense reconstruction if accuracy needed)
  • (AI for recognizing object and giving name)
  • Add label in 3D (anchor to a point in the existing cloud)
  • backproject label as mesh onto the imagte plain.
  • display

3. Suppose that your task is to reconstruct an object from different views. How do you proceed?

  • Hierarchical SFM
  • Feature detection & decription (SIFT)
  • Feature matching (description matching e.g SSD (robust error minimizing))
  • cluster images in to groups of 3 images with same features
  • take two of these images to construct pointcloud (8 point algo.)
  • merge 3rd view (3-point RANSAC) & BA
  • merge clusters pairwise (3D-3D) & BA

4. Building a panorama stitching application. Summarize the building blocks.

  • Feature detection & decription (SIFT) (rotation invariant)
  • Feature matching (description matching e.g SSD (robust error minimizing))
  • warp 2nd image to minimize euclidian distance with respect to feature location in first image
  • for good results rotation of camera arround one single point then roto-(trans)lations warp performs perfect :)

5. How would you design a mobile tourist app? The user points the phone in the direction of a landmark and the app displays tag with the name of it. How would you implement it?

  • google API
  • landmark recognition
  • DB of images
  • Bag Of Words

6. Assume that we have several images downloaded from flicker showing the two towers of Grossmünster. Since such images were uploaded by different persons they will have different camera parameters (intrinsic and extrinsic), different lighting, different resolutions and so on. If you were supposed to create a 3D model of Grossmünster, what kind of approach would you use? Can you get a dense 3D model or it will be a sparse one? Please explain the pipeline that you propose for this scenario.

  • hierarchical SFM
  • 8-point but with Fundamental matrix instead of essential matrix
  • only really solvable if same focal length (only cluster images with same f (info on flicker))
  • for cluster mergeing we can merge from different f
  • zero meaning
  • use SIFT for features -> spares one.
  • when normalizing for f then we can also make dense reconstruction

7. Assume that you move around a statue with a camera and take pictures in a way that the statue is not far from the camera and always completely visible in the image. If you were supposed to find out where the pictures were taken, what would you do with the images? What kind of approach would you use? Since the camera motion is around the statue, the images contain different parts of the statue. How do you deal with this problem?

  • small movements between images

8. Suppose that you have two robots exploring an environment, explain how the robots should localize themselves and each other with respect to the environment? What are the alternative solutions?

  • same bootstrap results in same world frame

Theoretical questions

01 – Introduction

1. Provide a definition of Visual Odometry.

  • VO is the process of incrementally estimating the pose of the vehicle by examining the changes that motion induces on the images of its onboard cameras

2. Explain the most important differences between VO, VSLAM and SFM.

  • VO focuses on estimating the 6DoF motion of the camera sequentially (as a new frame arrives) and in real time.
  • SFM is more general than VO and tackles the problem of 3D reconstruction and 6DOF pose estimation from unordered image sets
  • VSLAM is VO with loop detection and closure

3. Describe the needed assumptions for VO.

  • Sufficient illumination in the environment
  • Dominance of static scene over moving objects
  • Enough texture to allow apparent motion to be extracted
  • Sufficient scene overlap between consecutive frames

4. Illustrate its building blocks

  • Image capturing (sequential)
  • Feature detection & decription (SIFT)
  • Feature matching (description matching e.g SSD (robust error minimizing))
  • motion estimation (2D-2D (8-point RANSAC) then 3D-2D (P3+1P) (3D-3D (minimizing euclidean distance))
  • local optmization (bundle adjustment & pose graph opt.)
  • global optimization (SLAM) loop closing

02-03 – Image Formation

1. Explain what a blur circle is

  • “Circle of Confusion” or “Blur Circle”
  • For a fixed film distance from the lens, there is a specific distance between the object and the lens, at which the object appears “in focus” in the image, other points project to a “blur circle” in the image.
  • R = (Ld/2e) where d is distance between focal plane and image plane

2. Derive the thin lens equation and perform the pinhole approximation

  • B/A = e/z
  • B/A = (e-f)/f = e/f -1
  • => 1/f = 1/z + 1/e (thin lens eq.)

  • f ~= e
  • B/A = f/z
  • => B = (f/z)A

3. Define vanishing points and lines

  • Parallel lines in the world intersect in the image at a “vanishing point”
  • Parallel planes in the world intersect in the image at a “vanishing line”

4. Prove that parallel lines intersect at vanishing points

  • x = (f/Z)X
  • y = (f/Z)Y
  • L1 = P1 + s* [l, m, n]T
  • L2 = P2 + s* [l, m, n]T
  • for s->inf xVP = limes f*((X+sl)/(Z+sn)) = f(l/n)
  • for s->inf yVP = limes f*((Y+sm)/(Z+sn)) = f(m/n)

5. Explain how to build an Ames room

  • walls not parallel rather trapezoid
  • brain makes assumtion that was are parallel (perpendicular)

6. Derive a relation between the field of view and the focal length

  • tanges is opposing katete over next katete
  • tan(FoV/2) = (W/2)/f
  • W is lens hieght

7. Explain and write the equations of the perspective projection, including lens distortion and world to camera projection. ... ... Distortion equation, p. 62

8. Given an image and the associated camera pose, how would you superimpose a virtual object on the image (for example, a virtual cube). Describe the steps involved.

  • virtual object mesh defined by vertecies -> bring to camera frame using R and t
  • bring into image frame using the intrinsics
  • normalize z-axis (lambda) -> (u,v)
  • distort (virtual object)
  • draw the object

9. Normalized image coordinates and geometric explanation.

  • noramlitzed imagfe coordinates applies the inverse of K to normalize focal length to 1.
  • triangle normalized f to one.

10. Describe the general PnP problem and derive the behavior of its solutions. What’s the minimum number of points and what are the degenerate configurations?

  • in PnP we have the 3D points and their images given and want to find pose of camera
  • for 1 point we have infinite solutions
  • for 2 bounded infinity (spidel torus)
  • for 3 we get 3 independent equations with 3 unknows in total -> product of degrees of equations = 2*2*2 = 8 \(\begin{align*} {s_i}^2={A_i}^2+{B_i}^2 - 2A_iB_icos(\theta_i) \end{align*}\)
  • 4 positive ad 4 negative solution. Only positives are reasonable
  • => 3+1 points to make system overdetermined and disambiguate the 4 solutions
  • degenerate configs: at least 2 points colinear with camera center (not on ray), no 3 colinear

11. Explain the working principle of the P3P algorithm. What are the algebraic trigonometric equations that it attempts to solve?

  • equation above in 10.

12. Explain and derive the DLT for both 3D objects or planar grids. What is the minimum number of point correspondences it requires (for 3D objects and for planar grids)? Matrix equation Matrix equation part2 Matrix equation part3 Figure 3: Q matrix. source

  • solve with SVD, m-vector = smallest eigenvector, decompose M into R and t (QR)
  • for planar grids just ignore third coordinates
  • 3D: 6 points (non-coplanar and not on a twisted cubic and not 2 colinear with center of projection)
  • planar: minimum of 4 non-collinear points is required
  • 13. Define central and non-central omnidirectional cameras.
  • non central onmi-dir-cams have mirros where the rays do not intersec in one point (heind the mirror) 14. What kind of mirrors ensure central projection.
  • Mirror must be surface of revolution of a conic =>
  • Hyperbola + Perspective camera
  • Parabola + Orthographic lens 15. What do we mean by normalized image coordinates on the unit sphere?

04 – Filtering and edge detection

1. Explain the differences between convolution and correlation

  • convolution flips filter first and is assiciative and commutative

2. Explain the differences between a box filter and a Gaussian filter

  • box values all pixel with same weight
  • gaussian weights central pixels more

3. Explain why one should increase the size of the kernel of a Gaussian filter if 2𝜎 is close to the size of the kernel

  • at the border the filter should be close to zero otherwise allising effects

4. Explain when we would need a median & bilateral filter

  • denoising and preserves strong edges
  • bilateral preserves all edges up to a threshold

5. Explain how to handle boundary issues

  • zero padding (black)
  • wrap around
  • copy edge
  • reflect across edge

6. Explain the working principle of edge detection with a 1𝐷 signal

  • smoothing -> gaussian
  • detection -> derivative
  • combined ==> derivative of gaussian ==> maximum/minimum is edge
  • laplacian of gaussian ==> zero crossing is edge

7. Explain how noise does affect this procedure

  • whith noise the derivative is always high

8. Explain the differential property of convolution

  • instead of using two convolutions One for gaussian and one for derivative use derivative of filter as new filter

9. Show how to compute the first derivative of an image intensity function along 𝑥 and 𝑦

  • -11 filter, sobel or prewitt

10. Explain why the Laplacian of Gaussian operator is useful

  • laplacian of gaussian ==> zero crossing is edge
  • ==> more accurate/simple localization of edge

11. List the properties of smoothing and derivative filters Smoothing filter: – has all positive values in filter – sums to 1  preserve brightness of constant regions – removes “high-frequency” components: “low-pass” filter

Derivative filter: – has opposite signs used to get high response in regions of high contrast – sums to 0 no response in constant regions – highlights “high-frequency” components: “high-pass” filter

12. Illustrate the Canny edge detection algorithm

  • derivative of gaussian
  • threshold to remove low values
  • local maxima along gradient

13. Explain what non-maxima suppression is and how it is implemented

  • Thinning: non-maxima suppression (local-maxima detection) along gradient direction
  • each edge we have x and y derivative, combining them gives direction of gradient

05-06 – Point feature detection, descriptor, and matching

1. Explain what is template matching and how it is implemented

  • move template over image and calculate correlation. high correlation mean good match
  • different types of correlations can be used
  • output correlation map

2. Explain what are the limitations of template matching. Can you use it to recognize cars.

  • can only find pixel accurate images of the template
  • cars not pixel accurate

3. Illustrate the similarity metrics SSD, SAD, NCC, and Census transform

  • NCC: normalized cross-correlation: cosinus of angle between the normalized vectors
  • SSD: sum of squared differences
  • SAD: sum of abosulte differences
  • census transform: 8 pixel around central pixel if larger then 1 else 0. create bitencoding (append left), start top left, humming distance

4. What is the intuitive explanation behind SSD and NCC NCC

  • perpendicular => 0
  • opposite => -1
  • same => 1 SSD
  • larger the error -> large value
  • best value is 0

5. Explain what are good features to track. In particular, can you explain what are corners and blobs together with their pros and cons. How is their localization accuracy?

  • corners are good, edges are medium, falt region are bad, also good are blobs which are any other image pattern that is not a corner and differs significantly from its neighbors
  • corners are regions with high contrast in at least 2 directions
  • corners hace hight localizations accuracy but are less distinctive
  • blobs have lower localizations acc. but are more distinctive.
  • coners better fpor VO
  • blobs better for place recoginition

6. Explain the Harris corner detector. In particular: - a. Use the Moravec definition of corner, edge and flat region.

  • sliding window into different directions and calculate intensity change
  • all directione no intensity change => flat, change in only one direction => edge, change in at least 2 directions => corner - b. Show how to get the second moment matrix from the definition of SSD and first order approximation (show that this is a quadratic expression) and what is the intrinsic interpretation of the second moment matrix using an ellipse?
  • $M = [[I_x^2, I_xI_y][I_xI_y, I_y^2]]$
  • $SSD = sum (I(x,y)-I(x+dx, y+dy))^2 ~= sum (I(x,y)- I(x,y) -I_x(x,y)dx-I_y(x,y)dy)^2$
  • shorter axis of ellipse shows faster change in intensity.
  • langer axis is therefore along edge
  • if both agis are large => flat

- c. What is the M matrix like for an edge, for a flat region, for an axis-aligned 90-degree corner and for a non-axis—aligned 90-degree corner?

  • flat: both eigencvalues small
  • edge: one eigenvalue large
  • corner: both eigenvlaues large
  • axis alignment does not matter since eigenvectors give direction.

- d. What do the eigenvalues of M reveal?

  • magnitude of change in intensitiy into direction of eigenvector

- e. Can you compare Harris detection with Shi-Tomasi detection?

  • Harris: R = det(M)-k * trace(M)^2
  • shi-tomasi: R = min(lambda_1, lambda_2)
  • harris bit cheapter since eigenvlaues do not need to be calculated explicitely, but needs magic number

- f. Can you explain whether the Harris detector is invariant to illumination or scale changes? Is it invariant to view point changes?

  • for constant illumination change (affine) ok
  • cant handle scale changes -> corners become edges when upscaling
  • view point changes ok as long as corner stays corner - g. What is the repeatability of the Harris detector after rescaling by a factor of 2?
  • cant handle scale changes -> corners become edges when upscaling

1. How does automatic scale selection work?

  • take different patch sizes and get cornerness for images independent the size with max value

  • Sharp, local intensity changes are good regions to monitor in order to identify the scale
  • The ideal function for determining the scale is one that highlights sharp discontinuities
  • Solution: convolve image with a kernel that highlights edges (LoG)
  • Correct scale is found as local maxima or minima across consecutive smoothed images

2. What are the good and the bad properties that a function for automatic scale selection should have or not have?

  • A “good” function for scale detection should have a single & sharp peak
  • The ideal function for determining the scale is one that highlights sharp discontinuities
  • multiple is bad for the matching then

3. How can we implement scale invariant detection efficiently? (show that we can do this by resampling the image vs rescaling the kernel).

  • SIFT
  • instead of taking larger filters downsampling the image which improves rundtime since larger filter need more resource and making image smaller reduces number of positions (pixels)

4. What is a feature descriptor? (patch of intensity value vs histogram of oriented gradients). How do we match descriptors?

  • unique representatgion of a feature which is recognizable
  • easiest descriptor is patch itself
  • calculated de difference and match the ones with lowest error
  • more robust is also to only match when best match is mcuh better than second best match (0.8)

5. How is the keypoint detection done in SIFT and how does this differ from Harris?

  • octave of gaussians filtered images with different sigmas
  • taking differences of these gaussians DoG
  • take 3x3x3 box in layered DoGs, if central pixel is largest and over threshold then we found a keypoint
  • SIFT ist faster since it downsample therefre can fastly try big range of sigmas

6. How does SIFT achieve orientation invariance?

  • SIFT secriptor is orientation invarian since the HoG of the cells is shift so that over all larges value is first

7. How is the SIFT descriptor built?

  • 16x16 patch
  • gauss on whole patch
  • 4x4 cells
  • HoG for each cell
  • concatenate all HoGs
  • shift

8. What is the repeatability of the SIFT detector after a rescaling of 2? And for a 50 degrees’ viewpoint change?

  • 50 degree viewpoint chage is upper limit for which it works reasonably
  • 50 degree viewpoint chage 50% correct matches
  • scaling change of 2: scaleinvariant (if enought octaves)

9. Illustrate the 1st to 2nd closest ratio of SIFT detection: what’s the intuitive reasoning behind it? Where does the 0.8 factor come from?

  • d1/d2 < 0.8

10. How does the FAST detector work? What are its pros and cons compared with Harris?

  • ring with usually 16 pxl around centerpixel then see how many consecutive pixels are bighter that center pixel
  • if 9 or more consecutive pixel are all bighter of darker then feature is found

07 – Stereo Vision

1. Can you relate Structure from Motion to 3D reconstruction? What’s their difference?

  • SFM creates sparse scene point cloud, and gives the camera positions
  • dense 3D reconstruction needs camera poses given and then creates a dense map
  • SFM non sequential, dense recon in shown case sequential
  • SFM based on features
  • dense reconstruction based on pixels

  • 3D reconstruction: K, R, t are know, Goal: 3D structure
  • SFM: none known, Goal: 3D structure and K, R and t

2. Can you define disparity in both the simplified and the general case?

  • simplyfied: pixel displacement along main axis- since recified
  • general: displcement along epipolar line

3. Can you provide a mathematical expression of depth as a function of the baseline, the disparity and the focal length?

  • depth = (baseline * focal length)/ disparity (Only in simplified case)

4. Can you apply error propagation to derive an expression for depth uncertainty and express it as a function of depth? How can we improve the uncertainty?

  • uncertaity depending on disparity : Delta-depth = abs(derivative of depth-function by disparity) delta-disparity
  • ==> delta-depth = (b * f)/(disparity^2) delta-disp
  • ==> delta-depth = (depth^2/(b * f)) delta-disp
  • improfe by larger baseline, better resolution, larger f

5. Can you analyze the effects of a large/small baseline? Can you illustrate it with a sketch?

  • see formula from above Large and Small Baseline

6. What is the closest depth that a stereo camera can measure?

  • bf/max-disparity
  • max-disp = width-left-camera/2 - (-width-right-camera/2) = width-left-camera + width-right-camera
  • for identic cameras: width cameras

7. Are you able to show mathematically how to compute the intersection of two lines (linearly and non-linearly)?

  • linear: lambda * p_1 = M_1 * P, due to this equaility we knwo these vectors are paralell
  • ==> parallel vectors have a cross-product of 0
  • ==> p_1 x M_1 * P = 0 same for other camera
  • ==> p_2 x M_2 * P = 0
  • solve this system using svd

  • non linear: initialize with linear
  • then optimize using error minimization (SSRE (reprojection))

8. What is the geometric interpretation of the linear and non-linear approaches and what error do they minimize?

  • non-linear minimizes SSRE
  • linear: midpoint of the segment connecting the two backprojected rays (minimized distance from both rays)

9. Are you able to provide a definition of epipole, epipolar line and epipolar plane?

  • epipoles: intertsection of line connection both camera-centers with their cameraplanes
  • epipolar-line intersection of epipolarplane with camera-plane (backprojection of ray onto other cameras cameraplane)
  • epipolarplane: plane span by camera-centers and world-point

10. Are you able to draw the epipolar lines for two converging cameras, for a forward motion situation, and for a side-moving camera? Examples

11. Are you able to define stereo rectification and to derive mathematically the rectifying homographies?

  • warp of the image to how it would like like if the cameras would be perfectly aligned
  • apply camera specific rotation ans intrinsics inversed to get normalized representation to both views (not displayable)
  • apply “mean” rotation and intrisict to both views
  • K-hat = average of both K
  • R-hat: r_hat_1 = normalized vector between camera-centers (C_2-C_1)/norm(C_2-C_1)
  • r_hat_2 = r_L_3 x r_hat_1
  • r_hat_3 = r_hat_1 x r_hat_2

12. How is the disparity map computed?

  • recitify images
  • match points along epipolar line (using similarity like cross-correlation)
  • measure disparity
  • done, map

13. How can we establish stereo correspondences with subpixel accuracy?

  • interpolate cross-correlation function

14. Describe one or more simple ways to reject outliers in stereo correspondences.

  • take diretion of movement into account, e.g stereo case left right image -> from left to right the correspondance should move to the left
  • bigger patch

15. Is stereo vision the only way of estimating depth information? If not, are you able to list alternative options? (make link to other lectures of course)

  • no
  • n-view-SFM, 2-view-SFM, dense-3d-reconstruction

08-09– Multiple view geometry 2 and 3

1. What’s the minimum number of correspondences required for calibrated SFM and why?

  • 5, because we have 4n knowns for n correspondences (u,v for both views) and we have 5 unknows for the motion (3 for rotation, 2 for translation because of scale ambiguity we normalize for one of the coordinates, and we also have 3 unknoes for eacht correspondance (world-point)
  • to solve system we need more/at-least knowns than unknows
  • 4n >= 5+3n ==> n >= 5

2. Are you able to derive the epipolar constraint?

  • p_1 dot n must be 0 because they need to be perpendicular because of definition of normal vector of epipolarplane and p_1 has to be on epipolar plane
  • then we express n as Tx(R * p_2), we need T for the translation and R for the roation betwenn the camera frames
  • ==> T_x (translation as cross-product-matrix) * R = E (essential matrix)
  • epipolar constraint: p_2^T * E * p_1 = 0

3. Are you able to define the essential matrix?

  • Yes

4. Are you able to derive the 8-point algorithm?

  • p_i is expressed as (u_i, v_i, 1)
  • Write the epipolar constrain as sytem of equation dependen on unknow elements of E
  • Write the system as matrix multiplication where we express E as a vector (Q * E = 0)
  • Q is the vertical stack of multiple of such equations for different correspondances
  • solve this system using SVD
  • E is the eifenvector to the smallest eigenvalue
  • then we decompose E using SVD
  • extract R und T according to slide which wont be asked

  • (normaization) 5. How many rotation-translation combinations can the essential matrix be decomposed into?
  • into 4 but only one of then poses the point in front of the cameras

6. Are you able to provide a geometrical interpretation of the epipolar constraint? Epipolar Constraint

7. Are you able to describe the relation between the essential and the fundamental matrix?

  • fundamental matrix is where we don’t know the intrinsics so it contains them.
  • F = K_2^(-T) * E * K_1^(-1)

8. Why is it important to normalize the point coordinates in the 8-point algorithm?

  • because otherwise Q can be poorly conditioned
  • since the values are from 0-Xwhen multipling the valuesw can get very big.
  • only problems when solving the numerical solution, multiplies the noise

9. Describe one or more possible ways to achieve this normalization.

  • move orgin to centerpoint and normalize coordinates to [-1,1]
  • normalize mean to be the (0,0) and normalize the standart deviation to sqrt(2)
  • Normalization can be expressed as matrix B: p_norm = Bi * pi Hartley Scaling

10. Are you able to describe the normalized 8-point algorithm?

  • normalize point correspondances as described above (pi_norm = Bi * pi)
  • estimate normalized F_norm with normalized coordinates
  • compute unnormalized F = B2^T * F_norm * B1

11. Are you able to provide quality metrics for the essential matrix estimation?

  • algebraic error (deviation from 0 wen taking cross product)
  • directional error (normalize algebraic)
  • epipolarline distance (distance from image to epipolar line)
  • reprojection error

12. Why do we need RANSAC?

  • to remove outliers

13. What is the theoretical maximum number of combinations to explore?

  • for pairs: n(n-1)/2
  • for s points in sample: n_choose_s = n!/(s!(n-s)!)

14. After how many iterations can RANSAC be stopped to guarantee a given success probability?

  • k = log(1-p)/log(1-w^s) where p is probability of success, w is fraction of inliers, s in number of points in sample

15. What is the trend of RANSAC vs. iterations, vs. the fraction of outliers, vs. the number of points to estimate the model?

  • exponential curves, where 8-point rises first, then 5 then 2

16. How do we apply RANSAC to the 8-point algorithm, DLT, P3P?

  • we use 8 points to estimate model using 8-point algo and look how many inliers and the repeade k times
  • we use 6 points to estimate model using DLT algo and look how many inliers and the repeade k times
  • we use 3+1 points to estimate model using P3P algo and look how many inliers and the repeade k times

17. How can we reduce the number of RANSAC iterations for the SFM problem? (1- and 2-point RANSAC)

  • movement constraints
  • planar -> 2 points
  • planar circular -> 1 point

18. Bundle Adjustment and Pose Graph Optimization. Mathematical expressions and illustrations. Pros and cons.

  • BA: minimzes reprojection errors and adjusts camera-poses and 3D-points \(\begin{align*} (P^i,C_2) = argmin_{P^i,C_1,C_2} \sum_{i=1}^{N} \lVert {p_1}^i - \pi_1(P^i,C_1)\rVert^2 + \lVert {p_2}^i - \pi_2(P^i,C_2)\rVert^2 \end{align*}\)
  • PGO: minizes pose-error (roto-translations) for all possible poses that have same correspondences, only adjusts poses \(\begin{align*} C_k = argmin_{C_k} \sum_{i} \sum_{j} \lVert C_i - C_jT_{ij} \rVert^2 \end{align*}\) 19. Are you able to describe hierarchical and sequential SFM for monocular VO? hierarchical:
  • clusters of 3 view with same features
  • construct point cloude from two of these (2D-2D)
  • merge third view (3D-2D)
  • merge these clusters (3D-3D)

sequential

  • bootstrap (2D-2D for first two keyframes) -> pointcloud
  • then until next keayframe just get pose (P3P)
  • new keyframe add more points to pointclouds (P3P for pose then triangulation for 3D-points)
  • new keyframe when key-distance/avg-scene-depth = 0.1 to 0.2

  • pro/con expensive for to many keyframes and to small baselines -> to big of a depth uncertaity, too few keyframes -> to few correspondances

20. What are keyframes? Why do we need them and how can we select them?

  • see above

21. Are you able to define loop closure detection? Why do we need loops?

  • loop detection recognizin that we are at a place we were before (placerecognition)
  • then we can try to close the loop because we know it should be but our estimation sais different thing (pose-graph-otimization)
  • ==> VSLAM
  • we need loops to know that we are drifting
  • we need loops for loop closure which improves our optimizations (remove drifts)

22. Are you able to provide a list of the most popular open source VO and VSLAM algorithms?

  • ORB-SLAM
  • LSD-SLAM
  • DSO
  • SVO

23. Are you able to describe the differences between feature-based methods and direct methods?

  • feature based methods are based on features not pixels
  • direct methods are based on pixel intensities

24. Sparse vs semi-dense vs dense. What are their pros and cons?

  • for spares methods we only look ar some view pixels
  • semi-dense: we look ant more pixels
  • dense: we look at all pixels

  • dense methods are more costly because we have to calculate for all pixels buit tend to be more accurate
  • sparse are faster, less accurate
  • dense are useful with motion blure and defocus
  • sparse are equal good to dense when overlap is >30%

25. General definition of VO and comparison with respect VSLAM and SFM. Definition of loop closure detection (why do we need loops?). How can we detect loop closures? (make link to other lectures).

  • SFM > VSLAM > VO
  • loops see above

10 – Multiple view geometry 4 (benchmarking visual SLAM) not covered in class, so it won’t be asked.

11 – Tracking

1. Are you able to illustrate tracking with block matching?

  • shifting window to find patch again in 2nd image
  • shift indicates direction of movement
  • move inside searchregion and define comparison method

2. Are you able to explain the underlying assumptions behind differential methods, derive their mathematical expression and the meaning of the M matrix?

  • Assumptions: 1. Brightness constancy, 2. Temporal consistency, 3. Spatial coherency
  • temporal: nor moved to much (motion between frames 1-2 pxl)
  • spacial: Neighboring pixels belonging to the same surface and therefore undergo similar motion

  • different mthods: minimize difference of first image and second image with shift
  • take same patch in both imaged but for second image approximate shift with taylor FOP
  • $SSD = Sum(Delta-I - I_x * u - I_y * v)^2$
  • take partial derivatve (u,v) of SSD and set 0
  • express as matrix equation: $(u,v)^T = M^-1 * (Sum(I_x * Delta-I), Sum(I_y * Delta-I))^T$ (this is the shift)
  • interprataion of M same as for Harris. (both eigenvlaues large -> then it good to track)

3. When is this matrix invertible and when not?

  • For M to be invertible, its determinant has to be non zero

4. What is the aperture problem and how can we overcome it?

  • when only see an edge we don’t know the movement
  • we can overcome by choosing larger patch

5. What is optical flow?

  • Optical flow or optic flow is the pattern of apparent motion of objects in a visualscene caused by the relative motion between the observer (an eye or a camera) and the scene

6. Can you list pros and cons of block-based vs. differential methods for tracking?

  • block-based more robust on large movements but costly
  • differential only applyable for small movements, faster

7. Are you able to describe the working principle of KLT?

  • inital guessed warp that warps from img1 to template
  • iteratively find better warping parameters to minimize SDD between warped-img1 and template
  • use final found warp of img1 as inital guess for img 2
  • (template used to find object in image and find it initail placements (warp) so that we can extress the first warp from img1 to template)

8. What functional does KLT minimize? (proof won’t be asked, only the first two slides titled “derivation of the Lucas-Kanade algorithm”)

  • minimizes E (=SSD) Derivation of Lucas Kanade Algorithm

9. What is the Hessian matrix and for which warping function does it coincide to that used for point tracking?

  • KLT 2x6 warp function (affine) -> H is a 6x6 matrix
  • for point tracking (Differential Methods) the coinciding warp is the translation (2X3 but with only 2 parameters) -> H is a 2x2 matrix
  • coincides with translation
  • jacobian is a (number of functions)x(number of parameter) Matrix
  • hessian is a (number of parameter)x(number of parameter) matrix
  • the jacobian represents the first partial derivative
  • the hessian represents the second partial derivative
  • H = J^T*J
  • the number of fucntion is in our case alwasy 2
  • the number of parameters depends on the waring function
  • in the Differential Methods we take first derivative to get the SSD and the we want to minimze it for this we take another derivative -> second partial derivative => hessian

10. Can you list Lukas-Kanade failure cases and how to overcome them?

  • assumption of brigness, temporal and spacial not hold
  • occlusion
  • error in template
  • If the initial estimate is too far, then the linear approximation does not longer hold -> solution: Pyramidal implementations (see next slide)

11. How do we get the initial guess?

  • for consecutive frames take the solution of previous step
  • for first one: start at rest of find object in image and track it

12. Illustrate alternative tracking using point features.

  • keypoint tracking
    1. Keypoint detection and matching
    1. Geometric verification (RANSAC)

12a – Dense 3D Reconstruction

1. Are you able to describe the multi-view stereo working principle? (aggregated photometric error)

  • sample depth on multiple views
  • aggregate depth error over multiple view
  • add depths to 3rd dimension -> disparity-space-image

2. What are the differences in the behavior of the aggregated photometric error for corners, flat regions, and edges?

  • Aggregated photometric error for flat regions (a) and edges parallel to the epipolar line (c) show flat valleys (plus noise)
  • For distinctive features (corners as in (b) or blobs), the aggregated photometric error has one clear minimum.
  • Non distinctive features (e.g., from repetitive texture) will show multiple minima 1

3. What is the disparity space image (DSI) and how is it built in practice?

  • see 1. Calculate Disparity Space Image

4. How do we extract the depth from the DSI?

  • argmin of DSI

5. How do we enforce smoothness (regularization) and how do we incorporate depth discontinuities (mathematical expressions)?

  • not just take argmin but a optimal d-function (d(u,v))
  • we adjust this d-function to minimize the data-term error and the regularization term error.
  • regularization term error is based on the partial derivatives of the d-function
  • weighting regularization using lambda

6. What happens if we increase lambda (the regularization term)? What if lambda is 0? And if lambda is too big?

  • if 0 no regularization
  • for increasing lambda more smother but also încreasing data term error

7. What is the optimal baseline for multi-view stereo?

  • When baseline becomes large (e.g., >10% of the avg scene depth), then create new reference frame (keyframe) and start a new depth map computation

8. What are the advantages of GPUs?

  • GPU can parallize more threds and for image processing we often have many simple taks (for each pixel/ feature)
  • CPU have more chach which the GPU/we does not need

12b – Place Recognition

1. What is an inverted file index?

  • list of e.g pages where a word appears.
  • for images: instead of noting which image has which features/visual words we list all images which have that visual word (for each word)
  • word-to-image mapping

2. What is a visual word?

  • the centroid of a cluster in the descriptor space
  • “average feature”

3. How does K-means clustering work?

  • randomly pick 10 points in descriptor space as initial cetroids
  • match all features to nearest centroid ->
  • move centroid to mean feature position
  • repeate

4. Why do we need hierarchical clustering?

  • for faster access in inverted file index

5. Explain and illustrate image retrieval using Bag of Words. Hierarchical clustering

6. Discussion on place recognition: what are the open challenges and what solutions have been proposed?

  • geometric verification
  • how to find best from many goods
  • how to handle multiple appereance of same word
  • optimal descriptor
  • optimal magic numbers: descriptor length, clusters, branches, depth
  • how to pregroup
  • how big of a database

  • FABMAP: captures spacial dependencies of visual words

12c – Deep Learning – Won’t be asked at the exam

13 – Visual inertial fusion

1. Are you able to answer the following questions?

2. Why is it recommended to use an IMU for Visual Odometry?

  • complementary systems (VO bad for fast movements (blur …), IMU good for fast) (VO rich information, IMU large drift)

Pro Camera

  • Precise in slow motion
  • Rich information for other tasks

Con Camera

  • Limited output rate (~100 Hz)
  • Scale ambiguity in monocular setup
  • Lack of robustness to HDR and highspeed

Pro IMU

  • Robust
  • High output rate (~1,000 Hz)
  • Accurate at high acceleration
  • Can predict next feature position

Con IMU

  • Large relative uncertainty when at low acceleration/angular velocity
  • Ambiguity in gravity / acceleration

3. Why not just an IMU, without a camera?

  • IMU and vision are complementary
  • IMU has shit drift

4. How does a MEMS IMU work?

  • Accelerometer: A spring-like structure connects the device to a seismic mass vibrating in a capacitve divider. A capacitive divider converts the displacement of the seismic mass into an electric signal. Damping is created by the gas sealed in the device.

  • gyroscopes measure the Coriolis forces acting on MEMS vibrating structures (tuning forks, vibrating wheels, or resonant solids)

5. What is the drift of an industrial IMU?

  • Integration of angular velocity to get orientation: if there is a bias in angular velocity, the error is proportional to t
  • Double integration of acceleration to get position: if there is a bias in acceleration, the error of position is proportional to t2

6. What is the IMU measurement model? Measurement of Acceleration

The formula states that the measured acceleration from frame B(ody) with respect to frame W(orld), measured in frame B, equals the true acceleration from B to W, but measured in frame W, minus the G-Force measured in frame W. To put this in frame B, we have to apply the rotation from frame W to frame B. To all this, we of course have to add a bias and noise.

Measurement of Angular Velocity

The formulate states that the measured angular velocity from frame B with respect to frame W, measured in frame B, equals the true angular velocity from frame W to frame B, measured in frame B. To all this, we of course have to add a bias and noise.

7. What causes the bias in an IMU?

  • technical stuff Some facts about IMU biases: • They can change due to temperature change and mechanical pressure • They can change every time the IMU is started • Good news: they can be estimated

8. How do we model the bias?

  • the derivative of the bias is white Gaussian noise

9. How do we integrate the acceleration to get the position (formula)? IMU Model integration

10. What is the definition of loosely coupled and tightly coupled visual inertial fusions? Loosely coupled approach

Tightly coupled approach

11. How can we use non-linear optimization-based approaches to solve for visual inertial fusion?

  • filtering(weighting) and smoothing
  • kalman filter

12. Can you write down the cost function of smoothing methods and illustrate its meaning?

  • not exactly but for sure not like in slides

14 – Event-based Vision

1. Are you able to answer the following questions?

2. What is a DVS and how does it work?

  • dynamic vision sensor, reats to light intensity change.
  • a DVS outputs asynchronous events at microsecond resolution. An event is generated each time a single pixel detects a change of intensity

3. What are its pros and cons vs. standard cameras?

  • much faster “framerate”
  • low latency
  • high dynamic range
  • low power

BUT

  • asynchronous pixels
  • no intensity infos

4. Can we apply standard camera calibration techniques?

  • yes, on flashing grid

5. How can we compute optical flow with a DVS?

  • from delta-t and the brighness constancy assumtion we can calculate the movements (flow)

$L(x, y, t) = L(x+u, y+v, t+ \Delta t)$

We can then use taylor to replace to approximate the right hand term using the first derivative.

$L(x, y, t) = L(x, y, t+ \Delta t) + \delta tL/ \delta x * u + \delta L/ \delta y * v$

This can be rewritten as:

$L(x, y, t+ \Delta t) - L(x, y, t) = - \delta tL/ \delta x * u - \delta L/ \delta y * v$

-$\nabla L * u = C$

6. Could you intuitively explain why we can reconstruct the intensity? We first do a probability based estimation of the gradient and rotation from -∇L * u = C. Then, we optain the intensity values from a Poisson reconstruction, from whichwe can reconstruct the image live using a GPU. The only thing we need is a base image from which we can interpret the relative intensity changes.

7. What is the generative model of a DVS (formula)? Can you derive its 1st order approximation? ±𝐶 = log 𝐼 𝒙,𝑡 − log 𝐼 𝒙,𝑡 − Δt

  • first order approx. is brightness consistency assumption which requires knowledge of the contrast sensitivity C

8. What is a DAVIS sensor?

  • combination of event camera and normal camera

9. What is the focus maximization framework and how does it work? What is its advantage compared with the generative model?

  • search warp which maximizes sharpness when flattenig over time -> movement
  • object segmentation possible, direct solution movement

10. How can we get color events?

  • one event stream for each color -> process -> combine
This post is licensed under CC BY 4.0 by the author.