Application questions

1. Summarize the building blocks of a visual odometry (VO) or visual SLAM (VSLAM) algorithm

Image capturing (sequential)
Feature detection & decription (SIFT)
Feature matching (description matching e.g SSD (robust error minimizing))
motion estimation (2D-2D (8-point RANSAC) then 3D-2D (P3+1P) (3D-3D (minimizing euclidean distance))
local optmization (bundle adjustment & pose graph opt.)
global optimization (SLAM) loop closing

2. Augmented reality (AR) is a view of a physical scene augmented by computer-generated sensory inputs, such as data or graphics. Suppose you want to design an augmented reality system that super-imposes text labels to the image of real physical objects. Summarize the building blocks of an AR algorithm.

VO for getting pose of camera
3D pointcloud
surface estimation (dense reconstruction if accuracy needed)
(AI for recognizing object and giving name)
Add label in 3D (anchor to a point in the existing cloud)
backproject label as mesh onto the imagte plain.
display

3. Suppose that your task is to reconstruct an object from different views. How do you proceed?

Hierarchical SFM
Feature detection & decription (SIFT)
Feature matching (description matching e.g SSD (robust error minimizing))
cluster images in to groups of 3 images with same features
take two of these images to construct pointcloud (8 point algo.)
merge 3rd view (3-point RANSAC) & BA
merge clusters pairwise (3D-3D) & BA

4. Building a panorama stitching application. Summarize the building blocks.

Feature detection & decription (SIFT) (rotation invariant)
Feature matching (description matching e.g SSD (robust error minimizing))
warp 2nd image to minimize euclidian distance with respect to feature location in first image
for good results rotation of camera arround one single point then roto-(trans)lations warp performs perfect :)

5. How would you design a mobile tourist app? The user points the phone in the direction of a landmark and the app displays tag with the name of it. How would you implement it?

google API
landmark recognition
DB of images
Bag Of Words

6. Assume that we have several images downloaded from flicker showing the two towers of Grossmünster. Since such images were uploaded by different persons they will have different camera parameters (intrinsic and extrinsic), different lighting, different resolutions and so on. If you were supposed to create a 3D model of Grossmünster, what kind of approach would you use? Can you get a dense 3D model or it will be a sparse one? Please explain the pipeline that you propose for this scenario.

hierarchical SFM
8-point but with Fundamental matrix instead of essential matrix
only really solvable if same focal length (only cluster images with same f (info on flicker))
for cluster mergeing we can merge from different f
zero meaning
use SIFT for features -> spares one.
when normalizing for f then we can also make dense reconstruction

7. Assume that you move around a statue with a camera and take pictures in a way that the statue is not far from the camera and always completely visible in the image. If you were supposed to find out where the pictures were taken, what would you do with the images? What kind of approach would you use? Since the camera motion is around the statue, the images contain different parts of the statue. How do you deal with this problem?

small movements between images

8. Suppose that you have two robots exploring an environment, explain how the robots should localize themselves and each other with respect to the environment? What are the alternative solutions?

same bootstrap results in same world frame

Theoretical questions

01 – Introduction

1. Provide a definition of Visual Odometry.

VO is the process of incrementally estimating the pose of the vehicle by examining the changes that motion induces on the images of its onboard cameras

2. Explain the most important differences between VO, VSLAM and SFM.

VO focuses on estimating the 6DoF motion of the camera sequentially (as a new frame arrives) and in real time.
SFM is more general than VO and tackles the problem of 3D reconstruction and 6DOF pose estimation from unordered image sets
VSLAM is VO with loop detection and closure

3. Describe the needed assumptions for VO.

Sufficient illumination in the environment
Dominance of static scene over moving objects
Enough texture to allow apparent motion to be extracted
Sufficient scene overlap between consecutive frames

4. Illustrate its building blocks

Image capturing (sequential)
Feature detection & decription (SIFT)
Feature matching (description matching e.g SSD (robust error minimizing))
motion estimation (2D-2D (8-point RANSAC) then 3D-2D (P3+1P) (3D-3D (minimizing euclidean distance))
local optmization (bundle adjustment & pose graph opt.)
global optimization (SLAM) loop closing

02-03 – Image Formation

1. Explain what a blur circle is

“Circle of Confusion” or “Blur Circle”
For a fixed film distance from the lens, there is a specific distance between the object and the lens, at which the object appears “in focus” in the image, other points project to a “blur circle” in the image.
R = (Ld/2e) where d is distance between focal plane and image plane

2. Derive the thin lens equation and perform the pinhole approximation

B/A = e/z
B/A = (e-f)/f = e/f -1
=> 1/f = 1/z + 1/e (thin lens eq.)
f ~= e
B/A = f/z
=> B = (f/z)A

3. Define vanishing points and lines

Parallel lines in the world intersect in the image at a “vanishing point”
Parallel planes in the world intersect in the image at a “vanishing line”

4. Prove that parallel lines intersect at vanishing points

x = (f/Z)X
y = (f/Z)Y
L₁ = P₁ + s* [l, m, n]^T
L₂ = P₂ + s* [l, m, n]^T
for s->inf x_VP = limes f*((X+sl)/(Z+sn)) = f(l/n)
for s->inf y_VP = limes f*((Y+sm)/(Z+sn)) = f(m/n)

5. Explain how to build an Ames room

walls not parallel rather trapezoid
brain makes assumtion that was are parallel (perpendicular)

6. Derive a relation between the field of view and the focal length

tanges is opposing katete over next katete
tan(FoV/2) = (W/2)/f
W is lens hieght

7. Explain and write the equations of the perspective projection, including lens distortion and world to camera projection.

8. Given an image and the associated camera pose, how would you superimpose a virtual object on the image (for example, a virtual cube). Describe the steps involved.

virtual object mesh defined by vertecies -> bring to camera frame using R and t
bring into image frame using the intrinsics
normalize z-axis (lambda) -> (u,v)
distort (virtual object)
draw the object

9. Normalized image coordinates and geometric explanation.

noramlitzed imagfe coordinates applies the inverse of K to normalize focal length to 1.
triangle normalized f to one.

10. Describe the general PnP problem and derive the behavior of its solutions. What’s the minimum number of points and what are the degenerate configurations?

in PnP we have the 3D points and their images given and want to find pose of camera
for 1 point we have infinite solutions
for 2 bounded infinity (spidel torus)
for 3 we get 3 independent equations with 3 unknows in total -> product of degrees of equations = 2*2*2 = 8 $\begin{align*} {s_i}^2={A_i}^2+{B_i}^2 - 2A_iB_icos(\theta_i) \end{align*}$
4 positive ad 4 negative solution. Only positives are reasonable
=> 3+1 points to make system overdetermined and disambiguate the 4 solutions
degenerate configs: at least 2 points colinear with camera center (not on ray), no 3 colinear

11. Explain the working principle of the P3P algorithm. What are the algebraic trigonometric equations that it attempts to solve?

equation above in 10.
…

12. Explain and derive the DLT for both 3D objects or planar grids. What is the minimum number of point correspondences it requires (for 3D objects and for planar grids)? Figure 3: Q matrix. source

solve with SVD, m-vector = smallest eigenvector, decompose M into R and t (QR)
for planar grids just ignore third coordinates
3D: 6 points (non-coplanar and not on a twisted cubic and not 2 colinear with center of projection)
planar: minimum of 4 non-collinear points is required
13. Define central and non-central omnidirectional cameras.
non central onmi-dir-cams have mirros where the rays do not intersec in one point (heind the mirror) 14. What kind of mirrors ensure central projection.
Mirror must be surface of revolution of a conic =>
Hyperbola + Perspective camera
Parabola + Orthographic lens 15. What do we mean by normalized image coordinates on the unit sphere?
…

04 – Filtering and edge detection

1. Explain the differences between convolution and correlation

convolution flips filter first and is assiciative and commutative

2. Explain the differences between a box filter and a Gaussian filter

box values all pixel with same weight
gaussian weights central pixels more

3. Explain why one should increase the size of the kernel of a Gaussian filter if 2𝜎 is close to the size of the kernel

at the border the filter should be close to zero otherwise allising effects

4. Explain when we would need a median & bilateral filter

denoising and preserves strong edges
bilateral preserves all edges up to a threshold

5. Explain how to handle boundary issues

zero padding (black)
wrap around
copy edge
reflect across edge

6. Explain the working principle of edge detection with a 1𝐷 signal

smoothing -> gaussian
detection -> derivative
combined ==> derivative of gaussian ==> maximum/minimum is edge
laplacian of gaussian ==> zero crossing is edge

7. Explain how noise does affect this procedure

whith noise the derivative is always high

8. Explain the differential property of convolution

instead of using two convolutions One for gaussian and one for derivative use derivative of filter as new filter

9. Show how to compute the first derivative of an image intensity function along 𝑥 and 𝑦

-1 1 filter, sobel or prewitt

10. Explain why the Laplacian of Gaussian operator is useful

laplacian of gaussian ==> zero crossing is edge
==> more accurate/simple localization of edge

11. List the properties of smoothing and derivative filters Smoothing filter: – has all positive values in filter – sums to 1  preserve brightness of constant regions – removes “high-frequency” components: “low-pass” filter

Derivative filter: – has opposite signs used to get high response in regions of high contrast – sums to 0 no response in constant regions – highlights “high-frequency” components: “high-pass” filter

12. Illustrate the Canny edge detection algorithm

derivative of gaussian
threshold to remove low values
local maxima along gradient

13. Explain what non-maxima suppression is and how it is implemented

Thinning: non-maxima suppression (local-maxima detection) along gradient direction
each edge we have x and y derivative, combining them gives direction of gradient

05-06 – Point feature detection, descriptor, and matching

1. Explain what is template matching and how it is implemented

move template over image and calculate correlation. high correlation mean good match
different types of correlations can be used
output correlation map

2. Explain what are the limitations of template matching. Can you use it to recognize cars.

can only find pixel accurate images of the template
cars not pixel accurate

3. Illustrate the similarity metrics SSD, SAD, NCC, and Census transform

NCC: normalized cross-correlation: cosinus of angle between the normalized vectors
SSD: sum of squared differences
SAD: sum of abosulte differences
census transform: 8 pixel around central pixel if larger then 1 else 0. create bitencoding (append left), start top left, humming distance

4. What is the intuitive explanation behind SSD and NCC NCC

perpendicular => 0
opposite => -1
same => 1 SSD
larger the error -> large value
best value is 0

5. Explain what are good features to track. In particular, can you explain what are corners and blobs together with their pros and cons. How is their localization accuracy?

corners are good, edges are medium, falt region are bad, also good are blobs which are any other image pattern that is not a corner and differs significantly from its neighbors
corners are regions with high contrast in at least 2 directions
corners hace hight localizations accuracy but are less distinctive
blobs have lower localizations acc. but are more distinctive.
coners better fpor VO
blobs better for place recoginition

6. Explain the Harris corner detector. In particular: - a. Use the Moravec definition of corner, edge and flat region.

sliding window into different directions and calculate intensity change
all directione no intensity change => flat, change in only one direction => edge, change in at least 2 directions => corner - b. Show how to get the second moment matrix from the definition of SSD and first order approximation (show that this is a quadratic expression) and what is the intrinsic interpretation of the second moment matrix using an ellipse?
$M = [[I_x^2, I_xI_y][I_xI_y, I_y^2]]$
$SSD = sum (I(x,y)-I(x+dx, y+dy))^2 ~= sum (I(x,y)- I(x,y) -I_x(x,y)dx-I_y(x,y)dy)^2$
shorter axis of ellipse shows faster change in intensity.
langer axis is therefore along edge
if both agis are large => flat

- c. What is the M matrix like for an edge, for a flat region, for an axis-aligned 90-degree corner and for a non-axis—aligned 90-degree corner?

flat: both eigencvalues small
edge: one eigenvalue large
corner: both eigenvlaues large
axis alignment does not matter since eigenvectors give direction.

- d. What do the eigenvalues of M reveal?

magnitude of change in intensitiy into direction of eigenvector

- e. Can you compare Harris detection with Shi-Tomasi detection?

Harris: R = det(M)-k * trace(M)^2
shi-tomasi: R = min(lambda_1, lambda_2)
harris bit cheapter since eigenvlaues do not need to be calculated explicitely, but needs magic number

- f. Can you explain whether the Harris detector is invariant to illumination or scale changes? Is it invariant to view point changes?

for constant illumination change (affine) ok
cant handle scale changes -> corners become edges when upscaling
view point changes ok as long as corner stays corner - g. What is the repeatability of the Harris detector after rescaling by a factor of 2?
cant handle scale changes -> corners become edges when upscaling

1. How does automatic scale selection work?

take different patch sizes and get cornerness for images independent the size with max value
Sharp, local intensity changes are good regions to monitor in order to identify the scale
The ideal function for determining the scale is one that highlights sharp discontinuities
Solution: convolve image with a kernel that highlights edges (LoG)
Correct scale is found as local maxima or minima across consecutive smoothed images

2. What are the good and the bad properties that a function for automatic scale selection should have or not have?

A “good” function for scale detection should have a single & sharp peak
The ideal function for determining the scale is one that highlights sharp discontinuities
multiple is bad for the matching then

3. How can we implement scale invariant detection efficiently? (show that we can do this by resampling the image vs rescaling the kernel).

SIFT
instead of taking larger filters downsampling the image which improves rundtime since larger filter need more resource and making image smaller reduces number of positions (pixels)

4. What is a feature descriptor? (patch of intensity value vs histogram of oriented gradients). How do we match descriptors?

unique representatgion of a feature which is recognizable
easiest descriptor is patch itself
calculated de difference and match the ones with lowest error
more robust is also to only match when best match is mcuh better than second best match (0.8)

5. How is the keypoint detection done in SIFT and how does this differ from Harris?

octave of gaussians filtered images with different sigmas
taking differences of these gaussians DoG
take 3x3x3 box in layered DoGs, if central pixel is largest and over threshold then we found a keypoint
SIFT ist faster since it downsample therefre can fastly try big range of sigmas

6. How does SIFT achieve orientation invariance?

SIFT secriptor is orientation invarian since the HoG of the cells is shift so that over all larges value is first

7. How is the SIFT descriptor built?

16x16 patch
gauss on whole patch
4x4 cells
HoG for each cell
concatenate all HoGs
shift

8. What is the repeatability of the SIFT detector after a rescaling of 2? And for a 50 degrees’ viewpoint change?

50 degree viewpoint chage is upper limit for which it works reasonably
50 degree viewpoint chage 50% correct matches
scaling change of 2: scaleinvariant (if enought octaves)

9. Illustrate the 1st to 2nd closest ratio of SIFT detection: what’s the intuitive reasoning behind it? Where does the 0.8 factor come from?

d1/d2 < 0.8

10. How does the FAST detector work? What are its pros and cons compared with Harris?

ring with usually 16 pxl around centerpixel then see how many consecutive pixels are bighter that center pixel
if 9 or more consecutive pixel are all bighter of darker then feature is found

07 – Stereo Vision

1. Can you relate Structure from Motion to 3D reconstruction? What’s their difference?

SFM creates sparse scene point cloud, and gives the camera positions
dense 3D reconstruction needs camera poses given and then creates a dense map
SFM non sequential, dense recon in shown case sequential
SFM based on features
dense reconstruction based on pixels
3D reconstruction: K, R, t are know, Goal: 3D structure
SFM: none known, Goal: 3D structure and K, R and t

2. Can you define disparity in both the simplified and the general case?

simplyfied: pixel displacement along main axis- since recified
general: displcement along epipolar line

3. Can you provide a mathematical expression of depth as a function of the baseline, the disparity and the focal length?

depth = (baseline * focal length)/ disparity (Only in simplified case)

4. Can you apply error propagation to derive an expression for depth uncertainty and express it as a function of depth? How can we improve the uncertainty?

uncertaity depending on disparity : Delta-depth = abs(derivative of depth-function by disparity) delta-disparity
==> delta-depth = (b * f)/(disparity^2) delta-disp
==> delta-depth = (depth^2/(b * f)) delta-disp
improfe by larger baseline, better resolution, larger f

5. Can you analyze the effects of a large/small baseline? Can you illustrate it with a sketch?

see formula from above

6. What is the closest depth that a stereo camera can measure?

bf/max-disparity
max-disp = width-left-camera/2 - (-width-right-camera/2) = width-left-camera + width-right-camera
for identic cameras: width cameras

7. Are you able to show mathematically how to compute the intersection of two lines (linearly and non-linearly)?

linear: lambda * p_1 = M_1 * P, due to this equaility we knwo these vectors are paralell
==> parallel vectors have a cross-product of 0
==> p_1 x M_1 * P = 0 same for other camera
==> p_2 x M_2 * P = 0
solve this system using svd
non linear: initialize with linear
then optimize using error minimization (SSRE (reprojection))

8. What is the geometric interpretation of the linear and non-linear approaches and what error do they minimize?

non-linear minimizes SSRE
linear: midpoint of the segment connecting the two backprojected rays (minimized distance from both rays)

9. Are you able to provide a definition of epipole, epipolar line and epipolar plane?

epipoles: intertsection of line connection both camera-centers with their cameraplanes
epipolar-line intersection of epipolarplane with camera-plane (backprojection of ray onto other cameras cameraplane)
epipolarplane: plane span by camera-centers and world-point

10. Are you able to draw the epipolar lines for two converging cameras, for a forward motion situation, and for a side-moving camera?

11. Are you able to define stereo rectification and to derive mathematically the rectifying homographies?

warp of the image to how it would like like if the cameras would be perfectly aligned
apply camera specific rotation ans intrinsics inversed to get normalized representation to both views (not displayable)
apply “mean” rotation and intrisict to both views
K-hat = average of both K
R-hat: r_hat_1 = normalized vector between camera-centers (C_2-C_1)/norm(C_2-C_1)
r_hat_2 = r_L_3 x r_hat_1
r_hat_3 = r_hat_1 x r_hat_2

12. How is the disparity map computed?

recitify images
match points along epipolar line (using similarity like cross-correlation)
measure disparity
done, map

13. How can we establish stereo correspondences with subpixel accuracy?

interpolate cross-correlation function

14. Describe one or more simple ways to reject outliers in stereo correspondences.

take diretion of movement into account, e.g stereo case left right image -> from left to right the correspondance should move to the left
bigger patch

15. Is stereo vision the only way of estimating depth information? If not, are you able to list alternative options? (make link to other lectures of course)

no
n-view-SFM, 2-view-SFM, dense-3d-reconstruction

08-09– Multiple view geometry 2 and 3

1. What’s the minimum number of correspondences required for calibrated SFM and why?

5, because we have 4n knowns for n correspondences (u,v for both views) and we have 5 unknows for the motion (3 for rotation, 2 for translation because of scale ambiguity we normalize for one of the coordinates, and we also have 3 unknoes for eacht correspondance (world-point)
to solve system we need more/at-least knowns than unknows
4n >= 5+3n ==> n >= 5

2. Are you able to derive the epipolar constraint?

p_1 dot n must be 0 because they need to be perpendicular because of definition of normal vector of epipolarplane and p_1 has to be on epipolar plane
then we express n as Tx(R * p_2), we need T for the translation and R for the roation betwenn the camera frames
==> T_x (translation as cross-product-matrix) * R = E (essential matrix)
epipolar constraint: p_2^T * E * p_1 = 0

3. Are you able to define the essential matrix?

4. Are you able to derive the 8-point algorithm?

p_i is expressed as (u_i, v_i, 1)
Write the epipolar constrain as sytem of equation dependen on unknow elements of E
Write the system as matrix multiplication where we express E as a vector (Q * E = 0)
Q is the vertical stack of multiple of such equations for different correspondances
solve this system using SVD
E is the eifenvector to the smallest eigenvalue
then we decompose E using SVD
extract R und T according to slide which wont be asked
(normaization) 5. How many rotation-translation combinations can the essential matrix be decomposed into?
into 4 but only one of then poses the point in front of the cameras

6. Are you able to provide a geometrical interpretation of the epipolar constraint?

7. Are you able to describe the relation between the essential and the fundamental matrix?

fundamental matrix is where we don’t know the intrinsics so it contains them.
F = K_2^(-T) * E * K_1^(-1)

8. Why is it important to normalize the point coordinates in the 8-point algorithm?

because otherwise Q can be poorly conditioned
since the values are from 0-Xwhen multipling the valuesw can get very big.
only problems when solving the numerical solution, multiplies the noise

9. Describe one or more possible ways to achieve this normalization.

move orgin to centerpoint and normalize coordinates to [-1,1]
normalize mean to be the (0,0) and normalize the standart deviation to sqrt(2)
Normalization can be expressed as matrix B: p_norm = Bi * pi

10. Are you able to describe the normalized 8-point algorithm?

normalize point correspondances as described above (pi_norm = Bi * pi)
estimate normalized F_norm with normalized coordinates
compute unnormalized F = B2^T * F_norm * B1

11. Are you able to provide quality metrics for the essential matrix estimation?

algebraic error (deviation from 0 wen taking cross product)
directional error (normalize algebraic)
epipolarline distance (distance from image to epipolar line)
reprojection error

12. Why do we need RANSAC?

to remove outliers

13. What is the theoretical maximum number of combinations to explore?

for pairs: n(n-1)/2
for s points in sample: n_choose_s = n!/(s!(n-s)!)

14. After how many iterations can RANSAC be stopped to guarantee a given success probability?

k = log(1-p)/log(1-w^s) where p is probability of success, w is fraction of inliers, s in number of points in sample

15. What is the trend of RANSAC vs. iterations, vs. the fraction of outliers, vs. the number of points to estimate the model?

exponential curves, where 8-point rises first, then 5 then 2

16. How do we apply RANSAC to the 8-point algorithm, DLT, P3P?

we use 8 points to estimate model using 8-point algo and look how many inliers and the repeade k times
we use 6 points to estimate model using DLT algo and look how many inliers and the repeade k times
we use 3+1 points to estimate model using P3P algo and look how many inliers and the repeade k times

17. How can we reduce the number of RANSAC iterations for the SFM problem? (1- and 2-point RANSAC)

movement constraints
planar -> 2 points
planar circular -> 1 point

18. Bundle Adjustment and Pose Graph Optimization. Mathematical expressions and illustrations. Pros and cons.

BA: minimzes reprojection errors and adjusts camera-poses and 3D-points $\begin{align*} (P^i,C_2) = argmin_{P^i,C_1,C_2} \sum_{i=1}^{N} \lVert {p_1}^i - \pi_1(P^i,C_1)\rVert^2 + \lVert {p_2}^i - \pi_2(P^i,C_2)\rVert^2 \end{align*}$
PGO: minizes pose-error (roto-translations) for all possible poses that have same correspondences, only adjusts poses $\begin{align*} C_k = argmin_{C_k} \sum_{i} \sum_{j} \lVert C_i - C_jT_{ij} \rVert^2 \end{align*}$ 19. Are you able to describe hierarchical and sequential SFM for monocular VO? hierarchical:
clusters of 3 view with same features
construct point cloude from two of these (2D-2D)
merge third view (3D-2D)
merge these clusters (3D-3D)

sequential

bootstrap (2D-2D for first two keyframes) -> pointcloud
then until next keayframe just get pose (P3P)
new keyframe add more points to pointclouds (P3P for pose then triangulation for 3D-points)
new keyframe when key-distance/avg-scene-depth = 0.1 to 0.2
pro/con expensive for to many keyframes and to small baselines -> to big of a depth uncertaity, too few keyframes -> to few correspondances

20. What are keyframes? Why do we need them and how can we select them?

see above

21. Are you able to define loop closure detection? Why do we need loops?

loop detection recognizin that we are at a place we were before (placerecognition)
then we can try to close the loop because we know it should be but our estimation sais different thing (pose-graph-otimization)
==> VSLAM
we need loops to know that we are drifting
we need loops for loop closure which improves our optimizations (remove drifts)

22. Are you able to provide a list of the most popular open source VO and VSLAM algorithms?

ORB-SLAM
LSD-SLAM
DSO
SVO

23. Are you able to describe the differences between feature-based methods and direct methods?

feature based methods are based on features not pixels
direct methods are based on pixel intensities

24. Sparse vs semi-dense vs dense. What are their pros and cons?

for spares methods we only look ar some view pixels
semi-dense: we look ant more pixels
dense: we look at all pixels
dense methods are more costly because we have to calculate for all pixels buit tend to be more accurate
sparse are faster, less accurate
dense are useful with motion blure and defocus
sparse are equal good to dense when overlap is >30%

25. General definition of VO and comparison with respect VSLAM and SFM. Definition of loop closure detection (why do we need loops?). How can we detect loop closures? (make link to other lectures).

SFM > VSLAM > VO
loops see above

10 – Multiple view geometry 4 (benchmarking visual SLAM) not covered in class, so it won’t be asked.

11 – Tracking

1. Are you able to illustrate tracking with block matching?

shifting window to find patch again in 2nd image
shift indicates direction of movement
move inside searchregion and define comparison method

2. Are you able to explain the underlying assumptions behind differential methods, derive their mathematical expression and the meaning of the M matrix?

Assumptions: 1. Brightness constancy, 2. Temporal consistency, 3. Spatial coherency
temporal: nor moved to much (motion between frames 1-2 pxl)
spacial: Neighboring pixels belonging to the same surface and therefore undergo similar motion
different mthods: minimize difference of first image and second image with shift
take same patch in both imaged but for second image approximate shift with taylor FOP
$SSD = Sum(Delta-I - I_x * u - I_y * v)^2$
take partial derivatve (u,v) of SSD and set 0
express as matrix equation: $(u,v)^T = M^-1 * (Sum(I_x * Delta-I), Sum(I_y * Delta-I))^T$ (this is the shift)
interprataion of M same as for Harris. (both eigenvlaues large -> then it good to track)

3. When is this matrix invertible and when not?

For M to be invertible, its determinant has to be non zero

4. What is the aperture problem and how can we overcome it?

when only see an edge we don’t know the movement
we can overcome by choosing larger patch

5. What is optical flow?

Optical flow or optic flow is the pattern of apparent motion of objects in a visualscene caused by the relative motion between the observer (an eye or a camera) and the scene

6. Can you list pros and cons of block-based vs. differential methods for tracking?

block-based more robust on large movements but costly
differential only applyable for small movements, faster

7. Are you able to describe the working principle of KLT?

inital guessed warp that warps from img1 to template
iteratively find better warping parameters to minimize SDD between warped-img1 and template
use final found warp of img1 as inital guess for img 2
(template used to find object in image and find it initail placements (warp) so that we can extress the first warp from img1 to template)

8. What functional does KLT minimize? (proof won’t be asked, only the first two slides titled “derivation of the Lucas-Kanade algorithm”)

minimizes E (=SSD)

9. What is the Hessian matrix and for which warping function does it coincide to that used for point tracking?

KLT 2x6 warp function (affine) -> H is a 6x6 matrix
for point tracking (Differential Methods) the coinciding warp is the translation (2X3 but with only 2 parameters) -> H is a 2x2 matrix
coincides with translation
jacobian is a (number of functions)x(number of parameter) Matrix
hessian is a (number of parameter)x(number of parameter) matrix
the jacobian represents the first partial derivative
the hessian represents the second partial derivative
H = J^T*J
the number of fucntion is in our case alwasy 2
the number of parameters depends on the waring function
in the Differential Methods we take first derivative to get the SSD and the we want to minimze it for this we take another derivative -> second partial derivative => hessian

10. Can you list Lukas-Kanade failure cases and how to overcome them?

assumption of brigness, temporal and spacial not hold
occlusion
error in template
If the initial estimate is too far, then the linear approximation does not longer hold -> solution: Pyramidal implementations (see next slide)

11. How do we get the initial guess?

for consecutive frames take the solution of previous step
for first one: start at rest of find object in image and track it

12. Illustrate alternative tracking using point features.

keypoint tracking
1. Keypoint detection and matching
1. Geometric verification (RANSAC)

12a – Dense 3D Reconstruction

1. Are you able to describe the multi-view stereo working principle? (aggregated photometric error)

sample depth on multiple views
aggregate depth error over multiple view
add depths to 3rd dimension -> disparity-space-image

2. What are the differences in the behavior of the aggregated photometric error for corners, flat regions, and edges?

Aggregated photometric error for flat regions (a) and edges parallel to the epipolar line (c) show flat valleys (plus noise)
For distinctive features (corners as in (b) or blobs), the aggregated photometric error has one clear minimum.
Non distinctive features (e.g., from repetitive texture) will show multiple minima 1

3. What is the disparity space image (DSI) and how is it built in practice?

see 1.

4. How do we extract the depth from the DSI?

argmin of DSI

5. How do we enforce smoothness (regularization) and how do we incorporate depth discontinuities (mathematical expressions)?

not just take argmin but a optimal d-function (d(u,v))
we adjust this d-function to minimize the data-term error and the regularization term error.
regularization term error is based on the partial derivatives of the d-function
weighting regularization using lambda

6. What happens if we increase lambda (the regularization term)? What if lambda is 0? And if lambda is too big?

if 0 no regularization
for increasing lambda more smother but also încreasing data term error

7. What is the optimal baseline for multi-view stereo?

When baseline becomes large (e.g., >10% of the avg scene depth), then create new reference frame (keyframe) and start a new depth map computation

8. What are the advantages of GPUs?

GPU can parallize more threds and for image processing we often have many simple taks (for each pixel/ feature)
CPU have more chach which the GPU/we does not need

12b – Place Recognition

1. What is an inverted file index?

list of e.g pages where a word appears.
for images: instead of noting which image has which features/visual words we list all images which have that visual word (for each word)
word-to-image mapping

2. What is a visual word?

the centroid of a cluster in the descriptor space
“average feature”

3. How does K-means clustering work?

randomly pick 10 points in descriptor space as initial cetroids
match all features to nearest centroid ->
move centroid to mean feature position
repeate

4. Why do we need hierarchical clustering?

for faster access in inverted file index

5. Explain and illustrate image retrieval using Bag of Words.

6. Discussion on place recognition: what are the open challenges and what solutions have been proposed?

geometric verification
how to find best from many goods
how to handle multiple appereance of same word
optimal descriptor
optimal magic numbers: descriptor length, clusters, branches, depth
how to pregroup
how big of a database
FABMAP: captures spacial dependencies of visual words

12c – Deep Learning – Won’t be asked at the exam

13 – Visual inertial fusion

1. Are you able to answer the following questions?

2. Why is it recommended to use an IMU for Visual Odometry?

complementary systems (VO bad for fast movements (blur …), IMU good for fast) (VO rich information, IMU large drift)

Pro Camera

Precise in slow motion
Rich information for other tasks

Con Camera

Limited output rate (~100 Hz)
Scale ambiguity in monocular setup
Lack of robustness to HDR and highspeed

Pro IMU

Robust
High output rate (~1,000 Hz)
Accurate at high acceleration
Can predict next feature position

Con IMU

Large relative uncertainty when at low acceleration/angular velocity
Ambiguity in gravity / acceleration

3. Why not just an IMU, without a camera?

IMU and vision are complementary
IMU has shit drift

4. How does a MEMS IMU work?

Accelerometer: A spring-like structure connects the device to a seismic mass vibrating in a capacitve divider. A capacitive divider converts the displacement of the seismic mass into an electric signal. Damping is created by the gas sealed in the device.
gyroscopes measure the Coriolis forces acting on MEMS vibrating structures (tuning forks, vibrating wheels, or resonant solids)

5. What is the drift of an industrial IMU?

Integration of angular velocity to get orientation: if there is a bias in angular velocity, the error is proportional to t
Double integration of acceleration to get position: if there is a bias in acceleration, the error of position is proportional to t2

6. What is the IMU measurement model?

The formula states that the measured acceleration from frame B(ody) with respect to frame W(orld), measured in frame B, equals the true acceleration from B to W, but measured in frame W, minus the G-Force measured in frame W. To put this in frame B, we have to apply the rotation from frame W to frame B. To all this, we of course have to add a bias and noise.

The formulate states that the measured angular velocity from frame B with respect to frame W, measured in frame B, equals the true angular velocity from frame W to frame B, measured in frame B. To all this, we of course have to add a bias and noise.

7. What causes the bias in an IMU?

technical stuff Some facts about IMU biases: • They can change due to temperature change and mechanical pressure • They can change every time the IMU is started • Good news: they can be estimated

8. How do we model the bias?

the derivative of the bias is white Gaussian noise

9. How do we integrate the acceleration to get the position (formula)?

10. What is the definition of loosely coupled and tightly coupled visual inertial fusions?

11. How can we use non-linear optimization-based approaches to solve for visual inertial fusion?

filtering(weighting) and smoothing
kalman filter

12. Can you write down the cost function of smoothing methods and illustrate its meaning?

not exactly but for sure not like in slides

14 – Event-based Vision

1. Are you able to answer the following questions?

2. What is a DVS and how does it work?

dynamic vision sensor, reats to light intensity change.
a DVS outputs asynchronous events at microsecond resolution. An event is generated each time a single pixel detects a change of intensity

3. What are its pros and cons vs. standard cameras?

much faster “framerate”
low latency
high dynamic range
low power

BUT

asynchronous pixels
no intensity infos

4. Can we apply standard camera calibration techniques?

yes, on flashing grid

5. How can we compute optical flow with a DVS?

from delta-t and the brighness constancy assumtion we can calculate the movements (flow)

$L(x, y, t) = L(x+u, y+v, t+ \Delta t)$

We can then use taylor to replace to approximate the right hand term using the first derivative.

$L(x, y, t) = L(x, y, t+ \Delta t) + \delta tL/ \delta x * u + \delta L/ \delta y * v$

This can be rewritten as:

$L(x, y, t+ \Delta t) - L(x, y, t) = - \delta tL/ \delta x * u - \delta L/ \delta y * v$

-$\nabla L * u = C$

6. Could you intuitively explain why we can reconstruct the intensity? We first do a probability based estimation of the gradient and rotation from -∇L * u = C. Then, we optain the intensity values from a Poisson reconstruction, from whichwe can reconstruct the image live using a GPU. The only thing we need is a base image from which we can interpret the relative intensity changes.

7. What is the generative model of a DVS (formula)? Can you derive its 1st order approximation? ±𝐶 = log 𝐼 𝒙,𝑡 − log 𝐼 𝒙,𝑡 − Δt

first order approx. is brightness consistency assumption which requires knowledge of the contrast sensitivity C

8. What is a DAVIS sensor?

combination of event camera and normal camera

9. What is the focus maximization framework and how does it work? What is its advantage compared with the generative model?

search warp which maximizes sharpness when flattenig over time -> movement
object segmentation possible, direct solution movement

10. How can we get color events?

one event stream for each color -> process -> combine

Visual Odometry and Visual Algorithms [Part 17] - Application questions