Chapter 14 - Place Recognition

Slides 1 - 51

In this chapter we will dive into the field of place recognition. In the context of visual odometry place recognition is importent for recognizin when arriving at a already visited location so that loops can be closed. In order to be able to efficiently recognize locations we need a efficient representation of the features of a imgae whicih then can be clustered and seaeched for similar ones. Obviously place recognition can also be used for other application for example image retrieval where we want to find similar images to a given one.

We will now show how imprtant a efficient representation of a image is for image retrieval. Lets say we have a database with 100 million images. How can we search through all of those images to find a matching one in about 6 secounds?

Figure 1: Image Querying. source

So our general goal is to query a database with $N$ images. If we extract $M$ features from each image then if we compare our query image with all the others we have iterate over each image ($O(N)$) and then compare each feature of our query image with each feature of the image from the database ($O(M^2)$). This means that the total complexity is $N \cdot M^2$. T get an impression of this lets assume we have 100 million images in the database and we extract 1’000 features from each image. Then we would need to run 100’000’000’000’000 feature comparison. As a reference if each comparison takes 0.1 ms, then the image query would take 317 years!

So how can we reduce this complexity? Well the answer is - inverted file index - we need to use a representation where we only iterarte over the features and not the images like this we reduce the complexity to $O(M)$

Indexing local features

Indexing of visual features works similar to the index of a text document where most often at the end of the book there is a list listing which word appears on which page. For the visual context we want to finde the images that contain a certain feature. However compared to the example with text for image feature we have infinite different features whereas for text we most often have at least a boundend number of words. To reduce the number we can define Visual words as well a a vocabulay of such visual words. This approach is called Bag if Words (BOW).

How can we extract such visual words from the feature descriptors? To get a reasonable vocabulary we need plenty of data and therefora a large enought dataset. Then for each image the features and descripors are extracted and mapped into the descriptor space which depends on the representation of the descriptor. Then the descriptor space is clustered into $K$ clustes where $K$ is the number of visual words we want to have. For each cluster the centerpoint (centroid) is derived and represents the visual word. Usually the centroid is derived by thaking the arithmetic average of all desriptors in the cluster.

Figure 2: Descriptor Space. source

Clustering

How exactly do we cluster the descriptor space? The usual algorithm for that is the k-means clustering. This algorithm partitions $n$ datapoints into $k$ (user defined number) clusters, where each cluster has a center /centroid $m_i$. It minimizes the squared euclidean distance between the descriptor points and the nearest clustercenter $m_i$.

\[\begin{align*} D(X,M) = \sum_{i=1}^{k} \sum_{x \in S_i} (x-m_i)^2 \end{align*}\]

where $S_i$ is the the cluster $i$

The k-means algorithm starts out with $k$ randomly choosen cluster centers. The it starts iterting. In each iteration if first assigns all points to their neares clustercenter. Then it updates the clustercenter based on the assigned points, so it takes the mean position of all assigned points.

Figure 3: Clustering. source

Applying Visual Words to Image Retrieval

From the clustering we get the vocabolary of the domain, meaning all possible visual words. The inverted file index now lists all visual words from that vocabulary. For each word we add those images that contain this visual word to a list. By feeding new images into that process we call full more images into that index list. Now when we search using a query image we look which wvisual words appear in that query image. For each of the appearing words we check which other images have the same word. For those iamges we add one to a counter in the voting array. The voting array is a list with one entry for each image containing a counter variable. In the end we select the image with the higherst counter value in the voting array as the match for the query image. Using this inverted file index quering an image is independent of the number of images in the database. However every feature in the query image still needs to be compared against all visual words in the vocabolary. Given that we still have 1000 features per image and a vocabulary of 1’000’000 visual words this results in 1’000’000’000 comparisons. This process would run for 28 hours if each comarison takes 0.1 ms. Therefore the goal of achieveing a time under 6 seconds is still not reached.

Hierarchical clustering

The soltion to making the process even faster is the hierarchical clustering. Instead of clustering into $k$ clusters one can cluster into $b$ clusters and then cluster each of thos clusters again into $b$ clusters and so forth. Like this we are getting a treelike structure which makes the feature-visual word comparison faster. For each feature we go down this cluster tree to get to the leaf node which then descibes the visual word. This however means that we do not have to campare each feature with each visual word. This process of going down the tree is the same for building the index as well as retireving a match. now we calculate how long it takes when we now want to query in a datbase with 100’000’000 images and have 1000 features per image and we assume 1’000’000 visual words. Lets say we structure our hierarchical clustering to have 10 clusters on each level. So first we cluster into 10 clusters and the we cluster each clustaer again into 10 clusters and repeat this in total 6 time. Then we have 10 branches ($b$) each level and 6 depth levels ($l$). This results in $b^L = 1’000’000$ visual words. We have to compare each feature with 10 clusters each level. Therefore the number of comparisons is $M \cdot b \cdot L = 1000 \cdot 10 \cdot 6 = 60’000$. If we again assume 0.1 ms per comparison a query would need 6 seconds.

Figure 4: Hierarchical clustering. source

Robustnes

Onw downside this approach has is that it looses spatial imformation in the images. The matching is pure feature based which for the most part does not have any geometrical information. This is especially unfortunate if there are multiple possible matches which are reasonable but one of is always picked since it scores slighly better. In such a case it is beneficial to find a valid methods to differentiate the possible solutions to the find the best one. We can overcome this if we include the geomertic imformation as well. This we can achieve by using the 5 or 8 point algorithm to get the geometrical orientation of the iamge. then we can apply this to the most similar iamges and compare the reprocetion error.

For now we have more or less arbitrarely choosen the number of words in our vocabularely. But how much does it matter how many words we have. Well the more words the better we can differentiate the images. But we would also need more branches or levels, so the query times grows. In general we can say t the larger the number of words we use is the less of an improvement it give to add even more words.

Figure 5: Number of Words. source

We can also variate the branchfactor for a give number of words we then just have to adjust the depth.

Figure 6: Branch Factor. source

Summary

Figure 5: summary. source

Visual Odometry and Visual Algorithms [Part 14] - Place Recognition

Chapter 14 - Place Recognition

Indexing local features

Clustering

Applying Visual Words to Image Retrieval

Hierarchical clustering

Robustnes

Summary

Further Reading

Machine Learning is Fun! - Using Machine Learning to generate Super Mario Maker levels

Machine Learning is Fun! - Deep Learning and Convolutional Neural Networks

Machine Learning is Fun! - Modern Face Recognition with Deep Learning