Chapter 16 - Event based vision
Slides 1 - 72
Feature based Visual Odometry has been a field of research for a few decades now, and great improvements towards Efficiency and Robustness could be made, while accuracy stayed the samefor the most time between 1980 and 2007. It was the combination of VO and IMU Sensors that brought up the accuracy for a significant amount, with improving Robnstness as well. The last big accuracy and Efficiency step was made recently with the usage of Event Based Cameras in 2014, which is why we will take a deepter lock into Event Cameras in this Article.
As we have seen in the last Article, IMUs are very helpful for predicting the Cameras position when the cameras transformation happens too quickly such that images get blurry and untrackable. However, the IMU is an intrinsic sensor, meaning it does not look at the world around it but measures data internally. We therefore still lack have the downsides of the cameras that we can hardly overcome with intrinsic measurements. Sure, the IMU can definitely help with High-speed motion up to a certain degree, and probably with dynamic environments as well. But he can’t help with High Dynamic Range, long latency from the camera sensor, or Low-texture scenes. But these challenges can all be overcome with using event cameras.
Dynamic Vision Sensor
Event-Based Cameras (EBC) are with their functionality inspired by the Human eye. While traditional cameras work in frames every xth microsecond, the human eyes have 130 million photoreceptors that do not forward signals into their axons periodically, but whenever they register a change in their receptive field. This leads to very low information flow since the human eye is mostly still with moving only a fraction of the time. Dynamic Vision Sensors (DVS) work the same way. They are
- low-latency (~1 micro-second)
- High-dynamic range (HDR) (140db instead of 60db)
- High refreshrate (1MHz)
- Low-Power (0.01W copared to 1W)
Image sensors. They are so special because they do not register frames every few milliseconds but output a high frequency signal only for these pixel that registered an intensity change above a certain threshold. So if you film a static scene with an event camera, it will NOT produce any output signal since no pixel registers a change. Only when you move the camera, all pixels that significantly change in value will produce an output for the duration of the movement.
In the following image, you can see the output over time for a std camera and an event based camera, both filming a rotating plate with a dark marker on it. We can note a few things here. First, the output of the standard cameras are whole frames (images) over time, while the output of the event based camera are individual pixels over time, but with a much higher timely density. Second, the evnet based camera does not register the central point of the spinning plate, since it does not change. Third, the DVS output pixels are either blue or red, while red symbolizes an increase in density, blue a decrease in pixel density value. The output of the DVS is therefore a set of asynchronous events whenever a pixel registers a change in intensity. An event can therefore be characterized by (time, (u,v), sign).
Figure 1: Rotating plate, filmed by std and event-based camera. source
Event based Cameras
Event based cameras are just cameras that use a DVS sensor. To understand the cameras functionality in details, we will examine one single pixel over time and see how it creates events. This is a valid simplification since pixels are asynchronous and independent from each other anyways, so by looking at a single pixel, we can explain the cameras functionality as a whole.
First of all, we plot the events generated by a single pixel in an event camera over time with a curve displaying the light intensity value it receives. The intensity I is sampled using a logarithmic scale. An event is generated whenever the intensity increases or decreses by a constant threshold C. Positive changes result in a blue event (ON-Events), negative changes in a red one (OFF-Event). For constant instensity, no events are triggered. An event camera therefore samples the intensity, while a std camera samples the time.
Figure 2: Sampling (Event Camera, Std. Camera). source
This explains why event cameras have such a high frame rate: The sensor does not have to process all pixels every few milliseconds but rather processes only the ones triggering event but in a much higher framerate. But the question remains why event based camera are superior to normal ones in regards to Dynamic Range. Remember that a high dynamic ranges means that a camera sees regions of very bright and very dark pixels. Since event based pixels are asynchronous, we do not really need a global shutter time that defines the exposure. Each pixel rather only keeps track of the current intensity in comparison to the last intensity crossing. When a new event is generated by a high enough intensity change, everything is reset and the differentiation starts from this new anker point, meaning that the range to examine is much smaller and on a per pixel level, which makes it easy to deal with high dynamic range in the frame.
Using this knowledge, we can now examine picture taken by an event based camera that is rotating to the right. The result is an image with blue pixels wherever bright pixels became darker, and red pixels wherever dark pixels became brighter. We can see that an event based camera is therefore automatically an edge detector.
Figure 3: Picture taken by an event based camera. source
Note that in order to create this picture, we had to accumulate all events over a specific period of time (in this case 40 ms), otherwise the output would be too sparse and barely perceivable for us.
Another method to visualize the output of event based camera is to aggregate (sum up) all positive (+1) and negative (-1) events in a given time interval and display the result as a greyscale image.
Figure 3: Picture taken by an event based camera. source
Event based cameras are great for IoT applications since they are low-power and low-data. They are also used in Automotives where High dynamic range and low-memory staging is key, AR/VR for low-latency and low-power reasons, and in different industries involving fast moving parts.
In many regards, Event based camera are even superior to professional High-speed cameras. Figure 4: Comparing High-speed, Standard and Event-Based Cameras. source
Traditional VO Algorithms on EB-Cameras
Event-Based cameras are still Pinhole cameras after all, so most algorithms should work for EB-Cameras as well. Indeed, we can calibrate an event based camera using a grid with corner detection, with the only difference that the grid must be blinking in order to generate constant ON-OFF-ON Events. Optical Flow algorithms of course work the same, all event pixels follow the direction of movement. An edge moving constant into one direction would produce a line of high events. Plotted over time we’d see a plane of events where the moving speed would correspond to v = dx / dt, with dx being the traveled distance and dt being the passed time.
In many regards, Event based camera are even superior to professional High-speed cameras. Figure 5: Moving Edge over Time. source
Brightness Constancy Assumption
The most fundamental theorem of EB Cameras is that the gradient of the negative logarithmic instensity $\nabla L(x, y)$ multiplied by the Optical Flow / Motion u = (du, dv) equals to the contrast threshold C at which we sample the intensity values.
To put it in short: -$\nabla L * u = C$
Note that the gradient $\nabla$L always follows the direction of the edge. When the movement u is perpendicular to the gradient, no intensity change C will happen. If it follows the gradient, a even a small movement will lead to big changes in C.
Figure 6: Gradient -$\nabla$L and direciton of movement u. source
We can easily proof this theorem using the brightness constancy assumption that states that the intensity p before and after the motion must be unchanged. Let the time before the motion be t and after the motion $t + \Delta t$, with the motion vector u *= (u, v). \(\begin{align*} L(x, y, t) = L(x+u, y+v, t+\Delta t) \end{align*}\) We can then use taylor to replace to approximate the right hand term using the first derivative. \(L(x, y, t) = L(x, y, t+ \Delta t) + \delta tL/ \delta x * u + \delta L/ \delta y * v\) This can be rewritten as: \(L(x, y, t+ \Delta t) - L(x, y, t) = - \delta tL/ \delta x * u - \delta L/ \delta y * v\)
Well, we are already at the end. The left hand side term is the intensity difference, so either $\Delta L$ or just C. On the right, we have the gradient $(\delta tL/ \delta x, \delta L/ \delta y)$ and the motion vector u = (u, v). By separating these two, we get as expected: -$\nabla L * u = C$
How is this usefull? Well, we can reconstruct a greyscale image from our events that has super-resolution and high dynamic range (HDR). We first do a probability based estimation of the gradient and rotation from -$\nabla L * u = C$. Then, we optain the intensity values from a Poisson reconstruction, from whichwe can reconstruct the image live using a GPU. The only thing we need is a base image from which we can interpret the relative intensity changes.
Combining Event-Based and Std. Cameras
The two different camera types are quite complementary. While Event cameras have a high update rate, a good dynamic range and do not suffer from motion blur, standard cameras can extract static motion, have absolute intensity and are much more mature. There are indeed cameras commercially available that combine Event, Image and IMU all onto one sensor with perfectly overlapping pixels.
Such a combination can be used to deblur blurry videos. A blurry image can be seen as the integral of a sequence of latent images during exposure. Normally, this issue can’t be fixed since we do not have any information how these sequence looks like. However, since we have events in between our images, we can actually estimate the change between the latent images and find a sharp images by substracting the double integral of the events from the image. Why the double integral? Well, the event image is an edge detector, which is a sudden change in contrast, e.g. the second derivative of the intensity. The double integral therefore gives back an intensity value image, which we can subtract from the blurry one to get a deblurred image.
Figure 7: Sharpening an image using Events. source
The intermediate data generated by event based cameras can also help keeping track of Lucas-Kanade featuresto have a good estimation of where the features might be when the new frame arrives.
Focus Maximization Framework
The Brigness constancy assumption has one big problem: The illumination change C is scene dependent and can be (in value) different for each pixel in the image.
Focus Maximization takes a different approach. Remember that the output of an event based camera for a certain timeframe is a 3D time-space room. As we have seen, we can aggregate such a timeframe by stacking all events by neglecting the time and and interpreting the time for each event as the 0, basically reducing the dimensionality and getting a 2D image that is bright at every pixel where positive events dominated, and dark where negative events dominated. However, this simple aggregation has a flaw: It might produce blurry results again for longer timeframes.
The idea of the focus maximization is to warp the spatial-temporal volume in a way that maximizes the focus / sharpness of the resulting aggregation. So we warp the 3D event points first an daggregate later, with hoping to become a sharp image.
The pipeline for the maximization problem is quite simple. We first warp each pixel at each event based on a Warping function W, independent of their time. We then build a greyscale image out of the result and evaluate its sharpness using a standard-derivation approach: The higher the variance, the sharper the image.
Figure 8: Focus Maximization Framework. source
Deep Learning
This focus visualization can also be done using unsupervised deep learning techniques where the focuse is used as loss. A NN will then maximise the sharpness over the aggregated event image. The question arises: How do we feed event data into Neural Networks that need a constant input size to learn? Well, we sample the (x,y,t) space-time room into 3D Voxels, where each voxel is defined as the sum of positive and negative events. Regions in which no events happened result in voxels of value 0. Then, these voxels are fed into the Neural Network.
Figure 9: Voxel representation of space-time room. source
Using unsupervised learning, we can also use a Recurrent Neural Network to convert events to greyscale videos. The input is always the last reconstructed frame + the sequences of voxels. The resulting videos are of higher dynamic range than traditional camera footage.
UltimateSLAM
Focus maximization can be usefull to stabilize an image by estimating its rotation, image segmentation using video data only, obstacle detection and even visual SLAM, called UltimateSLAM.
With combining events, images and IMU, ultimateSLAM is a rbust visual SLAM algorith for high speed scenarios. In the Front-End, features are tracked from both the frames and the events. In the Back-End, a Sliding-Window Visual-interial fusion is used to calculate the relative poses. It has the advantage over standard SLAM that it also works in nearly complete darkness, due to the high dynamic range properties of the event cameras, and can track features very accurately also in high speed motions. In HDR and high-speed scenes, over 85% accuracy is gained.
Color Event Cameras
The same way we construct normal RGB cameras (using R, G, B sensors), we can construct RGB event cameras that only react to changes in their respective color channel. This information can be used to reconstruct RGB images from RGB-Events.