NeRF - Representing Scenes as Neural Radiance Fields for View Synthesis Review
I. Introduction
There is a task called view synthesis that takes multiple photos of an object as input and finds out what the object looks like from a new view. This wasn’t performing very well, but the author proposes a new method.
New method
The author designed a function that takes 5-dimensional data $(x, y, z, θ, ϕ)$ as input and extracts the radiance and density extracted from the object when viewed from that point of view.
At this time, radiance is ‘a light source emitted from $(x, y, z)$ in the $(θ, ϕ)$ direction’ and density is ‘a laser (virtual laser) passing through $(x, y, z)$. If you think of it as a vector representing the viewpoint. It works like a differential opacity to control the light source being attenuated . It is difficult to understand 100% in density.
How did the author create a function that obtains radiance and density by inputting $(x, y, z, θ, ϕ)$? Here the author used a deep learning algorithm. We designed an MLP that takes $(x, y, z, θ, ϕ)$ as input and outputs single volume density and view-dependent RGB. The deep learning algorithm is not used as the main element, but as one of the general elements used during the problem-solving process. And by using classical volume rendering techniques to accumulate RGB and density data obtained through MLP into a 2D image, you can obtain a scene when looking at an object from a new perspective. The author named his method neural radiance field (NeRF) . The name comes to mind after a nerf in the game. I feel so affectionate.
The author organized the process for carrying out the NeRF he proposed, that is, the process mentioned above, into three steps as follows.
Create a sampled set of 3D points (specify the viewpoints from which to look)
Enter the created sampled set and the 2D viewing direction corresponding to the sampled set into MLP to obtain the color and density of the object viewed from that location.
Accumulating color and density acquired through classical volume rendering techniques into 2D images
NeRF reminded me of the human brain. People can also mentally look at a specific part of an object at that point and then imagine a point they haven’t looked at yet. It felt like the network the author proposed would work that way too.
The process is represented graphically as follows:
The authors stated that the process of performing NeRF is differentiable. In other words, the MLP can be optimized using gradient descent by calculating the error between the image (scene) actually observed at that point and the ‘estimated’ image created by MLP. If we can reduce errors in images observed from multiple perspectives, we will be able to more accurately observe images observed from new perspectives.
Discover and complement shortcomings
The authors found that a basic implementation that optimizes NeRF for representing complex scenes performs poorly and requires a larger sampled set. So, we converted the $(x, y, z, θ, ϕ)$ data using positional encoding , allowing the MLP used in NeRF to express higher frequencies.
1
Being able to express higher frequencies here means increasing the ability to express boundaries. The low-frequency band of the image represents the background, and the high-frequency band represents areas where the RGB data fluctuates, such as borders.
We also introduced a hierarchical sampling procedure to reduce the number of queries required and select an appropriate number of sampled sets needed to express high-frequency scenes.
Benefits from NeRF
NeRF can represent objects of complex shapes that exist in the real world and perform gradient-based optimization using images projected to our eyes . According to the paper, NeRF has an advantage because it inherits the advantages of volumetric representation.
It also has the advantage that the cost required when modeling a scene of complex objects at high resolution is much less than the cost required when modeling a scene with discretized voxel grids . Even with a small amount of capacity, a 3D model scene can be created. Considering that 3D files have a large capacity, I think this is a significant advantage.
II. Neural Radiance Field Scene Representation
As mentioned earlier, the author receives $( x , y, z , θ, ϕ)$ as input data, that is , 3D coordinates , $g, b$ and volume density #σ# were created. This refers to the MLP used in NeRF. At this time, the direction $(θ, ϕ)$ looking at the object from a specific location $(x, y, z)$ is expressed as a unit vector $d$ composed of a Cartesian coordinate system $(x, y, z)$.
In other words, the author implemented a method of representing a scene with 5-dimensional data $(x, y, z, θ, ϕ)$ as an MLP that maps $( X , d ) → ( c , σ)$, and learned the parameter $Θ$ of this MLP. It was created to create a scene that is closer to reality. This can be seen as a more specific explanation of what I said earlier.
Make sure there is no change in performance no matter what perspective you look at
Usually, when designing and training a network, there are times when performance is good in certain cases, but performance is poor in other specific cases. The learning process was highly biased toward specific situations. The view synthesis performed by NeRF can be seen as ‘predicting well only the scenes observed at a specific time’.
So the author tried to solve this problem in the following way.
- Predict volume density σ using only
- Predict c using both
1
The MLP and fully-connected layers shown here all use ReLU as an activation function.
The picture below shows an example of the method proposed by the author and its effectiveness. (Above: Example, Below: Effect)
III. Volume Rendering with Radiance Fields
So far, we have looked at obtaining the volume density and RGB of an object viewed from a given location using MLP. Did I mention earlier that NeRF performs rendering using ‘classical volume rendering techniques’? This chapter explains it in detail.
Find the color to be rendered
First, let’s look at the formula for calculating the color to be rendered. The paper says that the volume density obtained by MLP can be interpreted as the probability that a laser disappearing from a very small particle in X can be differentiated . It is interesting that the probability that the virtual laser moving from the position we are looking at will be differentiated can be known through volume density.
Anyway, using a virtual laser with these properties, that is, camera ray $r (t) = o + t_d$ and $[t_n, t_f]$ that determines the section of the ray, we render the part of the object that collides with the camera ray. You can determine the color to use using the following equation.
I will explain the formula in detail. Here, $T(t)$ refers to the transmittance accumulated as the ray passes through the object from $t_n$ to $t$. The virtual laser we fire will pass through several points on the object. The transmittance obtained from the various points encountered at that time is accumulated. In other words, this can be interpreted as ‘ the probability that the camera ray will not contact an object when it is in the range from $t_n$ to $t’$ . I think this is an appropriate expression. If there is no object when we look at it, we can see things that are far away, but if there is an opaque object in front of us, we can only see objects that are nearby.
The author said, ‘To create a scene using NeRF, you need to measure $C( r )$ using the pixels touched by the virtual ray emitted by the desired virtual camera’ .
Stratified sampling approach
So how do we actually calculate $C( r )$? There are many ways. First of all, there is a deterministic quadrature, that is, an integration technique that approximates the section to be integrated as a square and then adds them all together. This is a technique often used when rendering voxel grids. However, if this method is used in NeRF, performance is limited because the MLP performs discrete integration with the values of a fixed discrete set of locations. Since the number of values used during integration is always the same, the coordinates that the virtual laser touches may be air rather than an object. In other words, ‘since integration is performed using $t_i$ sampled from all points that the virtual camera ray touches, unnecessary pixels also fall into the integration range, limiting performance . ‘
So the author used a different method, and the name of the method is ‘ stratified sampling approach ‘. The stratified sampling approach divides [t_n, t_f], which was used as the integration interval of C( r ), into N constant intervals and randomly selects one from each interval to use it as a discrete integration interval.
So the author used a different method, and the name of the method is stratified sampling approach. The stratified sampling approach divides $[t_n, t_f]$, which was used as the integration interval of $C( r )$, into N constant intervals and randomly selects one from each interval to use it as a discrete integration interval.
(a method of selecting $t_i$ using a stratified sampling approach)
If you select the values to be used for integration in this way, you can perform sampling from valid values because ‘they are selected from the values obtained from the MLP that processes continuous positions.’ If you perform integration with the sampled $t_i$, you will get higher performance. For reference, the integration performed at this time is also a discrete integration that follows the quadrature rule, and the equation is as follows.
$(δ_i = t _{(i + 1)} − t_i$ represents the distance between adjacent samples.)
$C( r )$ obtained using this formula is trivially differentiable, and its value decreases because alpha is combined with $α_i= 1−exp(−σ_i x δ_i)$. Well… I’m not sure what that means.
IV. Optimizing a Neural Radiance Field
This chapter provides a detailed explanation of the Positional Encoding and hierarchical sampling procedures explained earlier. These two are elements that help NeRF express complex high-frequency scenes. To briefly summarize the roles of these two, they are as follows.
- Positional encoding: Helps MLP process high-frequency data. Apply to input data
- Hierarchical sampling procedure: Allows NeRF to efficiently sample high-frequency scenes.
Now, let us explain them one by one in detail.
Positional encoding
The authors found that when the network $F_Θ$ directly processed $(x, y, z, θ, ϕ)$ coordinates, the rendering performance of high-frequency regions deteriorated significantly. In addition to the author, another researcher discovered this phenomenon. He said that this phenomenon occurs because ‘the network learns as a function of processing low frequencies .’
We also discovered that ‘ by mapping the input data to higher-dimensional data with a high-frequency function and inputting it to the network,’ the processing performance of the high-frequency region is learned in an improved direction.’
The authors applied this to NeRF. And we found that the performance improved significantly when the network $F Θ = F’ Θ ◦ γ$ was reorganized . Here, $F’_Θ$ is simply a regular MLP, and $γ$ is a function that maps from the $R$ dimension to the $R^{(2L)}$ dimension. The equation is written in detail as follows.
$γ (·)$ applies each element $x , y, z$ of the input data Normalize in the range [1, 1].
Positional encoding will be a very familiar word to those who have read transformer . However, it is a little different from what is written here, and if you compare the two, it is as follows.
Positional encoding of transformer: Used to give the concept of order to tokens to be input to a network that has no concept of order of input values.
NeRF’s positional encoding: Used to map the coordinates of continuous input data to higher-dimensional data so that MLP can more easily approximate high-dimensional components, that is, edges in the scene.
Hierarchical volume sampling
The author originally tried to randomly select N points from the camera ray and use them for rendering, but realized that doing so was inefficient. This is because the space through which the camera ray passes includes not only objects but also empty space. Even things that were not very helpful were used in rendering, making it difficult to produce good results.
So what I came up with was Hierarchical volume sampling, a method of sampling points in proportion to the effect predicted in the final rendering . They say that by doing this, you can efficiently extract coordinates to use for rendering.
Use two networks
The author set up two networks to be used in NeRF instead of one, naming them the “coarse” network and the “fine” network, respectively. First, let’s look at how to use a “coarse” network.
Using stratified sampling from the camera ray, values are selected one by one from $N_c$ intervals. Then calculate $C( r )$ with that value . (In this case, a “coarse” network is used.)
Based on the information obtained from the obtained values, values are selected focusing on coordinates where the volume density is relatively high.
And in order to do this, the expression $C_c( r )$ for calculating alpha composited color must be redefined as the weighted sum of $c_i$ as follows. (Show Equation 5)
1
When normalizing the weight $w_i$ used in Equation 5, $w_i$ is said to become a partially constant PDF.
Next, let’s look at the case of using the “fine” network.
Using inverse transform sampling from the camera ray, values are selected one by one from $N_f$ positions.
When calculating $C_f( r )$ using the “fine” network, the $N_c$ values entered into the “coarse” network are also used. In other words, $N_c + N_f$ data are input into the “fine” network.
This allows us to use more coordinates that are expected to contain visible content, that is, in the area where the actual object will exist. In other words, the proportion of using real objects rather than empty space for rendering increases. This immediately leads to increased performance. According to the paper, ‘values selected through nonuniform discretization of the entire integration interval are used’, but um… I didn’t understand it properly. Anyway, you can understand that if you extract the value in this way and use it for rendering, performance will improve.
Implementation details
This chapter describes the settings the author used to implement NeRF. In other papers, it usually appears when explaining the setup in the experiment, but here it appears before the experiment. To be precise, the learning process is explained in detail. Let me explain them one by one.
Obtain a dataset consisting of the image where the scene is stored, the camera that captured the image, the camera’s intrinsic parameters (the camera’s unique properties, such as the size of the image sensor in the camera), and scene bounds .
During the learning process, some pixels of the dataset that encounter the camera ray are selected at every epoch, hierarchical sampling is performed, $N_c$ values are input into the “coarse” network to obtain values, and $N_c + N_f$ values are input into the “fine” network.
Perform a volume rendering procedure to render the color of the point seen by the camera ray.
The loss is calculated using the values rendered using the “coarse” network, the values rendered using the “fine” network, and the actual rendered values. The loss used at this time is total squared loss.
Here, R refers to the set of camera rays in each batch, and $C( r) , C_c( r )$, and $C_f( r )$ represent the ground truth, the value obtained by “coarse”, and the value obtained by “fine” network, respectively. .
There is only one network that is actually learned.
Previously, we used the “fine” network and the “coarse” network to find the loss. However, only the “coarse” network performs training. The author says that the “coarse” network is used to assign samples in the “fine” network. If you think about it, if we use two networks together, we can train it to output the value we want by training only one network (the untrained network can be judged as an unchanging constant function(?)), so it is better to train only one network. I think it can save learning time.
Additionally, there are GPUs and hyper parameters used in the experiment, which we will omit.
V. Results
Datasets
DeepVoxels (Diffuse Synthetic 360): It has a simple structure and is made of four objects composed of surfaces with Lambertian reflection properties (the ideal property of reflecting uniformly in all directions no matter where light is emitted). The resolution of the scene where each object was photographed was 512 x 512, and the shots were taken along a hemispherical trajectory.
Their dataset (Realistic Synthetic 360): Additional data was created by the author. It was created with 8 objects, all of which have non-Lambert reflection properties (realistic reflection properties). Six trajectories were imaged along hemispherical trajectories, but two were imaged along spherical trajectories. The resolution of the captured photo is 800 x 800.
Comparisons
Quantitatively result
First, let’s compare numerically. The author organized the experimental results in a table, as follows.
Here, PSNR represents the peak signal-to-noise ratio and is a value used when evaluating image quality loss information. The higher the better. And SSIM is a number that represents structural similarity index measure . This is a value often used to measure image quality, and the higher the value, the better.
Lastly, LPIPS is short for Learned Perceptual Image Patch Similarity . This is a unit suggested in the paper in the repository I linked to, and measures the ‘distance between image patches’. If it is short, the two images are similar. If it is long, the two images are different.
Qualitatively results
Discussion
Here, we discuss the limitations of previously used methods and once again emphasize the advantages of NeRF. What’s unique about this is that NeRF requires very little capacity. The MLP parameters used in NeRF only occupy 5MB of space, which is a very small amount compared to LLFF , which requires 15GB. It is said to be smaller than the images in the dataset created by the author.
Ablation studies
Here, as the title suggests, we remove important elements that make up NeRF, and the observed performance drop shows how effective the authors’ proposals for NeRF were. First, let’s check the result table.
(Here, PE is ‘Positional Encoding’, VD is ‘View Dependence’, and H is ‘Hierarchical Sampling’.)
The parts to pay attention to in the table are the 5th and 6th rows. It can be observed that performance increases depending on the number of images entered as input. Even when using the smallest number of images, 25, it shows better performance than NV, SRN, and LLFF, which showed the best performance among existing networks. And the 7th and 8th columns are data showing the performance change according to the maximum frequency that NeRF will process. When positional encoding the input position vector It was expected that performance would be improved by further strengthening the details, but as a result of the experiment, it was found that when the maximum frequency was increased beyond 10, performance actually decreased. When increasing the maximum frequency from 5 to 10, as I suspected, the increase in L led to an increase in performance.
Looking at these results, the author said, ‘I believe that if 2^L exceeds the maximum frequency present, the gain from increasing L will be limited.’ The dataset used by the author believes that the ‘maximum frequency present’ is approximately 1024. Then, it makes sense that maximum performance would be achieved when L=10.
VI. Conclusion
The author solved the performance limitations of existing view synthesis by using a method of expressing object scenes using MLP, such as a continuous function. The author showed that his proposed NeRF had better performance than the existing main method of representing the scene in voxels using CNN. He also said that there is room for NeRF to develop to learn and render more efficiently, and ended the paper by saying that better performance can be achieved if NeRF is used in a way to effectively express ‘values obtained from deep learning networks’.