3D Deep Learning with Pytorch3D [Part 2 - 3D Deep Learning using PyTorch3D]
Setup environment
1
2
3
4
5
6
7
8
9
10
11
12
conda create -n dl3d python=3.7
conda activate dl3d
conda install pytorch torchvision torchaudio cudatoolkit-11.1 -c pytorch -c nvidia
conda install pytorch3d -c pytorch3d
conda install -c open3d-admin open3d
pip install -U scikit-learn scipy matplotlib
1. Fitting Deformable Mesh Models to Raw Point Clouds
1.1. Fitting meshes to point clouds – the problem
Real-world depth cameras, such as LiDAR, time-of-flight cameras, and stereo vision cameras, usually output either depth images or point clouds. For example, in the case of time-of-flight cameras, a modulated light ray is projected from the camera to the world, and the depth at each pixel is measured from the phase of the reflected light rays received at the pixel. Thus, at each pixel, we can usually get one depth measurement and one reflected light amplitude measurement. However, other than the sampled depth information, we usually do not have direct measurements of the surfaces. For example, we cannot measure the smoothness or norm of the surface directly.
Similarly, in the case of stereo vision cameras, at each time slot, the camera can take two RGB images from the camera pair at roughly the same time. The camera then estimates the depth by finding the pixel correspondences between the two images. The output is thus a depth estimation at each pixel. Again, the camera cannot give us any direct measurements of surfaces.
However, in many real-world applications, surface information is sought. For example, in robotic picking tasks, usually, we need to find regions on an object such that the robotic hands can grasp firmly. In such a scenario, it is usually desirable that the regions are large in size and reasonably flat.
There are many other scenarios in which we want to fit a (deformable) mesh model to a point cloud. For example, there are some machine vision applications where we have the mesh model for an industrial part and the point cloud measurement from the depth camera has an unknown orientation and pose. In this case, finding a fitting of the mesh model to the point cloud would recover the unknown object pose.
For another example, in human face tracking, sometimes, we want to fit a deformable face mesh model to point cloud measurements, such that we can recover the identity of the human being and/or facial expressions.
Loss functions are central concepts in almost all optimizations. Essentially, to fit a point cloud, we need to design a loss function, such that when the loss function is minimized, the mesh as the optimization variable fits to the point cloud.
Actually, selecting the right loss function is usually a critical design decision in many real-world projects. Different choices of loss function usually result in significantly different system performance. The requirements for a loss function usually include at least the following properties:
• The loss function needs to have desirable numerical properties, such as smooth, convex, without the issue of vanishing gradients, and so on
• The loss function (and its gradients) can be easily computed; for example, they can be efficiently computed on GPUs
• The loss function is a good measurement of model fitting; that is, minimizing the loss function results in a satisfactory mesh model fitting for the input point clouds
Other than one primary loss function in such model fitting optimization problems, we usually also need to have other loss functions for regularizing the model fitting. For example, if we have some prior knowledge that the surfaces should be smooth, then we usually need to introduce an additional regularization loss function, such that not-smooth meshes would be penalized more.
1.2. Formulating a deformable mesh fitting problem into an optimization problem
In this section, we are going to talk about how to formulate the mesh fitting problem into an optimization problem. One key observation here is that object surfaces such as pedestrians can always be continuously deformed into a sphere. Thus, the approach we are going to take will start from the surface of a sphere and deform the surface to minimize a cost function.
The cost function should be chosen such that it is a good measurement of how similar the point cloud is to the mesh. Here, we choose the major cost function to be the Chamfer set distance. The Chamfer distance is defined between two sets of points as follows:
The Chamfer distance is symmetric and is a sum of two terms. In the first term, for each point x in the first point cloud, the closest point y in the other point cloud is found. For each such pair x and y, their distance is obtained and the distances for all the pairs are summed up. Similarly, in the second term, for each y in the second point cloud, one x is found and the distances between such x and y pairs are summed up.
The Chamfer distance is a good measurement of how similar two point clouds are. If the two point clouds are the same, then the Chamfer distance is zero. If the two point clouds are different, then the Chamfer distance is positive.
1.3. Loss functions for regularization
In the previous section, we successfully formulated the deformable mesh fitting problem into an optimization problem. However, the approach of directly optimizing this primary loss function can be problematic. The issues lie in that there may exist multiple mesh models that can be good fits to the same point cloud. These mesh models that are good fits may include some mesh models that are far away from smooth meshes.
On the other hand, we usually have prior knowledge about pedestrians. For example, the surfaces of pedestrians are usually smooth, the surface norms are smooth also. Thus, even if a non-smooth mesh is close to the input point cloud in terms of Chamfer distance, we know with a certain level of confidence that it is far away from the ground truth.
Machine learning literature has provided solutions for excluding such undesirable non-smooth solutions for several decades. The solution is called regularization. Essentially, the loss we want to optimize is chosen to be a sum of multiple loss functions. Certainly, the first term of the sum will be the primary Chamfer distance. The other terms are for penalizing surface non-smoothness and norm non-smoothness.
1.3.1. Mesh Laplacian smoothing loss
The mesh Laplacian is a discrete version of the well-known Laplace-Beltrami operator. One version (usually called uniform Laplacian) is as follows:
In the preceding definition, the Laplacian at the i-th vertex is just a sum of differences, where each difference is between the coordinates of the current vertex and those of a neighboring vertex.
The Laplacian is a measurement for smoothness. If the i-th vertex and its neighbors lie all within one plane, then the Laplacian should be zero. Here, we are using a uniform version of the Laplacian, where the contribution to the sum from each neighbor is equally weighted. There are more complicated versions of Laplacians, where the preceding contributions are weighted according to various schemes.
1.3.2. Mesh normal consistency loss
The mesh normal consistency loss is a loss function for penalizing the distances between adjacent normal vectors on the mesh.
1.3.3. Mesh edge loss
Mesh edge loss is for penalizing long edges in meshes. For example, in the mesh model fitting problem we consider in this chapter, we want to eventually obtain a solution, such that the obtained mesh model fits the input point cloud uniformly. In other words, each local region of the point cloud is covered by small triangles of the mesh. Otherwise, the mesh model cannot capture the fine details of slowly varying surfaces, meaning the model may not be that accurate or trustworthy.
The aforementioned problem can be easily avoided by including the mesh edge loss in the objective function. The mesh edge loss is essentially a sum of all the edge lengths in the mesh.
Implementing the mesh fitting with PyTorch3D
2. Object Pose Detection and Tracking by Differentiable Rendering
2.1. Why we want to have differentiable rendering
The physical process of image formation is a mapping from 3D models to 2D images. As shown in the example in Figure below, depending on the positions of the red and blue spheres in 3D (two possible configurations are shown on the left-hand side), we may get different 2D images (the images corresponding to the two configurations are shown on the right-hand side).
Many 3D computer vision problems are a reversal of image formation. In these problems, we are usually given 2D images and need to estimate the 3D models from the 2D images. For example, in Figure below, we are given the 2D image shown on the right-hand side and the question is, which 3D model is the one that corresponds to the observed image?
According to some ideas that were first discussed in the computer vision community decades ago, we can formulate the problem as an optimization problem. In this case, the optimization variables here are the position of two 3D spheres. We want to optimize the two centers, such that the rendered images are like the preceding 2D observed image. To measure similarity precisely, we need to use a cost function – for example, we can use pixel-wise mean-square errors. We then need to compute a gradient from the cost function to the two centers of spheres, so that we can minimize the cost function iteratively by going toward the gradient descent direction.
However, we can calculate a gradient from the cost function to the optimization variables only under the condition that the mapping from the optimization variables to the cost functions is differentiable, which implies that the rendering process is also differentiable.
2.2. How to make rendering differentiable
Rendering is an imitation of the physical process of image formation. This physical process of image formation itself is differentiable in many cases. Suppose that the surface is normal and the material properties of the object are all smooth. Then, the pixel color in the example is a differentiable function of the positions of the spheres.
However, there are cases where the pixel color is not a smooth function of the position. This can happen at the occlusion boundaries, for example. This is shown in Figure 4.3, where the blue sphere is at a location that would occlude the red sphere at that view if the blue sphere moved up a little bit. The pixel moved at that view is thus not a differentiable function of the sphere center locations.
When we use conventional rendering algorithms, information about local gradients is lost due to discretization. As we discussed in the previous section, rasterization is a step of rendering where for each pixel on the imaging plane, we find the most relevant mesh face (or decide that no relevant mesh face can be found).
In conventional rasterization, for each pixel, we generate a ray from the camera center going through the pixel on the imaging plane. We will find all the mesh faces that intersect with this ray. In the conventional approach, the rasterizer will only return the mesh face that is nearest to the camera. The returned mesh face will then be passed to the shader, which is the next step of the rendering pipeline. The shader will then be applied to one of the shading algorithms (such as the Lambertian model or Phong model) to determine the pixel color. This step of choosing the mesh to render is a non-differentiable process, since it is mathematically modeled as a step function.
There has been a large body of literature in the computer vision community on how to make rendering differentiable. The differentiable rendering implemented in the PyTorch3D library mainly used the approach in Soft Rasterizer by Liu, Li, Chen, and Li (arXiv:1904.01786).
The main idea of differentiable rendering is illustrated in Figure below. In the rasterization step, instead of returning only one relevant mesh face, we will find all the mesh faces, such that the distance of the mesh face to the ray is within a certain threshold.
What problems can be solved by using differentiable rendering?
Differentiable rendering is a technique in that we can formulate the estimation problems in 3D computer vision into optimization problems. It can be applied to a wide range of problems. More interestingly, one exciting recent trend is to combine differentiable rendering with deep learning. Usually, differentiable rendering is used as the generator part of the deep learning models. The whole pipeline can thus be trained end to end.
2.3. The object pose estimation problem
In this section, we are going to show a concrete example of using differentiable rendering for 3D computer vision problems. The problem is object pose estimation from one single observed image. In addition, we assume that we have the 3D mesh model of the object.
For example, we assume we have the 3D mesh model for a toy cow, as shown in Figure 2.3.1. Now, suppose we have taken one image of the toy cow (Figure 2.3.2). The problem is then to estimate the orientation and location of the toy cow at the moments when these images are taken.
Because it is cumbersome to rotate and move the meshes, we choose instead to fix the orientations and locations of the meshes and optimize the orientations and locations of the cameras. By assuming that the camera orientations are always pointing toward the meshes, we can further simplify the problem, such that all we need to optimize is the camera locations.
Thus, we formulate our optimization problem, such that the optimization variables will be the camera locations. By using differentiable rendering, we can render RGB images and silhouette images for the two meshes. The rendered images are compared with the observed images and, thus, loss functions between the rendered images and observed images can be calculated. Here, we use mean-square errors as the loss function. Because everything is differentiable, we can then compute gradients from the loss functions to the optimization variables. Gradient descent algorithms can then be used to find the best camera positions, such that the rendered images are matched to the observed images.
3. Differentiable Volumetric Rendering
3.1. Overview of volumetric rendering
Volumetric rendering is a collection of techniques used to generate a 2D view of discrete 3D data. This 3D discrete data could be a collection of images, voxel representation, or any other discrete representation of data. The main goal of volumetric rendering is to render a 2D projection of 3D data since that is what our eyes can perceive on a flat screen. This method generated such projections without any explicit conversion to a geometric representation (such as meshes). Volumetric rendering is typically used when generating surfaces is difficult or can lead to errors. It can also be used when the content (and not just the geometry and surface) of the volume is important. It is typically used for data visualization. For example, in brain scans, a visualization of the content of the interior of the brain is typically very important.
We will get a high-level overview of volumetric rendering:
First, we represent the 3D space and objects in it by using a volume, which is a 3D grid of regularly spaced nodes. Each node has two properties: density and color features. The density typically ranges from 0 to 1. Density can also be understood as the probability of occupancy. That is, how sure we think that the node is occupied by a certain object. In some cases, the probability can also be opacity.
We need to define one or multiple cameras. The rendering is the process that determines what the cameras can observe from their views.
To determine the RGB values at each pixel of the preceding cameras, a ray is generated from the projection center going through each image pixel of the cameras. We need to check the probability of occupancy or opacity and colors along this ray to determine RGB values for the pixel. Note there are infinitely many points on each such ray. Thus, we need to have a sampling scheme to select a certain number of points along this ray. This sampling operation is called ray sampling.
Note that we have densities and colors defined on the nodes of the volume but not on the points on the rays. Thus, we need to have a way to convert densities and colors of volumes to points on rays. This operation is called volume sampling.
Finally, from the densities and colors of the rays, we need to determine the RGB values of each pixel. In this process, we need to compute how many incident lights can arrive at each point along the ray and how many lights are reflected to the image pixel. We call this process ray marching.
3.1.1. Ray sampling
Ray sampling is the process of emitting rays from the camera that goes through the image pixels and sampling points along these rays. The ray sampling scheme depends on the use case. For example, sometimes we might want to randomly sample rays that go through some random subset of image pixels. Typically, we need to use such a sampler during training since we only need a representative sample from the full data.
3.1.2. Volume sampling
Volume sampling is the process of getting color and occupancy information along the points provided by the ray samples. The volume representation we are working with is discrete. Therefore, the points defined in the ray sampling step might not fall exactly on a point. The nodes of the volume grids and points on rays typically have different spatial locations. We need to use an interpolation scheme to interpolate the densities and colors at points of rays from the densities and colors at volumes.
3.1.3. Ray marching
Now that we have the color and density values for all the points sampled with the ray sampler, we need to figure out how to use it to finally render the pixel value on the projected image. In this section, we are going to discuss the process of converting the densities and colors on points of rays to RGB values on images. This process models the physical process of image formation.
In this section, we discuss a very simple model, where the RGB value of each image pixel is a weighted sum of the colors on the points of the corresponding ray. If we consider the densities as probabilities of occupancy or opacity, then the incident light intensity at each point of the ray is a = product of $(1-p_{i})$, where $p_{i}$ are the densities. Given the probability that this point is occupied by a certain object is $p_{i}$, the expected light intensity reflected from this point is $w_{i} = a p_{i}$. We just use $w_{i}$ as the weights for the weighted sum of colors. Usually, we normalize the weights by applying a softmax operation, such that the weights all sum up to one.
3.2. Differentiable volumetric rendering
While standard volumetric rendering is used to render 2D projections of 3D data, differentiable volume rendering is used to do the opposite: construct 3D data from 2D images. This is how it works: we represent the shape and texture of the object as a parametric function. This function can be used to generate 2D projections. But, given 2D projections (this is typically multiple views of the 3D scene), we can optimize the parameters of these implicit shape and texture functions so that its projections are the multi-view 2D images. This optimization is possible since the rendering process is completely differentiable, and the implicit functions used are also differentiable.
3.2.1. Reconstructing 3D models from multi-view images
In this section, we are going to show an example of using differentiable volumetric rendering for reconstructing 3D models from multi-view images. Reconstructing 3D models is a frequently sought problem. Usually, the direct ways of measuring the 3D world are difficult and costly, for example, LiDAR and Radar are typically expensive. On the other hand, 2D cameras have much lower costs, which makes reconstructing the 3D world from 2D images incredibly attractive. Of course, to reconstruct the 3D world, we need multiple images from multiple views.
4. Neural Radiance Fields (NeRF)
4.1. Overview of NeRF
View synthesis is a long-standing problem in 3D computer vision. The challenge is to synthesize new views of a 3D scene using a small number of available 2D snapshots of the scene. It is particularly challenging because the view of a complex scene can depend on a lot of factors such as object artifacts, light sources, reflections, opacity, object surface texture, and occlusions. Any good representation should capture this information either implicitly or explicitly. Additionally, many objects have complex structures that are not completely visible from a certain viewpoint. The challenge is to construct complete information about the world given incomplete and noisy information.
As the name suggests, NeRF uses neural networks to model the world. As we will learn later in the chapter, NeRF uses neural networks in a very unconventional manner. It was a concept first developed by a team of researchers from UC Berkeley, Google Research, and UC San Diego. Because of their unconventional use of neural networks and the quality of the learned models, it has spawned multiple new inventions in the fields of view synthesis, depth sensing, and 3D reconstruction.
4.1.1. What is a radiance field?
Before we get to NeRF, let us understand what radiance fields are first. You see an object when the light from that object is processed by your body’s sensory system. The light from the object can either be generated by the object itself or reflected off it.
Radiance is the standard metric for measuring the amount of light that passes through or is emitted from an area inside a particular solid angle. For our purposes, we can treat the radiance to be the intensity of a point in space when viewed in a particular direction. When capturing this information in RGB, the radiance will have three components corresponding to the colors Red, Green, and Blue. The radiance of a point in space can depend on many factors, including the following:
- Light sources illuminating that point
- The existence of a surface (or volume density) that can reflect light at that point
- The texture properties of the surface
The following figure depicts the radiance value at a certain point in the 3D scene when viewed at a certain angle. The radiance field is just a collection of these radiance values at all points and viewing angles in the 3D scene:
If we know the radiance of all the points in a scene in all directions, we have all the visual information we need about the scene. This field of radiance values constitutes a radiance field. We can store the radiance field information as a volume in a 3D voxel grid data structure. We saw this in the previous chapter when discussing volume rendering.
4.1.2. Representing radiance fields with neural networks
In this section, we will explore a new way of using neural networks. In typical computer vision tasks, we use neural networks to map an input in the pixel space to an output. In the case of a discriminative model, the output is a class label. In the case of a generative model, the output is also in the pixel space. A NeRF model is neither of these.
NeRF uses a neural network to represent a volumetric scene function. This neural network takes a 5-dimensional input. These are the three spatial locations $(x, y, z)$ and two viewing angles $(θ, ∅)$. Its output is the volume density $σ$ at $(x, y, z)$ and the emitted color $(r, g, b)$ of the point $(x, y, z)$ when viewed from the viewing angle $(θ, ∅)$. The emitted color is a proxy used to estimate the radiance at that point. In practice, instead of directly using $(θ, ∅)$ to represent the viewing angle, NeRF uses the unit direction vector $d$ in the 3D Cartesian coordinate system. These are equivalent representations of the viewing angle.
The model therefore maps any point in the 3D scene and a viewing angle to the volume density and radiance at that point. You can then use this model to synthesize views by querying the 5D coordinates along camera rays and using the volume rendering technique you learned about in the previous chapter to project the output colors and volume densities into an image.
With the following figure, let us find out how a neural network can be used to predict the density and radiance at a certain point $(x, y, z)$ when viewed along a certain direction $(θ, ∅)$:
Note that this is a fully connected network – typically, this is referred to as a Multi-Layer Perceptron (MLP) . More importantly, this is not a convolutional neural network. We refer to this model as the NeRF model. A single NeRF model is optimized on a set of images from a single scene. Therefore, each model only knows the scene on which it is optimized. This is not the standard way to use a neural network, where we typically need the model to generalize unseen images. In the case of NeRF, we need the network to generalize unseen viewpoints well.
4.2. NeRF architecture
So far, we have used the NeRF model class without fully knowing what it looks like. In this section, we will first visualize what the neural network looks like and then go through the code in detail and understand how it is implemented.
The neural network takes the harmonic embedding of the spatial location $(x, y, z)$ and the harmonic embedding of $(θ, ∅)$ as its input and outputs the predicted density $σ$ and the predicted color $(r, g, b)$. The following figure illustrates the network architecture that we are going to implement in this section:
4.3. Volume rendering with radiance fields
Volume rendering allows you to create a 2D projection of a 3D image or scene. In this section, we will learn about rendering a 3D scene from different viewpoints. For the purposes of this section, assume that the NeRF model is fully trained and that it accurately maps the input coordinates $(x, y, z, d{x} , d{y} , d_{z} )$ to an output $(r, g, b, σ)$. Here are the definitions of these input and output coordinates:
- $(x, y, z)$: The spatial location of the point in the 3D scene
- $(d_{x} , d_{y} , d_{z} )$: The direction vector of the viewing angle
- $(r, g, b)$: The color of the point in the 3D scene
- $σ$: The volume density of the point in the 3D scene
In the previous section, you came to understand the concepts underlying volumetric rendering. You used the technique of ray sampling to get volume densities and colors from the volume. We called this volume sampling. In this chapter, we are going to use ray sampling on the radiance field to get the volume densities and colors. We can then perform ray marching to obtain the color intensities of that point. The ray marching technique used in the previous chapter and what is used in this chapter are conceptually similar. The difference is that 3D voxels are discrete representations of 3D space whereas radiance fields are a continuous representation of it (because we use a neural network to encode this representation). This slightly changes the way we accumulate color intensities along a ray.
4.3.1. Projecting ráy into the scene
Imagine placing a camera at a viewpoint and pointing it towards the 3D scene of interest. This is the scene on which the NeRF model is trained. To synthesize a 2D projection of the scene, we first send out of ray into the 3D scene originating from the viewpoint.
The ray can be parameterized as follows:
$r(t) = o + td$
Here, $r$ is the ray starting from the origin $o$ and traveling along the direction $d$. It is parametrized by $t$, which can be varied in order to move to different points on the ray. Note that $r$ is a 3D vector representing a point in space.
4.3.2. Accumulating the color of a ray
We can use some well-known classical color rendering techniques to render the color of the ray. Before we do that, let us get a feeling for some standard definitions:
Our NeRF model is a continuous function representing the radiance field of the scene. We can use it to obtain $c(r(t), d)$ and $σ(r(t))$ at various points along the ray. There are many techniques for numerically estimating the integral $C(r)$. While training and visualizing the outputs of the NeRF model, we used the standard EmissionAbsorptionRaymarcher method to accumulate the radiance along a ray.