3D Deep Learning with Pytorch3D [Part 3 - State-of-the-art 3D Deep Learning Using PyTorch3D]
Setup environment
1
2
3
4
5
6
7
8
9
10
11
12
conda create -n dl3d python=3.7
conda activate dl3d
conda install pytorch torchvision torchaudio cudatoolkit-11.1 -c pytorch -c nvidia
conda install pytorch3d -c pytorch3d
conda install -c open3d-admin open3d
pip install -U scikit-learn scipy matplotlib
1. Controllable Neural Feature Fields
1.1. GAN-based image synthesis
Deep generative models have been shown to produce photorealistic 2D images when trained on a distribution from a particular domain. Generative Adversarial Networks (GANs) are one of the most widely used frameworks for this purpose. They can synthesize high-quality photorealistic images at resolutions of 1,024 x 1,024 and beyond. For example, they have been used to generate realistic faces:
GANs can be trained to generate similar-looking images from any data distribution. The same StyleGAN2 model, when trained on a car dataset, can generate high-resolution images of cars:
GANs are based on a game-theoretic scenario where a generator neural network generates an image. However, in order to be successful, it must fool the discriminator into classifying it as a realistic image. This tug of war between the two neural networks (that is, the generator and the discriminator) can lead to a generator that produces photorealistic images. The generator network does this by creating a probability distribution on a multi-dimensional latent space such that the points on that distribution are realistic images from the domain of the training images. In order to generate a novel image, we just need to sample a point on the latent space and let the generator create an image from it:
Synthesizing high-resolution photorealistic images is great, but it is not the only desirable property of a generative model. More real-life applications open if the generation process is disentangled and controllable in a simple and predictable manner. More importantly, we need attributes such as object shape, size, and pose to be as disentangled as possible so that we can vary them without changing other attributes in the image.
Existing GAN-based image generation approaches generate 2D images without truly understanding the underlying 3D nature of the image. Therefore, there are no built-in explicit controls for varying attributes such as object position, shape, size, and pose. This results in GANs that have entangled attributes. For simplicity, think about an example of a GAN model that generates realistic faces, where changing the head pose also changes the perceived gender of the generated face. This can happen if the gender and head pose attributes become entangled. This is undesirable for most practical use cases. We need to be able to vary one attribute without affecting any of the others.
1.2. Compositional 3D-aware image synthesis
Our goal is controllable image synthesis. We need control over the number of objects in the image, their position, shape, size, and pose. The GIRAFFE model is one of the first to achieve all these desirable properties while also generating high-resolution photorealistic images. In order to have control over these attributes, the model must have some awareness of the 3D nature of the scene.
Now, let us look at how the GIRAFFE model builds on top of other established ideas to achieve this. It makes use of the following high-level concepts:
Learning 3D representation: A NeRF-like model for learning implicit 3D representation and feature fields. Unlike the standard NeRF model, this model outputs a feature field instead of the color intensity. This NeRF-like model is used to enforce a 3D consistency in the images generated.
Compositional operator: A parameter-free compositional operator to compose feature fields of multiple objects into a single feature field. This will help in creating images with the desired number of objects in them.
Neural rendering model: This uses the composed feature field to create an image. This is a 2D Convolutional Neural Network (CNN) that upsamples the feature field to create a higher dimensional output image.
GAN: The GIRAFFE model uses the GAN model architecture to generate new scenes. The preceding three components form the generator. The model also consists of a discriminator neural network that distinguishes between fake images and real images. Due to the presence of a NeRF model along with a composition operator, this model will make the image generation process both compositional and 3D aware.
Generating an image is a two-step process:
Volume-render a feature field given the camera viewing angle along with some information about the objects you want to render. This object information is some abstract vectors that you will learn about in future sections.
Use a neural rendering model to map the feature field to a high-resolution image.
This two-step approach was found to be better at generating high-resolution images as compared to directly generating the RGB values from the NeRF model output. From the previous chapter, we know that a NeRF model is trained on images from the same scene. A trained model can only generate an image from the same scene. This was one of the big limitations of the NeRF model.
In contrast, the GIRAFFE model is trained on images of unposed images from different scenes. A trained model can generate images from the same distribution as what it was trained on. Typically, this model is trained on the same kind of data. That is, the training data distribution comes from a single domain. For example, if we train a model on the Cars dataset, we can expect the images generated by this model to be some version of a car. It cannot generate images from a completely unseen distribution such as faces. While this is still a limitation of what the model can do, it is much less limited as compared to the standard NeRF model.
The fundamental concepts implemented in the GIRAFFE model that we have discussed so far are summarized in the following diagram:
The generator model uses the chosen camera pose and N, the number of objects (including the background), and the corresponding number of shape and appearance codes along with affine transformations to first synthesize feature fields. The individual feature fields corresponding to individual objects are then composed together to form an aggregate feature field. It then volume renders the feature field along the ray using the standard principles of volume rendering. Following this, a neural rendering network transforms this feature field to a pixel value in the image space.
1.2.1. Generating feature fields
The first step of the scene generation process is generating a feature field. This is analogous to generating an RGB image in the NeRF model. In the NeRF model, the output of the model is a feature field that happens to be an image made up of RGB values. However, a feature field can be any abstract notion of the image. It is a generalization of an image matrix. The difference here is that instead of generating a three-channel RGB image, the GIRAFFE model generates a more abstract image that we refer to as the feature field with dimensions $H_{v}$ , $W_{v}$ , and $M_{f}$ , where $H_{v}$ is the height of the feature field, $W_{v}$ is its width, and $M_{f}$ is the number of channels in the feature field.
For this section, let us assume that we have a trained GIRAFFE model. It has been trained on some predefined dataset that we are not going to think about now. To generate a new image, we need to do the following three things:
Specify the camera pose: This defines the viewing angle of the camera. As a preprocessing step, we use this camera pose to cast a ray into the scene and generate a direction vector ($d_{j}$) along with sampled points ($x _{ij}$). We will project many such rays into the scene.
Sample 2N latent codes: We sample two latent codes corresponding to each object we wish to see in the rendered output image. One latent code corresponds to the shape of the object and the other latent code corresponds to its appearance. These codes are sampled from a standard normal distribution.
Specify N affine transformations: This corresponds to the pose of the object in the scene.
The generator part of the model does the following:
For each expected object in the scene, use the shape code, the appearance code, the object’s pose information (that is, the affine transformation), the viewing direction vector, and a point in the scene ($x_{ij}$) to generate a feature field (a vector) and a volume density for that point. This is the NeRF model in action.
Use the compositional operator to compose these feature fields and densities into a single feature field and density value for that point. Here, the compositional operator does the following:
The volume density at a point can be simply summed up. The feature field is averaged by assigning importance proportional to the volume density of the object at that point. One important benefit of such a simple operator is that it is differentiable. Therefore, it can be introduced inside a neural network since the gradients can flow through this operator during the model training phase.
We use volume rendering to render a feature field for each ray generated for the input camera pose by aggregating feature field values along the ray. We do this for multiple rays to create a full feature field of dimension $H_{v} x W_{v}$ . Here, V is generally a small value. So, we are creating a low-resolution feature field.
Note
1
2
3
Feature fields
A feature field is an abstract notion of an image. They are not RGB values and are typically in low spatial dimensions (such as 16 x 16 or 64 x 64) but high channel dimensions. We need an image that is spatially high dimensional (for example, 512 x 512), but in three channels (RGB). Let us look at a way to do that with a neural network.
1.2.2. Mapping feature fields to images
After we generate a feature field of dimensions $H_{v} x W_{v} x M_{f}$ , we need to map this to an image of dimension $H x W x 3$. Typically, $H_{v} < H$, $W_{v} < W$, and $M_{f} > 3$. The GIRAFFE model uses the two-stage approach since an ablation analysis showed it to be better than using a single-stage approach to generate the image directly.
The mapping operation is a parametric function that can be learned with data, and using a 2D CNN is best suited for this task since it is a function in the image domain. You can think of this function as an upsampling neural network like a decoder in an auto-encoder. The output of this neural network is the rendered image that we can see, understand, and evaluate. Mathematically, this can be defined as follows:
This neural network consists of a series of upsampling layers done using n blocks of nearest neighbor upsampling, followed by a 3 x 3 convolution and leaky ReLU. This creates a series of n different spatial resolutions of the feature field. However, in each spatial resolution, the feature field is mapped to a three-channel image of the same spatial resolution via a 3 x 3 convolution. At the same time, images from the previous spatial resolution are upsampled using a non-parametric bilinear upsampling operator and added to the image of the new spatial resolution. This is repeated until we reach the desired spatial resolution of H x W.
The skip connections from the feature field to a similar dimensional image help with a strong gradient flow to the feature fields in each spatial resolution. Intuitively, this ensures that the neural rendering model has a strong understanding of the image in each spatial resolution. Additionally, the skip connections ensure that the final image that is generated is a combination of the image understanding at various resolutions.
This concept becomes very clear with the following diagram of the neural rendering model:
The neural rendering model takes the feature field output from the previous stage and generates a high-resolution RGB image. Since the feature field is generated using a NeRF-based generator, it should understand the 3D nature of the scene, the objects in them, and their position, pose, shape, and appearance. And since we use a compositional operator, the feature field also encodes the number of objects in the scene.
2. Modeling the Human Body in 3D
2.1. Formulating the 3D modeling problem
“All models are wrong, but some are useful” is a popular aphorism in statistics. It suggests that it is often hard to mathematically model all the tiny details of a problem. A model will always be an approximation of reality, but some models are more accurate and, therefore, more useful than others.
In the field of machine learning, modeling a problem generally involves the following two components:
- Mathematically formulating the problem
- Building algorithms to solve that problem under the constraints and boundaries of that formulation
Good algorithms used on badly formulated problems often result in sub-optimal models. However, less powerful algorithms applied to a well-formulated model can sometimes result in great solutions. This insight is especially true for building 3D human body models.
The goal of this modeling problem is to create realistic animated human bodies. More importantly, this should represent realistic body shapes and must deform naturally according to changes in body pose and capture soft tissue motions. Modeling the human body in 3D is a hard challenge. The human body has a mass of bones, organs, skin, muscles, and water and they interact with each other in complex ways.
To exactly model the human body, we need to model the behavior of all these individual components and their interactions with each other. This is a hard challenge, and for some practical applications, this level of exactness is unnecessary. In this chapter, we will model the human body’s surface and shape in 3D as a proxy for modeling the entire human body. We do not need the model to be exact; we just need it to have a realistic external appearance. This makes the problem more approachable.
2.1.1. Defining a good representation
The goal is to represent the human body accurately with a low-dimensional representation. Joint models are low-dimensional representations (typically 17 to 25 points in 3D space) but do not carry a lot of information about the shape and texture of the person. On another end, we can consider the voxel grid representation. This can model the 3D body shape and texture, but it is extremely highly dimensional and does not naturally lend itself to modeling body dynamics due to pose changes.
Therefore, we need a representation that can jointly represent body joints and surfaces, which contains information about body volume. There are several candidate representations for surfaces; one such representation is a mesh of vertices. The Skinned Multi-Person Linear (SMPL) model uses this representation. Once specified, this mesh of vertices will describe the 3D shape of a human body.
Because there is a lot of history to this problem, we will find that many artists in the character animation industry have worked on building good body meshes. The SMPL model uses such expert insights to build a good initial template of a body mesh. This is an important first step because certain parts of the body have high-frequency variations (such as the face and hands). Such high-frequency variations need more densely packed points to describe them, but body parts with lower frequency variations (such as thighs) need less dense points to accurately describe them. Such a hand-crafted initial mesh will help bring down the dimensionality of the problem while keeping the necessary information to accurately model it. This mesh in the SMPL model is gender-neutral, but you can build variations for men and women separately.
More concretely, the initial template mesh consists of 6,890 points in 3D space to represent the human body surface. When this is vectorized, this template mesh has a vector length of 6,890 x 3 = 20,670. Any human body can be obtained by distorting this template mesh vector to fit the body surface.
It sounds like a remarkably simple concept on paper, but the number of configurations of a 20,670-dimensional vector is extremely high. The set of configurations that represents a real human body is an extremely tiny subset of all the possibilities. The problem then becomes defining a method to obtain a plausible configuration that represents a real human body.
Before we understand how the SMPL model is designed, we need to learn about skinning models. In the next section, we will look at one of the simplest skinning techniques: the Linear Blend Skinning technique. This is important because the SMPL model is built on top of this technique.
2.1.2. Linear Blend Skinning
Once we have a good representation of the 3D human body, we want to model how it looks in different poses. This is particularly important for character animation. The idea is that skinning involves enveloping an underlying skeleton with a skin (a surface) that conveys the appearance of the object being animated. This is a term used in the animation industry. Typically, this representation takes the form of vertices, which are then used to define connected polygons to form the surface.
The Linear Blend Skinning model is used to take a skin in the resting pose and transform it into a skin in any arbitrary pose using a simple linear model. This is so efficient to render that many game engines support this technique, and it is also used to render game characters in real time.
Let us now understand what this technique involves. This technique is a model that uses the following parameters:
- A template mesh, T, with N vertices. In this case, N = 6,890.
- We have the K joint locations represented by the vector J. These joints correspond to joints in the human body (such as shoulders, wrists, and ankles). There are 23 of these joints (K = 23).
- Blend weights, W. This is typically a matrix of size N x K that captures the relationship between the N surface vertices and the K joints of the body.
- The pose parameters, $Ɵ$. These are the rotation parameters for each of the K joints. There are 3K of these parameters. In this case, we have 72 of them. 69 of these parameters correspond to 23 joints and 3 correspond to the overall body rotation.
The skinning function takes the resting pose mesh, the joint locations, the blend weights, and the pose parameters as input and produces the output vertices:
In Linear Blend Skinning, the function takes the form of a simple linear function of the transformed template vertices as described in the following equation:
The meaning of these terms is the following:
- $t_{i}$ represents the vertices in the original mesh in the resting pose.
- $G(Ɵ, J)$ is the matrix that transforms the joint k from the resting pose to the target pose.
- $w_{k}$, $i$ are elements of the blend weights, W. They represent the amount of influence the joint k has on the vertex i.
While this is an easy-to-use linear model and is very well used in the animation industry, it does not explicitly preserve volume. Therefore, transformations can look unnatural.
In order to fix this problem, artists tweak the template mesh so that when the skinning model is applied, the outcome looks natural and realistic. Such linear deformations applied to the template mesh to obtain realistic-looking transformed mesh are called blend shapes. These blend shapes are artist-designed for all of the different poses the animated character can have. This is a very time-consuming process.
2.2. SMPL model
As the acronym of SMPL suggests, this is a learned linear model trained on data from thousands of people. This model is built upon concepts from the Linear Blend Skinning model. It is an unsupervised and generative model that generates a 20,670-dimensional vector using the provided input parameters that we can control. This model calculates the blend shapes required to produce the correct deformations for varying input parameters. We need these input parameters to have the following important properties:
It should correspond to a real tangible attribute of the human body.
The features must be low-dimensional in nature. This will enable us to easily control the generative process.
The features must be disentangled and controllable in a predictable manner. That is, varying one parameter should not change the output characteristics attributed to other parameters.
Keeping these requirements in mind, the creators of the SMPL model came up with the two most important input attributes: some notion of body identity and body pose. The SMPL model decomposes the final 3D body mesh into an identity-based shape and pose-based shape (identity-based shape is also referred to as shape-based shape because the body shape is tied to a person’s identity). This gives the model the desired property of feature disentanglement. There are some other important factors such as breathing and soft tissue dynamics (when the body is in motion) that we do not explain in this chapter but are part of the SMPL model.
Most importantly, the SMPL model is an additive model of deformations. That is, the desired output body shape vector is obtained by adding deformations to the original template body vector. This additive property makes this model very intuitive to understand and optimize.
2.2.1. Defining the SMPL model
The SMPL model builds on top of the standard skinning models. It makes the following changes to it:
• Rather than using the standard resting pose template, it uses a template mesh that is a function of the body shape and poses
• Joint locations are a function of the body shape
The function specified by the SMPL model takes the following form:
The following is the meaning of the terms in the preceding definitions:
$β$ is a vector representing the identity (also called the shape) of the body. We will later learn more about what it represents.
$Ɵ$ is the pose parameter corresponding to the target pose.
$W$ is the blend weight from the Linear Blend Skinning model.
This function looks very similar to the Linear Blend Skinning model. In this function, the template mesh is a function of shape and pose parameters, and the joint’s location is a function of shape parameters. This is not the case in the Linear Blend Skinning model.
Shape and pose-dependent template mesh
Shape-dependent joints
2.2.2. Using the SMPL model
2.3. Estimating 3D human pose and shape using SMPLify
In the previous section, you explored the SMPL model and used it to generate a 3D human body with a random shape and pose. It is natural to wonder whether it is possible to use the SMPL model to fit a 3D human body onto a person in a 2D image. This has multiple practical applications, such as understanding human actions or creating animations from 2D videos. This is indeed possible, and in this section, we are going to explore this idea in more detail.
Imagine that you are given a single RGB image of a person without any information about body pose, camera parameters, or shape parameters. Our goal is to deduce the 3D shape and pose from just this single image. Estimating the 3D shape from a 2D image is not always error-free. It is a challenging problem because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D (because multiple 3D poses can have the same 2D pose when projected). We also need an automatic way of estimating this without much manual intervention. It also needs to work on complex poses in natural images with a variety of backgrounds, lighting conditions, and camera parameters.
One of the best methods of doing this was invented by researchers from the Max Planck Institute of Intelligent Systems (where the SMPL model was invented), Microsoft, the University of Maryland, and the University of Tübingen. This approach is called SMPLify. Let us explore this approach in more detail.
The SMPLify approach consists of the following two stages:
Automatically detect 2D joints using established pose detection models such as OpenPose or DeepCut. Any 2D joint detectors can be used in their place as long as they are predicting the same joints.
Use the SMPL model to generate the 3D shape. Directly optimize the parameters of the SMPL model so that the model joints of the SMPL model project to the 2D joints predicted in the previous stage.
We know that SMPL captures shapes from just the joints. With the SMPL model, we can therefore capture information about body shape just from the joints. In the SMPL model, the body shape parameters are characterized by β. They are the coefficients of the principal components in the PCA shape model. The pose is parametrized by the relative rotation and theta of the 23 joints in the kinematic tree. We need to fit these parameters, β and theta, so that we minimize an objective function.
2.3.1. Defining the optimization objective function
In summary, the objective function consists of five components that, together, ensure that the solution to this objective function is a set of pose and shape parameters (theta and beta) that ensure that the 2D join projection distances are minimized while simultaneously ensuring that there are no large joint angles, no unnatural self-penetrations, and that the pose and shape parameters adhere to a prior distribution we see in a large dataset consisting of natural body poses and shapes.
2.3.2. Exploring SMPLify
3. Performing End-to-End View Synthesis with SynSin
3.1. Overview of view synthesis
One of the most popular research directions in 3D computer vision is view synthesis. Given the data and the viewpoint, the idea of this research direction is to generate a new image that renders the object from another viewpoint.
View synthesis comes with two challenges. The model should understand the 3D structure and semantic information of the image. By 3D structure, we mean that when changing the viewpoint, we get closer to some objects and far away from others. A good model should handle this by rendering images where some objects are bigger and some are smaller to view - change. By semantic information, we mean that the model should differentiate the objects and should understand what objects are presented in the image. This is important because some objects can be partially included in the image; therefore, during the reconstruction, the model should understand the semantics of the object to know how to reconstruct the continuation of that object. For example, given an image of a car on one side where we only see two wheels, we know that there are two more wheels on the other side of the car. The model must contain these semantics during reconstruction:
Many challenges need to be addressed. For the models, it’s hard to understand the 3D scene from an image. There are several methods for view synthesis:
View synthesis from multiple images: Deep neural networks can be used to learn the depth of multiple images, and then reconstruct new images from another view. However, as mentioned earlier, this implies that we have multiple images from slightly different views, and sometimes, it’s hard to obtain such data.
View synthesis using ground-truth depth: This involves a group of techniques where a ground-truth mask is used beside the image, which represents the depth of the image and semantics. Although in some cases, these types of models can achieve good results, it’s hard and expensive to gather data on a large scale, especially when it comes to outdoor scenes. Also, it’s expensive and time-consuming to annotate such data on a large scale, too.
View synthesis from a single image: This is a more realistic setting when we have only one image and we aim to reconstruct an image from the new view. It’s harder to get more accurate results by only using one image. SynSin belongs to a group of methods that can achieve a state-of-the-art view synthesis.
3.2. SynSin network architecture
The idea of SynSin is to solve the view synthesis problem with an end-to-end model using only one image at test time. This is a model that doesn’t need 3D data annotations and acheives very good accuracy compared to its baseline:
The model is trained end-to-end, and it consists of three different modules:
• Spatial feature and depth networks
• Neural point cloud renderer
• Refinement module and discriminator
Let’s dive deeper into each one to better understand the architecture.
3.2.1. Spatial feature and depth networks
These are the spatial feature network (f) and the depth network (d):
Given a reference image and the desired change in pose (T), we wish to generate an image as if that change in the pose were applied to the reference image. For the first part, we only use a reference image and feed it to two networks. A spatial feature network aims to learn feature maps, which are higher-resolution representations of the image. This part of the model is responsible for learning semantic information about the image. This model consists of eight ResNet blocks and outputs 64-dimensional feature maps for each pixel of the image. The output has the same resolution as the original image.
Next, the depth network aims to learn the 3D structure of the image. It won’t be an accurate 3D structure, as we don’t use exact 3D annotations. However, the model will further improve it. UNet with eight downsampling and upsampling layers are used for this network, followed by the sigmoid layer. Again, the output has the same resolution as the original image.
As you might have noticed, both models keep a high resolution for the output channels. This will further help to reconstruct more accurate and higher-quality images.
3.2.2. Neural point cloud renderer
The next step is to create a 3D point cloud that can then be used with a view transform point to render a new image from the new viewpoint. For that, we use the combined output of the spatial feature and depth networks.
The next step should be rendering the image from another point. In most scenarios, a naïve renderer would be used. This projects 3D points to one pixel or a small region in the new view. A naïve renderer uses a z-buffer, which keeps all the distances from the point to the camera. The problem with the naïve renderer is that it’s not differentiable, which means we can’t use gradients to update our depth and spatial feature networks. Moreover, we want to render features instead of RGB images. This means the naïve renderer won’t work for this technique:
Why not just differentiate naïve renderers? Here, we face two problems:
Small neighborhoods: As mentioned earlier, each point only appears on one or a few pixels of the rendered image. Therefore, there are only a few gradients for each point. This is a drawback of local gradients, which degrades the performance of the network relying on gradient updates.
The hard z-buffer: The z-buffer only keeps the nearest point for rendering the image. If new points appear closer, suddenly the output will change drastically.
To overcome the issues presented here, the model tries to soften the hard decision. This technique is called a neural point cloud renderer . To do that, the renderer, instead of assigning a pixel for a point, splats with varying influence. This solves a small neighborhood problem. For the hard z-buffer issue, we then accumulate the effect of the nearest points, not just the nearest point:
A 3D point is projected and splatted with radius r (Figure above). Then, the influence of the 3D point on that pixel is measured by the Euclidean distance between the center of the splatted point to that point:
As you can see in the preceding figure, each point is splatted, which helps us to not lose too much information and helps in solving tricky problems.
The advantage of this approach is that it allows you to gather more gradients for one 3D point, which improves the network learning process for both spatial features and depth networks:
Lastly, we need to gather and accumulate points in the z-buffer. First, we sort points according to\ their distance from the new camera, and then K-nearest neighbors with alpha compositing are used to accumulate points:
3.2.3. Refinement module and discriminator
Last but not least, the model consists of a refinement module. This module has two missions: first to improve the accuracy of the projected feature and, second, to fill the nonvisible part of the image from the new view. It should output semantically meaningful and geometrically accurate images. For example, if only one part of the table is visible in the image and in the new view, the image should contain a larger part of it, this module should understand semantically that this is a table, and during the reconstruction, it should keep the lines of the new part geometrically correct (for instance, the straight lines should remain straight). The model learns these properties from a dataset of real-world images:
The refinement module (g) gets inputs from the neural point cloud renderer and then outputs the final reconstructed image. Then, it is used in loss objectives to improve the training process.
This task is solved with generative models. ResNet with eight blocks is used, and to keep the resolution of the image good, downsampling and upsampling blocks were used, too. We use GAN with two multilayer discriminators and feature matching loss on the discriminator. The final loss of the model consists of the L1 loss, content loss, and discriminator loss between the generated and target images:
The loss function is then used for model optimization as usual.
4. Mesh R-CNN
4.1. Overview of meshes and voxels
As mentioned earlier in this book, meshes and voxels are two different 3D data representations. Mesh R-CNN uses both representations to get better quality 3D structure predictions.
A mesh is the surface of a 3D model represented as polygons, where each polygon can be represented as a triangle. Meshes consist of vertices connected by edges. The edge and vertex connection creates faces that have a commonly triangular shape. This representation is good for faster transformations and rendering.
Voxels are the 3D analogs of 2D pixels. As each image consists of 2D pixels, it is logical to use the same idea to represent 3D data. Each voxel is a cube, and each object is a group of cubes where some of them are the outer visible parts, and some of them are inside the object. It’s easier to visualize 3D objects with voxels, but it’s not the only use case. In deep learning problems, voxels can be used as input for 3D convolutional neural networks
Mesh R-CNN uses both types of 3D data representations. Experiments have shown that predicting voxels and then converting them into the mesh, and then refining the mesh, helps the network learn better.
4.2. Mesh R-CNN architecture
3D shape detection has captured the interest of many researchers. Many models have been developed that have gotten good accuracy, but they mostly focused on synthetic benchmarks and isolated objects:
At the same time, 2D object detection and image segmentation problems have had rapid advances as well. Many models and architectures solve this problem with high accuracy and speed. There are solutions for localizing objects and detecting the bounding boxes and masks. One of them is called Mask R-CNN, which is a model for object detection and instance segmentation. This model is state- of-the-art and has a lot of real-life applications.
However, we see the world in 3D. The authors of the Mesh R-CNN paper decided to combine these two approaches into a single solution: a model that detects the object on a realistic image and outputs the 3D mesh instead of the mask. The new model takes a state-of-the-art object detection model, which takes an RGB image as input and outputs the class label, segmentation mask, and 3D mesh of the objects. The authors have added a new branch to Mask R-CNN that is responsible for predicting high-resolution triangle meshes:
The authors aimed to create one model that is end-to-end trainable. That is why they took the state- of-the-art Mask R-CNN model and added a new branch for mesh prediction. Before diving deeper into the mesh prediction part, let’s quickly recap Mask R-CNN:
Mask R-CNN takes an RGB image as input and outputs bounding boxes, category labels, and instance segmentation masks. First, the image passes through the backbone network, which is typically based on ResNet – for example, ResNet-50-FPN. The backbone network outputs the feature map, which is the input of the next network: the region proposal network (RPN). This network outputs proposals. The object classification and mask prediction branches then process the proposals and output classes and masks, respectively.
This structure of Mask R-CNN is the same for Mesh R-CNN as well. However, in the end, a mesh predictor was added. A mesh predictor is a new module that consists of two branches: the voxel branch and the mesh refinement branch.
The voxel branch takes proposed and aligned features as input and outputs the coarse voxel predictions. These are then given as input to the mesh refinement branch, which outputs the final mesh. The losses of the voxel branch and mesh refinement branch are added to the box and mask losses and the model is trained end to end:
4.2.1. Graph convolutions
Before we look at the structure of the mesh predictor, let’s understand what a graph convolution is and how it works.
Early variants of neural networks were adopted for structured Euclidean data. However, in the real world, most data is non-Euclidian and has graph structures. Recently, many variants of neural networks have started to adapt to graph data as well, with one of them being convolutional networks, which are called graph convolutional networks (GCNs) .
Meshes have this graph structure, which is why GCNs are applicable in 3D structure prediction problems. The basic operation of a CNN is convolution, which is done using filters. We use the sliding window technique for convolution, and the filters include weights that the model should learn. GCNs use a similar technique for convolution, though the main difference is that the number of nodes can vary, and the nodes are unordered:
Figure below shows an example of a convolutional layer. The input of the network is the graph and adjacency matrix, which represents the edges between the nodes in forward propagation. The convolution layer encapsulates information for each node by aggregating information from its neighborhood. After that, nonlinear transformation is applied. Later, the output of this network can be used in different tasks, such as classification:
4.2.2. Mesh predictor
The mesh predictor module aims to detect the 3D structure of an object. It is the logical continuation of the RoIAlign module, and it is responsible for predicting and outputting the final mesh.
As we get 3D meshes from real-life images, we can’t use fixed mesh templates with fixed mesh topologies. That is why the mesh predictor consists of two branches. The combination of the voxel branch and mesh refinement branch helps reduce the issue with fixed topologies.
The voxel branch is analogous to the mask branch from Mask R-CNN. It takes aligned features from ROIAlign and outputs a G x G x G grid of voxel occupancy probabilities. Next, the Cubify operation is used. It uses a threshold for binarizing voxel occupancy. Each occupied voxel is replaced with a cuboid triangle mesh with 8 vertices, 18 edges, and 12 faces.
The voxel loss is binary cross-entropy, which minimizes the predicted probabilities of voxel occupancy with ground truth occupancies.
The mesh refinement branch is a sequence of three different operations: vertex alignment, graph convolution, and vertex refinement. Vertex alignment is similar to ROI alignment; for each mesh vertex, it yields an image-aligned feature.
Graph convolution takes image-aligned features and propagates information along mesh edges. Vertex refinement updates vertex positions. It aims to update vertex geometry by keeping the topology fixed:
As shown in Figure above, we can have multiple stages of refinement. Each stage consists of vertex alignment, graph convolution, and vertex refinement operations. In the end, we get a more accurate 3D mesh.
The final important part of the model is the mesh loss function. For this branch, chamfer and normal losses are used. However, these techniques need sampled points from predicted and ground-truth meshes.
The following mesh sampling method is used: given vertices and faces, the points are uniformly sampled from a probability distribution of the surface of the mesh. The probability of each face is proportional to its area.
Using these sampling techniques, a point cloud from the ground truth, $Q$, and a point cloud from the prediction, $P$, are sampled. Next, we calculate $Λ_{PQ}$ , which is the set of pairs $(p,q)$ where $q$ is the nearest neighbor of $p$ in $Q$.
Chamfer distance is calculated between $P$ and $Q$:
Next, the absolute normal distance is calculated:
Here, $u_{p}$ and $u_{q}$ are the units normal to points $p$ and $q$, respectively.
However, only these two losses degenerated meshes. This is why, for high-quality mesh production, a shape regularizer was added, which was called edge loss:
The final mesh loss is the weighted average of three presented losses: chamfer loss, normal loss, and edge loss.
In terms of training, two types of experiments were conducted. The first one was to check the mesh predictor branch. Here, the ShapeNet dataset was used, which includes 55 common categories of classes. This is widely used in benchmarking for 3D shape prediction; however, it includes CAD models, which have separate backgrounds. Due to this, the mesh predictor model reached state-of-the-art status. Moreover, it solves issues regarding objects with holes that previous models couldn’t detect well:
The third row represents the output of the mesh predictor. We can see that it predicts the 3D shape and that it handles the topology and geometry of objects very well: