Home Paper note 8 - Virtual Sparse Convolution for Multimodal 3D Object Detection
Post
Cancel

Paper note 8 - Virtual Sparse Convolution for Multimodal 3D Object Detection

Virtual Sparse Convolution for Multimodal 3D Object Detection

1. Problems

  • Early methods extended the features of LiDAR points with image features, such as semantic mask and 2D CNN features. They did not increase the number of points; thus, the distant points still remain sparse.

  • methods based on virtual/pseudo points (for simplicity, both denoted as virtual points in the following) enrich the sparse points by creating additional points around the LiDAR points. => The virtual points complete the geometry of distant objects, showing the great potential for high-performance 3D detection.

  • However, virtual points generated from an image are generally very dense. => This brings a huge computational burden and causes a severe efficiency issue (Fig. 2 (f)).

image

  • Some solution: using larger voxel size, randomly downsampling. => virtual points will inevitably sacrifice useful shape cues from faraway points and result in decreased detection accuracy.

  • Another issue is that the depth completion can be inaccurate, and it brings a large amount of noise in the virtual points (Fig. 2 (c)).

  • Since it is very difficult to distinguish the noises from the background in 3D space, the localization precision of 3D detection is greatly degraded. In addition, the noisy points are non-Gaussian distributed, and can not be filtered by conventional denoising algorithms. Although recent semantic segmentation network show promising results, they generally require extra annotations.

=> VirConvNet pipeline based on a new Virtual Sparse Convolution (VirConv) operator.

Main observations:

(1) First, geometries of nearby objects are often relatively complete in LiDAR scans. Hence, most virtual points of nearby objects only bring marginal performance gain (Fig. 2 (e)(f)) , but increase the computational cost significantly.

(2) Second, noisy points introduced by inaccurate depth completions are mostly distributed on the instance boundaries (Fig. 2 (d)). They can be recognized in 2D images after being projected onto the image plane.

2. Contribution

  • Design a StVD (Stochastic Voxel Discard) scheme to retain those most important virtual points by a bin-based sampling, namely, discarding a huge number of nearby voxels while retaining faraway voxels. This can greatly speed up the network computation.

  • Design a NRConv (Noise-Resistant Submanifold Convolution) layer to encode geometry features of voxels in both 3D space and 2D image space. The extended receptive field in 2D space allows our NRConv to distinguish the noise pattern on the instance boundaries in 2D image space. Consequently, the negative impact of noise can be suppressed.

  • Built upon VirConv, we present three new multimodal detectors: a VirConv-L, a VirConv-T, and a semisupervised VirConv-S for efficient, high-precision, and semi-supervised 3D detection, respectively.

  • Competitve with SOTA method (Fig. 1)

image

3. VirConv for Multimodal 3D Detection

image

3.1. Stochastic Voxel Discard

To alleviate the computation problem and improve the density robustness for the virtual-point-based detector, we develop the StVD. It consists of two parts:

(1) input StVD, which speeds up the network by discarding input voxels of virtual points during both the training and inference process;

(2) layer StVD, which improves the density robustness by discarding voxels of virtual points at every VirConv block during only the training process.

image

  • Input StVD: Introduce a bin-based sampling strategy to perform efficient and balanced sampling (see Fig. 4 (c)). Specifically, We first divide the input voxels into $N^b$ bins (we adopt $N^b = 10$ in this paper) according to different distances. For the nearby bins (≤30m based on the statistics in Fig. 2 (e)), we randomly keep a fixed number (∼ 1K) of voxels. For distant bins, we keep all of the inside voxels. After the binbased sampling, we discard about 90% (which achieves the best precision-efficiency trade-off) of redundant voxels and it speeds up the network by about 2 times.

  • Layer StVD: To improve the robustness of detection from sparse points, we also develop a layer StVD which is applied to the training process. Specifically, we discard voxels at each VirConv block to simulate sparser training samples. We adopt a discarding rate of 15% in this paper. The layer StVD serves as a data augmentation strategy to help enhance the 3D detector’s training.

3.2. Noise-Resistant Submanifold Convolution

image

The noise introduced by the inaccurate depth completion can hardly be recognized from 3D space but can be easily recognized from 2D images. We develop an NRConv (see Fig. 3 (b)) from the widely used submanifold sparse convolution to address the noise problem. Specifically, given N input voxels formulated by a 3D indices vector $H ∈ R^{N×3}$ and a features vector $X ∈ R^{N×C^{in}}$ , we encode the noise-resistant geometry features $Y ∈ R^{N×C^{out}}$ in both 3D and 2D image space, where $C^{in}$ and $C^{out}$ denote the number of input and output feature channels respectively.

  • Encoding geometry features in 3D space: For each voxel feature $X_i$ in $X$, we first encode the geometry features by the 3D submanifold convolution kernel $K^{3D}(·)$. Specifically, the geometry features $\hat X_{i} ∈ R^{C^{out}/2}$ are calculated from the non-empty voxels within 3 × 3 × 3 neighborhood based on the corresponding 3D indices as:

image

  • Encoding noise-aware features in 2D image space: The noise brought by the inaccurate depth completion significantly degrade the detection performance. Since the noise is mostly distributed on the 2D instance boundaries, we extend the convolution receptive field to the 2D image space and encode the noise-aware features using the 2D neighbor voxels. Specifically, we first convert the 3D indices to a set of grid points based on the voxelization parameters (the conversion denoted as $G(·)$). Since state-of-the-art detectors also adopt the transformation augmentations (the augmentation denoted as $T (·)$) such as rotation and scaling, the grid points are generally misaligned with the corresponding image. Therefore, we transform the grid points backward into the original coordinate system based on the data augmentation parameters. Then we project the grid points into the 2D image plane based on the LiDARCamera calibration parameters (with the projection denoted as $P(·)$). The overall projection can be summarized by

image

After the 3D and 2D features encoding, we adopt a simple concatenation to implicitly learn a noise-resistant feature. Specifically, we finally concatenate $ \hat X_i and \tilde X_i$ to obtain the noise-resistant feature vector $Y ∈ R^{N×C^{out}}$ as

image

Different from related noise segmentation and removal methods, our NRConv implicitly distinguishes the noise pattern by extending the receptive field to 2D image space. Consequently, the impact of noise is suppressed without lose of shape cues.

3.3. Detection Frameworks with VirConv

image

  • VirConv-L: We first construct the lightweight VirConvL (Fig. 3 (c)) for fast multimodal 3D detection. VirConv-L adopts an early fusion scheme and replaces the backbone of Voxel-RCNN with our VirConvNet. Specifically, we denote the LiDAR points as $P = {p}, p = [x, y, z, α]$, where $x, y, z$ denotes the coordinates and α refers intensity. We denote the virtual points as $V = {v}, v = [x, y, z]$. We fuse them into a single point cloud $P^∗ = {p^∗}k, p^∗ = [x, y, z, α, β]$, where $β$ is an indicator denoting where the point came from. The intensity of virtual points is padded by zero. The fused points are encoded into feature volumes by our VirConvNet for 3D detection.

  • VirConv-T: We then construct a high-precision VirConv-T based on a Transformed Refinement Scheme (TRS) and a late fusion scheme (see Fig. 5). CasA and TED achieve high detection performance based on three-stage refinement and multiple transformation design, respectively. However, both of them require heavy computations. We fuse the two high computation detectors into a single efficient pipeline. Specifically, we first transform P and V with different rotations and reflections. Then we adopt the VoxelNet and VirConvNet to encode the features of P and V , respectively. Similar to TED, the convolutional weights between different transformations are shared. After that, the RoIs are generated by a Region Proposal Network (RPN) and refined by the backbone features (the RoI features of P and V fused by simple concatenation) under the first transformation. The refined RoIs are further refined by the backbone features under other transformations. Next, the refined RoIs from different refinement stages are fused by boxes voting, as is done by CasA. We finally perform a non-maximum-suppression (NMS) on the fused RoIs to obtain detection results.

  • VirConv-S: We also design a semi-supervised pipeline, VirConv-S, using the widely used pseudo-label method [33, 41]. Specifically, first, a model is pre-trained using the labeled training data. Then, pseudo labels are generated on a larger-scale unannotated dataset using this pre-trained model. A high-score threshold (empirically, 0.9) is adopted to filter out low-quality labels. Finally, the VirConv-T model is trained using both real and pseudo labels.

4. Experiments

image

This post is licensed under CC BY 4.0 by the author.