Multi-View 3D Object Detection Network for Autonomous Driving
1. Overview
In this paper, authors propose a Multi-View 3D object detection network (MV3D) which takes multimodal data as input and predicts the full 3D extent of objects in 3D space. The main idea for utilizing multimodal information is to perform region-based feature fusion . We first propose a multi-view encoding scheme to obtain a compact and effective representation for sparse 3D point cloud. As illustrated in Fig. 1, the multi-view 3D detection network consists of two parts: a 3D Proposal Network and a Region-based Fusion Network. The 3D proposal network utilizes a bird’s eye view representation of point cloud to generate highly accurate 3D candidate boxes. The benefit of 3D object proposals is that it can be projected to any views in 3D space. The multi-view fusion network extracts region-wise features by projecting 3D proposals to the feature maps from mulitple views. We design a deep fusion approach to enable interactions of intermediate layers from different views. Combined with drop-path training and auxiliary loss, our approach shows superior performance over the early/late fusion scheme. Given the multi-view feature representation, the network performs oriented 3D box regression which predict accurate 3D location, size and orientation of objects in 3D space.
3D Object Detection in Point Cloud: encode 3D point cloud with multi-view feature maps, enabling region-based representation for multimodal fusion.
3D Object Detection in Images: incorporate LIDAR point cloud to improve 3D localization
Multimodal Fusion: Using the same base network for each column and adding auxiliary paths and losses for regularization.
3D Object proposals: introducing a bird’s eye view representation of point cloud and employing 2D convolutions to generate accurate 3D proposals.
2. MV3D Network
The MV3D network takes a multi-view representation of 3D point cloud and an image as input. It first generates 3D object proposals from the bird’s eye view map and deeply fuses multi-view features via region-based representation. The fused features are used for category classification and oriented 3D box regression.
2.1. 3D Point Cloud Representation
Projecting 3D point cloud to the bird’s eye view and the front view.
Bird’s Eye View Representation: The bird’s eye view representation is encoded by height, intensity and density. We discretize the projected point cloud into a 2D grid with resolution of 0.1m. For each cell, the height feature is computed as the maximum height of the points in the cell.
Front View Representation: Front view representation provides complementary information to the bird’s eye view representation. As LIDAR point cloud is very sparse, projecting it into the image plane results in a sparse 2D point map. Instead, we project it to a cylinder plane to generate a dense front view map. Authors encode the front view map with three-channel features, which are height, distance and intensity.
2.2. 3D Proposal Network
We use the bird’s eye view map as input. In 3D object detection, The bird’s eye view map has several advantages over the front view/image plane:
Objects preserve physical sizes when projected to the bird’s eye view, thus having small size variance, which is not the case in the front view/image plane.
Objects in the bird’s eye view occupy different space, thus avoiding the occlusion problem.
In the road scene, since objects typically lie on the ground plane and have small variance in vertical location, the bird’s eye view location is more crucial to obtaining accurate 3D bounding boxes.
=> Using explicit bird’s eye view map as input makes the 3D location prediction more feasible.
Given a bird’s eye view map. the network generates 3D box proposals from a set of 3D prior boxes.
2.3. Region-based Fusion Network
We design a region-based fusion network to effectively combine features from multiple views and jointly classify object proposals and do oriented 3D box regression.
Multi-View ROI Pooling: Since features from different views/modalities usually have different resolutions, we employ ROI pooling for each view to obtain feature vectors of the same length. Given the generated 3D proposals, we can project them to any views in the 3D space. In our case, we project them to three views, i.e., bird’s eye view (BV), front view (FV), and the image plane (RGB).
Deep Fusion: To combine information from different features, prior work usually use early fusion or late fusion. Inspired by, we employ a deep fusion approach, which fuses multi-view features hierarchically. A comparison of the architectures of our deep fusion network and early/late fusion networks are shown in figure below.
Oriented 3D Box Regression: Given the fusion features of the multi-view network, we regress to oriented 3D boxes from 3D proposals. In particular, the regression targets are the 8 corners of 3D boxes . They are encoded as the corner offsets normalized by the diagonal length of the proposal box. Despite such a 24-D vector representation is redundant in representing an oriented 3D box, we found that this encoding approach works better than the centers and sizes encoding approach. In our model, the object orientations can be computed from the predicted 3D box corners . We use a multitask loss to jointly predict object categories and oriented 3D boxes . As in the proposal network, the category loss uses cross-entropy and the 3D box loss uses smooth $l_1$. During training, the positive/negative ROIs are determined based on the IoU overlap of brid’s eye view boxes. A 3D proposal is considered to be positive if the bird’s eye view IoU overlap is above 0.5, and negative otherwise. During inference, we apply NMS on the 3D boxes after 3D bounding box regression. We project the 3D boxes to the bird’s eye view to compute their IoU overlap. We use IoU threshold of 0.05 to remove redundant boxes, which ensures objects can not occupy the same space in bird’s eye view.
Network Regularization: We employ two approaches to regularize the region-based fusion network: drop-path training and auxiliary losses . For each iteration, we randomly choose to do global drop-path or local drop-path with a probability of 50%. If global drop-path is chosen, we select a single view from the three views with equal probability. If local drop-path is chosen, paths input to each join node are randomly dropped with 50% probability. We ensure that for each join node at least one input path is kept. To further strengthen the representation capability of each view, we add auxiliary paths and losses to the network. As shown below, the auxiliary paths have the same number of layers with the main network. Each layer in the auxiliary paths shares weights with the corresponding layer in the main network. We use the same multi-task loss, i.e. classification loss plus 3D box regression loss, to back-propagate each auxiliary path. We weight all the losses including auxiliary losses equally. The auxiliary paths are removed during inference.
2.4. Implementation Details
- Network Architecture: In our multi-view network, each view has the same architecture. The base network is built on the 16-layer VGG net with the following modifications:
• Channels are reduced to half of the original network.
• To handle extra-small objects, we use feature approximation to obtain high-resolution feature map. In particular, we insert a 2x bilinear upsampling layer before feeding the last convolution feature map to the 3D Proposal Network. Similarly, we insert a 4x/4x/2x upsampling layer before the ROI pooling layer for the BV/FV/RGB branch.
• We remove the 4th pooling operation in the original VGG network, thus the convolution parts of our network proceed 8x downsampling.
• In the muti-view fusion network, we add an extra fully connected layer fc8 in addition to the original fc6 and fc7 layer
We initialize the parameters by sampling weights from the VGG-16 network pretrained on ImageNet. Despite our network has three branches, the number of parameters is about 75% of the VGG-16 network. The inference time of the network for one image is around 0.36s on a GeForce Titan X GPU.
Input Representation: In the case of KITTI, which provides only annotations for objects in the front view (around 90◦ field of view), we use point cloud in the range of [0, 70.4] × [-40, 40] meters. We also remove points that are out of the image boundaries when projected to the image plane. For bird’s eye view, the discretization resolution is set to 0.1m, therefore the bird’s eye view input has size of 704×800. Since KITTI uses a 64-beam Velodyne laser scanner, we can obtain a 64×512 map for the front view points. The RGB image is up-scaled so that the shortest size is 500.
Training: The network is trained in an end-to-end fashion. For each mini-batch we use 1 image and sample 128 ROIs, roughly keeping 25% of the ROIs as positive. We train the network using SGD with a learning rate of 0.001 for 100K iterations. Then we reduce the learning rate to 0.0001 and train another 20K iterations.
3. Experiments
- 3D Proposal Recall:
- 3D Localization:
- 3D Object Detection:
- Ablation Study:
- 2D Object Detection:
4. Code and Test
System:
- Ubuntu 22.04
- i7-12800H
- RTX 3060 6GB
- 40GB RAM
4.1. Dataset preparation
Use the raw data, provided by KITTI,click the link
- Create account and download the raw data
We use [synced + rectified data] + [calibration] + [tracklets]
Download the data and unzip it
4.2. Install
- Create conda environment
1
conda env create -f environment.yml
Check GPU and Tensorflow version
Build
1
2
3
source activate mv3d
sudo chmod 755 ./make.sh
./make.sh
- Note
If you meet the error like this:
1
2
error: token ""__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."" is not valid in preprocessor expressions
74 | #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
Try it and rebuild.
4.3. Data Preprocessing
- After running src/data.py, we get the required inputs for MV3D net. It is saved in kitti.