SECOND - Sparsely Embedded Convolutional Detection
In this paper, Authors present a novel approach called SECOND (Sparsely Embedded CONvolutional Detection), which addresses these challenges in 3D convolution-based detection by maximizing the use of the rich 3D information present in point cloud data . This method incorporates several improvements to the existing convolutional network architecture. Spatially sparse convolutional networks are introduced for LiDAR-based detection and are used to extract information from the z-axis before the 3D data are downsampled to something akin to 2D image data .
Another advantage of using point cloud data is that it is very easy to scale, rotate and move objects by applying direct transformations to specified points on those objects. SECOND incorporates a novel form of data augmentation based on this capability. A ground-truth database is generated that contains the attributes of objects and the associated point cloud data. Objects sampled from this database are then introduced into the point clouds during training. This approach can greatly increase the convergence speed and the final performance of our network.
In addition to the above, we also introduce a novel angle loss regression approach to solve the problem of the large loss generated when the difference in orientation between the ground truth and the prediction is equal to $π$, which yields a bounding box identical to the true bounding box. The performance of this angle regression approach surpasses that of any current method we know about, including the orientation vector regression function available in AVOD. We also introduce an auxiliary direction classifier to recognize the directions of objects.
1. SECOND Detector
1.1. Network Architecture
The proposed SECOND detector, depicted in Figure 1, consists of three components:
(1) A voxelwise feature extractor;
(2) A sparse convolutional middle layer;
(3) An RPN.
1.1.1. Point Cloud Grouping
Based on VoxelNet to obtain a voxel representation of the point cloud data. We first preallocate buffers based on the specified limit on the number of voxels; then, we iterate over the point cloud and assign the points to their associated voxels, and we save the voxel coordinates and the number of points per voxel. We check the existence of the voxels based on a hash table during the iterative process. If the voxel related to a point does not yet exist, we set the corresponding value in the hash table; otherwise, we increment the number of voxels by one. The iterative process will stop once the number of voxels reaches the specified limit. Finally, we obtain all voxels, their coordinates and the number of points per voxel for the actual number of voxels.
1.1.2. Voxelwise Feature Extractor
We use a voxel feature encoding (VFE) layer, to extract voxelwise features. A VFE layer takes all points in the same voxel as input and uses a fully connected network (FCN) consisting of a linear layer, a batch normalization (BatchNorm) layer and a rectified linear unit (ReLU) layer to extract pointwise features. Then, it uses elementwise max pooling to obtain the locally aggregated features for each voxel. Finally, it tiles the obtained features and concatenates these tiled features and the pointwise features together. We use $VFE(c_{out})$ to denote a VFE layer that transforms the input features into $c_{out}$-dimensional output features. Similarly, $FCN(c_{out})$ denotes a Linear-BatchNorm-ReLU layer that transforms the input features into $c_{out}$-dimensional output features. As a whole, the voxelwise feature extractor consists of several VFE layers and an FCN layer.
1.1.3. Sparse Convolutional Middle Extractor
Sparse Convolution Algorithm
Rule Generation Algorithm
Sparse Convolutional Middle Extractor
Our middle extractor is used to learn information about the z-axis and convert the sparse 3D data into a 2D BEV image. Figure 3 shows the structure of the middle extractor. It consists of two phases of sparse convolution. Each phase contains several submanifold convolutional layers and one normal sparse convolution to perform downsampling in the z-axis. After the z-dimensionality has been downsampled to one or two, the sparse data are converted into dense feature maps. Then, the data are simply reshaped into image-like 2D data.
1.1.4. Region Proposal Network
RPNs have recently begun to be used in many detection frameworks. In this work, we use a single shot multibox detector (SSD)-like architecture to construct an RPN architecture. The input to the RPN consists of the feature maps from the sparse convolutional middle extractor. The RPN architecture is composed of three stages. Each stage starts with a downsampled convolutional layer, which is followed by several convolutional layers. After each convolutional layer, BatchNorm and ReLU layers are applied. We then upsample the output of each stage to a feature map of the same size and concatenate these feature maps into one feature map. Finally, three 1 × 1 convolutions are applied for the prediction of class, regression offsets and direction.
1.2. Training and Inference
1.2.1. Loss Function
Sine-Error Loss for Angle Regression
where the subscript $p$ indicates the predicted value. This approach to angle loss has two advantages: (1) it solves the adversarial example problem between orientations of 0 and $π$, and (2) it naturally models the IoU against the angle offset function. To address the issue that this loss treats boxes with opposite directions as being the same, we have added a simple direction classifier to the output of the RPN. This direction classifier uses a softmax loss function. We use the following approach to generate the direction classifier target: if the yaw rotation around the z-axis of the ground truth is higher than zero, the result is positive; otherwise, it is negative.
Focal Loss for Classification
Total Training Loss
By combining the losses discussed above, we can obtain the final form of the multitask loss as follows:
1.2.2. Data Augmentation
Sample Ground Truths from the Database
The major problem we encountered during training was the existence of too few ground truths, which significantly limited the convergence speed and final performance of the network. To solve this problem, we introduced a data augmentation approach. First, we generated a database containing the labels of all ground truths and their associated point cloud data (points inside the 3D bounding boxes of the ground truths) from the training dataset. Then, during training, we randomly selected several ground truths from this database and introduced them into the current training point cloud via concatenation. Using this approach, we could greatly increase the number of ground truths per point cloud and simulate objects existing in different environments. To avoid physically impossible outcomes, we performed a collision test after sampling the ground truths and removed any sampled objects that collided with other objects.
Object Noise
To consider noise, we followed the same approach used in VoxelNet, in which each ground truth and its point cloud are independently and randomly transformed, instead of transforming all point clouds with the same parameters. Specifically, we used random rotations sampled from a uniform distribution $∆θ ∈ [−π/2, π/2]$ and random linear transformations sampled from a Gaussian distribution with a mean of zero and a standard deviation of 1.0.
Global Rotation and Scaling
We applied global scaling and rotation to the whole point cloud and to all ground-truth boxes. The scaling noise was drawn from the uniform distribution [0.95, 1.05], and $[−π/4, π/4]$ was used for the global rotation noise.
1.2.3. Network Details
Note
• We apply sparse convolution in LiDAR-based object detection, thereby greatly increasing the speeds of training and inference. • We propose an improved method of sparse convolution that allows it to run faster. • We propose a novel angle loss regression approach that demonstrates better orientation regression performance than other methods do. • We introduce a novel data augmentation method for LiDAR-only learning problems that greatly increases the convergence speed and performance.