VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
1. PointNet problem
High computational and memory requiremente => Scaling up 3D feature learning networks to orders of magnitude more points and to 3D detection tasks.
=> VoxelNet: a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network.
We design a novel voxel feature encoding (VFE) layer, which enables inter-point interaction within a voxel, by combining point-wise features with a locally aggregated feature.
Stacking multiple VFE layers allows learning complex features for characterizing local 3D shape information.
Specifically, VoxelNet divides the point cloud into equally spaced 3D voxels, encodes each voxel via stacked VFE layers, and then 3D convolution further aggregates local voxel features, transforming the point cloud into a high-dimensional volumetric representation.
Finally, a RPN consumes the volumetric representation and yields the detection result. This efficient algorithm benefits both from the sparse point structure and efficient parallel processing on the voxel grid.
2. VoxelNet
2.1. VoxelNet Architecture
2.1.1. Feature Learning Network
Voxel Partition: Given a point cloud, we subdivide the 3D space into equally spaced voxels as shown in Figure 2. Suppose the point cloud encompasses 3D space with range $D$, $H$, $W$ along the $Z$, $Y$, $X$ axes respectively. We define each voxel of size $v_D$, $v_H$, and $v_W$ accordingly. The resulting 3D voxel grid is of size $D’ = D/v_D$, $H’ = H/v_H$, $W’ = W/v_W$ . Here, for simplicity, we assume $D$, $H$, $W$ are a multiple of $v_D$, $v_H$, $v_W$.
Grouping: We group the points according to the voxel they reside in. Due to factors such as distance, occlusion, object’s relative pose, and non-uniform sampling, the LiDAR point cloud is sparse and has highly variable point density throughout the space. Therefore, after grouping, a voxel will contain a variable number of points. An illustration is shown in Figure 2, where Voxel-1 has significantly more points than Voxel-2 and Voxel-4, while Voxel-3 contains no point.
Random Sampling: Typically a high-definition LiDAR point cloud is composed of ∼100k points. Directly processing all the points not only imposes increased memory/efficiency burdens on the computing platform, but also highly variable point density throughout the space might bias the detection. To this end, we randomly sample a fixed number, $T$, of points from those voxels containing more than $T$ points. This sampling strategy has two purposes, (1) computational savings; and (2) decreases the imbalance of points between the voxels which reduces the sampling bias, and adds more variation to training.
Stacked Voxel Feature Encoding: The key innovation is the chain of VFE layers. For simplicity, Figure 2 illustrates the hierarchical feature encoding process for one voxel. Without loss of generality, we use VFE Layer-1 to describe the details in the following paragraph. Figure 3 shows the architecture for VFE Layer-1.
- Sparse Tensor Representation: By processing only the non-empty voxels, we obtain a list of voxel features, each uniquely associated to the spatial coordinates of a particular non-empty voxel. The obtained list of voxel-wise features can be represented as a sparse 4D tensor, of size $C × D’ × H’ × W’$ as shown in Figure 2. Although the point cloud contains ∼100k points, more than 90% of voxels typically are empty. Representing non-empty voxel features as a sparse tensor greatly reduces the memory usage and computation cost during backpropagation, and it is a critical step in our efficient implementation.
2.1.2. Convolutional Middle Layers
We use $ConvMD(c_{in}, c_{out}, k, s, p)$ to represent an Mdimensional convolution operator where $c_{in}$ and $c_{out}$ are the number of input and output channels, $k$, $s$, and $p$ are the M-dimensional vectors corresponding to kernel size, stride size and padding size respectively. When the size across the M-dimensions are the same, we use a scalar to represent the size e.g. $k$ for $\textbf{k} = (k, k, k)$.
Each convolutional middle layer applies 3D convolution, BN layer, and ReLU layer sequentially. The convolutional middle layers aggregate voxel-wise features within a progressively expanding receptive field, adding more context to the shape description.
2.1.3. Region Proposal Network
The input to our RPN is the feature map provided by the convolutional middle layers. The architecture of this network is illustrated in Figure 4. The network has three blocks of fully convolutional layers. The first layer of each block downsamples the feature map by half via a convolution with a stride size of 2, followed by a sequence of convolutions of stride 1 (×q means q applications of the filter). After each convolution layer, BN and ReLU operations are applied. We then upsample the output of every block to a fixed size and concatanate to construct the high resolution feature map. Finally, this feature map is mapped to the desired learning targets:
(1) a probability score map and
(2) a regression map.
2.2. Loss Function
Let ${a^{pos}i}_i = 1…N{pos}$ be the set of Npos positive anchors and ${a^{neg}j}_j=1…N{neg}$ be the set of $N_{neg}$ negative anchors. We parameterize a 3D ground truth box as $(x^g_c, y^g_c, z^g_c, l^g, w^g, h^g, θ^g)$, where $x^g_c$, $y^g_c$, $z^g_c$ represent the center location, $l^g$, $w^g$, $h^g$ are length, width, height of the box, and $θ^g$ is the yaw rotation around Z-axis. To retrieve the ground truth box from a matching positive anchor parameterized as $(x^a_c, y^a_c, z^a_c, l^a, w^a, h^a, θ^a)$, we define the residual vector $u∗ ∈ R^7$ containing the 7 regression targets corresponding to center location $∆x$, $∆y$, $∆z$, three dimensions $∆l$, $∆w$, $∆h$, and the rotation $∆θ$, which are computed as:
where $d^a =\sqrt{(l^a)^2 + (w^a)^2}$ is the diagonal of the base of the anchor box. Here, we aim to directly estimate the oriented 3D box and normalize $∆x$ and $∆y$ homogeneously with the diagonal $d$ a. We define the loss function as follows:
where $p^{pos}i$ and $p^{neg}_j$ represent the softmax output for positive anchor $a^{pos}_i$ and negative anchor $a^{neg}_j$ respectively, while $u_i ∈ R^7$ and $u^∗_i ∈ R^7$ are the regression output and ground truth for positive anchor aposi . The first two terms are the normalized classification loss for ${a^{pos}_i}_i=1…N{pos}$ and ${a^{neg}j}_j=1…N{neg}$ , where the $L_{cls}$ stands for binary cross entropy loss and $α$, $β$ are postive constants balancing the relative importance. The last term $L_{reg}$ is the regression loss, where we use the SmoothL1 function.
2.3. Efficient Implementation
GPUs are optimized for processing dense tensor structures. The problem with working directly with the point cloud is that the points are sparsely distributed across space and each voxel has a variable number of points. We devised a method that converts the point cloud into a dense tensor structure where stacked VFE operations can be processed in parallel across points and voxels.
The method is summarized in Figure 5. We initialize a $K × T × 7$ dimensional tensor structure to store the voxel input feature buffer where $K$ is the maximum number of non-empty voxels, $T$ is the maximum number of points per voxel, and 7 is the input encoding dimension for each point. The points are randomized before processing. For each point in the point cloud, we check if the corresponding voxel already exists. This lookup operation is done efficiently in $O(1)$ using a hash table where the voxel coordinate is used as the hash key. If the voxel is already initialized we insert the point to the voxel location if there are less than $T$ points, otherwise the point is ignored. If the voxel is not initialized, we initialize a new voxel, store its coordinate in the voxel coordinate buffer, and insert the point to this voxel location. The voxel input feature and coordinate buffers can be constructed via a single pass over the point list, therefore its complexity is $O(n)$. To further improve the memory/compute efficiency it is possible to only store a limited number of voxels $(K)$ and ignore points coming from voxels with few points.
After the voxel input buffer is constructed, the stacked VFE only involves point level and voxel level dense operations which can be computed on a GPU in parallel. Note that, after concatenation operations in VFE, we reset the features corresponding to empty points to zero such that they do not affect the computed voxel features. Finally, using the stored coordinate buffer we reorganize the computed sparse voxel-wise structures to the dense voxel grid. The following convolutional middle layers and RPN operations work on a dense voxel grid which can be efficiently implemented on a GPU.
3. Experiments
Feature Learning Network
PointCloud preprocessing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def process_pointcloud(point_cloud, cls = cfg.DETECT_OBJ):
# Input:
# (N, 4)
# Output:
# voxel_dict
if cls == 'Car':
scene_size = np.array([4, 80, 70.4], dtype = np.float32)
voxel_size = np.array([0.4, 0.2, 0.2], dtype = np.float32)
grid_size = np.array([10, 400, 352], dtype = np.int64)
lidar_coord = np.array([0, 40, 3], dtype = np.float32)
max_point_number = 35
else:
scene_size = np.array([4, 40, 48], dtype = np.float32)
voxel_size = np.array([0.4, 0.2, 0.2], dtype = np.float32)
grid_size = np.array([10, 200, 240], dtype = np.int64)
lidar_coord = np.array([0, 20, 3], dtype = np.float32)
max_point_number = 45
np.random.shuffle(point_cloud)
shifted_coord = point_cloud[:, :3] + lidar_coord
# reverse the point cloud coordinate (X, Y, Z) -> (Z, Y, X)
voxel_index = np.floor(
shifted_coord[:, ::-1] / voxel_size).astype(np.int)
bound_x = np.logical_and(
voxel_index[:, 2] >= 0, voxel_index[:, 2] < grid_size[2])
bound_y = np.logical_and(
voxel_index[:, 1] >= 0, voxel_index[:, 1] < grid_size[1])
bound_z = np.logical_and(
voxel_index[:, 0] >= 0, voxel_index[:, 0] < grid_size[0])
bound_box = np.logical_and(np.logical_and(bound_x, bound_y), bound_z)
point_cloud = point_cloud[bound_box]
voxel_index = voxel_index[bound_box]
# [K, 3] coordinate buffer as described in the paper
coordinate_buffer = np.unique(voxel_index, axis = 0)
K = len(coordinate_buffer)
T = max_point_number
# [K, 1] store number of points in each voxel grid
number_buffer = np.zeros(shape = (K), dtype = np.int64)
# [K, T, 7] feature buffer as described in the paper
feature_buffer = np.zeros(shape = (K, T, 7), dtype = np.float32)
# build a reverse index for coordinate buffer
index_buffer = {}
for i in range(K):
index_buffer[tuple(coordinate_buffer[i])] = i
for voxel, point in zip(voxel_index, point_cloud):
index = index_buffer[tuple(voxel)]
number = number_buffer[index]
if number < T:
feature_buffer[index, number, :4] = point
number_buffer[index] += 1
feature_buffer[:, :, -3:] = feature_buffer[:, :, :3] - \
feature_buffer[:, :, :3].sum(axis = 1, keepdims = True)/number_buffer.reshape(K, 1, 1)
voxel_dict = {'feature_buffer': feature_buffer,
'coordinate_buffer': coordinate_buffer,
'number_buffer': number_buffer}
return voxel_dict
VFE layer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class VFELayer(nn.Module):
def __init__(self, in_channels, out_channels):
super(VFELayer, self).__init__()
self.in_channels = in_channels
self.out_channels = out_channels
self.units = int(out_channels / 2)
self.dense = nn.Sequential(nn.Linear(self.in_channels, self.units), nn.ReLU())
self.batch_norm = nn.BatchNorm1d(self.units)
def forward(self, inputs, mask):
# [ΣK, T, in_ch] -> [ΣK, T, units] -> [ΣK, units, T]
tmp = self.dense(inputs).transpose(1, 2)
# [ΣK, units, T] -> [ΣK, T, units]
pointwise = self.batch_norm(tmp).transpose(1, 2)
# [ΣK, 1, units]
aggregated, _ = torch.max(pointwise, dim = 1, keepdim = True)
# [ΣK, T, units]
repeated = aggregated.expand(-1, cfg.VOXEL_POINT_COUNT, -1)
# [ΣK, T, 2 * units]
concatenated = torch.cat([pointwise, repeated], dim = 2)
# [ΣK, T, 1] -> [ΣK, T, 2 * units]
mask = mask.expand(-1, -1, 2 * self.units)
concatenated = concatenated * mask.float()
return concatenated
Feature learning network
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class FeatureNet(nn.Module):
def __init__(self):
super(FeatureNet, self).__init__()
self.vfe1 = VFELayer(7, 32)
self.vfe2 = VFELayer(32, 128)
def forward(self, feature, number, coordinate):
batch_size = len(feature)
feature = torch.cat(feature, dim = 0) # [ΣK, cfg.VOXEL_POINT_COUNT, 7]; cfg.VOXEL_POINT_COUNT = 35/45
coordinate = torch.cat(coordinate, dim = 0) # [ΣK, 4]; each row stores (batch, d, h, w)
vmax, _ = torch.max(feature, dim = 2, keepdim = True)
mask = (vmax != 0) # [ΣK, T, 1]
x = self.vfe1(feature, mask)
x = self.vfe2(x, mask)
# [ΣK, 128]
voxelwise, _ = torch.max(x, dim = 1)
# Car: [B, 10, 400, 352, 128]; Pedestrain/Cyclist: [B, 10, 200, 240, 128]
outputs = torch.sparse.FloatTensor(coordinate.t(), voxelwise, torch.Size([batch_size, cfg.INPUT_DEPTH, cfg.INPUT_HEIGHT, cfg.INPUT_WIDTH, 128]))
outputs = outputs.to_dense()
return outputs
Convolutional Middle Layers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class ConvMD(nn.Module):
def __init__(self, M, cin, cout, k, s, p, bn = True, activation = True):
super(ConvMD, self).__init__()
self.M = M # Dimension of input
self.cin = cin
self.cout = cout
self.k = k
self.s = s
self.p = p
self.bn = bn
self.activation = activation
if self.M == 2: # 2D input
self.conv = nn.Conv2d(self.cin, self.cout, self.k, self.s, self.p)
if self.bn:
self.batch_norm = nn.BatchNorm2d(self.cout)
elif self.M == 3: # 3D input
self.conv = nn.Conv3d(self.cin, self.cout, self.k, self.s, self.p)
if self.bn:
self.batch_norm = nn.BatchNorm3d(self.cout)
else:
raise Exception('No such mode!')
def forward(self, inputs):
out = self.conv(inputs)
if self.bn:
out = self.batch_norm(out)
if self.activation:
return F.relu(out)
else:
return out
Region Proposal Network
Deconv2D
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class Deconv2D(nn.Module):
def __init__(self, cin, cout, k, s, p, bn = True):
super(Deconv2D, self).__init__()
self.cin = cin
self.cout = cout
self.k = k
self.s = s
self.p = p
self.bn = bn
self.deconv = nn.ConvTranspose2d(self.cin, self.cout, self.k, self.s, self.p)
if self.bn:
self.batch_norm = nn.BatchNorm2d(self.cout)
def forward(self, inputs):
out = self.deconv(inputs)
if self.bn == True:
out = self.batch_norm(out)
return F.relu(out)
Middle and RPN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
class MiddleAndRPN(nn.Module):
def __init__(self, alpha = 1.5, beta = 1, sigma = 3, training = True, name = ''):
super(MiddleAndRPN, self).__init__()
self.middle_layer = nn.Sequential(ConvMD(3, 128, 64, 3, (2, 1, 1,), (1, 1, 1)),
ConvMD(3, 64, 64, 3, (1, 1, 1), (0, 1, 1)),
ConvMD(3, 64, 64, 3, (2, 1, 1), (1, 1, 1)))
if cfg.DETECT_OBJ == 'Car':
self.block1 = nn.Sequential(ConvMD(2, 128, 128, 3, (2, 2), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)))
else: # Pedestrian/Cyclist
self.block1 = nn.Sequential(ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)))
self.deconv1 = Deconv2D(128, 256, 3, (1, 1), (1, 1))
self.block2 = nn.Sequential(ConvMD(2, 128, 128, 3, (2, 2), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)),
ConvMD(2, 128, 128, 3, (1, 1), (1, 1)))
self.deconv2 = Deconv2D(128, 256, 2, (2, 2), (0, 0))
self.block3 = nn.Sequential(ConvMD(2, 128, 256, 3, (2, 2), (1, 1)),
ConvMD(2, 256, 256, 3, (1, 1), (1, 1)),
ConvMD(2, 256, 256, 3, (1, 1), (1, 1)),
ConvMD(2, 256, 256, 3, (1, 1), (1, 1)),
ConvMD(2, 256, 256, 3, (1, 1), (1, 1)),
ConvMD(2, 256, 256, 3, (1, 1), (1, 1)))
self.deconv3 = Deconv2D(256, 256, 4, (4, 4), (0, 0))
self.prob_conv = ConvMD(2, 768, 2, 1, (1, 1), (0, 0), bn = False, activation = False)
self.reg_conv = ConvMD(2, 768, 14, 1, (1, 1), (0, 0), bn = False, activation = False)
self.output_shape = [cfg.FEATURE_HEIGHT, cfg.FEATURE_WIDTH]
def forward(self, inputs):
batch_size, DEPTH, HEIGHT, WIDTH, C = inputs.shape # [batch_size, 10, 400/200, 352/240, 128]
inputs = inputs.permute(0, 4, 1, 2, 3) # (B, D, H, W, C) -> (B, C, D, H, W)
temp_conv = self.middle_layer(inputs) # [batch, 64, 2, 400, 352]
temp_conv = temp_conv.view(batch_size, -1, HEIGHT, WIDTH) # [batch, 128, 400, 352]
temp_conv = self.block1(temp_conv) # [batch, 128, 200, 176]
temp_deconv1 = self.deconv1(temp_conv)
temp_conv = self.block2(temp_conv) # [batch, 128, 100, 88]
temp_deconv2 = self.deconv2(temp_conv)
temp_conv = self.block3(temp_conv) # [batch, 256, 50, 44]
temp_deconv3 = self.deconv3(temp_conv)
temp_conv = torch.cat([temp_deconv3, temp_deconv2, temp_deconv1], dim = 1)
# Probability score map, [batch, 2, 200/100, 176/120]
p_map = self.prob_conv(temp_conv)
# Regression map, [batch, 14, 200/100, 176/120]
r_map = self.reg_conv(temp_conv)
return torch.sigmoid(p_map), r_map
Full network
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
class RPN3D(nn.Module):
def __init__(self, cls = 'Car', alpha = 1.5, beta = 1, sigma = 3):
super(RPN3D, self).__init__()
self.cls = cls
self.alpha = alpha
self.beta = beta
self.sigma = sigma
self.feature = FeatureNet()
self.rpn = MiddleAndRPN()
# Generate anchors
self.anchors = cal_anchors() # [cfg.FEATURE_HEIGHT, cfg.FEATURE_WIDTH, 2, 7]; 2 means two rotations; 7 means (cx, cy, cz, h, w, l, r)
self.rpn_output_shape = self.rpn.output_shape
def forward(self, data):
tag = data[0]
label = data[1]
vox_feature = data[2]
vox_number = data[3]
vox_coordinate = data[4]
features = self.feature(vox_feature, vox_number, vox_coordinate)
prob_output, delta_output = self.rpn(features)
# Calculate ground-truth
pos_equal_one, neg_equal_one, targets = cal_rpn_target(
label, self.rpn_output_shape, self.anchors, cls = cfg.DETECT_OBJ, coordinate = 'lidar')
pos_equal_one_for_reg = np.concatenate(
[np.tile(pos_equal_one[..., [0]], 7), np.tile(pos_equal_one[..., [1]], 7)], axis = -1)
pos_equal_one_sum = np.clip(np.sum(pos_equal_one, axis = (1, 2, 3)).reshape(-1, 1, 1, 1), a_min = 1, a_max = None)
neg_equal_one_sum = np.clip(np.sum(neg_equal_one, axis = (1, 2, 3)).reshape(-1, 1, 1, 1), a_min = 1, a_max = None)
# Move to gpu
device = features.device
pos_equal_one = torch.from_numpy(pos_equal_one).to(device).float()
neg_equal_one = torch.from_numpy(neg_equal_one).to(device).float()
targets = torch.from_numpy(targets).to(device).float()
pos_equal_one_for_reg = torch.from_numpy(pos_equal_one_for_reg).to(device).float()
pos_equal_one_sum = torch.from_numpy(pos_equal_one_sum).to(device).float()
neg_equal_one_sum = torch.from_numpy(neg_equal_one_sum).to(device).float()
# [batch, cfg.FEATURE_HEIGHT, cfg.FEATURE_WIDTH, 2/14] -> [batch, 2/14, cfg.FEATURE_HEIGHT, cfg.FEATURE_WIDTH]
pos_equal_one = pos_equal_one.permute(0, 3, 1, 2)
neg_equal_one = neg_equal_one.permute(0, 3, 1, 2)
targets = targets.permute(0, 3, 1, 2)
pos_equal_one_for_reg = pos_equal_one_for_reg.permute(0, 3, 1, 2)
# Calculate loss
cls_pos_loss = (-pos_equal_one * torch.log(prob_output + small_addon_for_BCE)) / pos_equal_one_sum
cls_neg_loss = (-neg_equal_one * torch.log(1 - prob_output + small_addon_for_BCE)) / neg_equal_one_sum
cls_loss = torch.sum(self.alpha * cls_pos_loss + self.beta * cls_neg_loss)
cls_pos_loss_rec = torch.sum(cls_pos_loss)
cls_neg_loss_rec = torch.sum(cls_neg_loss)
reg_loss = smooth_l1(delta_output * pos_equal_one_for_reg, targets * pos_equal_one_for_reg, self.sigma) / pos_equal_one_sum
reg_loss = torch.sum(reg_loss)
loss = cls_loss + reg_loss
return prob_output, delta_output, loss, cls_loss, reg_loss, cls_pos_loss_rec, cls_neg_loss_rec
def predict(self, data, probs, deltas, summary = False, vis = False):
'''
probs: (batch, 2, cfg.FEATURE_HEIGHT, cfg.FEATURE_WIDTH)
deltas: (batch, 14, cfg.FEATURE_HEIGHT, cfg.FEATURE_WIDTH)
'''
tag = data[0]
label = data[1]
vox_feature = data[2]
vox_number = data[3]
vox_coordinate = data[4]
img = data[5]
lidar = data[6]
batch_size, _, _, _ = probs.shape
device = probs.device
batch_gt_boxes3d = None
if summary or vis:
batch_gt_boxes3d = label_to_gt_box3d(label, cls = self.cls, coordinate = 'lidar')
# Move to cpu and convert to numpy array
probs = probs.cpu().detach().numpy()
deltas = deltas.cpu().detach().numpy()
# BOTTLENECK
batch_boxes3d = delta_to_boxes3d(deltas, self.anchors, coordinate = 'lidar')
batch_boxes2d = batch_boxes3d[:, :, [0, 1, 4, 5, 6]]
batch_probs = probs.reshape((batch_size, -1))
# NMS
ret_box3d = []
ret_score = []
for batch_id in range(batch_size):
# Remove box with low score
ind = np.where(batch_probs[batch_id, :] >= cfg.RPN_SCORE_THRESH)[0]
tmp_boxes3d = batch_boxes3d[batch_id, ind, ...]
tmp_boxes2d = batch_boxes2d[batch_id, ind, ...]
tmp_scores = batch_probs[batch_id, ind]
# TODO: if possible, use rotate NMS
boxes2d = corner_to_standup_box2d(center_to_corner_box2d(tmp_boxes2d, coordinate = 'lidar'))
# 2D box index after nms
ind, cnt = nms(torch.from_numpy(boxes2d).to(device), torch.from_numpy(tmp_scores).to(device),
cfg.RPN_NMS_THRESH, cfg.RPN_NMS_POST_TOPK)
ind = ind[:cnt].cpu().detach().numpy()
tmp_boxes3d = tmp_boxes3d[ind, ...]
tmp_scores = tmp_scores[ind]
ret_box3d.append(tmp_boxes3d)
ret_score.append(tmp_scores)
ret_box3d_score = []
for boxes3d, scores in zip(ret_box3d, ret_score):
ret_box3d_score.append(np.concatenate([np.tile(self.cls, len(boxes3d))[:, np.newaxis],
boxes3d, scores[:, np.newaxis]], axis = -1))
if summary:
# Only summry the first one in a batch
cur_tag = tag[0]
P, Tr, R = load_calib(os.path.join(cfg.CALIB_DIR, cur_tag + '.txt'))
front_image = draw_lidar_box3d_on_image(img[0], ret_box3d[0], ret_score[0],
batch_gt_boxes3d[0], P2 = P, T_VELO_2_CAM = Tr, R_RECT_0 = R)
bird_view = lidar_to_bird_view_img(lidar[0], factor = cfg.BV_LOG_FACTOR)
bird_view = draw_lidar_box3d_on_birdview(bird_view, ret_box3d[0], ret_score[0], batch_gt_boxes3d[0],
factor = cfg.BV_LOG_FACTOR, P2 = P, T_VELO_2_CAM = Tr, R_RECT_0 = R)
heatmap = colorize(probs[0, ...], cfg.BV_LOG_FACTOR)
ret_summary = [['predict/front_view_rgb', front_image[np.newaxis, ...]], # [None, cfg.IMAGE_HEIGHT, cfg.IMAGE_WIDTH, 3]
# [None, cfg.BV_LOG_FACTOR * cfg.INPUT_HEIGHT, cfg.BV_LOG_FACTOR * cfg.INPUT_WIDTH, 3]
['predict/bird_view_lidar', bird_view[np.newaxis, ...]],
# [None, cfg.BV_LOG_FACTOR * cfg.FEATURE_HEIGHT, cfg.BV_LOG_FACTOR * cfg.FEATURE_WIDTH, 3]
['predict/bird_view_heatmap', heatmap[np.newaxis, ...]]]
return tag, ret_box3d_score, ret_summary
if vis:
front_images, bird_views, heatmaps = [], [], []
for i in range(len(img)):
cur_tag = tag[i]
P, Tr, R = load_calib(os.path.join(cfg.CALIB_DIR, cur_tag + '.txt'))
front_image = draw_lidar_box3d_on_image(img[i], ret_box3d[i], ret_score[i],
batch_gt_boxes3d[i], P2 = P, T_VELO_2_CAM = Tr, R_RECT_0 = R)
bird_view = lidar_to_bird_view_img(lidar[i], factor = cfg.BV_LOG_FACTOR)
bird_view = draw_lidar_box3d_on_birdview(bird_view, ret_box3d[i], ret_score[i], batch_gt_boxes3d[i],
factor = cfg.BV_LOG_FACTOR, P2 = P, T_VELO_2_CAM = Tr, R_RECT_0 = R)
heatmap = colorize(probs[i, ...], cfg.BV_LOG_FACTOR)
front_images.append(front_image)
bird_views.append(bird_view)
heatmaps.append(heatmap)
return tag, ret_box3d_score, front_images, bird_views, heatmaps
return tag, ret_box3d_score
def smooth_l1(deltas, targets, sigma = 3.0):
# Reference: https://mohitjainweb.files.wordpress.com/2018/03/smoothl1loss.pdf
sigma2 = sigma * sigma
diffs = deltas - targets
smooth_l1_signs = torch.lt(torch.abs(diffs), 1.0 / sigma2).float()
smooth_l1_option1 = torch.mul(diffs, diffs) * 0.5 * sigma2
smooth_l1_option2 = torch.abs(diffs) - 0.5 / sigma2
smooth_l1_add = torch.mul(smooth_l1_option1, smooth_l1_signs) + torch.mul(smooth_l1_option2, 1 - smooth_l1_signs)
smooth_l1 = smooth_l1_add
return smooth_l1