YOLOStereo3D: A Step Back to 2D

作者: starryCaptain | 来源:发表于2023-06-07 15:01 被阅读0次

Step back
2020.2.24
I was alone
成长
用英语学法语之英法混合文本阅读训练 7
What Toastmasters Brings to Me
《菜根谈》251
【第27篇·楊子】常用的口令詞（七）
A Step You Can’t Take Back
a step u can't take back

论文YOLOStereo3D: A Step Back to 2D for Efficient Stereo 3D Detection发表在ICRA 2021上，关注的场景是基于纯视觉双目相机的3D目标检测。纯视觉相机虽然不能像激光雷达那样直接测量出深度信息——这在3D目标检测中非常重要——而只能通过算法对深度进行估计，但胜在成本便宜的多。许多最先进的纯视觉3D目标检测算法都基于伪激光雷达（pseudo-LiDAR）的理念，结合常规的立体匹配算法来实现。然而，高性能的差异估计（disparity estimation）网络本身在处理图片时通常都需要较长的时间，限制了其在实际部署中的应用。
作者认为，与其把纯视觉双目相机的3D检测任务当作一个使用低精度的、估计出的点云的检测方案，不如退一步，当作是一个使用立体特征增强（enhanced stereo features）的单目3D检测任务，这也是此论文的主要动机。论文提出的YOLOStereo3D模型在KITTI数据集上可以达到10fps/s的检测速度。
论文主要做了三方面的贡献。

对于推理架构，将单目3D目标检测的推理管线纳入并优化为双目3D目标检测。
对于网络的设计，在引入了一个点对点的关联模块（point-wise correlation module），并提出了一个分层的、密集连接的结构，以利用多尺度的立体特征。
对于实验结果，所提出的YOLOStereo3D在不使用点云的情况下在KITTI 3D基准上展示了有竞争力的结果，每帧推理时间小于0.1秒。
Fig.1 Network inference structure of YOLOStereo3D.png

该网络的多尺度的立体特征提取方法主要由以下内容组成：

Light-weight Cost Volume

网络框架中使用到了Correlation的Cost Volume计算方式。其基本公式如下：

cost_volume = left_feature.new_zeros(b, max_disp, h, w)  # [B, D, H, W]
    for i in range(self.max_disp):
        if i > 0:
            cost_volume[:, i, :, i:] = (left_feature[:, :, :, i:] * right_feature[:, :, :, :-i]).mean(dim=1)
        else:
            cost_volume[:, i, :, :] = (left_feature * right_feature).mean(dim=1)

对于输入维度为[B, C, H, W]的左右视图特征，相比于Concatenation的Cost Volume计算方式（如下），得到的代价体维度从[B, 2C, D, H, W]缩小到了[B, D, H, W]。维度的降低使得立体匹配过程的处理速度从200ms降低到了7ms。

cost_volume = left_feature.new_zeros(b, 2 * c, max_disp, h, w)  # [B, 2C, D, H, W]
for i in range(self.max_disp):
    if i > 0:
        cost_volume[:, :, i, :, i:] = torch.cat((left_feature[:, :, :, i:], right_feature[:, :, :, :-i]), dim=1)
    else:
        cost_volume[:, :, i, :, :] = torch.cat((left_feature, right_feature), dim=1)

更多关于Cost Volume的内容，可以查看我的另一篇文章：对于几种立体视觉中Cost Volume的理解
然而，这可能导致网络在融合阶段在数值上偏向于单眼特征，对立体匹配结果进行的降采样又可能会引起进一步的信息损失。因此，作者又提出用密集连接的幻影模块（Densely Connected Ghost Module）和分层融合结构（Hierachical Multi-scale Fusion Structure）来缓解这两个问题。

Densely Connected Ghost Module

发表在CVPR 2020上的《GhostNet: More Features From Cheap Operations》提出可以基于一组基本特征图（intrinsic feature maps），使用一系列成本低廉的线性变换来生成许多幻影特征图（ghost feature maps），很大程度上减少了网络所需计算资源。
在YOLOStereo3D中，原始输入特征与Ghost模块的输出密集串联，从而使通道的数量增加了两倍。对应Fig.1网络推理结构图中的紫色部分。

Ghost module.png

Hierachical Multi-scale Fusion Structure

在1/4和1/8级别的下采样上，分别构建了一个最大视差为96和192的轻量级Cost Volume。它们被送入一个密集连接的幽灵模块，被降采样，并与较小尺寸的特征串联。在1/16级别的下采样上，首先对通道的数量进行1×1的卷积下采样。然后构建一个小型的Cost Volume（也被扁平化为一个二维特征图），以保留更多来自右边图像的语义信息。上一尺度级别的特征经过Ghost Module处理、下采样后与下一级别的特征进行拼接，最终得到包含多尺度特征的代价体特征。对应Fig.1网络推理结构图中的(b)部分。

代码解读

预处理

项目的训练流程首先从调用scripts.imdb_precompute_3d.py对数据集进行预处理开始，其接受一个config参数，为用户指定的项目总配置文件路径（一些示例在config文件夹中提供，可直接根据任务挑选一个复制为config/config.py作为参数传入）。
该方法首先从cfg.path.visualDet3D_path/visualDet3D/data/kitti/test_split文件夹下读取train.txt和val.txt中的数据，分别组成形如[000001,000002,……]的作为训练集和验证集文件的文件名列表数组train_names, val_names返回。（训练集和验证集的组成规则可通过test_split/new_config.py进行修改和重新生成。）
在read_one_split(cfg, index_names, data_root_dir, output_dict, data_split, time_display_inter)方法中，根据传入的文件名列表，循环生成变量名为data_frame的KittiData类对象，其label，calib，image2_path，calib_path等属性值通过本地数据集calib，image_2，image_3等文件夹中的相应数据赋值。其中包括了一些细节处理，例如对于training，对data_frame的label做了一些过滤，筛选出type被包含在config中，并且occluded和z满足设定条件的KittiData对象，最后组成与文件名列表对应的frames对象列表，存储到形如workdirs/Stereo3D/output/train/imdb.pkl的文件中，以供后续流程读取使用。
其中还包括一些数据增强以及对锚点的数量、锚点的平均高度、宽度和长度等统计信息的计算，并将这些信息存储到npy文件中，以便在训练模型时使用。这些统计信息可以用于调整锚点的尺寸和位置，从而提高目标检测的准确性。

训练

首先需要注意的是，此项目对数据集、网络和运行管线都进行了模块化设计。在visualDet3D/networks/utils/registry.py中定义了Registry类，实例化了项目中用到的几大模块：

DATASET_DICT = Registry("datasets")
BACKBONE_DICT = Registry("backbones")
DETECTOR_DICT = Registry("detectors")
PIPELINE_DICT = Registry("pipelines")
AUGMENTATION_DICT = Registry("augmentation")
SAMPLER_DICT = Registry("sampler")

以数据集为例，通过如下装饰器注解的方式将具体的数据集类注册到DETECTOR_DICT中，

@DATASET_DICT.register_module
class KittiStereoDataset(torch.utils.data.Dataset):

在需要使用处通过配置文件中的类名称加载具体的数据集：

dataset_train = DATASET_DICT[cfg.data.train_dataset](cfg,"training")

实现了良好的解耦。

接下来详细分析在scripts.train.py中进行的训练流程。
如上所述，首先加载了训练和验证数据集，然后使用类似的方式加载了核心的目标检测网络模型，即detector：

detector = DETECTOR_DICT[cfg.detector.name](cfg.detector)

对于YoloStereo3D来说，加载的即是visualDet3D/networks/detectors/yolostereo3d_detector.py中的Stereo3D(nn.Module)类。然后就常规的流程，定义optimizer，scheduler（只不过它们也是根据配置文件中的设置进行选择的），根据cfg.trainer.max_epochs进行循环迭代。每一次迭代中从在dataloader_train中得到batch的data数据。data数据的格式和类型需要查看类KittiStereoDataset中的具体实现。详细分析如下。
在def __getitem__(self, index)方法中，通过

imdb_file_path = os.path.join(preprocessed_path, split, 'imdb.pkl')
self.imdb = pickle.load(open(imdb_file_path, 'rb'))  # list of kittiData
kitti_data = self.imdb[index]

读取预处理阶段生成的imdb.pkl文件，反序列化生成KittiData的对象，经过一些数据增强和计算，将kitti_data中的数据组合成一个如下的字典返回。

output_dict = {'calib': [P2, P3],
               'image': [transformed_left_image, transformed_right_image],
               'label': [obj.type for obj in transformed_label],
               'bbox2d': bbox2d,  # [N, 4] [x1, y1, x2, y2]
               'bbox3d': bbox3d_state,
               'original_shape': calib.image_shape,
               'disparity': disparity,
               'original_P': calib.P2.copy()}

并且，类KittiStereoDataset中定义了collate_fn(batch)方法：

@staticmethod
def collate_fn(batch):
    left_images = np.array([item["image"][0] for item in batch])  # [batch, H, W, 3]
    left_images = left_images.transpose([0, 3, 1, 2])
    right_images = np.array([item["image"][1] for item in batch])  # [batch, H, W, 3]
    right_images = right_images.transpose([0, 3, 1, 2])
    P2 = [item['calib'][0] for item in batch]
    P3 = [item['calib'][1] for item in batch]
    label = [item['label'] for item in batch]
    bbox2ds = [item['bbox2d'] for item in batch]
    bbox3ds = [item['bbox3d'] for item in batch]
    disparities = [item['disparity'] for item in batch]
    if disparities[0] is None:
        return torch.from_numpy(left_images).float(), torch.from_numpy(right_images).float(), torch.tensor(
            P2).float(), torch.tensor(P3).float(), label, bbox2ds, bbox3ds
    else:
        return torch.from_numpy(left_images).float(), torch.from_numpy(right_images).float(), torch.tensor(
            P2).float(), torch.tensor(P3).float(), label, bbox2ds, bbox3ds, torch.tensor(disparities).float()

方法将一个batch的kitti_data中的图片，标签等数据进行组合。该方法在build_dataloader方法中作为参数传入

dataloader_train = build_dataloader(dataset_train , ... , collate_fn=dataset_train.collate_fn)

，由此确定了dataloader_train中得到的每一个batch的data数据的格式。

通过从运行管线字典中加载

training_detection = PIPELINE_DICT[cfg.trainer.training_func]

作为训练方法。对于YoloStereo3D来说，加载的即是visualDet3D/networks/pipelines/trainers.py中的train_stereo_detection()方法。向该方法中传入data,detector,optimizer,cfg等数据。其内部又先使用了compound_annotation()方法生成了符合神经网络输入格式的图片的真实标注框信息。

def compound_annotation(labels, max_length, bbox_2d, bbox_3d, obj_types):
    """
    Args:
        labels: List[List[str]]
        max_length: int, max_num_objects, can be dynamic for each iterations
        bbox_2d: List[np.ndArray], [left, top, right, bottom].
        bbox_3d: List[np.ndArray], [cam_x, cam_y, z, w, h, l, alpha].
        obj_types: List[str]
    Return:
        np.ndArray, [batch_size, max_length, 12]
            [x1, y1, x2, y2, cls_index, cx, cy, z, w, h, l, alpha]
            cls_index = -1 if empty
    """
    annotations = np.ones([len(labels), max_length, bbox_3d[0].shape[-1] + 5]) * -1
    for i in range(len(labels)):
        label = labels[i]
        for j in range(len(label)):
            annotations[i, j] = np.concatenate([
                bbox_2d[i][j], [obj_types.index(label[j])], bbox_3d[i][j]
            ])
    return annotations

该方法首先初始化一个3维的Numpy数组annotations，并将其每个元素初始化为 -1。其中第一维的长度为一个batch数据中标签列表的数量（也即batch的长度），第二维的长度则是这个batch数据中最长的标签列表的数量（每张图片中可能存在不同数量的标签），第三个维度的长度则是由于需要在此维度上拼接bbox_2d中的4个信息，物体的类别索引信息，和bbox_3d中的7个信息，因此其长度为bbox_3d[0].shape[-1] + 5。
真实标注框信息annotations与图片，相机内参等数据一起作为网络的输入进行训练。首先，在class YoloStereo3DCore(nn.Module)中

def forward(self, images):
    batch_size = images.shape[0]
    left_images = images[:, 0:3, :, :]
    right_images = images[:, 3:, :, :]
    images = torch.cat([left_images, right_images], dim=0)
    features = self.backbone(images)
    left_features = [feature[0:batch_size] for feature in features]
    right_features = [feature[batch_size:] for feature in features]
    features, depth_output = self.neck(left_features, right_features)
    output_dict = dict(features=features, depth_output=depth_output)
    return output_dict

拼接出维度为(batch_size * 2, C, H, W)的images，输入到visualDet3D/networks/backbones/resnet.py实现的backbone中forward

def forward(self, img_batch):
    outs = []
    x = self.conv1(img_batch)
    x = self.bn1(x)
    x = self.relu(x)
    if -1 in self.out_indices:
        outs.append(x)
    x = self.maxpool(x)
    for i in range(self.num_stages):
        layer = getattr(self, f"layer{i + 1}")
        x = layer(x)
        if i in self.out_indices:
            outs.append(x)
    return outs

根据num_stages，提取出各层次的特征组成列表返回，多层特征列表的长度为num_stages。在初始默认设置的情况下，返回的就是一个len为3的列表，列表中分别是维度为[8,64,72,320]，[8,128,36,160]，[8,256,18,80]的张量。(batch_size, C, H, W)分别为对应维度的含义。
将左右图像的多层特征分离后，输入到neck网络，即class StereoMerging(nn.Module)中forward

def __init__(self, base_features):
    super(StereoMerging, self).__init__()
    self.cost_volume_0 = PSMCosineModule(downsample_scale=4, max_disp=96, input_features=base_features)
    PSV_depth_0 = self.cost_volume_0.depth_channel
    self.cost_volume_1 = PSMCosineModule(downsample_scale=8, max_disp=192, input_features=base_features * 2)
    PSV_depth_1 = self.cost_volume_1.depth_channel
    self.cost_volume_2 = CostVolume(downsample_scale=16, max_disp=192, input_features=base_features * 4, PSM_features=8)
    PSV_depth_2 = self.cost_volume_2.output_channel
    self.depth_reasoning = CostVolumePyramid(PSV_depth_0, PSV_depth_1, PSV_depth_2)
    self.final_channel = self.depth_reasoning.output_channel_num + base_features * 4

def forward(self, left_x, right_x):
    PSVolume_0 = self.cost_volume_0(left_x[0], right_x[0])
    PSVolume_1 = self.cost_volume_1(left_x[1], right_x[1])
    PSVolume_2 = self.cost_volume_2(left_x[2], right_x[2])
    PSV_features, depth_output = self.depth_reasoning(PSVolume_0, PSVolume_1, PSVolume_2)  # c = 1152
    features = torch.cat([left_x[2], PSV_features], dim=1)  # c = 1152 + 256 = 1408
    return features, depth_output

forward中的操作对应了论文中的Light-weight Cost Volume及Hierachical Multi-scale Fusion Structure部分。对于1/4和1/8级别的下采样特征，计算Correlation方式的Cost Volume，对于1/16级别的下采样特征，计算Concatenation方式的Cost Volume。

得到的3个不同层级的feature被输入到depth_reasoning方法，即class CostVolumePyramid(nn.Module)的forward中进行特征融合。

self.depth_reasoning = CostVolumePyramid(PSV_depth_0, PSV_depth_1, PSV_depth_2)

class CostVolumePyramid(nn.Module):

    def __init__(self, depth_channel_4, depth_channel_8, depth_channel_16):
        super(CostVolumePyramid, self).__init__()
        self.depth_channel_4 = depth_channel_4  # 24
        self.depth_channel_8 = depth_channel_8  # 24
        self.depth_channel_16 = depth_channel_16  # 96

        input_features = depth_channel_4  # 24
        self.four_to_eight = nn.Sequential(
            ResGhostModule(input_features, 3 * input_features, 3, ratio=3),
            nn.AvgPool2d(2),
            BasicBlock(3 * input_features, 3 * input_features),
        )
        input_features = 3 * input_features + depth_channel_8  # 3 * 24 + 24 = 96
        self.eight_to_sixteen = nn.Sequential(
            ResGhostModule(input_features, 3 * input_features, 3, ratio=3),
            nn.AvgPool2d(2),
            BasicBlock(3 * input_features, 3 * input_features),
        )
        input_features = 3 * input_features + depth_channel_16  # 3 * 96 + 96 = 384
        self.depth_reason = nn.Sequential(
            ResGhostModule(input_features, 3 * input_features, kernel_size=3, ratio=3),
            BasicBlock(3 * input_features, 3 * input_features),
        )
        self.output_channel_num = 3 * input_features  # 1152

        ...

    def forward(self, psv_volume_4, psv_volume_8, psv_volume_16):
        psv_4_8 = self.four_to_eight(psv_volume_4)
        psv_volume_8 = torch.cat([psv_4_8, psv_volume_8], dim=1)
        psv_8_16 = self.eight_to_sixteen(psv_volume_8)
        psv_volume_16 = torch.cat([psv_8_16, psv_volume_16], dim=1)
        psv_16 = self.depth_reason(psv_volume_16)
        if self.training:
            return psv_16, self.depth_output(psv_16)
        return psv_16, torch.zeros([psv_volume_4.shape[0], 1, psv_volume_4.shape[2], psv_volume_4.shape[3]])

其中的four_to_eight，eight_to_sixteen，depth_reason即对应论文中的Densely Connected Ghost Module部分。
将经过neck处理后的最终features输入到bbox_head中计算分类损失和回归损失。最后将总损失反向传播迭代。

验证

每一个epoch结束时，会在验证集上进行验证。与训练类似，从运行管线字典中加载了

evaluate_detection = PIPELINE_DICT[cfg.trainer.evaluate_func]

对于YoloStereo3D来说，具体加载的即是visualDet3D/networks/pipelines/evaluators.py中的evaluate_kitti_obj()方法。在此方法内部又加载了

test_func = PIPELINE_DICT[cfg.trainer.test_func]

位于visualDet3D/networks/pipelines/testers.py中的test_stereo_detection()方法，最终将模型预测的scores, bbox, obj_names数值映射并绘制到原图中。

Step back
洗澡时唱了回歌，很多事情涌入脑海。有那么一瞬间，似乎想明白了最近想不通的事情。故而对自己说，step back，...
2020.2.24
step n.脚步，台阶 step up 提高，向前走去 step back 退后 stem 茎干柄；基于，起源；...
I was alone
sometimes I kind of take a giant step back into myself, a...
成长
Grow Grow, step back and perspective will change. Just ad...
用英语学法语之英法混合文本阅读训练 7
文/羊生生L'Argent (money)Beofre starting this Step, go back a...
What Toastmasters Brings to Me
Back to Mar., Year 2018, I took my first step into TM, li...
《菜根谈》251
退一步宽平一步（One step back and one step wider）【原文】争先的经路窄，退后...
【第27篇·楊子】常用的口令詞（七）
Step or hop back into a plank position 向後走或者跳，來到板式 The le...
A Step You Can’t Take Back
2015.11.18 下雨天，一个人的宿舍，下午五点半，一碗泡面，热气腾腾。今天有一位故人，猝不及防来到了南京这座...
a step u can't take back
我没有时间哭，没有时间哭。还是哭了。要做的事情太多了，太多了，没有时间矫情了，继续往前走。走！