目标检测SSD模型Tensorflow代码详解(未完)

作者: TiTiWung | 来源:发表于2019-04-30 15:52 被阅读0次

目标检测SSD模型Tensorflow代码详解(未完)
MobileNet SSD V2模型的压缩与tflite格式的转
MobileNet SSD V2模型的压缩与tflite格式的转
C++ opencv-3.4.1 调用tensorflow训练好
目标检测系列——SSD（上）
SSD：TensorFlow中的单次多重检测器
搞定目标检测（SSD篇）（下）
SSD代码阅读
目标检测学习资料
YOLOv3：从代码到模型

SSD模型

与ssd模型相关的帖子已经有许多了,关于ssd模型的流程这里就不赘述了,在这里推荐我认为写的相当好的一个帖子
https://zhuanlan.zhihu.com/p/24954433?refer=xiaoleimlnote
本文将从代码层面详述ssd模型的实现细节,所使用的代码是balancap基于tensorflow的代码,见https://github.com/balancap/SSD-Tensorflow

anchors的生成

y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]] 
# 一般feat_shape[0]=feat_shape[1],这里根据feature map的大小均匀采feat_shape个样点, 
# y和x的shape是一样的,均为feat_shape*feat_shape
'''
比如feat_shape=8的时候
y = [[0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 1, 1, 1, 1, 1],
    [2, 2, 2, 2, 2, 2, 2, 2],
    [3, 3, 3, 3, 3, 3, 3, 3],
    [4, 4, 4, 4, 4, 4, 4, 4],
    [5, 5, 5, 5, 5, 5, 5, 5],
    [6, 6, 6, 6, 6, 6, 6, 6],
    [7, 7, 7, 7, 7, 7, 7, 7]]
x = [[0, 1, 2, 3, 4, 5, 6, 7],
    [0, 1, 2, 3, 4, 5, 6, 7],
    [0, 1, 2, 3, 4, 5, 6, 7],
    [0, 1, 2, 3, 4, 5, 6, 7],
    [0, 1, 2, 3, 4, 5, 6, 7],
    [0, 1, 2, 3, 4, 5, 6, 7],
    [0, 1, 2, 3, 4, 5, 6, 7],
    [0, 1, 2, 3, 4, 5, 6, 7]]
'''
y = (y.astype(dtype) + offset) * step / img_shape[0]  # 作归一化处理

h[0] = sizes[0] / img_shape[0]   # 第一个方形框
w[0] = sizes[0] / img_shape[1]
di = 1
if len(sizes) > 1:
# 第二个稍大一些的方形框
    h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]   
    w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]
    di += 1
for i, r in enumerate(ratios):
 # 两或四个成对矩形框,成对是长宽互换
    h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)  
    w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)

选取了6个层的feature map,我们将6组anchor按照y,x,h,w组成一组的形式组合成一个长为6的list

ground truth的label,localization的生成

# 判断当前做的anchors与ground truth的IOU次数是否已经达到图片中目标的数量,
# 相当于有多少个目标,就做多少次IOU
def condition(i, feat_labels, feat_scores,
                feat_ymin, feat_xmin, feat_ymax, feat_xmax):
    r = tf.less(i, tf.shape(labels))
    return r[0]

yref, xref, href, wref = anchors_layer
# 找到矩形的左上,右下角位置的坐标,若有6个anchor,则每个点的ymin有6个值
ymin = yref - href / 2.  
xmin = xref - wref / 2.
ymax = yref + href / 2.
xmax = xref + wref / 2.
vol_anchors = (xmax - xmin) * (ymax - ymin) 
# 计算某个feature map所在的所有先验anchors的面积,形成一个list

def jaccard_with_anchors(bbox):
 # 取出一个ground truth与所有anchors交集的左上角及右下角的坐标
    int_ymin = tf.maximum(ymin, bbox[0])  
    int_xmin = tf.maximum(xmin, bbox[1])
    int_ymax = tf.minimum(ymax, bbox[2])
    int_xmax = tf.minimum(xmax, bbox[3])
    h = tf.maximum(int_ymax - int_ymin, 0.)
    w = tf.maximum(int_xmax - int_xmin, 0.)
    # Volumes.
    inter_vol = h * w   # 计算所有交集的面积
    union_vol = vol_anchors - inter_vol \    # 计算交并比
        + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
    jaccard = tf.div(inter_vol, union_vol)
    return jaccard

label = labels[i]
bbox = bboxes[i]
# 第一次循环的时候判断所有anchors与ground truth的IOU交并比
jaccard = jaccard_with_anchors(bbox)   
mask = tf.greater(jaccard, feat_scores)  
# 在第一次循环的时候先去除ground truth与所有anchors交集为空的,第二次及
# 后面的循环的时候,当某个anchor与当前ground truth的交并比比前几次anchor
# 与ground truth交并比更大的时候,将该坐标的label及score置为当前ground truth的信息
mask = tf.logical_and(mask, feat_scores > -0.5)  # 第一遍循环的时候没有作用
mask = tf.logical_and(mask, label < num_classes)   
# 从ground truth做出来的label,怎么会比num_classes更小?这里应该是去除背景,
# 但在数据输入的时候label是如何处理的,也作成feat_shape形状,在anchor里面
# 有ground truth的部分都写上相应的label吗?
imask = tf.cast(mask, tf.int64)
fmask = tf.cast(mask, dtype)
# Update values using mask.
feat_labels = imask * label + (1 - imask) * feat_labels 
# 在第一步的时候,相当于将anchor与ground truth的IOU为0的部分置为背景,
# 在后续的循环当中,相当于anchor与当前ground truth交并比更大时将label
# 置于当前ground truth的label,若非更大,则保持原样,即取原先的ground truth的label
feat_scores = tf.where(mask, jaccard, feat_scores)  
# 将true的位置替换为IOU交并比,将false的位置替换为feat_scores,在第一步实际上为0,
# feat_scores似乎怎么都不会小于0

feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin  
# 第一次循环的时候,anchor与ground truth有IOU则置该框的feat_ymin为bbox[0],其它置为0,
# 后续的循环中,与ground truth交并比更大的anchor置其feat_ymin为该ground truth的bbox[0]
feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax

feat_cy = (feat_cy - yref) / href / prior_scaling[0]  
# 中心点的坐标encode成ground truth与anchor中心坐标的差除以anchor的宽再除以比例系数,
# 可以看做回归误差,即要不断学习接近的东西,一般都是零点几的数除以0.1(即乘以10)
feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
feat_h = tf.log(feat_h / href) / prior_scaling[2]  
# 高(宽)encode成ground truth的高除以anchor的高的对数(一般为正负零点几),再除以0.2(即乘以5)
feat_w = tf.log(feat_w / wref) / prior_scaling[3]