GHM

作者: 小松qxs | 来源:发表于2019-10-29 17:01 被阅读0次

GHM
GHM总结
论文阅读：Gradient Harmonized Single-
从头预测基因-GlimmerHMM
关于Focal loss 和 GHM
样本不均衡-Focal loss，GHM

titile	Gradient Harmonized Single-stage Detector
url	https://arxiv.org/pdf/1811.05181.pdf
动机	single-stage相比于two-stage更优雅，但存在正负样本间数量差异和easy、hard examples之间的矛盾
内容	GHM：从梯度的角度解决正负样本间数量差异和easy、hard examples之间的矛盾。传统方法： 1、OHEM：直接放弃大量examples，训练效率较低。 2、Focal loss：存在两个超参需要设置，不能随训练数据的变化动态调整。 1、左图： (1) easy examples较多，可能淹没少数hard examples贡献，训练效率低。 (2) 非常大的梯度范数examples（非常困难）密度略大于medium examples。视为outliers，outliers在模型收敛时也稳定存在。可能影响模型的稳定性。 3、右图： (1) 受梯度分布启发提出GHM (2) GHM训练easy examples和outliers产生的累积梯度权重均会降低。 (3) exampls贡献平衡，训练有效且稳定。贡献：整个过程分为四步：rescaling, integrating, refining and strengthening，将多层语义信息进行整合。 1、提出single-stage样本失衡的原理：gradient norm分布，提出GHM。 2、分类和回归损失为GHM-C和GHM-R(根据分布动态调整)，证明gradient contribution of examples with different attributes，对超参鲁棒。 3、加入GHM，得到state-of-the-art。 Gradient Harmonizing Mechanism： Problem Description： easy examples较多，淹没hard，hard程度较大的也比较多，强行学习这些outlier，模型往往会不准确。 Gradient Density：有较大density的样本会被降低权重。 GHM-C Loss：和focal loss相比曲线趋势相似，outlier部分会降低权重，参数动态变化的。 Unit Region Approximation： Complexity Analysis: 1、naive algorithm计算所有样本的gradient density复杂度：O(N²)，并行计算，每个计算单元仍有N。 2、best algorithm先按梯度范数对样本排序，复杂度O(NlogN），然后队列扫描样本，O(N)得到密度。这种排序在并行计算中不能收益。 3、single-stage N较大，直接计算比较耗时。通过另一种方法近似L1 smooth通过拐点来区别outlier和inlier。 Unit Region：用统计直方图的方式计算（设置bin）复杂的更低，且可以并行计算，GD=落在bin中的数量bin的个数，时间复杂度O(NM) EMA：* momentum：smooth，避免mini-batch中的极限值。 GHM-R Loss： L1smooth d 很大时，g norm始终为1，依赖于g norm算loss不能体现差异。更改loss，如依赖\|d\|计算loss，由于可以取无限大，无法应用unit region原理。新loss定义如下： d很小的时候，近似平方函数(L2 loss)，d很大的时候，近似线性函数(L1 loss)。所以位置的梯度均存在且连续(L1 smooth 拐点处不存在) 回归中均为正样本，outliers占比例大（与分类不同）。loss函数如下：分类中easy examples不是很重要，但是在回归中全是正样本，回归target位置，easy的也很重要，最终测试指标mAP是计算IOU0.5-0.95，说明easy的example也计算在这个指标中，所以easy同样重要。 up-weighting the important part of easy examples and down-weighting the outliers
实验	Implementation Details： RetinaNet：ResNet backbone with FPN。 Anchors：3 scales，3 aspect ratios。SGD 8 GPUs(2 images on each GPU)，14 epochs initial learning rate 0.01，9th epoch和12th epoch学习率乘0.1， weight decay 0.0001，momentum 0.9， EMA α = 0.75。 GHM-C Loss： all adopt smooth L1 loss function with δ = 1/9 for the box regression branch Baseline：Average Precision (AP) of 28.6 Number of Unit Region：实验均不采用EMA。 M太小，密度在不同梯度范数上不能很好的变化，性能不是很好。 Speed：inference速度不变。 Comparison with Other Methods： GHM-R Loss： Comparison with Other Losses： Two-Stage Detector：faster-RCNN with Res50-FPN Main Results：
思考