5. Results
We test our FCN on semantic segmentation and scene parsing, exploring PASCAL VOC, NYUDv2, and SIFT Flow. Although these tasks have historically distinguished between objects and regions, we treat both uniformly as pixel prediction. We evaluate our FCN skip architecture on each of these datasets, and then extend it to multi-modal input for NYUDv2 and multi-task prediction for the semantic and geometric labels of SIFT Flow.
Metrics We report four metrics from common semantic segmentation and scene parsing evaluations that are variations on pixel accuracy and region intersection over union (IU). Let nij be the number of pixels of class i predicted to belong to class j, where there are ncl different classes, and let ti = P j nij be the total number of pixels of class i. We compute:
PASCAL VOC Table 3 gives the performance of our FCN-8s on the test sets of PASCAL VOC 2011 and 2012, and compares it to the previous state-of-the-art, SDS [17], and the well-known R-CNN [12]. We achieve the best results on mean IU8 by a relative margin of 20%. Inference time is reduced 114× (convnet only, ignoring proposals and refinement) or 286× (overall).
NYUDv2 [33] is an RGB-D dataset collected using the Microsoft Kinect. It has 1449 RGB-D images, with pixelwise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. [14]. We report results on the standard split of 795 training images and 654 testing images. (Note: all model selection is performed on PASCAL 2011 val.) Table 4 gives the performance of our model in several variations. First we train our unmodified coarse model (FCN-32s) on RGB images. To add depth information, we train on a model upgraded to take four-channel RGB-D input (early fusion). This provides little benefit, perhaps due to the difficultly of propagating meaningful gradients all the way through the model. Following the success of Gupta et al. [15], we try the three-dimensional HHA encoding of depth, training nets on just this information, as well as a “late fusion” of RGB and HHA where the predictions from both nets are summed at the final layer, and the resulting two-stream net is learned end-to-end. Finally we upgrade this late fusion net to a 16-stride version.
SIFT Flow is a dataset of 2,688 images with pixel labels for 33 semantic categories (“bridge”, “mountain”, “sun”), as well as three geometric categories (“horizontal”, “vertical”, and “sky”). An FCN can naturally learn a joint representation that simultaneously predicts both types of labels. We learn a two-headed version of FCN-16s with semantic and geometric prediction layers and losses. The learned model performs as well on both tasks as two independently trained models, while learning and inference are essentially as fast as each independent model by itself. The results in Table 5, computed on the standard split into 2,488 training and 200 test images,9 show state-of-the-art performance on both tasks.
5 结果
我们测试我们的FCN语义分割和场景分析,研究了PASCAL VOC,NYUDv2和SIFT Flow。 尽管以前这些任务主要用在物体和区域,但我们将这两种任务统一视为像素预测。 我们在每个数据集上评估FCN跨层式架构,然后将其扩展到NYUDv2的多模式输入以及SIFT Flow的语义和几何标签的多任务预测。
度量 我们从常见的语义分割和场景解析评估中提出四种度量,它们在像素准确率和在联合的区域交叉上是不同的。令n_ij为类别i的被预测为类别j的像素数量,有n_ij个不同的类别,令
PASCAL VOC表3给出了我们的FCN-8在PASCAL VOC 2011和2012测试集上的性能,并将其与以前最先进的SDS [17]和众所周知的R-CNN[12]进行比较。 我们在平均IU8上获得了最好的结果,相对提升了为20%。 推断时间减少114×(只有卷积网,没有proposals和微调)或286×(总体)。
NVUDv2 [33]是一种通过利用Microsoft Kinect收集到的RGB-D数据集,含有已经被合并进Gupt等人[14]的40类别的语义分割任务的pixelwise标签。我们报告结果基于标准分离的795张图片和654张测试图片。(注意:所有的模型选择将展示在PASCAL 2011 val上)。表4给出了我们模型在一些变化上的表现。首先我们在RGB图片上训练我们的未经修改的粗糙模型(FCN-32s)。为了添加深度信息,我们训练模型升级到能采用4通道RGB-Ds的输入(早期融合)。这提供了一点便利,也许是由于模型一直要传播有意义的梯度的困难。紧随Gupta等人[15]的成功,我们尝试3维的HHA编码深度,只在这个信息上(即深度)训练网络,和RGB与HHA的“后期融合”一样来自这两个网络中的预测将在最后一层进行总结,结果的双流网络将进行端到端的学习。最后我们将这种后期融合网络升级到16步长的版本。
SIFT Flow是包含33个语义类别(“桥”,“山”,“太阳”)以及三个几何类别(“水平”,“垂直”和“天空”)的像素标签的2,688幅图像的数据集。 FCN可以自然地学习共同的权重,同时预测两种类型的标签。 我们学习了带有语义和几何预测层次和损失的FCN-16的双向版本。 学习模型在两个任务上的表现都与两个独立训练的模型一样好,而学习和推理本身与每个独立模型本质上一样快。 表5中的结果是根据标准划分为2,488个训练和200个测试图像计算得出的,显示了两项任务的优越性能。
6. Conclusion
Fully convolutional networks are a rich class of models, of which modern classification convnets are a special case. Recognizing this, extending these classification nets to segmentation, and improving the architecture with multi-resolution layer combinations dramatically improves the state-of-the-art, while simultaneously simplifying and speeding up learning and inference.
Acknowledgements This work was supported in part by DARPA’s MSEE and SMISC programs, NSF awards IIS1427425, IIS-1212798, IIS-1116411, and the NSF GRFP, Toyota, and the Berkeley Vision and Learning Center. We gratefully acknowledge NVIDIA for GPU donation. We thank Bharath Hariharan and Saurabh Gupta for their advice and dataset tools. We thank Sergio Guadarrama for reproducing GoogLeNet in Caffe. We thank Jitendra Malik for his helpful comments. Thanks to Wei Liu for pointing out an issue wth our SIFT Flow mean IU computation and an error in our frequency weighted mean IU formula.
A. Upper Bounds on IU
In this paper, we have achieved good performance on the mean IU segmentation metric even with coarse semantic prediction. To better understand this metric and the limits of this approach with respect to it, we compute approximate upper bounds on performance with prediction at various scales. We do this by downsampling ground truth images and then upsampling them again to simulate the best results obtainable with a particular downsampling factor. The following table gives the mean IU on a subset of PASCAL 2011 val for various downsampling factors
Pixel-perfect prediction is clearly not necessary to achieve mean IU well above state-of-the-art, and, conversely, mean IU is a not a good measure of fine-scale accuracy.
B. More Results
We further evaluate our FCN for semantic segmentation. PASCAL-Context [29] provides whole scene annotations of PASCAL VOC 2010. While there are over 400 distinct classes, we follow the 59 class task defined by [29] that picks the most frequent classes. We train and evaluate on the training and val sets respectively. In Table 6, we compare to the joint object + stuff variation of Convolutional Feature Masking [4] which is the previous state-of-the-art on this task. FCN-8s scores 37.8 mean IU for a 20% relative improvement.
Changelog
The arXiv version of this paper is kept up-to-date with corrections and additional relevant material. The following gives a brief history of changes.
v2 Add Appendix A giving upper bounds on mean IU and Appendix B with PASCAL-Context results. Correct PASCAL validation numbers (previously, some val images were included in train), SIFT Flow mean IU (which used an inappropriately strict metric), and an error in the frequency weighted mean IU formula. Add link to models and update timing numbers to reflect improved implementation (which is publicly available)
6 结论
全卷积网络是丰富的模型类别,其中现代分类网格是一种特殊情况。 认识到这一点,将这些分类网络扩展到分割,并通过多分辨率层组合改进体系结构,极大地改进了最新技术,同时简化和加快了学习和推理。
鸣谢 这项工作有以下部分支持DARPA's MSEE和SMISC项目,NSF awards IIS-1427425, IIS-1212798, IIS-1116411, 还有NSF GRFP,Toyota, 还有 Berkeley Vision和Learning Center。我们非常感谢NVIDIA捐赠的GPU。我们感谢Bharath Hariharan 和Saurabh Gupta的建议和数据集工具;我们感谢Sergio Guadarrama 重构了Caffe里的GoogLeNet;我们感谢Jitendra Malik的有帮助性评论;感谢Wei Liu指出了我们SIFT Flow平均IU计算上的一个问题和频率权重平均IU公式的错误。
A. IU的上限
在本文中,即使使用粗略的语义预测,我们在均值IU分割度量上也取得了很好的性能。 为了更好地理解这个度量和这个方法对它的限制,我们用不同尺度的预测来计算性能的近似上界。 我们通过下采样地面实况图像然后再次对其进行上采样来模拟通过特定下采样因子可获得的最佳结果。 下表给出了各种下采样因子在PASCAL 2011 val子集上的平均IU.pixel-perfect预测很显然在取得最最好效果上不是必须的,而且,相反的,平均IU不是一个好的精细准确度的测量标准。
B 更多的结果
我们将我们的FCN用于语义分割进行了更进一步的评估。PASCAL-Context [29] 提供了PASCAL VOC 2011的全部场景注释。有超过400中不同的类别,我们遵循了 [29] 定义的被引用最频繁的59种类任务。我们分别训练和评估了训练集和val集。在表6中,我们将联合对象和Convolutional Feature Masking [4] 的stuff variation进行比较,后者是之前这项任务中最好的方法。FCN-8s在平均IU上得分为37.8,相对提高了20%
更新日志
本文的arXiv版本保持最新,并附有更正和其他相关材料。 以下给出了变化的简要历史。v2 添加了附录A和附录B。修正了PASCAL的有效数量(之前一些val图像被包含在训练中),SIFT Flow平均IU(用的不是很规范的度量),还有频率权重平均IU公式的一个错误。添加了模型和更新时间数字来反映改进的实现的链接(公开可用的)。
参考文献
[1] C. M. Bishop. Pattern recognition and machine learning,page 229. Springer-Verlag New York, 2006. 6
[2] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV,2012. 9
[3] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber.Deep neural networks segment neuronal membranes in electron microscopy images. In NIPS, pages 2852–2860,2012. 1, 2, 4, 7
[4] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. arXiv preprint arXiv:1412.1283, 2014. 9
[5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.1, 2
[6] D. Eigen, D. Krishnan, and R. Fergus. Restoring an image taken through a window covered with dirt or rain. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 633–640. IEEE, 2013. 2
[7] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283, 2014. 2
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results.
[9] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2013. 1, 2, 4,7, 8
[10] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor matching with convolutional neural networks: a comparison to SIFT.CoRR, abs/1405.5769, 2014. 1
[11] Y. Ganin and V. Lempitsky. N4-fields: Neural network nearest neighbor fields for image transforms. In ACCV, 2014. 1,2, 7
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition,2014. 1, 2, 7
[13] A. Giusti, D. C. Cires¸an, J. Masci, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In ICIP, 2013. 3, 4
[14] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition of indoor scenes from RGB-D images. In CVPR, 2013. 8
[15] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from RGB-D images for object detection and segmentation. In ECCV. Springer, 2014. 1, 2, 8
[16] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik.Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV), 2011. 7
[17] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In European Conference on Computer Vision (ECCV), 2014. 1, 2, 4, 5, 7, 8
[18] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained http://localization.In Computer Vision and Pattern Recognition, 2015.2
[19] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 1, 2
[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014. 7
[21] J. J. Koenderink and A. J. van Doorn. Representation of local geometry in the visual system. Biological cybernetics,55(6):367–375, 1987. 6
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 2, 3, 5
[23] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard,W. Hubbard, and L. D. Jackel. Backpropagation applied to hand-written zip code recognition. In Neural Computation,1989. 2, 3
[24] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. M¨uller. Efficient backprop. In Neural networks: Tricks of the trade,pages 9–48. Springer, 1998. 7
[25] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5):978–994, 2011.8
[26] J. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence?In NIPS, 2014. 1
[27] S. Mallat. A wavelet tour of signal processing. Academic press, 2nd edition, 1999. 4
[28] O. Matan, C. J. Burges, Y. LeCun, and J. S. Denker. Multidigit recognition using a space displacement neural http://network.In NIPS, pages 488–495. Citeseer, 1991. 2
[29] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler,R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 891–898. IEEE, 2014. 9
[30] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. E. Barbano. Toward automatic phenotyping of developing embryos from videos. Image Processing, IEEE Transactions on, 14(9):1360–1371, 2005. 1, 2, 4, 7
[31] P. H. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, 2014. 1, 2,4, 7, 8
[32] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.1, 2, 4
[33] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012. 8
[34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014. 1, 2, 3, 5
[35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842,2014. 1, 2, 3, 5
[36] J. Tighe and S. Lazebnik. Superparsing: scalable nonparametric image parsing with superpixels. In ECCV, pages 352–365. Springer, 2010. 8
[37] J. Tighe and S. Lazebnik. Finding things: Image parsing with regions and per-exemplar detectors. In CVPR, 2013. 8
[38] J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. CoRR, abs/1406.2984, 2014. 2
[39] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058–1066, 2013. 4
[40] R. Wolf and J. C. Platt. Postal address block location using a convolutional locator network. Advances in Neural Information Processing Systems, pages 745–745, 1994. 2
[41] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014,pages 818–833. Springer, 2014. 2
[42] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Partbased r-cnns for fine-grained category detection. In Computer Vision–ECCV 2014, pages 834–849. Springer, 2014.1
网友评论