-
摘要: 细粒度图像分类问题是计算机视觉领域一项极具挑战的研究课题,其目标是对子类进行识别,如区分不同种类的鸟.由于子类别间细微的类间差异和较大的类内差异,传统的分类算法不得不依赖于大量的人工标注信息.近年来,随着深度学习的发展,深度卷积神经网络为细粒度图像分类带来了新的机遇.大量基于深度卷积特征算法的提出,促进了该领域的快速发展.本文首先从该问题的定义以及研究意义出发,介绍了细粒度图像分类算法的发展现状.之后,从强监督与弱监督两个角度对比分析了不同算法之间的差异,并比较了这些算法在常用数据集上的性能表现.最后,我们对这些算法进行了总结,并讨论了该领域未来可能的研究方向及其面临的挑战.Abstract: Fine-grained image categorization is a challenging task in the field of computer vision, which aims to classify sub-categories, such as different species of birds. Due to the low inter-class but high intra-class variations, traditional categorization algorithms have to depend on a large amount of annotation information. Recently, with the advances of deep learning, deep convolutional neural networks have provided a new opportunity for fine-grained image recognition. Numerous deep convolutional feature-based algorithms have been proposed, which have advanced the development of fine-grained image research. In this paper, starting from its definition, we give a brief introduction to some recent developments in fine-grained image categorization. After that, we analyze different algorithms from the strongly supervised to and weakly supervised ones, and compare their performances on some popular datasets. Finally, we provide a brief summary of these methods as well as the potential future research direction and major challenges.1) 本文责任编委 王亮
-
表 1 CUB200-2011[1]数据库上的算法性能比较(其中BBox指标注框信息(Bounding Box), Parts指局部区域信息)
Table 1 Performance of different algorithms in CUB200-2011[1] (where BBox refers to bounding box, Parts means part annotations)
算法 BBox
(训练)Parts
(训练)BBox
(测试)Parts
(测试)简要描述 准确率(%) CUB[1] √ √ SIFT + BoW + SVM 10.3 CUB[1] √ √ √ √ SIFT + BoW + SVM 17.3 [2mm] POOF[26] √ √ √ POOF + SVM 56.8 POOF[26] √ √ √ √ POOF + SVM 73.3 Alignment[31] √ √ Fisher + SVM 62.7 Symbiotic[30] √ √ Fisher + SVM 61 [2mm] DeCAF[25] √ √ Alex-Net + Logistic Regression 61 Part R-CNN[43] √ √ Alex-Net + Fine-Tune + SVM 73.9 Pose Normalized CNN[48] √ √ Alex-Net + Fine-Tune + SVM 75.7 Pose Normalized CNN[48] √ √ √ √ Alex-Net + Fine-Tune + SVM 85.4 [2mm] Two-level Attention[56] Alex-Net 69.7 Two-level Attention[56] VGG16-Net 77.9 Zhang et al.[12] VGG16-Net + Fine-Tune + SVM 79.3 Constellations[58] VGG19-Net + Fine-Tune + Flip + SVM 81 Bilinear CNN[13] VGG19-Net/VGG-M + Flip 84.1 Spatial Transformer Net[55] Inception[62] + Flip 84.1 -
[1] Wah C, Branson S, Welinder P, Perona P, Belongie S. The Caltech-UCSD Birds-200-2011 Dataset, Technical Report CNS-TR-2011-001, California Institute of Technology, Pasadena, CA, USA, 2011 [2] Bosch A, Zisserman A, Muñoz X. Scene classification using a hybrid generative/discriminative approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(4):712-727 doi: 10.1109/TPAMI.2007.70716 [3] Wu J X, Rehg J M. CENTRIST:a visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(8):1489-1501 doi: 10.1109/TPAMI.2010.224 [4] Gehler P, Nowozin S. On feature combination for multiclass object classification. In:Proceedings of the 12th IEEE International Conference on Computer Vision. Kyoto, Japan:IEEE, 2009. 221-228 [5] Jarrett K, Kavukcuoglu K, Ranzato M, LeCun Y. What is the best multi-stage architecture for object recognition? In:Proceedings of the 12th IEEE International Conference on Computer Vision. Kyoto, Japan:IEEE, 2009. 2146-2153 [6] Wright J, Yang A Y, Ganesh A, Sastry S S, Ma Y. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(2):210-227 doi: 10.1109/TPAMI.2008.79 [7] 李晓莉, 达飞鹏.基于排除算法的快速三维人脸识别方法.自动化学报, 2010, 36(1):153-158 http://www.aas.net.cn/CN/abstract/abstract13642.shtmlLi Xiao-Li, Da Fei-Peng. A rapid method for 3D face recognition based on rejection algorithm. Acta Automatica Sinica, 2010, 36(1):153-158 http://www.aas.net.cn/CN/abstract/abstract13642.shtml [8] Khosla A, Jayadevaprakash N, Yao B P, Li F F. Novel dataset for fine-grained image categorization. In:Proceedings of the 1st Workshop on Fine-Grained Visual Categorization (FGVC), IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Springs, USA:IEEE, 2011. [9] Nilsback M E, Zisserman A. Automated flower classification over a large number of classes. In:Proceedings of the 6th Indian Conference on Computer Vision, Graphics & Image Processing. Bhubaneswar, India:IEEE, 2008. 722-729 [10] Krause J, Stark M, Deng J, Li F F. 3D object representations for fine-grained categorization. In:Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops (ICCVW). Sydney, Australia:IEEE, 2013. 554-561 [11] Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A. Fine-grained visual classification of aircraft[Online], available:https://arxiv.org/abs/1306.5151, June 21, 2013 [12] Zhang Y, Wei X S, Wu J X, Cai J F, Lu J B, Nguyen V A, Do M N. Weakly supervised fine-grained categorization with part-based image representation. IEEE Transactions on Image Processing, 2016, 25(4):1713-1725 doi: 10.1109/TIP.2016.2531289 [13] Lin T Y, RoyChowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition. In:Proceedings of the 15th IEEE International Conference on Computer Vision (ICCV). Santiago, Chile:IEEE, 2015. 1449-1457 [14] 张琳波, 王春恒, 肖柏华, 邵允学.基于Bag-of-phrases的图像表示方法.自动化学报, 2012, 38(1):46-54 http://www.aas.net.cn/CN/abstract/abstract17634.shtmlZhang Lin-Bo, Wang Chun-Heng, Xiao Bai-Hua, Shao Yun-Xue. Image representation using bag-of-phrases. Acta Automatica Sinica, 2012, 38(1):46-54 http://www.aas.net.cn/CN/abstract/abstract17634.shtml [15] 余旺盛, 田孝华, 侯志强.基于区域边缘统计的图像特征描述新方法.计算机学报, 2014, 37(6):1398-1410 http://www.cnki.com.cn/Article/CJFDTOTAL-JSJX201406018.htmYu Wang-Sheng, Tian Xiao-Hua, Hou Zhi-Qiang. A new image feature descriptor based on region edge statistical. Chinese Journal of Computers, 2014, 37(6):1398-1410 http://www.cnki.com.cn/Article/CJFDTOTAL-JSJX201406018.htm [16] 颜雪军, 赵春霞, 袁夏. 2DPCA-SIFT:一种有效的局部特征描述方法.自动化学报, 2014, 40(4):675-682 http://www.aas.net.cn/CN/abstract/abstract18333.shtmlYan Xue-Jun, Zhao Chun-Xia, Yuan Xia. 2DPCA-SIFT:an efficient local feature descriptor. Acta Automatica Sinica, 2014, 40(4):675-682 http://www.aas.net.cn/CN/abstract/abstract18333.shtml [17] Lowe D G. Object recognition from local scale-invariant features. In:Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra, Greece:IEEE, 1999. 1150-1157 [18] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In:Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA:IEEE, 2005. 886-893 [19] Jégou H, Douze M, Schmid C, Pérez P. Aggregating local descriptors into a compact image representation. In:Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. San Francisco, USA:IEEE, 2010. 3304-3311 [20] Perronnin F, Dance C. Fisher kernels on visual vocabularies for image categorization. In:Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, USA:IEEE, 2007. 1-8 [21] Sánchez J, Perronnin F, Mensink T, Verbeek J. Image classification with the Fisher vector:theory and practice. International Journal of Computer Vision, 2013, 105(3):222-245 doi: 10.1007/s11263-013-0636-x [22] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In:Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, USA:MIT Press, 2012. 1097-1105 [23] 高莹莹, 朱维彬.深层神经网络中间层可见化建模.自动化学报, 2015, 41(9):1627-1637 http://www.aas.net.cn/CN/abstract/abstract18736.shtmlGao Ying-Ying, Zhu Wei-Bin. Deep neural networks with visible intermediate layers. Acta Automatica Sinica, 2015, 41(9):1627-1637 http://www.aas.net.cn/CN/abstract/abstract18736.shtml [24] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553):436-444 doi: 10.1038/nature14539 [25] Donahue J, Jia Y Q, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T. DeCAF:a deep convolutional activation feature for generic visual recognition. In:Proceedings of the 31st International Conference on Machine Learning. Beijing, China:ACM, 2014. 647-655 [26] Berg T, Belhumeur P N. POOF:part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In:Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Portland, USA:IEEE, 2013. 955-962 [27] Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification. In:Proceedings of the 11th European Conference on Computer Vision. Berlin Heidelberg, Germany:Springer, 2010. 143-156 [28] Bo L, Ren X, Fox D. Kernel descriptors for visual recognition. In:Proceedings of the 24th Annual Conference on Neural Information Processing Systems. Vancouver, Canada:MIT Press, 2010. 244-252 [29] Branson S, Van Horn G, Wah C, Perona P, Belongie S. The ignorant led by the blind:a hybrid human-machine vision system for fine-grained categorization. International Journal of Computer Vision, 2014, 108(1-2):3-29 doi: 10.1007/s11263-014-0698-4 [30] Chai Y N, Lempitsky V, Zisserman A. Symbiotic segmentation and part localization for fine-grained categorization. In:Proceedings of the 14th IEEE International Conference on Computer Vision (ICCV). Sydney, Australia:IEEE, 2013. 321-328 [31] Gavves E, Fernando B, Snoek C G M, Smeulders A W M, Tuytelaars T. Fine-grained categorization by alignments. In:Proceedings of the 14th IEEE International Conference on Computer Vision (ICCV). Sydney, Australia:IEEE, 2013. 1713-1720 [32] Yao B P, Bradski G, Li F F. A codebook-free and annotation-free approach for fine-grained image categorization. In:Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Providence, USA:IEEE, 2012. 3466-3473 [33] Yang S L, Bo L F, Wang J, Shapiro L. Unsupervised template learning for fine-grained object recognition. In:Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA:MIT Press, 2012. 3122-3130 [34] Branson S, Wah C, Schroff F, Babenko B, Welinder P, Perona P, Belongie S. Visual recognition with humans in the loop. In:Proceedings of the 11th European Conference on Computer Vision. Berlin Heidelberg, Germany:Springer, 2010. 438-451 [35] Wah C, Branson S, Perona P, Belongie S. Multiclass recognition and part localization with humans in the loop. In:Proceedings of the 13th IEEE International Conference on Computer Vision (ICCV). Barcelona, Spain:IEEE, 2011. 2524-2531 [36] LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, Jackel L D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4):541-551 doi: 10.1162/neco.1989.1.4.541 [37] LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11):2278-2324 doi: 10.1109/5.726791 [38] Zeiler M D, Fergus R. Visualizing and understanding convolutional networks. In:Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland:Springer, 2014. 818-833 [39] Gong Y C, Wang L W, Guo R Q, Lazebnik S. Multi-scale orderless pooling of deep convolutional activation features. In:Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland:Springer, 2014. 392-407 [40] Cimpoi M, Maji S, Vedaldi A. Deep filter banks for texture recognition and segmentation. In:Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA:IEEE, 2015. 3828-3836 [41] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[Online], available:https://arxiv.org/abs/1409.1556, April 10, 2015 [42] Welinder P, Branson S, Mita T, Wah C, Schroff F, Belongie S, Perona P. Caltech-UCSD Birds 200, Technical Report CNS-TR-2010-001, California Institute of Technology, Pasadena, CA, USA, 2010 [43] Zhang N, Donahue J, Girshick R, Darrell T. Part-based R-CNNs for fine-grained category detection. In:Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland:Springer, 2014. 834-849 [44] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In:Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, USA:IEEE, 2014. 580-587 [45] Viola P, Jones M J. Robust real-time face detection. International Journal of Computer Vision, 2004, 57(2):137-154 doi: 10.1023/B:VISI.0000013087.49260.fb [46] Wu J X, Liu N N, Geyer C, Rehg M J. C^4:a real-time object detection framework. IEEE Transactions on Image Processing, 2013, 22(10):4096-4107 doi: 10.1109/TIP.2013.2270111 [47] Uijlings J R R, van de Sande K E A, Gevers T, Smeulders A W M. Selective search for object recognition. International Journal of Computer Vision, 2013, 104(2):154-171 doi: 10.1007/s11263-013-0620-5 [48] Branson S, Van Horn G, Belongie S, Perona P. Bird species categorization using pose normalized deep convolutional nets[Online], available:https://arxiv.org/abs/1406.2952, June 11, 2014 [49] Branson S, Beijbom O, Belongie S. Efficient large-scale structured learning. In:Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Portland, USA:IEEE, 2013. 1806-1813 [50] Krause J, Jin H L, Yang J C, Li F F. Fine-grained recognition without part annotations. In:Proceedings of the 15th IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA:IEEE, 2015. 5546-5555 [51] Guillaumin M, Küttel D, Ferrari V. Imagenet auto-annotation with segmentation propagation. International Journal of Computer Vision, 2014, 110(3):328-348 doi: 10.1007/s11263-014-0713-9 [52] Kuettel D, Guillaumin M, Ferrari V. Segmentation propagation in imagenet. In:Proceedings of the 12th European Conference on Computer Vision. Berlin Heidelberg, Germany:Springer, 2012. 459-473 [53] Lin D, Shen X Y, Lu C W, Jia J Y. Deep LAC:deep localization, alignment and classification for fine-grained recognition. In:Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA:IEEE, 2015. 1666-1674 [54] Xu Z, Huang S L, Zhang Y, Tao D C. Augmenting strong supervision using web data for fine-grained categorization. In:Proceedings of the 15th IEEE International Conference on Computer Vision (ICCV). Santiago, Chile:IEEE, 2015. 2524-2532 [55] Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K. Spatial transformer networks. In:Proceedings of the 29th Annual Conference on Neural Information Processing Systems. Montreal, Canada:MIT Press, 2015. 2017-2025 [56] Xiao T J, Xu Y C, Yang K Y, Zhang J X, Peng Y X, Zhang Z. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In:Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA:IEEE, 2015. 842-850 [57] Zhang Y, Wu J X, Cai J F. Compact representation for image classification:to choose or to compress. In:Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, USA:IEEE, 2014. 907-914 [58] Simon M, Rodner E. Neural activation constellations:unsupervised part model discovery with convolutional networks. In:Proceedings of the 15th IEEE International Conference on Computer Vision (ICCV). Santiago, Chile:IEEE, 2015. 1143-1151 [59] Simon M, Rodner E, Denzler J. Part detector discovery in deep convolutional neural networks. In:Proceedings of the 12th Asian Conference on Computer Vision. Singapore:Springer, 2014. 162-177 [60] Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks:visualising image classification models and saliency maps[Online], available:https://arxiv.org/abs/1312.6034, April 19, 2014 [61] Wang D Q, Shen Z Q, Shao J, Zhang W, Xue X Y, Zhang Z. Multiple granularity descriptors for fine-grained categorization. In:Proceedings of the 15th IEEE International Conference on Computer Vision (ICCV). Santiago, Chile:IEEE, 2015. 2399-2406 [62] Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In:Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA:IEEE, 2015. 1-9 [63] Hall D, Perona P. Fine-grained classification of pedestrians in video:benchmark and state of the art. In:Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA:IEEE, 2015. 5482-5491 [64] Liu Y, Zhang D S, Lu G J, Ma W Y. A survey of content-based image retrieval with high-level semantics. Pattern Recognition, 2007, 40(1):262-282 doi: 10.1016/j.patcog.2006.04.045 [65] Datta R, Joshi D, Li J, Wang J Z. Image retrieval:ideas, influences, and trends of the new age. ACM Computing Surveys, 2008, 40(2):Article No.5 http://dl.acm.org/citation.cfm?id=1348248 [66] Felzenszwalb P F, Girshick R B, McAllester D, Ramanan D. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(9):1627-1645 doi: 10.1109/TPAMI.2009.167 [67] Wei X S, Luo J H, Wu J X. Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Transactions on Image Processing, 2017, 26(6):2868-2881 doi: 10.1109/TIP.2017.2688133 [68] Xie L X, Wang J D, Zhang B, Tian Q. Fine-grained image search. IEEE Transactions on Multimedia, 2015, 17(5):636-647 doi: 10.1109/TMM.2015.2408566