2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于显著图的弱监督实时目标检测

李阳 王璞 刘扬 刘国军 王春宇 刘晓燕 郭茂祖

李阳, 王璞, 刘扬, 刘国军, 王春宇, 刘晓燕, 郭茂祖. 基于显著图的弱监督实时目标检测. 自动化学报, 2020, 46(2): 242−255 doi: 10.16383/j.aas.c180789
引用本文: 李阳, 王璞, 刘扬, 刘国军, 王春宇, 刘晓燕, 郭茂祖. 基于显著图的弱监督实时目标检测. 自动化学报, 2020, 46(2): 242−255 doi: 10.16383/j.aas.c180789
Li Yang, Wang Pu, Liu Yang, Liu Guo-Jun, Wang Chun-Yu, Liu Xiao-Yan, Guo Mao-Zu. Weakly supervised real-time object detection based on saliency map. Acta Automatica Sinica, 2020, 46(2): 242−255 doi: 10.16383/j.aas.c180789
Citation: Li Yang, Wang Pu, Liu Yang, Liu Guo-Jun, Wang Chun-Yu, Liu Xiao-Yan, Guo Mao-Zu. Weakly supervised real-time object detection based on saliency map. Acta Automatica Sinica, 2020, 46(2): 242−255 doi: 10.16383/j.aas.c180789

基于显著图的弱监督实时目标检测

doi: 10.16383/j.aas.c180789
基金项目: 国家重点基础研究发展计划(2016YFC0901902), 国家自然科学基金(61671188, 61571164, 61976071, 61871020)资助
详细信息
    作者简介:

    李阳:哈尔滨工业大学计算机科学与技术学院博士研究生. 2013年获得哈尔滨工业大学计算机科学与技术学院硕士学位. 主要研究方向为计算机视觉与机器学习. E-mail: liyang13@hit.edu.cn

    王璞:2018年获得哈尔滨工业大学计算机科学与技术学院硕士学位. 主要研究方向为计算机视觉与机器学习. E-mail: wangpu@hit.edu.cn

    刘扬:博士, 哈尔滨工业大学计算机科学与技术学院副教授. 主要研究方向为机器学习, 图像处理与计算机视觉. 本文通信作者. E-mail: yliu76@hit.edu.cn

    刘国军:博士, 哈尔滨工业大学计算机科学与技术学院副教授. 主要研究方向为计算机视觉, 图像处理与模式识别. E-mail: hitliu@hie.edu.cn

    王春宇:博士, 哈尔滨工业大学计算机科学与技术学院副教授. 主要研究方向为机器学习与计算生物学. E-mail: chunyu@hit.edu.cn

    刘晓燕:博士, 哈尔滨工业大学计算机科学与技术学院副教授. 主要研究方向为机器学习与计算生物学. E-mail: liuxiaoyan@hit.edu.cn

    郭茂祖:博士, 北京建筑大学电气与信息工程学院教授. 主要研究方向为机器学习, 数据挖掘, 生物信息学与计算机视觉. E-mail: guomaozu@bucea.edu.cn

Weakly Supervised Real-time Object Detection Based on Saliency Map

Funds: Supported by National Key Research and Development Program of China (2016YFC0901902) and National Natural Science Foundation of China (61671188, 61571164, 61976071, 61871020)
  • 摘要: 深度卷积神经网络(Deep convolutional neural network, DCNN)在目标检测任务上使用目标的全标注来训练网络参数, 其检测准确率也得到了大幅度的提升. 然而, 获取目标的边界框(Bounding-box)标注是一项耗时且代价高的工作. 此外, 目标检测的实时性是制约其实用性的另一个重要问题. 为了克服这两个问题, 本文提出一种基于图像级标注的弱监督实时目标检测方法. 该方法分为三个子模块: 1)首先应用分类网络和反向传递过程生成类别显著图, 该显著图提供了目标在图像中的位置信息; 2)根据类别显著图生成目标的伪标注(Pseudo-bounding-box); 3)最后将伪标注看作真实标注并优化实时目标检测网络的参数. 不同于其他弱监督目标检测方法, 本文方法无需目标候选集合获取过程, 并且对于测试图像仅通过网络的前向传递过程就可以获取检测结果, 因此极大地加快了检测的速率(实时性). 此外, 该方法简单易用; 针对未知类别的目标检测, 只需要训练目标类别的分类网络和检测网络. 因此本框架具有较强的泛化能力, 为解决弱监督实时检测问题提供了新的研究思路. 在PASCAL VOC 2007数据集上的实验表明: 1)本文方法在检测的准确率上取得了较好的提升; 2)实现了弱监督条件下的实时检测.
  • 图  1  弱监督实时目标检测方法结构图

    Fig.  1  Pipeline for weakly supervised real-time object detection

    图  2  类别显著图

    Fig.  2  Class-specific saliency maps

    图  3  二值化类别显著图以及相应的伪标注

    Fig.  3  Binarization class-specific saliency maps and the corresponding pseudo-bounding-boxes

    图  4  $4\times 4$$8\times 8$的特征单元和相对应的目标基检测框

    Fig.  4  Feature map cells for $4\times 4$ and $8\times 8$ and its corresponding object default bounding-boxes

    图  5  不同阈值设定下20个类别的目标检测准确率

    Fig.  5  Object detection precision for 20 categories under different thresholds

    图  6  PASCAL VOC 2007测试集上的成功检测样例

    Fig.  6  Successful detection examples on PASCAL VOC 2007 test set

    图  7  PASCAL VOC 2007 测试集上的失败检测样例

    Fig.  7  Unsuccessful detection examples on PASCAL VOC 2007 test set

    表  1  二值化类别显著图的阈值设置(PASCAL VOC数据集20个类别)

    Table  1  Thresholds of the binarization class-specific saliency maps (20 categories for PASCAL VOC dataset)

    类别 阈值 类别 阈值 类别 阈值
    Plane 0.7 Cat 0.4 Person 0.5
    Bike 0.7 Chair 0.6 Plant 0.7
    Bird 0.5 Cow 0.5 Sheep 0.5
    Boat 0.8 Diningtable 0.3 Sofa 0.5
    Bottle 0.8 Dog 0.3 Train 0.7
    Bus 0.6 Horse 0.3 TV 0.8
    Car 0.6 Motorbike 0.5
    下载: 导出CSV

    表  2  PASCAL VOC 2007测试数据集的目标检测准确率(%)

    Table  2  Object detection precision (%) on PASCAL VOC 2007 test set

    方法 Bilen[11] Cinbis[7] Wang[32] Teh[17] WSDDN[10] WSDDN_Ens[10] WCCN[9] MELM[12, 14] PCL[13] 本文 (07数据集) 本文 (07+12数据集)
    mAP 27.7 30.2 31.6 34.5 34.8 39.3 42.8 47.3 48.8 33.9 39.3
    Plane 46.2 39.3 48.9 48.8 39.4 46.4 49.5 55.6 63.2 54.5 54.2
    Bike 46.9 43.0 42.3 45.9 50.1 58.3 60.6 66.9 69.9 52.9 60.0
    Bird 24.1 28.8 26.1 37.4 31.5 35.5 38.6 34.2 47.9 30.0 41.9
    Boat 16.4 20.4 11.3 26.9 16.3 25.9 29.2 29.1 22.6 15.2 20.7
    Bottle 12.2 8.0 11.9 9.2 12.6 14.0 16.2 16.4 27.3 7.8 11.4
    Bus 42.2 45.5 41.3 50.7 64.5 66.7 70.8 68.8 71.0 47.3 55.9
    Car 47.1 47.9 40.9 43.4 42.8 53.0 56.9 68.1 69.1 44.5 49.2
    Cat 35.2 22.1 34.7 43.6 42.6 39.2 42.5 43.0 49.6 62.5 71.3
    Chair 7.8 8.4 10.8 10.6 10.1 8.9 10.9 25.0 12.0 9.4 10.5
    Cow 28.3 33.5 34.7 35.9 35.7 41.8 44.1 65.6 60.1 17.6 23.2
    Table 12.7 23.6 18.8 27.0 24.9 26.6 29.9 45.3 51.5 39.3 48.5
    Dog 21.5 29.2 34.4 38.6 38.2 38.6 42.2 53.2 37.3 48.9 54.5
    Horse 30.1 38.5 35.4 48.5 34.4 44.7 47.9 49.6 63.3 45.7 52.1
    Motorbike 42.4 47.9 52.7 43.8 55.6 59.0 64.1 68.6 63.9 49.9 56.4
    Person 7.8 20.3 19.1 24.7 9.4 10.8 13.8 2.0 15.8 17.2 15.0
    Plant 20.0 20.0 17.4 12.1 14.7 17.3 23.5 25.4 23.6 15.7 17.6
    Sheep 26.6 35.8 35.9 29.0 30.2 40.7 45.9 52.5 48.8 18.4 23.3
    Sofa 20.6 30.8 33.3 23.2 40.7 49.6 54.1 56.8 55.3 29.3 42.5
    Train 35.9 41.0 34.8 48.8 54.7 56.9 60.8 62.1 61.2 43.7 47.6
    TV 29.6 20.1 46.5 41.9 46.9 50.8 54.5 57.1 62.1 28.2 30.4
    下载: 导出CSV

    表  3  PASCAL VOC 2007测试数据集上的目标检测速度(FPS)及检测平均准确率

    Table  3  Object detection speed (FPS) and detection mean average precision on PASCAL VOC 2007 test set

    评价指标及数据 FPS mAP 数据集
    $30\,$HzDPM[24] 30 26.1 07
    Fast R-CNN[1] 0.5 70.0 07+12
    Faster R-CNN[4] 7 73.2 07+12
    YOLO_VGG[39] 21 66.4 07+12
    Fast_YOLO[39] 155 52.7 07+12
    SSD[3] 46 68.0 07
    WSDDN_Ens[10] 0.5 39.3 07
    WCNN[9] 42.8 07
    MELM[12, 14] 47.3 07
    PCL[13] 1.4 48.8 07
    本文方法 45 39.3 07+12
    下载: 导出CSV

    表  4  不同二值化阈值对于检测准确率(%)的影响

    Table  4  The influence of different binarization thresholds on the detection precision (%)

    阈值 0.3 0.4 0.5 0.6 0.7 表1
    mAP 26.6 28.7 30.3 29.4 26.7 33.9
    Plane 30 39.8 45.5 47.7 51.1 54.5
    Bike 45.8 37.9 51.6 53 54.2 52.9
    Bird 22.8 30.8 32 29.2 26.9 30.0
    Boat 6.3 5.2 8.2 10.5 15.1 15.2
    Bottle 2.5 2.8 4.1 5.3 5.9 7.8
    Bus 42.1 44.3 48.6 50 49.7 47.3
    Car 32.9 36.9 39.5 44.6 44.7 44.5
    Cat 59 63 58.5 47.6. 23.8 62.5
    Chair 1 1.8 1 9.5 1.2 9.4
    Cow 18.6 16.1 22.5 16.5 15.6 17.6
    Table 40.4 36.5 38.1 29.9 24 39.3
    Dog 51.8 47.8 36.7 25.1 7.6 48.9
    Horse 41.6 44.5 44.2 40 31.3 45.7
    Motorbike 47.9 51 55.1 51.3 49 49.9
    Person 9.6 12.4 15.3 12.5 9.2 17.2
    Plant 11.1 11.6 11.3 10.7 17.1 15.7
    Sheep 12.7 14.4 20.7 17.7 15.6 18.4
    Sofa 25.2 29 27.9 31.8 26.6 29.3
    Train 26 28.8 34.6 39.6 42.9 43.7
    TV 5.2 10 10.7 15 22.2 28.2
    下载: 导出CSV

    表  5  在不同二值化阈值下, 通过文献[23]的方法生成的测试集检测结果(mAP)与本文方法的对比

    Table  5  Under different binarization thresholds, the detection results (mAP) of test set generated by [23] and our method

    阈值 文献 [23] (%) 本文方法 (%) 阈值 文献 [23] (%) 本文方法 (%)
    0.3 16.7 26.6 0.6 16.1 29.4
    0.4 17.5 28.7 0.7 13.7 26.7
    0.5 17.5 30.3 表1 18.5 33.9
    下载: 导出CSV
  • [1] Girshick R. Fast R-CNN. In: Proceedings of the 2015 IEEE International Conference on Computer Vision. Boston, USA: IEEE, 2015. 1440−1448
    [2] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014. 580−587
    [3] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y, Berg A C. SSD: single shot multibox detector. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016. 21−37
    [4] Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. In: Procoeedings of the 2015 Advances in neural Information Processing Systems. Montréal, Canada: MIT Press, 2015. 91−99
    [5] Song H O, Girshick R, Jegelka S, Mairal J, Harchaoui Z, Darrell T. On learning to localize objects with minimal supervision. arXiv preprint.arXiv: 1403.1024, 2014.
    [6] 李勇, 林小竹, 蒋梦莹. 基于跨连接LeNet-5网络的面部表情识别. 自动化学报, 2018, 44(1): 176−182

    6 Li Yong, Lin Xiao-Zhu, Jiang Meng-Ying. Facial expression recognition with cross-connect LeNet-5 network. Acta Automatica Sinica, 2018, 44(1): 176−182
    [7] 7 Cinbis R G, Verbeek J, Schmid C. Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(1): 189−203 doi: 10.1109/TPAMI.2016.2535231
    [8] Shi M J, Ferrari V. Weakly supervised object localization using size estimates. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer, 2016. 105−121
    [9] Diba A, Sharma V, Pazandeh A, Pirsiavash H, Gool L V. Weakly supervised cascaded convolutional networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017. 914−922
    [10] Bilen H, Vedaldi A. Weakly supervised deep detection networks. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 2846−2854
    [11] Bilen H, Pedersoli M, Tuytelaars T. Weakly supervised object detection with convex clustering. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 1081−1089
    [12] Wan F, Wei P X, Jiao J B, Han Z J, Ye Q X. Min-entropy latent model for weakly supervised object detection. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 1297−1306
    [13] Tang P, Wang X G, Bai S, Shen W, Bai X, Liu W Y, Yuille A L. PCL: proposal cluster learning for weakly supervised object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. DOI: 10.1109/TPAMI.2018.2876304
    [14] Wan F, Wei P X, Jiao J B, Han Z J, Ye Q X. Min-entropy latent model for weakly supervised object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. DOI: 10.1109/CVPR.2018.00141
    [15] 奚雪峰, 周国栋. 面向自然语言处理的深度学习研究. 自动化学报, 2016, 42(10): 1445−1465

    15 Xi Xue-Feng, Zhou Guo-Dong. A survey on deep learning for natural language processing. Acta Automatica Sinica, 2016, 42(10): 1445−1465
    [16] 常亮, 邓小明, 周明全, 武仲科, 袁野, 杨硕, 王宏安. 图像理解中的卷积神经网络. 自动化学报, 2016, 42(9): 1300−1312

    16 Chang Liang, Deng Xiao-Ming, Zhou Ming-Quan, Wu Zhong-Ke, Yuan Ye, Yang Shuo, Wang Hong-An. Convolutional neural networks in image understanding. Acta Automatica Sinica, 2016, 42(9): 1300−1312
    [17] Teh E W, Rochan M, Wang Y. Attention networks for weakly supervised object localization. In: Proceedings of the 2016 British Mahcine Vision Conference. York, UK: British Machine Vision Association, 2016.
    [18] Kantorov V, Oquab M, Cho M, Laptev I. Contextlocnet: context-aware deep network models for weakly supervised localization. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer, 2016. 350−365
    [19] Zhou B L, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 2921−2929
    [20] Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classiflcation models and saliency maps. arXiv Preprint. arXiv: 1312.6034, 2013.
    [21] Wei Y C, Feng J S, Liang X D, Cheng M M, Zhao Y, Yan S C. Object region mining with adversarial erasing: a simple classiflcation to semantic segmentation approach. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017. 1568−1576
    [22] Kolesnikov A, Lampert C H. Seed, expand and constrain: three principles for weakly-supervised image segmentation. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer, 2016. 695−711
    [23] Shimoda W, Yanai K. Distinct class-specific saliency maps for weakly supervised semantic segmentation. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer, 2016. 218−234
    [24] Sadeghi M A, Forsyth D. 30 Hz object detection with DPM V5. In: Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 65−79
    [25] Dean T, Ruzon M A, Segal M, Shlens J, Vijayanarasimhan S, Yagnik J. Fast, accurate detection of 100, 000 object classes on a single machine. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE, 2013. 1814−1821
    [26] Van de Sande K E A, Uijlings J R R, Gevers T, Smeulders A W M. Segmentation as selective search for object recognition. In: Proceedings of the 2011 IEEE International Conference on Computer Vision. Colorado Springs, USA: IEEE, 2011. 1879−1886
    [27] Zitnick C L, Dollár P. Edge boxes: locating object proposals from edges. In: Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 391−405
    [28] 28 Dietterich T G, Lathrop R H, Lozano-Pérez T. Solving the multiple instance problem with axis-parallel rectangles. Artiflcial Intelligence, 1997, 89(1−2): 31−71 doi: 10.1016/S0004-3702(96)00034-3
    [29] Zhang D, Liu Y, Si L, Zhang J, Lawrence R D. Multiple instance learning on structured data. In: Proceedings of the 2011 Advances in Neural Information Processing Systems. Cranada, Spain: MIT Press, 2011. 145−153
    [30] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 3431−3440
    [31] 31 Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z H, Karpathy A, Khosla A, Bernstein M, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3): 211−252 doi: 10.1007/s11263-015-0816-y
    [32] Wang C, Ren W Q, Huang K Q, Tan T N. Weakly supervised object localization with latent category learning. In: Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 431−445
    [33] George P, Kokkinos I, Savalle P A. Untangling local and global deformations in deep convolutional networks for image classiflcation and sliding window detection. arXiv Preprint arXiv: 1412.0296, 2014.
    [34] Tang P, Wang X G, Bai X, Liu W Y. Multiple instance detection network with online instance classifier refinement. arXiv Preprint arXiv: 1701.00138, 2017.
    [35] Wu J J, Yu Y N, Huang C, Yu K. Deep multiple instance learning for image classiflcation and auto-annotation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 3460−3469
    [36] Oquab M, Bottou L, Laptev I, Sivic J. Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014. 1717−1724
    [37] Zhu W J, Liang S, Wei Y C, Sun J. Saliency optimization from robust background detection. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Rognition. Columbus, USA: IEEE, 2014. 2814−2821
    [38] Zhu L, Chen Y H, Yuille A, Freeman W. Latent hierarchical structural learning for object detection. In: Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Rognition. San Francisco, USA, 2010. 1062−1069
    [39] Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unifled, real-time object detection. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 779−788
    [40] Springenberg J T, Dosovitskiy A, Brox T, Riedmiller M. Striving for simplicity: the all convolutional net. arXiv Preprint arXiv: 1412.6806, 2014.
    [41] Cheng M M, Liu Y, Lin W Y, Zhang Z M, Posin P L, Torr P H S. BING: binarized normed gradients for objectness estimation at 300 fps. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014. 3286−3293
    [42] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv Preprint. arXiv: 1409.1556, 2014.
    [43] Yan J J, Lei Z, Wen L Y, Li S Z. The fastest deformable part model for object detection. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014. 2497−2504
  • 加载中
图(7) / 表(5)
计量
  • 文章访问数:  5413
  • HTML全文浏览量:  1767
  • PDF下载量:  652
  • 被引次数: 0
出版历程
  • 收稿日期:  2018-11-27
  • 录用日期:  2019-06-24
  • 网络出版日期:  2020-01-16
  • 刊出日期:  2020-03-06

目录

    /

    返回文章
    返回