2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于深度学习的人体行为识别算法综述

朱煜 赵江坤 王逸宁 郑兵兵

朱煜, 赵江坤, 王逸宁, 郑兵兵. 基于深度学习的人体行为识别算法综述. 自动化学报, 2016, 42(6): 848-857. doi: 10.16383/j.aas.2016.c150710
引用本文: 朱煜, 赵江坤, 王逸宁, 郑兵兵. 基于深度学习的人体行为识别算法综述. 自动化学报, 2016, 42(6): 848-857. doi: 10.16383/j.aas.2016.c150710
ZHU Yu, ZHAO Jiang-Kun, WANG Yi-Ning, ZHENG Bing-Bing. A Review of Human Action Recognition Based on Deep Learning. ACTA AUTOMATICA SINICA, 2016, 42(6): 848-857. doi: 10.16383/j.aas.2016.c150710
Citation: ZHU Yu, ZHAO Jiang-Kun, WANG Yi-Ning, ZHENG Bing-Bing. A Review of Human Action Recognition Based on Deep Learning. ACTA AUTOMATICA SINICA, 2016, 42(6): 848-857. doi: 10.16383/j.aas.2016.c150710

基于深度学习的人体行为识别算法综述

doi: 10.16383/j.aas.2016.c150710
基金项目: 

国家自然科学基金 61370174, 61271349

中央高校基本科研业务费专项资金资助 WH1214015

详细信息
    作者简介:

    赵江坤 华东理工大学信息科学与工程学院硕士研究生. 主要研究方向为智能视频分析与模式识别. E-mail: zhaojk90@gmail.com

    王逸宁 华东理工大学信息科学与工程学院硕士研究生. 主要研究方向为智能视频分析与模式识别. E-mail: wyn885@126.com

    郑兵兵 华东理工大学信息科学与工程学院硕士研究生. 主要研究方向为智能视频分析与模式识别. E-mail: 13162233697@163.com

    通讯作者:

    朱煜华 东理工大学信息科学与工程学院教授. 1999年获得南京理工大学博士学位. 主要研究方向为智能视频分析与理解, 模式识别方法, 数字图像处理方法及应用. 本文通信作者. E-mail: zhuyu@ecust.edu.cn

A Review of Human Action Recognition Based on Deep Learning

Funds: 

National Natural Science Foundation of China 61370174, 61271349

and the Fundamental Research Funds for the Central Universities WH1214015

More Information
    Author Bio:

    ZHAO Jiang-Kun Master student at the School of Information Science and Engineering, East China University of Science and Technology. His research interest covers intelligent video analysis and pattern recognition

    WANG Yi-Ning Master student at the School of Information Science and Engineering, East China University of Science and Technology. His research interest covers intelligent video analysis and pattern recognition

    ZHENG Bing-Bing Master student at the School of Information Science and Engineering, East China University of Science and Technology. His research interest covers intelligent video analysis and pattern recognition

    Corresponding author: ZHU Yu Professor in the School of Information Science and Engineering, East China University of Science and Technology. She received her Ph. D. degree from Nanjing University of Science and Technology, China in 1999. Her research interest covers intelligent video analysis and understanding, pattern recognition, digital image processing methods and applications. Corresponding author of this paper
  • 摘要: 人体行为识别和深度学习理论是智能视频分析领域的研究热点, 近年来得到了学术界及工程界的广泛重视, 是智能视频分析与理解、视频监控、人机交互等诸多领域的理论基础. 近年来, 被广泛关注的深度学习算法已经被成功运用于语音识别、图形识别等各个领域.深度学习理论在静态图像特征提取上取得了卓著成就, 并逐步推广至具有时间序列的视频行为识别研究中. 本文在回顾了基于时空兴趣点等传统行为识别方法的基础上, 对近年来提出的基于不同深度学习框架的人体行为识别新进展进行了逐一介绍和总结分析; 包括卷积神经网络(Convolution neural network, CNN)、独立子空间分析(Independent subspace analysis, ISA)、限制玻尔兹曼机(Restricted Boltzmann machine, RBM)以及递归神经网络(Recurrent neural network, RNN)及其在行为识别中的模型建立, 对模型性能、成果进展及各类方法的优缺点进行了分析和总结.
  • 图  1  动作识别原理框图

    Fig.  1  The °owchart of action recognition

    图  2  Weizman 数据库部分动作示例

    Fig.  2  Examples of Weizman database

    图  3  KTH 数据库部分动作示例

    Fig.  3  Examples of KTH database

    图  4  UCF Sports 数据库部分动作示例

    Fig.  4  Examples of UCF Sports database

    图  5  Hollywood 数据库部分动作示例

    Fig.  5  Examples of Hollywood database

    图  6  基于光流法的运动信息表征方法

    Fig.  6  Movement information representation method based on optical °ow method

    图  7  3D 梯度方向直方图获得过程

    Fig.  7  HOG3D descriptor

    图  8  3DCNN 结构图

    Fig.  8  The structure of 3DCNN

    图  9  多分辨率卷积神经网络结构图

    Fig.  9  The structure of multiresolution convolution neural network

    图  10  AutoEncoder 结构图

    Fig.  10  The structure of AutoEncoder

    图  11  ISA-3D 结构图

    Fig.  11  The structure of ISA-3D

    图  12  RNN 结构图

    Fig.  12  The structure of RNN

    图  13  LSTM 单元

    Fig.  13  The unit of LSTM

    表  1  基于几何形状或基于运动信息的识别结果(%)

    Table  1  The results of recognition methods based on geometric shapes or motion information (%)

    Fujiyoshi 等[1] Chaudhry 等[2]
    Weizman -100
    KTH92.7395.77
    下载: 导出CSV

    表  2  基于时空兴趣点的特征提取方法在KTH、UCF Sports 及Hollywood 数据库上的结果(%)

    Table  2  The results of methods based on the interest of time and space on the KTH, UCF Sports and Hollywood databases (%)

    HOG3D HOG/HOF HOG HOF Cuboids ESURF
    Harris 3D[5] 89/80/44 92/78/45 81/71/33 92/75/43 - -
    Cuboids[6] 90/83/46 89/78/46 82/73/39 88/77/43 89/77/45 -
    Hessian[8] 85/79/41 89/79/46 78/66/36 89/75/43 - 81/77/38
    Dense[11] 85/86/45 86/82/47 79/77/39 88/83/46 - -
    下载: 导出CSV

    表  3  基于CNN 的行为识别算法结果(%)

    Table  3  The results of action recognition based on CNN (%)

    KTH UCF101
    Ji 等[29]90.2 -
    Simonyan 等[33] -88
    下载: 导出CSV

    表  4  ISA 在三个数据库上的结果统计(%)

    Table  4  The results of ISA on three databases (%)

    KTH UCF Sports Hollyword 2
    Le 等[36]93.986.553.3
    下载: 导出CSV
  • [1] Fujiyoshi H, Lipton A J, Kanade T. Real-time human motion analysis by image skeletonization. IEICE Transactions on Information and Systems, 2004, 87-D(1) : 113-120
    [2] Chaudhry R, Ravichandran A, Hager G, Vidal R. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL: IEEE, 2009. 1932-1939
    [3] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recognition. San Diego, CA, USA: IEEE, 2005. 886-893
    [4] Lowe D G. Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra: IEEE, 1999. 1150-1157
    [5] Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition. Cambridge: IEEE, 2004. 32-36
    [6] Dollar P, Rabaud V, Cottrell G, Belongie S. Behavior recognition via sparse spatio-temporal features. In: Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. Beijing, China: IEEE, 2005. 65-72
    [7] Rapantzikos K, Avrithis Y, Kollias S. Dense saliency-based spatiotemporal feature points for action recognition. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL: IEEE, 2009. 1454-1461
    [8] Knopp J, Prasad M, Willems G, Timofte R, Van Gool L. Hough transform and 3D SURF for robust three dimensional classification. In: Proceedings of the 11th European Conference on Computer Vision (ECCV 2010) . Berlin Heidelberg: Springer. 2010. 589-602
    [9] Kláser A, Marszaéek M, Schmid C. A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of the 19th British Machine Vision Conference. Leeds: BMVA Press, 2008. 99.1-99.10
    [10] Wang H, Ullah M M, Klaser A, Laptev I, Schmid C. Evaluation of local spatio-temporal features for action recognition. In: Proceedings of the 2009 British Machine Vision Conference. London, UK: BMVA Press, 2009. 124.1-124.11
    [11] Wang H, Kláser A, Schmid C, Liu C L. Action recognition by dense trajectories. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Providence, RI: IEEE, 2011. 3169-3176
    [12] Hinton G E. Learning multiple layers of representation. Trends in Cognitive Sciences, 2007, 11(10) : 428-434
    [13] Deng L, Yu D. Deep learning: methods and applications. Foundations and Trends® in Signal Processing, 2014, 7(3-4) : 197-387
    [14] Schmidhuber J. Deep learning in neural networks: an overview. Neural Networks, 2015, 61: 85-117
    [15] Gorelick L, Blank M, Shechtman E, Irani M, Basri R. Actions as space-time shapes. In: Proceedings of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE, 2005. 1395-1402
    [16] Soomro K, Zamir A R. Action recognition in realistic sports videos. Computer Vision in Sports. Switzerland: Springer. 2014. 181-208
    [17] Rodriguez M D, Ahmed J, Shah M. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK: IEEE, 2008. 1-8
    [18] Marszalek M, Laptev I, Schmid C. Actions in context. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL: IEEE, 2009. 2929-2936
    [19] Yang X D, Tian Y L. Effective 3D action recognition using EigenJoints. Journal of Visual Communication and Image Representation, 2014, 25(1) : 2-11
    [20] Bobick A, Davis J. An appearance-based representation of action. In: Proceedings of the 13th International Conference on Pattern Recognition. Vienna: IEEE, 1996. 307-312
    [21] Weinland D, Ronfard R, Boyer E. Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 2006, 104(2-3) : 249-257
    [22] Bobick A F, Davis J W. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3) : 257-267
    [23] Sarikaya R, Hinton G E, Deoras A. Application of deep belief networks for natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(4) : 778-784
    [24] Ren Y F, Wu Y. Convolutional deep belief networks for feature extraction of EEG signal. In: Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN). Beijing, China: IEEE, 2014. 2850-2853
    [25] Bengio Y. Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2009, 2(1) : 1-127
    [26] LeCun Y, Ranzato M. Deep learning tutorial. In: Tutorials in International Conference on Machine Learning (ICML13) . Atlanta, USA: Citeseer, 2013.
    [27] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems. Lake Tahoe, Nevada, United States, 2012. 1097-1105
    [28] Bouvrie J. Notes on Convolutional Neural Networks. MIT CBCL Technical Report, 2006, 38-44
    [29] Ji S W, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1) : 221-231
    [30] Chéron G, Laptev I, Schmid C. P-CNN: pose-based CNN features for action recognition. In: Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015. 3218-3226
    [31] Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition. arXiV: 1604.04494, 2015.
    [32] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F F. Large-scale video classification with convolutional neural networks. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, OH: IEEE, 2014. 1725-1732
    [33] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of Advances in Neural Information Processing Systems. Red Hook, NY: Curran Associates, Inc., 2014. 568-576
    [34] Poultney C, Chopra S, Cun Y L. Efficient learning of sparse representations with an energy-based model. In: Proceedings of Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2006. 1137-1144
    [35] Bengio Y, Lamblin P, Popovici D, Larochelle H. Greedy layer-wise training of deep networks. In: Proceedings of Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2006.
    [36] Le Q V, Zou W Y, Yeung S Y, Ng A Y. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Providence, RI: IEEE, 2011. 3361-3368
    [37] Hyvárinen A, Hurri J, Hoyer P O. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision. London: Springer-Verlag, 2009.
    [38] Hinton G. A practical guide to training restricted Boltzmann machines. Momentum, 2010, 9(1) : 926
    [39] Fischer A, Igel C. An introduction to restricted Boltzmann machines. In: Proceedings of the 17th Iberoamerican Congress on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Buenos Aires, Argentina: Springer. 2012. 14-36
    [40] Larochelle H, Bengio Y. Classification using discriminative restricted Boltzmann machines. In: Proceedings of the 25th International Conference on Machine Learning. New York: ACM, 2008. 536-543
    [41] Chen H, Murray A F. Continuous restricted Boltzmann machine with an implementable training algorithm. IEE Proceedings-Vision, Image and Signal Processing, 2003, 150(3) : 153-158
    [42] Taylor G W, Hinton G E. Factored conditional restricted Boltzmann machines for modeling motion style. In: Proceedings of the 26th Annual International Conference on Machine Learning. New York: ACM, 2009. 1025-1032
    [43] Chen B, Ting J A, Marlin B, de Freitas N. Deep learning of invariant spatio-temporal features from video. In: Proceedings of Conferrence on Neural Information Processing Systems (NIPS) Workshop on Deep Learning and Unsupervised Feature Learning. Whistler BC Canada, 2010.
    [44] Pineda F J. Generalization of back-propagation to recurrent neural networks. Physical Review Letters, 1987, 59(19) : 2229-2232
    [45] Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv: 1412.3555, 2014.
    [46] Omlin C W, Giles C L. Training second-order recurrent neural networks using hints. In: Proceedings of the 9th International Workshop Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1992. 361-366
    [47] Sak H, Senior A, Beaufays F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv: 1402.1128, 2014.
    [48] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8) : 1735-1780
    [49] Sak H, Senior A, Beaufays F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of the 2014 Annual Conference of International Speech Communication Association (INTERSPEECH). Singapore: ISCA, 2014. 338-342
    [50] Ng J Y H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G. Beyond short snippets: deep networks for video classification. arXiv: 1503.08909, 2015.
    [51] Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. arXiv: 1411.4389, 2014.
  • 加载中
图(13) / 表(4)
计量
  • 文章访问数:  6553
  • HTML全文浏览量:  4266
  • PDF下载量:  6105
  • 被引次数: 0
出版历程
  • 收稿日期:  2015-10-31
  • 录用日期:  2016-04-18
  • 刊出日期:  2016-06-20

目录

    /

    返回文章
    返回