基于多阶信息融合的行为识别方法研究

张冰冰; 葛疏雨; 王旗龙; 李培华

doi:10.16383/j.aas.c180265

基于多阶信息融合的行为识别方法研究

doi: 10.16383/j.aas.c180265

1.
大连理工大学信息与通信工程学院大连 116033
2.
天津大学智能与计算学部天津 300350

基金项目:

国家自然科学基金 61971086

国家自然科学基金 61806140

国家自然科学基金 61471082

详细信息

作者简介:
张冰冰  大连理工大学信息与通信工程学院博士研究生. 2016年获长春工业大学硕士学位. 主要研究方向为人体行为识别, 图像分类, 深度学习. E-mail: icyzhang@mail.dlut.edu.cn

葛疏雨  大连理工大学信息与通信工程学院硕士研究生. 2016年获大连理工大学通信工程专业学士学位. 主要研究方向为图像分类, 人体行为识别, 深度学习. E-mail: gsy@mail.dlut.edu.cn

王旗龙  博士, 天津大学智能与计算学部副教授. 主要研究方向为图像建模, 视觉数据分类, 深度学习. E-mail: qlwang@mail.dlut.edu.cn

通讯作者:
李培华博士, 大连理工大学信息与通信工程学院教授. 主要研究方向为基于信息几何的图像分类与检索. 本文通信作者. E-mail: peihuali@dlut.edu.cn

计量
- 文章访问数: 1567
- HTML全文浏览量: 616
- PDF下载量: 391
- 被引次数: 0
出版历程
- 收稿日期: 2018-04-28
- 录用日期: 2018-11-05
- 刊出日期: 2021-04-02

Multi-order Information Fusion Method for Human Action Recognition

1.
School of Information and Communication Engineering, Dalian University of Technology, Dalian 116033
2.
College of Intelligence and Computing, Tianjin University, Tianjin 300350

Funds:

National Natural Science Foundation of China 61971086

National Natural Science Foundation of China 61806140

National Natural Science Foundation of China 61471082

More Information

Author Bio:
ZHANG Bing-Bing  Ph. D. candidate at the School of Information and Communication Engineering, Dalian University of Technology. She received her master degree from Changchun University of Technology in 2016. Her research interest covers human action recognition, image classiflcation, and deep learning

GE Shu-Yu  Master student at the School of Information and Communication Engineering, Dalian University of Technology. He received his bachelor degree from Dalian University of Technology in 2016. His research interest covers image classiflcation, human action recognition, and deep learning

WANG Qi-Long  Ph. D., associate professor at the College of Intelligence and Computing, Tianjin University. His research interest covers image modeling, visual classiflcation, and deep learning

Corresponding author: LI Pei-Hua Ph. D., professor at the School of Information and Communication Engineering, Dalian University of Technology. His research interest covers image classiflcation and search using theoretical and computational methods of information geometry. Corresponding author of this paper

摘要

摘要: 双流卷积神经网络能够获取视频局部空间和时间特征的一阶统计信息, 测试阶段将多个视频局部特征的分类器分数平均作为最终的预测. 但是, 一阶统计信息不能充分建模空间和时间特征分布, 测试阶段也未考虑使用多个视频局部特征之间的更高阶统计信息. 针对这两个问题, 本文提出一种基于二阶聚合的视频多阶信息融合方法. 首先, 通过建立二阶双流模型得到视频局部特征的二阶统计信息, 与一阶统计信息形成多阶信息. 其次, 将基于多阶信息的视频局部特征分别进行二阶聚合, 形成高阶视频全局表达. 最后, 采用两种策略融合该表达. 实验表明, 本文方法能够有效提高行为识别精度, 在HMDB51和UCF101数据集上的识别准确率比双流卷积神经网络分别提升了8 % 和2.1 %, 融合改进的密集点轨迹(Improved dense trajectory, IDT) 特征之后, 其性能进一步提升.
- 行为识别 /
- 双流卷积神经网络 /
- 多阶信息融合 /
- 二阶聚合
Abstract: The classical two-stream convolutional neural network (CNN) can capture the flrst-order statistics of the local spatial and temporal features from an input video, while making flnal predictions by averaging the softmax scores of the local video features. However, the flrst-order statistics can not fully characterize the distribution of the spatial and temporal features, while higher-order information inherent in local features is discarded at the test stage. To solve the two problems above, this paper proposes a multi-order information fusion method for human action recognition. To this end, we flrst introduce a novel two-stream CNN model for capturing second-order statistics of the local spatial and temporal features, which, together with the original flrst-order statistics, forms the so-called multi-order information. We perform individually second-order aggregation of these extracted local multi-order information to compute global video representations. Finally, two strategies are proposed to fuse video representations for prediction. The experimental results demonstrate that our proposed method signiflcantly improves recognition accuracy over the original two-stream CNN model, i.e., 8 % and 2.1 % gains on the HMDB51 and UCF101, respectively. The performance of our method is further improved by combining traditional IDT (improved dense trajectory) features.
- Human action recognition /
- two-stream convolutional neural network /
- multi-order information fusion /
- second-order aggregation
Recommended by Associate Editor LAI Jian-Huang
注释:

1) 本文责任编委赖剑煌

HTML全文

图 1 基于二阶聚合的多阶信息融合方法流程图

Fig. 1 The flow chart of multi-order information fusion based on second-order aggregation

下载: 全尺寸图片幻灯片

图 2 视频局部特征z进行压缩双线性池化操作流程图

Fig. 2 The flow chart of compact bilinear pooling of one local video feature z

下载: 全尺寸图片幻灯片

图 3 表达级融合过程示意图

Fig. 3 Fusion at the representation level

下载: 全尺寸图片幻灯片

图 4 分类器分数级融合过程示意图

Fig. 4 Fusion at the classifier score level

下载: 全尺寸图片幻灯片

图 5 HMDB51上对视频序列中均匀采样帧数目$N$的评估

Fig. 5 Evaluation of the number $N$ of the frames uniformly sampled from the video on HMDB51

下载: 全尺寸图片幻灯片

图 6 HMDB51和UCF101数据集在不同视频全局表达维度d下对表达级融合策略和分类器分数级融合策略的评估

Fig. 6 Evaluation of representation level fusion strategy and class score level fusion strategy under the difierent dimension of the video representation on HMDB51 and UCF101 dataset

下载: 全尺寸图片幻灯片

表 1 一阶、二阶空间和时间流网络在UCF101和HMDB51上准确率的比较

Table 1 Comparisons of first-order spatial and temporal network with second-order spatial and temporal network on UCF101 and HMDB51

模型	网络类型	UCF101 (%)	HMDB51 (%)
ResNet-50^[9]	一阶空间	$ 82.30 $	$ 48.90 $
iSQRT-COV-ResNet-50-2K	二阶空间	$ {\bf 85.29} $	$ {\bf 49.65} $
ResNet-50^[9]	一阶时间	$ 87.00 $	$ 55.80 $
iSQRT-COV-ResNet-50-2K	二阶时间	$ {\bf 88.07} $	$ {\bf 57.64} $

下载: 导出CSV

表 2 UCF101和HMDB51上多阶信息融合有效性评估

Table 2 Evaluation of the efiectiveness of multi-order information fusion on UCF101 and HMDB51

一阶空间流	一阶时间流	二阶空间流	二阶时间流	UCF101 (%)	HMDB51 (%)
√	√			91.70	61.20
	√	√		92.90	65.17
√			√	91.34	61.63
		√	√	92.67	63.50
√	√		√	92.50	65.18
	√	√	√	92.96	66.14
√	√	√		91.78	60.60
√		√	√	91.12	58.71
√	√	√	√	92.75	64.74

下载: 导出CSV

表 3 UCF101和HMDB51上基于二阶聚合的视频不同多阶信息融合评估

Table 3 Evaluation of fusing difierent multi-order information of the video based on second-order aggregation on UCF101 and HMDB51

一阶空间信息	一阶时间信息	二阶空间信息	二阶时间信息	UCF101 (%)	HMDB51 (%)
√	√			89.28	64.24
	√	√		87.57	59.56
√			√	92.58	65.93
		√	√	92.07	64.10
√	√		√	92.68	68.02
	√	√	√	92.60	67.45
√	√	√		88.64	61.44
√		√	√	92.55	64.88
√	√	√	√	92.98	68.15

下载: 导出CSV

表 4 不同融合方法测试时间比较

Table 4 Test speed comparison of different fusion methods

方法	测试方式	时间(s/视频)
一阶双流网络融合(基线)^[9]	10-crop	9.670
二阶双流网络融合	10-crop	10.459
一阶+二阶双流网络融合	10-crop	20.129
多阶信息二阶聚合	1-crop	6.412

下载: 导出CSV

表 5 基于双流卷积神经网络架构的行为识别方法比较

Table 5 Comparison of difierent human action recognition arthogram based on two-stream convolutional network

方法	网络架构	UCF101 (%)	HMDB51 (%)
Two-stream^[6]	VGG-M	$88.0$	$59.4$
Two-stream 3D卷积+ 3D池化^[7]	VGG-16	$92.5$	$66.4$
Two-stream^[9]	ResNet-50	$91.7$	$61.2$
ST-ResNet*^[8]	ResNet-50	$93.4$	$66.4$
ST-multiplier network^[9]	ResNet-50 (空间), ResNet-152 (时间)	$94.2$	$68.9$
Two-Stream fusion + IDT^[7]	VGG-16	$93.5$	$69.2$
ST-ResNet + IDT^[8]	ResNet-50	$94.6$	$70.3$
ST-multiplier + IDT^[9]	ResNet-50 (空间), ResNet-152 (时间)	$94.9$	$72.2$
本文方法	ResNet-50	93.8	69.2
本文方法+联合训练^[8]	ResNet-50	94.1	70.7
本文方法+ IDT	ResNet-50	94.6	74.4

下载: 导出CSV

参考文献(25)

[1]	朱煜, 赵江坤, 王逸宁, 郑兵兵. 基于深度学习的人体行为识别算法综述. 自动化学报, 2016, 42(6): 848-857 doi: 10.16383/j.aas.2016.c150710 Zhu Yu, Zhao Jiang-Kun, Wang Yi-Ning, Zheng Bing-Bing. A review of human action recognition based on deep learning. Acta Automatica Sinica, 2016, 42(6): 848-857 doi: 10.16383/j.aas.2016.c150710
[2]	苏本跃, 蒋京, 汤庆丰, 盛敏. 基于函数型数据分析方法的人体动态行为识别. 自动化学报, 2017, 43(6): 866-876 doi: 10.16383/j.aas.2017.c160120 Su Ben-Yue, Jiang Jing, Tang Qing-Feng, Sheng Min. Human dynamic action recognition based on functional data analysis. Acta Automatica Sinica, 2017, 43(6): 866-876 doi: 10.16383/j.aas.2017.c160120
[3]	周风余, 尹建芹, 杨阳, 张海婷, 袁宪锋. 基于时序深度置信网络的在线人体动作识别. 自动化学报, 2016, 42(7): 1030-1039 doi: 10.16383/j.aas.2016.c150629 Zhou Feng-Yu, Yin Jian-Qin, Yang Yang, Zhang Hai-Ting, Yuan Xian-Feng. Online recognition of human actions based on temporal deep belief neural network. Acta Automatica Sinica, 2016, 43(6): 1030-1039 doi: 10.16383/j.aas.2016.c150629
[4]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, USA: NIPS Foundation, Inc., 2012. 1097-1105
[5]	Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of the 14th International Conference on Computer Vision. Sydney, Australia: IEEE, 2013. 3551-3558
[6]	Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS Foundation, Inc., 2014. 568-576
[7]	Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 1933-1941
[8]	Feichtenhofer C, Pinz A, Wildes R P. Spatiotemporal residual networks for video action recognition. In: Proceedings of the 29th International Conference on Neural Information Processing Systems. Barcelona, ES, Spain: NIPS Foundation, Inc., 2016. 3468-3476
[9]	Feichtenhofer C, Pinz A, Wildes R P. Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA: IEEE, 2017. 7445-7454
[10]	He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 770-778
[11]	Wang Y B, Long M S, Wang J M, Yu S P. Spatiotemporal pyramid network for video action recognition. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA: IEEE, 2017. 2097-2106
[12]	Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X D, et al. Temporal segment networks: Towards good practices for deep action recognition. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer, 2016. 20-36
[13]	Hu J, Zheng W, Lai J, Zhang J G. Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39(11): 2186-2200 doi: 10.1109/TPAMI.2016.2640292
[14]	Shahroudy A, Ng T, Gong Y H, Wang G. Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 2018, 40(5): 1045-1058 doi: 10.1109/TPAMI.2017.2691321
[15]	Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2014, 115(3): 211- 252 doi: 10.1007/s11263-015-0816-y
[16]	Lin T Y, Roychowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition. In: Proceedings of the 15th International Conference on Computer Vision. Santiago, USA: IEEE, 2015. 1449-1457
[17]	Lin T Y, Roychowdhury A, Maji S. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1309-1322 doi: 10.1109/TPAMI.2017.2723400
[18]	Li P H, Xie J T, Wang Q L, Zuo W M. Is second-order information helpful for large-scale visual recognition? In: Proceedings of the 16th International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 2089-2097
[19]	Li P H, Xie J T, Wang Q L, Gao Z L. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA, 2018. 947-955
[20]	Gao Y, Beijbom O, Zhang N, Darrell T. Compact bilinear pooling. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 317-326
[21]	Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In: Proceedings of the 2002 International Colloquium on Automata, Languages, and Programming. Malaga, ES, Spain: Springer, 2002. 693-703
[22]	Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv: 1212. 0402, 2012. 1-7
[23]	Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. HMDB: A large video database for human motion recognition. In: Proceedings of the 2011 IEEE International Conference on Computer Vision. Barcelona, ES, Spain: IEEE, 2011. 2556- 2563
[24]	MatConvNet: CNNs for MATLAB: Source Code [Onlilne], available: http://www.vlfeat.org/matconvnet, November 7, 2018
[25]	Peng X J, Wang L M, Wang X X, Qiao Y. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding, 2016, 150: 109-125 doi: 10.1016/j.cviu.2016.03.013