基于自注意力模态融合网络的跨模态行人再识别方法研究

杜鹏; 宋永红; 张鑫瑶

doi:10.16383/j.aas.c190340

基于自注意力模态融合网络的跨模态行人再识别方法研究

doi: 10.16383/j.aas.c190340

杜鹏^1,,
宋永红^{1, 2,},
张鑫瑶^1,

1.
西安交通大学软件学院西安 710049
2.
西安交通大学人工智能学院西安 710049

基金项目: 国家重点研究发展计划 (2017YFB1301101), 陕西省自然科学基础研究计划 (2018JM6104)资助

详细信息

作者简介:
杜鹏：西安交通大学软件学院硕士研究生. 主要研究方向为行人再识别.E-mail: xjydupeng@163.com

宋永红：西安交通大学人工智能学院研究员. 主要研究方向为图像与视频内容理解, 智能软件开发. 本文通信作者. E-mail: songyh@xjtu.edu.cn

张鑫瑶：西安交通大学软件学院硕士研究生. 主要研究方向为行人再识别.E-mail: xyzhangxy@stu.xjtu.edu.cn

计量
- 文章访问数: 1290
- HTML全文浏览量: 930
- PDF下载量: 421
- 被引次数: 0
出版历程
- 收稿日期: 2019-05-07
- 录用日期: 2019-10-11
- 网络出版日期: 2022-04-22
- 刊出日期: 2022-06-02

Self-attention Cross-modality Fusion Network for Cross-modality Person Re-identification

DU Peng^1
,,
SONG Yong-Hong^{1, 2
,},
ZHANG Xin-Yao^1
,

1.
School of Software Engineering, Xi＇an Jiaotong University, Xi＇an 710049
2.
College of Artificial Intelligence, Xi＇an Jiaotong University, Xi＇an 710049

Funds: Supported by National Key Research and Development Program of China (2017YFB1301101) and Natural Science Basic Research Program of Shaanxi Province (2018JM6104)

More Information

Author Bio:
DU Peng　Master student at the School of Software Engineering, Xi＇an Jiaotong University. His main research interest is person re-identification

SONG Yong-Hong　Researcher at the College of Artificial Intelligence, Xi＇an Jiaotong University. Her research interest covers image and video content understanding, and intelligent software development. Corresponding author of this paper

ZHANG Xin-Yao　Master student at the School of Software Engineering, Xi＇an Jiaotong University. Her main research interest is person re-identification

摘要

摘要: 行人再识别是实现多目标跨摄像头跟踪的核心技术, 该技术能够广泛应用于安防、智能视频监控、刑事侦查等领域. 一般的行人再识别问题面临的挑战包括摄像机的低分辨率、行人姿态变化、光照变化、行人检测误差、遮挡等. 跨模态行人再识别相比于一般的行人再识别问题增加了相同行人不同模态的变化. 针对跨模态行人再识别中存在的模态变化问题, 本文提出了一种自注意力模态融合网络. 首先是利用CycleGAN生成跨模态图像. 在得到了跨模态图像后利用跨模态学习网络同时学习两种模态图像特征, 对于原始数据集中的图像利用SoftMax 损失进行有监督的训练, 对生成的跨模态图像利用LSR (Label smooth regularization) 损失进行有监督的训练. 之后, 使用自注意力模块将原始图像和CycleGAN生成的图像进行区分, 自动地对跨模态学习网络的特征在通道层面进行筛选. 最后利用模态融合模块将两种筛选后的特征进行融合. 通过在跨模态数据集SYSU-MM01上的实验证明了本文提出的方法和跨模态行人再识别其他方法相比有一定程度的性能提升.
- 跨模态行人再识别 /
- 自注意力 /
- 跨模态融合 /
- CycleGAN
Abstract: Person re-identification is the core technology to achieve multi-target multi-camera tracking. It can be widely used in many areas such as security, intelligent video surveillance, and criminal investigation. Person re-identification is a challenging task due to the low resolution of camera, human pose variations, illumination variations, pedestrian detector errors and occlusion. Compared with the general person re-identification, the cross-modality person re-identification has the variations of different modalities of the same person. In order to solve the cross-modality problem in cross-modality person re-identification, we propose the self-attention cross-modality fusion network. First, CycleGAN is used to generate cross-modality images. After obtaining the cross-modality images, we use the cross-modality learning network to learn the two modalities features simultaneously. SoftMax loss is used to train original images and label smooth regularization (LSR) loss is used to train generated images. Then, we use self-attention module to distinguish between original images and the generated image, and automatically select the useful features between channels. Finally, modality fusion module is used to fuse these selected features from two modalities images. Comparing with state-of-the-art methods on a large scale cross-modality dataset SYSU-MM01 further demonstrate the effectiveness of the proposed self-attention cross-modality fusion network.
- Cross-modality person re-identification /
- self-attention /
- cross-modality fusion /
- CycleGAN

HTML全文

图 1 行人再识别和多目标跨摄像头跟踪关系示意

Fig. 1 The relationship between person re-identification and multi-target cross-camera tracking

下载: 全尺寸图片幻灯片

图 2 跨模态行人再识别数据

Fig. 2 Data of cross-modality person re-identification

下载: 全尺寸图片幻灯片

图 3 自注意力模态融合网络

Fig. 3 Self-attention cross-modality fusion network

下载: 全尺寸图片幻灯片

图 4 CycleGAN网络示意图

Fig. 4 Structure of CycleGAN network

下载: 全尺寸图片幻灯片

图 5 利用CycleGAN生成的跨模态图像

Fig. 5 Generated cross-modality images using CycleGAN

下载: 全尺寸图片幻灯片

图 6 包含较多噪声的跨模态转换后的图像

Fig. 6 Generated cross-modality images with more noise

下载: 全尺寸图片幻灯片

图 7 自注意力模块示意图

Fig. 7 Structure of self-attention model

下载: 全尺寸图片幻灯片

表 1 各模块在SYSU-MM01 All-search模式下的实验结果

Table 1 Experimental results of each module in SYSU-MM01 dataset and All-search mode

方法	All-search
	Single-shot				Multi-shot
	Rank 1	Rank 10	Rank 20	mAP	Rank 1	Rank 10	Rank 20	mAP
Baseline	27.36	71.95	84.58	28.53	32.48	78.34	88.93	23.17
跨模态学习	30.83	72.35	84.07	31.45	37.25	80.58	90.22	25.48
跨模态 + 自注意力	31.3	73.34	84.78	31.72	37.98	81.76	91.05	25.39
跨模态 + 模态融合	31.85	74.38	85.66	32.49	38.65	81.74	91.25	26.46
自注意力模态融合	33.31	74.51	85.79	33.18	39.71	82	91.14	26.89

下载: 导出CSV

表 2 各模块在SYSU-MM01 Indoor-search模式下的实验结果

Table 2 Experimental results of each module in SYSU-MM01 dataset and Indoor-search mode

方法	Indoor-search
	Single-shot				Multi-shot
	Rank 1	Rank 10	Rank 20	mAP	Rank 1	Rank 10	Rank 20	mAP
Baseline	32.17	81.3	92.26	42.76	38.95	85.29	93.62	33.73
跨模态学习	37.21	80.81	90.29	47.06	43.98	86.01	93.37	37.09
跨模态 + 自注意力	36.55	80.32	90.41	46.42	44.89	85.31	94.18	36.43
跨模态 + 模态融合	37.63	81.75	91.48	47.73	44.82	87.26	94.97	38.07
自注意力模态融合	38.09	81.68	90.61	47.86	45.8	86.72	93.86	37.95

下载: 导出CSV

表 3 加入各模块后的GFLOPs和参数量

Table 3 GFLOPs and parameters after joining each module

方法	GFLOPs	GFLOPs相比于Baseline的变化	参数量	参数量相比于Baseline的变化
Baseline	2.702772224	−	25 557 032	−
跨模态学习	2.702772224	0	25 557 032	0
跨模态 + 自注意力	2.7038208	0.001048576	26 609 960	+1 052 928 (4.12%)
跨模态 + 模态融合	5.405544448	2.702772224	25 557 032	0
自注意力模态融合	5.409639424	2.7068672	27 136 424	+1 579 392 (6.18%)

下载: 导出CSV

表 4 在SYSU-MM01 All-search模式下和跨模态行人再识别的对比实验

Table 4 Comparative experiments between our method and others in SYSU-MM01 dataset and All-search mode

方法	All-search
	Single-shot				Multi-shot
	Rank 1	Rank 10	Rank 20	mAP	Rank 1	Rank 10	Rank 20	mAP
HOG + Euclidean	2.76	18.25	31.91	4.24	3.82	22.77	37.63	2.16
Zero-padding	14.8	54.12	71.33	15.95	19.13	61.4	78.41	10.89
BDTR	17.01	55.43	71.96	19.66	−	−	−	−
cmGAN	26.97	67.51	80.56	27.8	31.49	72.74	85.01	22.27
Baseline (本文方法)	27.36	71.95	84.58	28.53	32.48	78.34	88.93	23.17
跨模态学习网络 (本文方法)	30.83	72.35	84.07	31.45	37.25	80.58	90.22	25.48
自注意力模态融合 (本文方法)	33.31	74.51	85.79	33.18	39.71	82	91.14	26.89

下载: 导出CSV

表 5 在SYSU-MM01 Indoor-search模式下和跨模态行人再识别的对比实验

Table 5 Comparative experiments between our method and others in SYSU-MM01 dataset and Indoor-search mode

方法	Indoor-search
	Single-shot				Multi-shot
	Rank 1	Rank 10	Rank 20	mAP	Rank 1	Rank 10	Rank 20	mAP
HOG + Euclidean	3.22	24.68	44.52	7.25	4.75	29.06	49.38	3.51
Zero-padding	20.58	68.38	85.79	26.92	24.43	75.86	91.32	18.64
cmGAN	31.63	77.23	89.18	42.19	37	80.94	92.11	32.76
Baseline (本文方法)	32.17	81.3	92.26	42.76	38.95	85.29	93.62	33.73
跨模态学习网络 (本文方法)	37.21	80.81	90.29	47.06	43.98	86.01	93.37	37.09
自注意力模态融合 (本文方法)	38.09	81.68	90.61	47.86	45.8	86.72	93.86	37.95

下载: 导出CSV

参考文献(33)

[1]	李幼蛟, 卓力, 张菁, 李嘉锋, 张辉. 行人再识别技术综述[J]. 自动化学报, 2018, 44(9): 1554-1568 LI You-Jiao, ZHUO Li, ZHANG Jing, LI Jia-Feng, ZHANG Hui. A Survey of Person Re-identiflcation. Acta Automatica Sinica, 2018, 44(9): 1554-1568
[2]	吴彦丞, 陈鸿昶, 李邵梅, 高超. 基于行人属性先验分布的行人再识别. 自动化学报, 2019, 45(5): 953-964 Wu Yan-Cheng, Chen Hong-Chang, Li Shao-Mei, Gao Chao. Person re-identiflcation using attribute priori distribution. Acta Automatica Sinica, 2019, 45(5): 953-964
[3]	Zhang L, Ma B, Li G, Huang Q, Tian Q. Generalized semisupervised and structured subspace learning for cross-modal retrieval. IEEE Transactions on Multimedia, 2018, 20: 128-141 doi: 10.1109/TMM.2017.2723841
[4]	Zhang L, Ma B P, Li G R, Huang Q M, Tian Q. PL-ranking: A novel ranking method for cross-modal retrieval. In: Proceedings of the 24th ACM on Multimedia Conference. Amsterdam, the Netherlamds: ACM, 2016. 1355−1364
[5]	Krizhevsky A, Sutskever I, Hinton G. ImageNet classiflcation with deep convolutional neural networks. COMMUNICATIONS OF THE ACM, 2012, 60: 84-90
[6]	Gray D, Tao H. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In: Proceedings of the 10th European Conference on Computer Vision, Marseille, France: Springer, 2008. 262−275
[7]	Yang Y, Yang J M, Yan J J, Liao S C, Yi D, Li S Z. Salient color names for person re-identiflcation. In: Proceedings of the 2014 Eruopeam Computer Vision. Zuich, Switzerland: Springer, 2014. Part I : 536−551
[8]	Köstinger M, Hirzer M, Wohlhart P, Roth P M, Bischof H. Large scale metric learning from equivalence constraints. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE, 2012. 2288−2295
[9]	Zheng Z, Zheng L, Yang Y. A discriminatively learned cnn embedding for person reidentiflcation.ACM Transactions on Multimedia Computing Communications and Applications, 2016, 14(1):13-1-13-20
[10]	Bromley J, Bentz J W, BOTTOU L, Guyon I, Lecun Y, Moore C, et al. Signature veriflcation using a “ siamese” time delay neural network. International Journal of Pattern Recognition and Artiflcial Intelligence, 1993, 7(4): 669-688 doi: 10.1142/S0218001493000339
[11]	Zhang X, Luo H, Fan X, Xiang W L, Sun Y X, Xiao Q Q, Jiang W, Zhang C, Sun J. AlignedReID: Surpassing humanlevel performance in person re-identiflcation. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017.
[12]	Zhao H Y, Tian M Q, Sun S Y, Shao J, Yan J J, Yi S, Wang X G, Tang X O. Spindle net: Person re-identiflcation with human body region guided feature decomposition and fusion. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017, 907−915
[13]	Dai Z Z, Chen M Q, Zhu S Y, Tan P. Batch feature erasing for person re-identiflcation and beyond. Computer Research Repository, 2018.
[14]	Zhong Z, Zheng L, Zheng Z D, Li S Z, Yang Y. Camera style adaptation for person re-identiflcation. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake, USA: IEEE, 2018. 5157−5166
[15]	Zhu J Y, Park T, Isola P, ECfros A A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 2242−2251
[16]	Wu A C, Zheng W S, Yu H X, Gong S G, Lai J H. RGB-infrared crossmodality person re-identiflcation. In: Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 5390−5399
[17]	Ye M, Wang Z, Lan X Y, Yuen P C. Visible thermal person re-identiflcation via dual-constrained top-ranking. In: Proceedings of the 27th International Joint Conference on Artiflcial Intelligence. Stockholm, Sweden: AAAI, 2018.1092−1099
[18]	Dai P Y, Ji R R, Wang H B, Wu Q, Huang Y Y. Cross-modality person re-identiflcation with generative adversarial training. In: Proceedings of the 27th International Joint Conference on Artiflcial Intelligence. Stockholm, Sweden: AAAI, 2018. 677−683
[19]	Lin J W, Li H. HPILN: A feature learning framework for crossmodality person re-identiflcation. [Online], available: https://arxiv.org/abs/1906.03142, August 14, 2019
[20]	Szegedy C, Vanhoucke V, Iofie S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 2818−2826
[21]	Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in Neural Information Processing Systems, Berlin, Germany: Springer, 2014. 2672−2680
[22]	林懿伦, 戴星原, 李力, 王晓, 王飞跃. 人工智能研究的新前线: 生成式对抗网络. 自动化学报, 2018, 44(5): 775-792 Lin Yi-Lun, Dai Xing-Yuan, Li Li, Wang Xiao, Wang FeiYue. The new frontier of AI research: generative adversarial networks. Acta Automatica Sinica, 2018, 44(5): 775-792
[23]	Mirza M P, Osindero S. Conditional generative adversarial nets. [Online], available: https://arxiv.org/abs/1411.1784, November 6, 2014
[24]	Isola P, Zhu J Y, Zhou T H, Efros A A. Image-to-image translation with conditional adversarial networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017. 5967−5976
[25]	Yi Z L, Zhang H, Tan P, Gong M L. DualGAN: unsupervised dual learning for image-to-image translation. In: Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 2868−2876
[26]	He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 770−778
[27]	Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018. 7132−7141
[28]	Nair V, Hinton G E. Rectifled linear units improve restricted Boltzmann machines vinod nair. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. Haifa, Israel: Omnipress, 2010. 807−814
[29]	Yin X, Goudriaan J W, Lantinga E A, Vos J C, Spiertz H L. A flexible sigmoid function of determinate growth. Annals of botany, 2003, 91(3): 361-371.
[30]	Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z M, Desmaison A, Antiga L, Lerer A. Automatic differentiation in PyTorch. In: Proceedings of the 31st Conference and Workshop on Neural Information Processing Systems. California, USA: NIPS, 2017.
[31]	Reddi S J, Kale S, Kumar S. On the convergence of Adam and Beyond. In: Proceedings of the 6th International Conference on Learning Representations. Vancouver, BC, Canada: ICLR, 2018.
[32]	Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, CA, USA: IEEE, 2005. 886−893
[33]	Liao S C, Hu Y, Zhu X Y, Li S Z. Person re-identiflcation by local maximal occurrence representation and metric learning. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 2197−2206