基于显著性特征提取的图像描述算法

王鑫; 宋永红; 张元林

doi:10.16383/j.aas.c190279

基于显著性特征提取的图像描述算法

doi: 10.16383/j.aas.c190279

1.
西安交通大学软件学院西安 710049
2.
西安交通大学人工智能学院西安 710049

基金项目: 陕西省自然科学基础研究计划(2018JM6104), 国家重点研究开发项目 (2017YFB1301101)资助

详细信息

作者简介:
王鑫：西安交通大学软件学院硕士研究生. 主要研究方向为图像内容描述. E-mail: 18991371026@163.com

宋永红：西安交通大学人工智能学院研究员. 主要研究方向为图像与视频内容理解、智能软件开发. 本文通信作者. E-mail: songyh@xjtu.edu.cn

张元林：西安交通大学人工智能学院副教授. 主要研究方向为计算机视觉及机器学习. E-mail: ylzhangxian@xjtu.edu.cn

计量
- 文章访问数: 1201
- HTML全文浏览量: 986
- PDF下载量: 314
- 被引次数: 0
出版历程
- 收稿日期: 2019-04-01
- 录用日期: 2019-09-12
- 网络出版日期: 2022-01-12
- 刊出日期: 2022-03-25

Salient Feature Extraction Mechanism for Image Captioning

1.
School of Software Engineering, Xi＇an Jiaotong University, Xi＇an 710049
2.
College of Artificial Inteligence, Xi＇an Jiaotong University, Xi＇an 710049

Funds: Supported by Natural Science Basic Research Program of Shaanxi (2018JM6104) and National Key Research and Development Program of China (2017YFB1301101)

More Information

Author Bio:
WANG Xin　Master student at the School of Software Engineering, Xi＇an Jiaotong University. His main research interest is image captioning

SONG Yong-Hong　Researcher at the College of Artificial Intelligence, Xi＇an Jiaotong University. Her research interest covers image and video content understanding, intelligent software development. Corresponding author of this paper

ZHANG Yuan-Lin　Associate professor at the College of Artificial Intelligence, Xi＇an Jiaotong University. His research interest covers computer vision and machine learning

摘要

摘要: 图像描述(Image captioning)是一个融合了计算机视觉和自然语言处理这两个领域的研究方向, 本文为图像描述设计了一种新颖的显著性特征提取机制(Salient feature extraction mechanism, SFEM), 能够在语言模型预测每一个单词之前快速地向语言模型提供最有价值的视觉特征来指导单词预测, 有效解决了现有方法对视觉特征选择不准确以及时间性能不理想的问题. SFEM包含全局显著性特征提取器和即时显著性特征提取器这两个部分: 全局显著性特征提取器能够从多个局部视觉向量中提取出显著性视觉特征, 并整合这些特征到全局显著性视觉向量中; 即时显著性特征提取器能够根据语言模型的需要, 从全局显著性视觉向量中提取出预测每一个单词所需的显著性视觉特征. 本文在MS COCO (Microsoft common objects in context)数据集上对SFEM进行了评估, 实验结果表明SFEM能够显著提升基准模型 (baseline)生成图像描述的准确性, 并且SFEM在生成图像描述的准确性方面明显优于广泛使用的空间注意力模型, 在时间性能上也大幅领先空间注意力模型.
- 图像描述 /
- 显著性特征提取 /
- 语言模型 /
- 编码器 /
- 解码器
Abstract: Image captioning is a research direction that combines computer vision and natural language processing. In this paper, a novel saliency feature extraction mechanism (SFEM) is designed to solve several key problems existing in current methods. It can quickly provide the most valuable visual features to the language model before which predict word. And it effectively solves the problems that the existing methods are inaccurate in selecting visual features and time-consuming. SFEM consists of global salient feature extractor and instant salient feature extractor: global salient Feature extractor extracts salient visual features from multiple local visual vectors and integrate these features into a global salient visual vector; the instant salient feature extractor can extract the saliency visual features required at each moment from the global saliency visual vector according to the needs of the language model. We evaluated SFEM on the MS COCO (Microsoft common objects in context) dataset. Experiments show that our SFEM can significantly improve the accuracy of baseline in caption generating. And SFEM is significantly better than the widely used spatial attention model in both the accuracy of generating caption and time performance.
- Image captioning /
- salient feature extract /
- language model /
- encoder /
- decoder

HTML全文

图 1 本文网络模型

Fig. 1 Structure of our network

下载: 全尺寸图片幻灯片

图 2 局部视觉向量与图像的对应关系

Fig. 2 Correspondence between local visual vectors and image

下载: 全尺寸图片幻灯片

图 3 SFEM网络结构

Fig. 3 Structure of SFEM

下载: 全尺寸图片幻灯片

图 4 显著性特征在空间上的分布

Fig. 4 Spatial distribution of salient features

下载: 全尺寸图片幻灯片

图 5 即时显著性特征随预测单词的变化

Fig. 5 The change of instant salient features with predicted words

下载: 全尺寸图片幻灯片

图 6 本文模型生成的图像描述展示

Fig. 6 Image descriptions generated by the model of this paper

下载: 全尺寸图片幻灯片

表 1 $\bar{D_{t}}$值最高的20个单词

Table 1 The top-20 words with $\bar{D_{t}}$ value

单词	$\overline{D}_{t}$	单词	$\overline{D}_{t}$	单词	$\overline{D}_{t}$
hood	0.0592	ducks	0.0565	doughnut	0.0546
cats	0.0589	pug	0.0564	baby	0.0546
teddy	0.0576	rug	0.0561	bird	0.0545
little	0.0573	hummingbird	0.0556	pen	0.0543
duck	0.0571	pasta	0.0549	motorcycle	0.0543
bananas	0.0569	horse	0.0547	colorful	0.0542
seagull	0.0565	panda	0.0546	—	—

下载: 导出CSV

表 2 Encoder-Decoder + SFEM在MS COCO数据集上的表现(%)

Table 2 The performance of Encoder-Decoder + SFEM on MS COCO dataset (%)

模型名称	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDER	SPICE
Encoder-Decoder^{[7, 19]}	72.2	55.4	41.7	31.3	24.6	53.0	95.5	17.2
Encoder-Decoder + Spatial Attention^{[7, 19]}	73.4	57.0	43.2	32.6	25.3	54.0	100.1	18.5
Encoder-Decoder + SFEM	75.1	58.8	44.9	34.0	26.3	55.2	105.9	19.5

下载: 导出CSV

表 3 Up-Down + SFEM在MS COCO数据集上的表现(%)

Table 3 The performance of Up-Down + SFEM on MS COCO dataset (%)

模型名称	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDER	SPICE
Encoder-Decoder$^{\star}$+ SFEM	74.3	55.8	42.1	33.2	25.7	54.5	105.2	19.4
Up-Down-Spatial Attention^[20-21]	74.2	55.7	42.3	33.2	25.9	54.1	105.2	19.2
Up-Down-SFEM	74.6	56.0	42.4	33.1	26.0	54.2	106.1	19.7

下载: 导出CSV

表 4 本模型和空间注意力模型的时间性能对比(帧/s)

Table 4 Time performance comparison between our model and the spatial attention model (frame/s)

模型名称	帧速率 (GPU)	帧速率 (CPU)
Encoder-Decoder + Spatial Attention^{[7, 19]}	69.8	36.3
Encoder-Decoder + SFEM	81.9	52.2

下载: 导出CSV

表 5 各个模块单次执行平均花费时间(s)

Table 5 The average time spent by each module in a single execution (s)

模型名称	单次执行时间 (GPU)	单次执行时间 (CPU)
Spatial Attention^{[7, 19]}	0.00035	0.0019
GE	0.00034	0.0020
IE	0.000073	0.000087

下载: 导出CSV

表 6 本文模型在MS COCO数据集上的表现(%)

Table 6 The performance of our model on MS COCO dataset (%)

模型名称	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDER	SPICE
Soft-Attention^[16]	70.7	49.2	34.4	24.3	23.9	—	—	—
Hard-Attention^[16]	71.8	50.4	35.7	25.0	23.0	—	—	—
Semantic Attention^[9]	70.9	53.7	40.2	30.4	24.3	—	—	—
SCA-CNN^[18]	71.9	54.8	41.1	31.1	25.0	—	—	—
Up-Dwon^[20]	74.2	55.7	42.3	33.2	25.9	54.1	105.2	19.2
本文: SFEM	75.1	58.8	44.9	34.0	26.3	55.2	105.9	19.5

下载: 导出CSV

表 7 组合模型在MS COCO数据集上的表现(%)

Table 7 Performance of the combined model on MS COCO dataset (%)

模型名称	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDER	SPICE
Spatial Attention^{[7, 19]}	73.4	57.0	43.2	32.6	25.3	54.0	100.1	18.5
GE+Spatial Attention	74.5	57.9	44.0	33.1	25.9	54.4	103.6	19.0
IE+Spatial Attention	74.3	57.8	44.0	33.3	25.9	54.7	102.7	18.9

下载: 导出CSV

参考文献(31)

[1]	Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S M, Choi Y, et al. BabyTalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891-2903 doi: 10.1109/TPAMI.2012.162
[2]	Mao J H, Xu W, Yang Y, Wang J, Yuille A L. Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: ICLR, 2015.
[3]	汤鹏杰, 王瀚漓, 许恺晟. LSTM逐层多目标优化及多层概率融合的图像描述. 自动化学报, 2018, 44(7): 1237-1249 Tang Peng-Jie, Wang Han-Li, Xu Kai-Sheng. Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM. Acta Automatica Sinica, 2018, 44(7): 1237-1249
[4]	Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [Online], available: https://arxiv.org/pdf/1406.1078v3.pdf, September 3, 2014
[5]	Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: ICLR, 2015.
[6]	Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS, 2014.
[7]	Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE, 2015. 3156−3164
[8]	张雪松, 庄严, 闫飞, 王伟. 基于迁移学习的类别级物体识别与检测研究与进展. 自动化学报, 2019, 45(7): 1224-1243 Zhang Xue-Song, Zhuang Yan, Yan Fei, Wang Wei. Status and development of transfer learning based category-level object recognition and detection. Acta Automatica Sinica, 2019, 45(7): 1224-1243
[9]	You Q Z, Jin H L, Wang Z W, Fang C, Luo J B. Image captioning with semantic attention. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 4651−4659
[10]	Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780 doi: 10.1162/neco.1997.9.8.1735
[11]	Jia X, Gavves E, Fernando B, Tuytelaars T. Guiding the long-short term memory model for image caption generation. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 2407−2415
[12]	Wu Q, Shen C H, Liu L Q, Dick A, Van Den Hengel A. What value do explicit high level concepts have in vision to language problems? In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 203−212
[13]	Yang Z L, Yuan Y, Wu Y X, Cohen W W, Salakhutdinov R R. Review networks for caption generation. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: NIPS, 2016.
[14]	奚雪峰, 周国栋. 面向自然语言处理的深度学习研究. 自动化学报, 2016, 42(10): 1445-1465 Xi Xue-Feng, Zhou Guo-Dong. A survey on deep learning for natural language processing. Acta Automatica Sinica, 2016, 42(10): 1445-1465
[15]	侯丽微, 胡珀, 曹雯琳. 主题关键词信息融合的中文生成式自动摘要研究. 自动化学报, 2019, 45(3): 530-539 Hou Li-Wei, Hu Po, Cao Wen-Lin. Automatic Chinese abstractive summarization with topical keywords fusion. Acta Automatica Sinica, 2019, 45(3): 530-539
[16]	Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, et al. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR.org, 2015. 2048−2057
[17]	Lu J S, Xiong C M, Parikh D, Socher R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017. 3242−3250
[18]	Chen L, Zhang H W, Xiao J, Nie L Q, Shao J, Liu W, et al. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017. 6298−6306
[19]	Chen X P, Ma L, Jiang W H, Yao J, Liu W. Regularizing RNNs for caption generation by reconstructing the past with the present. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 7995−8003
[20]	Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 6077−6086
[21]	Lu J S, Yang J W, Batra D, Parikh D. Neural baby talk. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 7219−7228
[22]	Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: Common objects in context. In: Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 740−755
[23]	Karpathy A, L F F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE, 2015. 3128−3137
[24]	Papineni K, Roukos S, Ward T, Zhu W J. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: ACL, 2002. 311−318
[25]	Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, USA: ACL, 2005. 65−72
[26]	Vedantam R, Zitnick C L, Parikh D. CIDEr: Consensus-based image description evaluation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE, 2015. 4566−4575
[27]	Lin C Y. ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004. Barcelona, Spain: Association for Computational Linguistics, 2004.
[28]	Anderson P, Fernando B, Johnson M, Gould S. SPICE: Semantic propositional image caption evaluation. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer, 2016.
[29]	Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS, 2015. 91−99
[30]	Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123(1): 32-73 doi: 10.1007/s11263-016-0981-7
[31]	Rennie S J, Marcheret E, Mroueh Y, Ross J, Goel V. Self-critical sequence training for image captioning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017. 1179−1195