2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

LSTM逐层多目标优化及多层概率融合的图像描述

汤鹏杰 王瀚漓 许恺晟

汤鹏杰, 王瀚漓, 许恺晟. LSTM逐层多目标优化及多层概率融合的图像描述. 自动化学报, 2018, 44(7): 1237-1249. doi: 10.16383/j.aas.2017.c160733
引用本文: 汤鹏杰, 王瀚漓, 许恺晟. LSTM逐层多目标优化及多层概率融合的图像描述. 自动化学报, 2018, 44(7): 1237-1249. doi: 10.16383/j.aas.2017.c160733
TANG Peng-Jie, WANG Han-Li, XU Kai-Sheng. Multi-objective Layer-wise Optimization and Multi-level Probability Fusion for Image Description Generation Using LSTM. ACTA AUTOMATICA SINICA, 2018, 44(7): 1237-1249. doi: 10.16383/j.aas.2017.c160733
Citation: TANG Peng-Jie, WANG Han-Li, XU Kai-Sheng. Multi-objective Layer-wise Optimization and Multi-level Probability Fusion for Image Description Generation Using LSTM. ACTA AUTOMATICA SINICA, 2018, 44(7): 1237-1249. doi: 10.16383/j.aas.2017.c160733

LSTM逐层多目标优化及多层概率融合的图像描述

doi: 10.16383/j.aas.2017.c160733
基金项目: 

江西省教育厅科学技术研究项目 GJJ170643

上海高校特聘教授(东方学者)跟踪计划 GZ2015005

国家自然科学基金 61622115

国家自然科学基金 61472281

详细信息
    作者简介:

    汤鹏杰  同济大学计算机科学与技术系博士研究生.主要研究方向为计算机视觉和深度学习.E-mail:5tangpengjie@tongji.edu.cn

    许恺晟  同济大学计算机科学与技术系硕士研究生.主要研究方向为图像理解和深度学习.E-mail:iaalm@tongji.edu.cn

    通讯作者:

    王瀚漓 同济大学计算机科学与技术系教授.主要研究方向为视频编码, 计算机视觉和机器学习.本文通信作者.E-mail:hanliwang@tongji.edu.cn

Multi-objective Layer-wise Optimization and Multi-level Probability Fusion for Image Description Generation Using LSTM

Funds: 

Scientific Research Foundation of the Education Bureau of Jiangxi Province GJJ170643

Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning GZ2015005

National Natural Science Foundation of China 61622115

National Natural Science Foundation of China 61472281

More Information
    Author Bio:

    Ph. D. candidate in the Department of Computer Science and Technology, Tongji University. His research interest covers computer vision and deep learning

    Master student in the Department of Computer Science and Technology, Tongji University. His research interest covers image understanding and deep learning

    Corresponding author: WANG Han-Li Professor in the Department of Computer Science and Technology, Tongji University. His research interest covers video coding, computer vision, and machine learning. Corresponding author of this paper
  • 摘要: 使用计算模型对图像进行自动描述属于视觉高层理解,要求模型不仅能够对图像中的目标及场景进行描述,而且能够对目标与目标之间、目标与场景之间的关系进行表达,同时能够生成符合一定语法和结构的自然语言句子.目前基于深度卷积神经网络(Convolutional neural network,CNN)和长短时记忆网络(Long-short term memory,LSTM)的方法已成为解决该问题的主流,虽然已取得巨大进展,但存在LSTM层次不深,难以优化的问题,导致模型性能难以提升,生成的描述句子质量不高.针对这一问题,受深度学习思想的启发,本文设计了基于逐层优化的多目标优化及多层概率融合的LSTM(Multi-objective layer-wise optimization/multi-layer probability fusion LSTM,MLO/MLPF-LSTM)模型.模型中首先使用浅层LSTM进行训练,收敛之后,保留原LSTM模型中的分类层及目标函数,并添加新的LSTM层及目标函数重新对模型进行训练,对模型原有参数进行微调;在测试时,将多个分类层使用Softmax函数进行变换,得到每层对单词的预测概率分值,然后将多层的概率分值进行加权融合,得到单词的最终预测概率.在MSCOCO和Flickr30K两个数据集上实验结果显示,该模型性能显著,在多个统计指标上均超过了同类其他方法.
    1)  本文责任编委 王立威
  • 图  1  LSTM单元

    Fig.  1  LSTM unit

    图  2  训练第1阶段(基准模型)

    Fig.  2  The $1$st stage in training process (benchmark model)

    图  3  训练第$K$阶段

    Fig.  3  The $K$th stage in training process

    图  4  MLPF-LSTM图像描述生成流程

    Fig.  4  The pipeline of image description generation in MLPF-LSTM

    图  5  MSCOCO数据集中部分训练样本

    Fig.  5  The examples for training in MSCOCO dataset

    图  6  MLO/MLPF-LSTM (3-stage)模型生成的部分图像描述示例

    Fig.  6  Examples of image descriptions with MLO/ MLPF-LSTM (3-stage)

    图  7  在MSCOCO数据集上使用不同策略加深模型深度时的性能表现

    Fig.  7  Performance under different strategies at each stage on MSCOCO

    表  1  MSCOCO数据集上不同层次及多层融合之后的性能对比(非联调方式) ($\%$)

    Table  1  Performance comparison under different fusion conditions on MSCOCO (non-jointly optimizing) ($\%$)

    Models B-1 B-2 B-3 B-4 C
    Baseline 67.7 49.4 35.2 25.0 78.2
    2-stage P1 67.8 49.7 35.3 25.0 78.5
    P2 67.5 49.6 35.3 25.0 79.6
    Fusion 68.0 50.0 35.5 25.1 79.1
    3-stage P1 67.9 49.8 35.5 25.2 79.0
    P2 67.5 49.6 35.3 25.0 79.6
    P3 67.3 49.4 35.1 24.8 78.9
    Fusion 68.0 50.0 35.8 25.4 80.2
    4-stage P1 67.6 49.5 35.3 25.1 78.7
    P2 67.0 49.1 34.9 24.8 79.7
    P3 66.8 49.0 34.8 24.7 79.5
    P4 66.9 49.0 34.8 24.6 78.9
    Fusion 67.7 49.8 35.6 25.3 80.4
    C表示CIDEr
    下载: 导出CSV

    表  2  MSCOCO数据集上不同层次及多层融合之后的性能对比(联调方式) ($\%$)

    Table  2  Performance comparison under different fusion conditions on MSCOCO (jointly optimizing) ($\%$)

    Models B-1 B-2 B-3 B-4 C
    Baseline$^+$ 70.2 52.7 38.3 27.6 86.2
    2-stage P1 70.2 52.7 38.4 27.8 88.4
    P2 69.9 52.6 38.3 27.7 87.5
    Fusion 70.2 52.8 38.4 27.8 88.5
    3-stage P1 70.5 52.8 38.4 27.8 89.3
    P2 70.1 52.5 38.2 27.8 88.9
    P3 70.1 52.8 38.5 27.9 88.2
    Fusion 70.6 53.2 38.8 28.2 90.0
    C表示CIDEr
    下载: 导出CSV

    表  3  Flickr30K数据集上不同层次及多层融合之后的性能对比(联调方式) ($\%$)

    Table  3  Performance comparison under different fusion conditions on Flickr30K (jointly optimizing) ($\%$)

    Models B-1 B-2 B-3 B-4 M
    Baseline$^+$ 60.2 41.8 28.5 19.2 19.2
    2-stage P1 61.5 42.9 29.2 19.7 19.4
    P2 60.7 42.2 29.0 19.8 19.2
    Fusion 61.4 42.8 29.2 19.8 19.6
    M表示METEOR
    下载: 导出CSV

    表  4  MSCOCO数据集上不同层次及多层融合之后的性能对比(使用联调方式和集束搜索算法) ($\%$)

    Table  4  Performance comparison under different fusion conditions on MSCOCO (jointly optimizing and Beam search algorithm are employed) ($\%$)

    Models B-1 B-2 B-3 B-4 C
    Baseline$^+$ 71.3 54.4 40.8 30.5 92.0
    2-stage P1 71.4 54.3 40.7 30.6 93.8
    P2 71.6 54.8 41.1 31.0 93.7
    Fusion 71.5 54.5 41.0 31.0 94.2
    C表示CIDEr
    下载: 导出CSV

    表  5  Flickr30K数据集上不同层次及多层融合之后的性能对比(使用联调方式和集束搜索算法) ($\%$)

    Table  5  Performance comparison under different fusion conditions on Flickr30K (jointly optimizing and Beam search algorithm are employed) ($\%$)

    Models B-1 B-2 B-3 B-4 M
    Baseline$^+$ 63.4 44.5 30.9 21.1 19.0
    2-stage P1 65.1 45.8 31.8 21.9 19.2
    P2 65.0 46.0 32.0 21.9 19.3
    Fusion 66.2 47.2 33.1 23.0 19.6
    M表示METEOR
    下载: 导出CSV

    表  6  不同方法在MSCOCO数据集上的性能对比($\%$)

    Table  6  Performance comparison with other state-of-the-art methods on MSCOCO ($\%$)

    Methods B-1 B-2 B-3 B-4 C
    multimodal RNN[14] 62.5 45.0 32.1 23.0 66.0
    Google NIC[2] 66.6 46.1 32.9 24.6 --
    LRCN-AlexNet[13] 62.8 44.2 30.4 21.0 --
    m-RNN[1] 67.0 49.0 35.0 25.0 --
    Soft-attention[15] 70.7 49.2 34.4 24.3 --
    Hard-attention[15] 71.8 50.4 35.7 25.0 --
    emb-gLSTM, Gaussian[28] 67.0 49.1 35.8 26.4 81.3
    MLO/MLPF-LSTM 67.7 49.8 35.6 25.3 80.4
    MLO/MLPF-LSTM$^+$ 70.6 53.2 38.8 28.2 90.0
    MLO/MLPF-LSTM$^+$(BS) 71.5 54.5 41.0 31.0 94.2
    BS表示Beam search, C表示CIDEr
    下载: 导出CSV

    表  7  不同方法在Flickr30K数据集上的性能对比($\%$)

    Table  7  Performances comparison with other state-of-the-art methods on Flickr30K ($\%$)

    Methods B-1 B-2 B-3 B-4 M
    multimodal RNN[14] 57.3 36.9 24.0 15.7 15.3
    Google NIC[2] 66.3 42.3 27.7 18.3 --
    LRCN-AlexNet[13] 58.7 39.1 25.1 16.5 --
    m-RNN[1] 60.0 41.0 28.0 19.0 --
    Soft-attention[15] 66.7 43.4 28.8 19.1 18.5
    Hard-attention[15] 66.9 43.9 29.6 19.9 18.5
    emb-gLSTM, Gaussian[28] 64.6 44.6 30.5 20.6 17.9
    MLO/MLPF-LSTM$^+$ 61.4 42.8 29.2 19.8 19.6
    MLO/MLPF-LSTM$^+$(BS) 66.2 47.2 33.1 23.0 19.6
    M表示METEOR, BS表示Beam search
    下载: 导出CSV
  • [1] Mao J H, Xu W, Yang Y, Wang J, Huang Z H, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of the 2015 International Conference on Learning Representations. San Diego, USA, 2015.
    [2] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: a neural image caption generator. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 3156-3164
    [3] Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S M, Choi Y, Berg A C, Berg T L. BabyTalk:understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12):2891-2903 doi: 10.1109/TPAMI.2012.162
    [4] Mitchell M, Han X F, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H Ⅲ. Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, France: ACL, 2012. 747-756 http://tamaraberg.com/papers/EACL12.pdf
    [5] Elliott D, Keller F. Image description using visual dependency representations. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, USA: ACL, 2013. 1292-1302 http://www.aclweb.org/anthology/D/D13/D13-1128.pdf
    [6] Farhadi A, Hejrati M, Sadeghi M A, Young P, Rashtchian C, Hockenmaieret J, Forsyth D. Every picture tells a story: generating sentences from images. In: Proceedings of the 2010 European Conference on Computer Vision (ECCV). Berlin, Heidelberg, Germany: Springer, 2010. 15-29 https://www.cs.cmu.edu/~afarhadi/papers/sentence.pdf
    [7] 张红斌, 姬东鸿, 尹兰, 任亚峰.基于梯度核特征及N-gram模型的商品图像句子标注.计算机科学, 2016, 43(5):269-273, 287 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsja201605053&dbname=CJFD&dbcode=CJFQ

    Zhang Hong-Bin, Ji Dong-Hong, Yin Lan, Ren Ya-Feng. Product image sentence annotation based on gradient kernel feature and N-gram model. Computer Science, 2016, 43(5):269-273, 287 http://kns.cnki.net/KCMS/detail/detail.aspx?filename=jsja201605053&dbname=CJFD&dbcode=CJFQ
    [8] Socher R, Karpathy A, Le Q V, Manning C D, Ng A Y. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2014, 2:207-218 https://nlp.stanford.edu/~socherr/SocherKarpathyLeManningNg_TACL2013.pdf
    [9] Kuznetsova P, Ordonez V, Berg T L, Choi Y. TreeTalk:composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2014, 2:351-362
    [10] Kuznetsova P, Ordonez V, Berg A, Berg T, Choi Y. Generalizing image captions for image-text parallel corpus. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Sofia, Bulgaria: ACL, 2013. 790-796 http://www.cs.unc.edu/~vicente/files/acl13_generalization.pdf
    [11] Mason R, Charniak E. Nonparametric method for data-driven image captioning. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, USA: ACL, 2014. 592-598 http://aclweb.org/anthology/P/P14/P14-2097.pdf
    [12] 蒋树强, 闵巍庆, 王树徽.面向智能交互的图像识别技术综述与展望.计算机研究与发展, 2016, 53(1):113-122 doi: 10.7544/issn1000-1239.2016.20150689

    Jiang Shu-Qiang, Min Wei-Qing, Wang Shu-Hui. Survey and prospect of intelligent interaction-oriented image recognition techniques. Journal of Computer Research and Development, 2016, 53(1):113-122 doi: 10.7544/issn1000-1239.2016.20150689
    [13] Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 2625-2634 http://openaccess.thecvf.com/content_cvpr_2015/papers/Donahue_Long-Term_Recurrent_Convolutional_2015_CVPR_paper.pdf
    [14] Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 3128-3137
    [15] Xu K, Ba J L, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R S, Bengio Y. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France, 2015. 2048-2057
    [16] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786):504-507 doi: 10.1126/science.1127647
    [17] Hermans M, Schrauwen B. Training and analyzing deep recurrent neural networks. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc., 2013. 190-198 https://papers.nips.cc/paper/5166-training-and-analysing-deep-recurrent-neural-networks.pdf
    [18] Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 1-9 https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
    [19] Lee C Y, Xie S N, Gallagher P W, Zhang Z Y, Tu Z W. Deeply-supervised nets. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics. San Diego, USA, 2015. 562-570
    [20] Gerber R, Nagel H H. Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In: Proceedings of the 1996 International Conference on Image Processing. Lausanne, Switzerland: IEEE, 1996. 805-808 https://ieeexplore.ieee.org/document/561027/
    [21] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 2012 Advances in Neural Information Processing Systems. Lake Tahoe, USA: MIT Press, 2012. 1097-1105 https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
    [22] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 2015 International Conference on Learning Representations. San Diego, USA, 2015.
    [23] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 770-778
    [24] 石俊飞, 刘芳, 林耀海, 刘璐.基于深度学习和层次语义模型的极化SAR分类.自动化学报, 2017, 43(2):215-226 http://www.aas.net.cn/CN/abstract/abstract19010.shtml

    Shi Jun-Fei, Liu Fang, Lin Yao-Hai, Liu Lu. Polarimetric SAR image classification based on deep learning and hierarchical semantic model. Acta Automatica Sinica, 2017, 43(2):215-226 http://www.aas.net.cn/CN/abstract/abstract19010.shtml
    [25] 王伟凝, 王励, 赵明权, 蔡成加, 师婷婷, 徐向民.基于并行深度卷积神经网络的图像美感分类.自动化学报, 2016, 42(6):905-914 http://www.aas.net.cn/CN/abstract/abstract18881.shtml

    Wang Wei-Ning, Wang Li, Zhao Ming-Quan, Cai Cheng-Jia, Shi Ting-Ting, Xu Xiang-Min. Image aesthetic classification using parallel deep convolutional neural networks. Acta Automatica Sinica, 2016, 42(6):905-914 http://www.aas.net.cn/CN/abstract/abstract18881.shtml
    [26] 段艳杰, 吕宜生, 张杰, 赵学亮, 王飞跃.深度学习在控制领域的研究现状与展望.自动化学报, 2016, 42(5):643-654 http://www.aas.net.cn/CN/abstract/abstract18852.shtml

    Duan Yan-Jie, Lv Yi-Sheng, Zhang Jie, Zhao Xue-Liang, Wang Fei-Yue. Deep learning for control:the state of the art and prospects. Acta Automatica Sinica, 2016, 42(5):643-654 http://www.aas.net.cn/CN/abstract/abstract18852.shtml
    [27] 郭潇逍, 李程, 梅俏竹.深度学习在游戏中的应用.自动化学报, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtml

    Guo Xiao-Xiao, Li Cheng, Mei Qiao-Zhu. Deep learning applied to games. Acta Automatica Sinica, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtml
    [28] Jia X, Gavves E, Fernando B, Tuytelaars T. Guiding the long-short term memory model for image caption generation. In: Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015. 2407-2415 https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Jia_Guiding_the_Long-Short_ICCV_2015_paper.pdf
    [29] 奚雪峰, 周国栋.面向自然语言处理的深度学习研究.自动化学报, 2016, 42(10):1445-1465 http://www.aas.net.cn/CN/abstract/abstract18934.shtml

    Xi Xue-Feng, Zhou Guo-Dong. A survey on deep learning for natural language processing. Acta Automatica Sinica, 2016, 42(10):1445-1465 http://www.aas.net.cn/CN/abstract/abstract18934.shtml
    [30] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11):2278-2324 doi: 10.1109/5.726791
    [31] Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of the 2014 British Machine Vision Conference. Nottingham, England: British Machine Vision Association, 2014.
    [32] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C, Li F F. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3):211-252 doi: 10.1007/s11263-015-0816-y
    [33] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8):1735-1780 doi: 10.1162/neco.1997.9.8.1735
    [34] Graves A. Generating sequences with recurrent neural networks[Online], available: https://arxiv.org/pdf/1308.0850v5.pdf, June 5, 2014
    [35] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994, 5(2):157-166 doi: 10.1109/72.279181
    [36] Gers F A, Schmidhuber J. Recurrent nets that time and count. In: Proceedings of the 2000 IEEE-INNS-ENNS International Joint Conference on Neural Networks. Como, Italy: IEEE, 2000. 189-194
    [37] Cho K, Van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation[Online], available: https://arxiv.org/pdf/1406.1078v3.pdf, September 3, 2014
    [38] Greff K, Srivastava R K, Koutník J, Steunebrink B R, Schmidhuber J. LSTM:a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10):2222-2232 doi: 10.1109/TNNLS.2016.2582924
    [39] Zhou B L, Lapedriza A, Xiao J X, Torralaba A, Oliva A. Learning deep features for scene recognition using places database. In: Proceedings of the 2015 Advances in Neural Information Processing Systems. Montréal, Canada: MIT Press, 2015. 487-495 http://places.csail.mit.edu/places_NIPS14.pdf
    [40] Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: common objects in context. In: Proceedings of the 2014 European Conference on Computer Vision (ECCV). Zurich, Switzerland: Springer, 2014. 740-755
    [41] Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations:new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2014, 2:67-78 http://nlp.cs.illinois.edu/HockenmaierGroup/Papers/TACL2014/TACLDenotationGraph.pdf
    [42] Papineni K, Roukos S, Ward T, Zhu W J. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: ACL, 2002. 311-318 http://www.aclweb.org/anthology/P02-1040.pdf
    [43] Banerjee S, Lavie A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization. Ann Arbor, USA: ACL, 2005. 65-72 http://www.cs.cmu.edu/~alavie/METEOR/pdf/Banerjee-Lavie-2005-METEOR.pdf
    [44] Vedantam R, Zitnick C L, Parikh D. CIDEr: consensus-based image description evaluation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 4566-4575
    [45] Jia Y Q, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, Florida, USA: ACM, 2014. 675-678
    [46] Krishna R, Zhu Y K, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S, Li F F. Visual Genome: connecting language and vision using crowd sourced dense image annotations[Online], available: https://arxiv.org/pdf/1602.07332.pdf, February 23, 2016
  • 加载中
图(7) / 表(7)
计量
  • 文章访问数:  3937
  • HTML全文浏览量:  1087
  • PDF下载量:  1132
  • 被引次数: 0
出版历程
  • 收稿日期:  2016-10-25
  • 录用日期:  2017-03-02
  • 刊出日期:  2018-07-20

目录

    /

    返回文章
    返回