2.793

2018影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

从视频到语言: 视频标题生成与描述研究综述

汤鹏杰 王瀚漓

汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述. 自动化学报, 2021, 47(x): 1−23 doi: 10.16383/j.aas.c200662
引用本文: 汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述. 自动化学报, 2021, 47(x): 1−23 doi: 10.16383/j.aas.c200662
Tang Peng-Jie, Wang Han-Li. From video to language: survey of video captioning and description. Acta Automatica Sinica, 2021, 47(x): 1−23 doi: 10.16383/j.aas.c200662
Citation: Tang Peng-Jie, Wang Han-Li. From video to language: survey of video captioning and description. Acta Automatica Sinica, 2021, 47(x): 1−23 doi: 10.16383/j.aas.c200662

从视频到语言: 视频标题生成与描述研究综述

doi: 10.16383/j.aas.c200662
基金项目: 国家自然科学基金(62062041, 61976159, 61962003), 上海市科技创新行动计划项目(20511100700), 江西省自然科学基金(20202BAB202017, 20202BABL202007), 井冈山大学博士启动基金(JZB1923)资助
详细信息
    作者简介:

    汤鹏杰:井冈山大学大学电子与信息工程学院讲师, 博士. 主要研究方向为机器学习、计算机视觉、多媒体智能计算. E-mail: 5tangpengjie@tongji.edu.cn

    王瀚漓:同济大学计算机科学与技术系教授, 博士生导师. 主要研究方向为机器学习、视频编码、计算机视觉、多媒体智能计算等. 本文通讯作者. E-mail: hanliwang@tongji.edu.cn

From Video to Language: Survey of Video Captioning And Description

Funds: Supported by National Natural Science Foundation of P. R. China (62062041, 61976159, 61962003), Shanghai Innovation Action Project of Science and Technology (20511100700), NSF of Jiangxi Province (20202BAB202017,20202BABL202007), Ph.D. Research Project of Jinggangshan University (JZB1923)
  • 摘要: 视频标题生成与描述是使用自然语言对视频进行总结与重新表达. 由于视频与语言之间存在异构特性, 其数据处理过程较为复杂. 本文主要对基于“编码-解码”架构的模型做了详细阐述, 以视频特征编码与使用方式为依据, 将其分为基于视觉特征均值/最大值的方法、基于视频序列记忆建模的方法、基于三维卷积特征的方法及混合方法, 并对各类模型进行了归纳与总结. 最后, 对当前存在的问题及可能趋势进行了总结与展望, 指出需要生成融合情感、逻辑等信息的结构化语段, 并在模型优化、数据集构建、评价指标等方面进行更为深入的研究.
  • 图  1  视频标题生成与描述任务示例

    Fig.  1  Example of video captioning and description

    图  2  基于模板/规则的视频描述框架

    Fig.  2  The template/rule based framework for video captioning and description

    图  3  基于视觉均值/最大值特征的视频描述框架

    Fig.  3  The mean/max pooling visual feature based framework for video captioning and description

    图  4  基于RNN序列建模的视频描述框架

    Fig.  4  The RNN based framework for video captioning and description

    图  5  Res-F2F视频描述生成流程

    Fig.  5  The framework of Res-F2F for video captioning and description

    图  6  视频密集描述任务示例

    Fig.  6  Example of dense video captioning and description

    图  7  基于强化学习的层次化视频描述框架

    Fig.  7  The reinforcement learning based framework for video captioning and description

    图  8  基于3D卷积特征的视频描述基本框架

    Fig.  8  The 3D CNN based framework for video captioning and description

    图  9  含有情感与动态时序信息的复杂视频示例

    Fig.  9  Video with rich emotion and motion feature

    图  10  MSVD数据集部分示例(训练集)

    Fig.  10  Examples from MSVD (training set)

    图  11  MSR-VTT2016数据集部分示例(训练集)

    Fig.  11  Examples from MSR-VTT2016 (training set)

    图  12  SAAT模型生成描述句子示例(“RF”表示参考句子, “SAAT”表示模型所生成的句子)

    Fig.  12  Candidate sentence examples with SAAT model (“RF” stands for references, and “SAAT” denotes the generated sentences with SAAT)

    图  13  SDVC模型生成的部分描述示例(“RF-e”表示参考语句, “SDVC-e”表示SDVC模型生成的句子)

    Fig.  13  Description examples with SDVC model (“RF-e” stands for the references, and “SDVC-e” denotes the generated sentences with SDVC)

    表  1  部分基于视觉序列特征均值/最大值的模型在MSVD数据集上的性能表现(%)

    Table  1  Performance (%) of a few popular models based on visual sequential feature with mean/max pooling on MSVD

    Methods(方法)B-1B-2B-3B-4METEORCIDEr
    LSTM-YT[23]33.329.1
    DFS-CM(Mean)[27]80.067.456.846.533.6
    DFS-CM(Max)[27]79.867.357.147.134.1
    LSTM-E[25]78.866.055.445.331.0
    LSTM-TSAIV[26]82.872.062.852.833.574.0
    MS-RNN(R)[112]82.972.663.553.333.874.8
    RecNetlocal(SA-LSTM)[47]52.334.180.3
    下载: 导出CSV

    表  4  其他部分主流模型在MSVD上的性能表现(%)

    Table  4  Performance (%) of a few other popular models on MSVD

    Methods(方法)B-4METEORCIDEr
    FGM[115]13.723.9
    TDConvED I[79]53.333.876.4
    SibNet[80]54.234.888.2
    GRU-EVEhft+sem(CI)[81]47.935.078.1
    下载: 导出CSV

    表  2  部分基于序列RNN视觉特征建模的模型在MSVD数据集上的性能表现(%)

    Table  2  Performance (%) of a few popular models based on visual sequential feature with RNN on MSVD

    Methods(方法)B-1B-2B-3B-4METEORCIDEr
    S2VT[32]29.8
    Res-F2F (G-R101-152)[34]82.871.762.452.435.784.3
    Joint-BiLSTM reinforced[35]30.3
    HRNE with attention[38]79.266.355.143.833.1
    Boundary-aware encoder[39]42.532.463.5
    hLSTMat[41]82.972.263.053.033.6
    Li et al[42]48.031.668.8
    MGSA(I+C)[43]53.435.086.7
    LSTM-GAN[113]42.930.4
    PickNet (V+L+C)[114]52.333.376.5
    下载: 导出CSV

    表  3  部分基于3D卷积特征的模型在MSVD数据集上的性能表现(%)

    Table  3  Performance (%) of a few popular models based on 3D visual feature on MSVD

    Methods(方法)B-1B-2B-3B-4METEORCIDEr
    ETS(Local+Global)[48]41.929.651.7
    M3 -inv3[62]81.671.462.352.032.2
    SAAT[77]46.533.581.0
    Topic-guided[68]49.333.983.0
    ORG-TRL[76]54.336.495.2
    下载: 导出CSV

    表  5  部分基于视觉序列均值/最大值的模型在MSR-VTT2016数据集上的性能表现(%)

    Table  5  Performance (%) of visual sequential feature based models with mean/max pooling on MSR-VTT2016

    Methods(方法)B-1B-2B-3B-4METEORCIDEr
    LSTM-YT[23]75.960.646.535.426.3
    MS-RNN[112]39.826.140.9
    RecNetlocal(SA-LSTM)[47]39.126.642.7
    ruc-uva[116]38.726.945.9
    Aalto[60]41.127.746.4
    下载: 导出CSV

    表  8  其他主流模型在MSR-VTT2016上的性能(%)

    Table  8  Performance (%) of other popular models on MRT-VTT2016

    Methods(方法)B-4METEORCIDEr
    TDConvED (R)[79]39.527.542.8
    SibNet[80]41.227.848.6
    GRU-EVEhft+sem(CI)[81]38.328.448.1
    v2t navigator[119]43.729.045.7
    下载: 导出CSV

    表  6  部分基于RNN视觉序列特征建模的模型在MSR-VTT2016数据集上的性能表现(%)

    Table  6  Performance (%) of a few popular models based on visual sequential feature with RNN on MRT-VTT2016

    Methods(方法)B-1B-2B-3B-4METEORCIDEr
    Res-F2F (G-R101-152)[34]81.167.253.741.429.048.9
    hLSTMat[41]38.326.3
    Li et al[42]76.162.149.137.526.4
    MGSA(I+A+C)[43]45.428.650.1
    LSTM-GAN[113]36.026.1
    aLSTM[117]38.026.1
    VideoLAB[118]39.527.744.2
    PickNet (V+L+C)[114]41.327.744.1
    DenseVidCap[49]44.229.450.5
    ETS(Local+Global)[48]77.862.248.137.128.4
    下载: 导出CSV

    表  7  部分基于3D卷积特征的模型在MSR-VTT2016数据集上的性能表现(%)

    Table  7  Performance (%) of a few popular models based on 3D visual sequential feature on MRT-VTT2016

    Methods(方法)B-1B-2B-3B4METEORCIDEr
    ETS(C3D+VGG-19)[111]81.565.052.540.529.9--
    M3 –inv3[62]38.126.6
    Topic-guided[68]44.129.349.8
    ORG-TRL[76]43.628.850.9
    SAAT(RL)[77]79.665.952.139.927.751.0
    下载: 导出CSV

    表  9  部分基于RNN视觉序列特征建模的模型在ActivityNet Captions数据集(验证集)上的性能表现 (%)

    Table  9  Performance (%) of a few popular models based on visual sequential feature with RNN on ActivityNet Captions dataset (validation set)

    Methods(方法)B1B2B3B4METEORCIDEr
    Masked Transformer[53]9.964.812.421.154.989.25
    TDA-CG[51]10.755.062.551.315.867.99
    MFT[42]13.316.132.821.247.0821.00
    SDVC[55]17.927.992.940.938.8230.68
    下载: 导出CSV

    表  10  部分基于3D卷积特征的模型在ActivityNet Captions数据集(验证集)上的性能表现 (%)

    Table  10  Performance (%) of a few popular models based on 3D visual sequential feature on ActivityNet Captions dataset (validation set)

    Methods(方法)B1B2B3B4METEORCIDEr
    DCE[86]10.814.571.900.715.6912.43
    DVC[87]12.225.722.270.736.9312.61
    下载: 导出CSV
  • [1] Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2): 91−110 doi: 10.1023/B:VISI.0000029664.99615.94
    [2] Dalal N, and Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2005. 886−893
    [3] Nagel H H. A vision of “vision and language” comprises action: An example from road traffic. Artificial Intelligence Review, 1994, 8: 189−214 doi: 10.1007/BF00849074
    [4] Kojima A, Tamura T, and Fukunaga K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 2002, 50(2): 171−184 doi: 10.1023/A:1020346032608
    [5] Gupta A, Srinivasan P, Shi J, and Davis L S. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2009. 2012−2019
    [6] Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, and et al. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot. In: Proceedings of the IEEE International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2013. 2712−2719
    [7] Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, and Schiele B. Translating video content to natural language descriptions. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2013. 433−440
    [8] Krizhevsky A, Sutskever I, and Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of the Conference on Neural Information Processing Systems, Cambridge, USA: MIT Press, 2012. 1097−1105
    [9] Simonyan K, and Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representation, Banff AB Canada, 2014
    [10] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, and et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015. 1−9
    [11] He K, Zhang X, Ren S, and Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 770−778
    [12] 胡建芳, 王熊辉, 郑伟诗, 赖剑煌. RGB-D行为识别研究进展及展望. 自动化学报, 2019, 45(5): 829−840

    Hu Jian-Fang, Wang Xiong-Hui, Zheng Wei-Shi, and Lai Jian-Huang. RGB-D action recognition: Recent advances and future perspectives. Acta Automatica Sinica, 2019, 45(5): 829−840
    [13] 周波, 李俊峰. 结合目标检测的人体行为识别. 自动化学报, 2020, 46(9): 1961−1970

    Zhou Bo, Li Jun-Feng. Human action recognition combined with object detection. Acta Automatica Sinica, 2020, 46(9): 1961−1970
    [14] Wu J, Wang L, Wang L, Guo J, and Wu G. Learning actor relation graphs for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 9956−9966
    [15] Ji S, Xu W, Yang M, and Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221−231 doi: 10.1109/TPAMI.2012.59
    [16] Tran D, Bourdev L, Fergus R, Torresani L, and Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 4489−4497
    [17] Cho K, Merrienboer B, Bahdanau D, Bougares F, Schwenk H, and et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL Press, 2014. 1724−1734
    [18] Xu K, Ba J L, Kiros R, Cho K, Courville A, and et al. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, New York, USA: ACM Press, 2015. 2048−2057
    [19] Yao T, Pan Y, Li Y, Qiu Z, and Mei T. Boosting image captioning with attributes. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2017. 4904−4912
    [20] Aafaq N, Mian A, Liu W, Gilani S Z, and Shah M. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys, 2019, 52(6): 115(1--37)
    [21] Li S, Tao Z, Li K, and Fu Y. Visual to text: Survey of image and video captioning. IEEE Transactions on Emerging Topics in Computational Intelligence, 2019, 3(4): 297−312 doi: 10.1109/TETCI.2019.2892755
    [22] Xu R, Xiong C, Chen W, and Corso J J. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2015. 2346−2352
    [23] Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, and Saenko K. Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies, Stroudsburg, USA: ACL Press, 2015. 1494−1504
    [24] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S and et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115: 211−252 doi: 10.1007/s11263-015-0816-y
    [25] Pan Y, Mei T, Yao T, Li H, and Rui Y. Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 4594−4602
    [26] Pan Y, Yao T, Li H, and Mei T. Video captioning with transferred semantic attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 984−992
    [27] 汤鹏杰, 谭云兰, 李金忠等. 密集帧率采样的视频标题生成. 计算机科学与探索, 2018, 12(6): 981−993 doi: 10.3778/j.issn.1673-9418.1705058

    Tang Peng-Jie, Tan Yun-Lan, Li Jin-Zhong, and Tan Bin. Dense frame rate sampling based model for video caption generation. Journal of Frontiers of Computer Science and Technology, 2018, 12(6): 981−993 doi: 10.3778/j.issn.1673-9418.1705058
    [28] Dalal N, Triggs B, and Schmid C. Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2016. 428−441
    [29] Wang H, Kläser A, Schmid C, and Liu C-L. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 2013, 103: 60−79 doi: 10.1007/s11263-012-0594-8
    [30] Wang H, and Schmid C. Action recognition with improved trajectories. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2013. 3551−3558
    [31] Simonyan K, and Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Conference on Neural Information Processing Systems, Cambridge, USA: MIT Press, 2014. 568−576
    [32] Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, and Saenko K. Sequence to sequence-video to text. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 4534−4542
    [33] Venugopalan S, Hendricks LA, Mooney R, and Saenko K. Improving lstm-based video description with linguistic knowledge mined from text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, USA: ACL Press, 2016. 1961−1966
    [34] Tang P, Wang H, and Li Q. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15: 1−23
    [35] Bin Y, Yang Y, Shen F, Xu X, and Shen H T. Bidirectional long-short term memory for video description. In: Proceedings of the ACM International Conference on Multimedia, New York, USA: ACM Press, 2016. 436−440
    [36] Pasunuru R, and Bansal M. Multi-task video captioning with video and entailment generation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, Stroudsburg, USA: ACL Press, 2017. 1273−1283
    [37] Li L, and Gong B. End-to-end video captioning with multitask reinforcement learning. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Piscataway, USA: IEEE Press, 2019. 339−348
    [38] Pan P, Xu Z, Yang Y, Wu F, and Zhuang Y. Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 1029−1038
    [39] Baraldi L, Grana C, and Cucchiara R. Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 3185--3194
    [40] Xu J, Yao T, Zhang Y, and Mei T. Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the ACM International Conference on Multimedia, New York, USA: ACM Press, 2017. 537−545
    [41] Song J, Guo Z, Gao L, Liu W, Zhang D, and Shen H T. Hierarchical LSTM with adjusted temporal attention for video captioning. In: Proceedings of the International Joint Conference on Artificial Intelligence, San Francisco, USA: Margan Kaufmann Press, 2017. 2737−2743
    [42] Li W, Guo D, and Fang X. Multimodal architecture for video captioning with memory networks and an attention mechanism. Pattern Recognition Letters, 2018, 105: 23−29 doi: 10.1016/j.patrec.2017.10.012
    [43] Chen S, and Jiang Y-G. Motion guided spatial attention for video captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2019. 8191−8198
    [44] Zhang J, and Peng Y. Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 8327−8336
    [45] Zhang J, and Peng Y. Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Transactions on Image Processing, 2020, 29: 6209−6222 doi: 10.1109/TIP.2020.2988435
    [46] Wang B, Ma L, Zhang W, and Liu W. Reconstruction network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 7622−7631
    [47] Zhang W, Wang B, Ma L, and Liu W. Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(12): 3088−3101
    [48] Yao L, Torabi A, Cho K, Ballas N, Pal C, and et al. Describing videos by exploiting temporal structure. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 4507−4515
    [49] Shen Z, Li J, Su Z, Li M, Chen Y, Jiang Y-G, and Xue X. Weakly supervised dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 5159−5167
    [50] Johnson J, Karpathy A, and Li F-F. DenseCap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 4565−4574
    [51] Wang J, Jiang W, Ma L, Liu W, and Xu Y. Bidirectional attentive fusion with context gating for dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 7190−7198
    [52] Zhou L, Zhou Y, Corso J J, Socher R, and Xiong C. End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 8739−8748
    [53] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, and et al. Attention is all you need. In: Proceedings of the Conference on Neural Information Processing Systems, Cambridge, USA: MIT Press, 2017. 5998−6008
    [54] Zhou L, Kalantidis Y, Chen X, Corso J J, and Rohrbach M. Grounded video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6578−6587
    [55] Mun J, Yang L, Zhou Z, Xu N, and Han B. Streamlined dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6588−6597
    [56] Wang X, Chen W, Wu J, Wang Y-F, and Wang W-Y. Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 4213−4222
    [57] Xiong Y, Dai B, and Lin D. Move forward and tell: A progressive generator of video descriptions. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2018. 468−483
    [58] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R and Li F-F. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2014. 1725−1732
    [59] Heilbron F C, Escorcia V, Ghanem B, and Niebles J C. ActivityNet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015. 961−970
    [60] Shetty R, and Laaksonen J. Frame- and segment-level features and candidate pool evaluation for video caption gen eration. In: Proceedings of the ACM International Conference on Multimedia, New York, USA: ACM Press, 2016. 1073−1076
    [61] Yu Y, Choi J, Kim Y, Yoo K, Lee S-H, and Kim G. Supervising neural attention models for video captioning by human gaze data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 490−498
    [62] Wang J, Wang W, Huang Y, Wang L, and Tan T. M3: Multimodal memory modeling for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 7512−7520
    [63] Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, and et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015. 2625−2634
    [64] Tang P, Wang H, and Kwong S. Deep sequential fusion LSTM network for image description. Neurocomputing, 2018, 312: 154−164 doi: 10.1016/j.neucom.2018.05.086
    [65] Pei W, Zhang J, Wang X, Ke L, Shen X, and Tai Y-W. Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 8347−8356
    [66] Li X, Zhao B, and Lu X. Mam-RNN: Multi-level attention model based RNN for video captioning. In: Proceedings of the International Joint Conference on Artificial Intelligence, San Francisco, USA: Margan Kaufmann Press, 2017. 2208−2214
    [67] Zhao B, Li X, and Lu X. Cam-RNN: Co-attention model based RNN for video captioning. IEEE Transactions on Image Processing, 2019, 28(11): 5552−5565 doi: 10.1109/TIP.2019.2916757
    [68] Chen S, Jin Q, Chen J, and Hauptmann A G. Generating video descriptions with latent topic guidance. IEEE Transactions on Multimedia, 2019, 21(9): 2407−2418 doi: 10.1109/TMM.2019.2896515
    [69] Gan C, Gan Z, He X, Gao J, and Deng L. StyleNet: Generating attractive visual captions with styles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 955−964
    [70] Pan B, Cai H, Huang D, Lee K-H, Gaidon A, Adeli E, and Niebles J C. Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2020. 10867−10876
    [71] Hemalatha M and Chandra Sekhar C. Domain-specific semantics guided approach to video captioning. In: Proceedings of the 2020 Winter Conference on Applications of Computer Vision, Piscataway, USA: IEEE Press, 2020. 1587−1596
    [72] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 6299−6308
    [73] Cherian A, Wang J, Hori C, Marks T M. Spatio-temporal ranked-attention networks for video captioning. In: Proceedings of the 2020 Winter Conference on Applications of Computer Vision, Piscataway, USA: IEEE Press, 2020. 1617−1626
    [74] Wang L, Shang C, Qiu H, Zhao T, Qiu B, and Li H. Multi-stage tag guidance network in video caption. In: Proceedings of the ACM International Conference on Multimedia, New York, USA: ACM Press, 2020. 4610−4614
    [75] Hou J, Wu X, Zhao W, Luo J, and Jia Y. Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2020. 8918−8927
    [76] Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, and Zha Z. Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2020. 13278−13288
    [77] Zheng Q, Wang C, and Tao D. Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2020. 13096−13105
    [78] Hou J, Wu X, Zhang X, Qi Y, Jia Y, and Luo J. Joint commonsense and relation reasoning for image and video captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2020. 10973−10980
    [79] Chen J, Pan Y, Li Y, Yao T, Chao H, and Mei T. Temporal deformable convolutional encoder-decoder networks for video captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2019. 8167−8174
    [80] Liu S, Ren Z, and Yuan J. SibNet: Sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, to be published
    [81] Aafaq N, Akhtar N, Liu W, Gilani S Z, and Mian A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 12487−12496
    [82] Yu H, Wang J, Huang Z, Yang Y, and Xu W. Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 4584−4593
    [83] Iashin V, Rahtu E. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In: Proceedings of the British Machine Vision Conference, Berlin, Germany: Springer Press, 2020
    [84] Park J, Darrell T, and Rohrbach A. Identity-aware multi-sentence video description. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2020. 360−378
    [85] Escorcia V, Heilbron F C, Niebles J C, and Ghanem B. DAPs: Deep action proposals for action understanding. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2016. 768−784
    [86] Krishna R, Hata K, Ren F, Li F-F, and Niebles J C. Dense-captioning events in videos. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2017. 706−715
    [87] Li Y, Yao T, Pan Y, Chao H, and Mei T. Jointly localizing and describing events for dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 7492−7500
    [88] Wang T, Zheng H, Yu M, Tian Q, and Hu H. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 2020, to be published
    [89] Park J S, Rohrbach M, Darrell T, and Rohrbach A. Adversarial inference for multi-sentence video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6598−6608
    [90] Devlin J, Chang M-W, Lee K, and Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies, Stroudsburg, USA: ACL Press, 2019. 4171−4186
    [91] Sun C, Myers A, Vondrick C, Murphy K, and Schmid C. VideoBERT: A joint model for video and language representation learning. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2019. 7464−7473
    [92] Xie S, Sun C, Huang J, Tu Z, and Murphy K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2018. 318−35
    [93] Sun C, Baradel F, Murphy K, and Schmid C. Learning video representation using contrastive bidirectional transformer. [Online], available: https://arxiv.org/pdf/1906.05743.pdf, Dec. 18, 2020.
    [94] Luo H, Ji L, Huang H, Duan N, Li T, Li J, and et al. UniVL: A unified video and language pre-training model for multimodal understanding and generation. [Online], available: https://arxiv.org/pdf/2002.06353.pdf, Dec. 18, 2020.
    [95] Mathews A P, Xie L, and He X. SentiCap: Generating image descriptions with sentiments. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2016. 3574−3580
    [96] Guo L, Liu J, Yao P, Li J, and Lu H. MSCap: Multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 4204−4213
    [97] Park C C, Kim B, and Kim G. Attend to you: Personalized image captioning with context sequence memory networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 895−903
    [98] Shuster K, Humeau S, Hu H, Bordes A, and Weston J. Engaging image captioning via personality. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 12516−12526
    [99] Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, and Luo J. “Factual” or “Emotional”: Stylized image captioning with adaptive learning and attention. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2018. 527−543
    [100] Zhao W, Wu X, and Zhang X. MemCap: Memorizing style knowledge for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, New York NY USA: 2020. 12984−12992
    [101] Girshick R, Donahue J, Darrell T, and Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2014. 580−587
    [102] Girshick R. Fast R-CNN. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 1440−1448
    [103] Ren S, He K, Girshick R, and Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137−1149 doi: 10.1109/TPAMI.2016.2577031
    [104] Babenko A, and Lempitsky V. Aggregating deep convolutional features for image retrieval. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 1269−1277
    [105] Kalantidis K, Mellina C, and Osindero S. Cross-dimensional weighting for aggregated deep convolutional features. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2016. 685−701
    [106] Papineni K, Roukos S, Ward T, and Zhu W-J. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the Association for Computational Linguistics, Stroudsburg, USA: ACL Press, 2002. 311−318
    [107] Banerjee S, and Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the Association for Computational Linguistics Workshop, Stroudsburg, USA: ACL Press, 2005. 228−231
    [108] Lin C-Y, and Och F J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, Stroudsburg, USA: ACL Press, 2004. 21−26
    [109] Vedantam R, Zitnick C L, and Parikh D. CIDEr: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015. 4566−4575
    [110] Anderson P, Fernando B, Johnson M, and Gould S. SPICE: Semantic propositional image caption evaluation. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2016. 382−398
    [111] Xu J, Mei T, Yao T, and Rui Y. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 5288−5296
    [112] Song J, Guo Y, Gao L, Li X, Hanjalic A, and Shen H-T. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(10): 3047−3058 doi: 10.1109/TNNLS.2018.2851077
    [113] Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, and Shen H-T. Video captioning by adversarial LSTM. IEEE Transactions on Image Processing, 2018, 27(11): 5600−5611 doi: 10.1109/TIP.2018.2855422
    [114] Chen Y, Wang S, Zhang W, and Huang Q. Less is more: Picking informative frames for video captioning. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2018. 367−384
    [115] Thomason J, Venugopalan S, Guadarrama S, Saenko K, and Mooney R. Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the International Conference on Computational Linguistics, Stroudsburg, USA: ACL Press, 2014. 1218−1227
    [116] Dong J, Li X, Lan W, Huo Y, and Snoek C G M. Early embedding and late reranking for video captioning. In: Proceedings of the ACM Conference on Multimedia, New York, USA: ACM Press, 2016. 1082−1086
    [117] Gao L, Guo Z, Zhang H, Xu X, and Shen H-T. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, 2017, 19(9): 2045−2055 doi: 10.1109/TMM.2017.2729019
    [118] Ramanishka V, Das A, Park D H, Venugopalan S, Hendricks L A, Rohrbach M, and Saenko K. Multimodal video description. In: Proceedings of the ACM Conference on Multimedia, New York, USA: ACM Press, 2016. 1092−1096
    [119] Jin Q, Chen J, Chen S, Xiong Y, and Hauptmann A. Describing videos using multi-modal fusion. In: Proceedings of the ACM Conference on Multimedia, New York, USA: ACM Press, 2016, 1087−1091
    [120] Zhou L, Xu C, and Corso J J. Towards automatic learning of procedures from web instructional videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park, USA: AAAI Press, 2018. 7590−7598
    [121] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015.3156 −3164
    [122] Zhang M, Yang Y, Zhang H, Ji Y, Shen H T, and Chua T-S. More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing, 2019, 28(1): 32−44 doi: 10.1109/TIP.2018.2855415
    [123] Yang L, Wang H, Tang P and Li Q. CaptionNet: A tailor-made recurrent neural network for generating image descriptions. IEEE Transactions on Multimedia, 2020, to be published
    [124] 汤鹏杰, 王瀚漓, 徐恺晟. LSTM逐层多目标优化及多层概率融合的图像描述. 自动化学报, 2018, 44(7): 1237−1249

    Tang Peng-Jie, Wang Han-Li, and Xu Kai-Sheng. Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM. Acta Automatica Sinica, 2018, 44(7): 1237−1249
    [125] Li X, Jiang S, and Han J. Learning object context for dense captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2019. 8650−8657
    [126] Yin G, Sheng L, Liu B, Yu N, Wang X, ang Shao J. Context and attribute grounded dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6241−6250
    [127] Kim D-J, Choi J, Oh T-H, and Kweon I S. Dense relational captioning: Triple-stream networks for relationship-based captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6271−6280
    [128] Chatterjee M, and Schwing A G. Diverse and coherent paragraph generation from images. In: Proceedings of the European Conference on Computer Vision, Piscataway, USA: IEEE Press, 2018. 729−744
    [129] Wang J, Pan Y, Yao T, Tang J, and Mei T. Convolutional auto-encoding of sentence topics for image paragraph generation. In: Proceedings of the International Joint Conference on Artificial Intelligence, San Francisco, USA: Margan Kaufmann Press, 2019. 940−946
  • 加载中
计量
  • 文章访问数:  126
  • HTML全文浏览量:  41
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-08-17
  • 录用日期:  2020-12-14
  • 网络出版日期:  2021-01-20

目录

    /

    返回文章
    返回