From Video to Language: Survey of Video Captioning And Description
-
摘要: 视频标题生成与描述是使用自然语言对视频进行总结与重新表达. 由于视频与语言之间存在异构特性, 其数据处理过程较为复杂. 本文主要对基于“编码-解码”架构的模型做了详细阐述, 以视频特征编码与使用方式为依据, 将其分为基于视觉特征均值/最大值的方法、基于视频序列记忆建模的方法、基于三维卷积特征的方法及混合方法, 并对各类模型进行了归纳与总结. 最后, 对当前存在的问题及可能趋势进行了总结与展望, 指出需要生成融合情感、逻辑等信息的结构化语段, 并在模型优化、数据集构建、评价指标等方面进行更为深入的研究.Abstract: The task of video captioning and description is to summarize and re-express the visual content of video with natural language/text. It is challenging because it involves the transformation of different modal information, and there exists heterogeneity between the visual data and language. In this work, the models based on the “encoder-decoder” pipeline are mainly elaborated in detail. According to the encoding and usage of visual features, the current models are classified into four types: the models based on mean/max pooling feature, the models based on video sequential memory, the models based on 3D CNN feature, and the models based on hybrid features. A number of popular works of each type are described and analyzed. Finally, the existing problems and possible trends worth studying are summarized. It is pointed out that the prior knowledge including emotion and logical semantics in complex videos should be further mined and embedded for the generation of logical paragraph description. Moreover, it is still desired to further investigate the techniques of model optimization, dataset construction and evaluation metrics for video captioning and description.
-
表 1 部分基于视觉序列特征均值/最大值的模型在MSVD数据集上的性能表现(%)
Table 1 Performance (%) of a few popular models based on visual sequential feature with mean/max pooling on MSVD
表 4 其他部分主流模型在MSVD上的性能表现(%)
Table 4 Performance (%) of a few other popular models on MSVD
表 2 部分基于序列RNN视觉特征建模的模型在MSVD数据集上的性能表现(%)
Table 2 Performance (%) of a few popular models based on visual sequential feature with RNN on MSVD
Methods(方法) B-1 B-2 B-3 B-4 METEOR CIDEr S2VT[32] — — — — 29.8 — Res-F2F (G-R101-152)[34] 82.8 71.7 62.4 52.4 35.7 84.3 Joint-BiLSTM reinforced[35] — — — — 30.3 — HRNE with attention[38] 79.2 66.3 55.1 43.8 33.1 — Boundary-aware encoder[39] — — — 42.5 32.4 63.5 hLSTMat[41] 82.9 72.2 63.0 53.0 33.6 — Li et al[42] — — — 48.0 31.6 68.8 MGSA(I+C)[43] — — — 53.4 35.0 86.7 LSTM-GAN[113] — — — 42.9 30.4 — PickNet (V+L+C)[114] — — — 52.3 33.3 76.5 表 3 部分基于3D卷积特征的模型在MSVD数据集上的性能表现(%)
Table 3 Performance (%) of a few popular models based on 3D visual feature on MSVD
表 5 部分基于视觉序列均值/最大值的模型在MSR-VTT2016数据集上的性能表现(%)
Table 5 Performance (%) of visual sequential feature based models with mean/max pooling on MSR-VTT2016
表 8 其他主流模型在MSR-VTT2016上的性能(%)
Table 8 Performance (%) of other popular models on MRT-VTT2016
表 6 部分基于RNN视觉序列特征建模的模型在MSR-VTT2016数据集上的性能表现(%)
Table 6 Performance (%) of a few popular models based on visual sequential feature with RNN on MRT-VTT2016
Methods(方法) B-1 B-2 B-3 B-4 METEOR CIDEr Res-F2F (G-R101-152)[34] 81.1 67.2 53.7 41.4 29.0 48.9 hLSTMat[41] — — — 38.3 26.3 — Li et al[42] 76.1 62.1 49.1 37.5 26.4 — MGSA(I+A+C)[43] — — — 45.4 28.6 50.1 LSTM-GAN[113] — — — 36.0 26.1 — aLSTM[117] — — — 38.0 26.1 — VideoLAB[118] — — — 39.5 27.7 44.2 PickNet (V+L+C)[114] — — — 41.3 27.7 44.1 DenseVidCap[49] — — — 44.2 29.4 50.5 ETS(Local+Global)[48] 77.8 62.2 48.1 37.1 28.4 — 表 7 部分基于3D卷积特征的模型在MSR-VTT2016数据集上的性能表现(%)
Table 7 Performance (%) of a few popular models based on 3D visual sequential feature on MRT-VTT2016
表 9 部分基于RNN视觉序列特征建模的模型在ActivityNet Captions数据集(验证集)上的性能表现 (%)
Table 9 Performance (%) of a few popular models based on visual sequential feature with RNN on ActivityNet Captions dataset (validation set)
-
[1] Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2): 91−110 doi: 10.1023/B:VISI.0000029664.99615.94 [2] Dalal N, and Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2005. 886−893 [3] Nagel H H. A vision of “vision and language” comprises action: An example from road traffic. Artificial Intelligence Review, 1994, 8: 189−214 doi: 10.1007/BF00849074 [4] Kojima A, Tamura T, and Fukunaga K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 2002, 50(2): 171−184 doi: 10.1023/A:1020346032608 [5] Gupta A, Srinivasan P, Shi J, and Davis L S. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2009. 2012−2019 [6] Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, and et al. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot. In: Proceedings of the IEEE International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2013. 2712−2719 [7] Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, and Schiele B. Translating video content to natural language descriptions. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2013. 433−440 [8] Krizhevsky A, Sutskever I, and Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of the Conference on Neural Information Processing Systems, Cambridge, USA: MIT Press, 2012. 1097−1105 [9] Simonyan K, and Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representation, Banff AB Canada, 2014 [10] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, and et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015. 1−9 [11] He K, Zhang X, Ren S, and Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 770−778 [12] 胡建芳, 王熊辉, 郑伟诗, 赖剑煌. RGB-D行为识别研究进展及展望. 自动化学报, 2019, 45(5): 829−840Hu Jian-Fang, Wang Xiong-Hui, Zheng Wei-Shi, and Lai Jian-Huang. RGB-D action recognition: Recent advances and future perspectives. Acta Automatica Sinica, 2019, 45(5): 829−840 [13] 周波, 李俊峰. 结合目标检测的人体行为识别. 自动化学报, 2020, 46(9): 1961−1970Zhou Bo, Li Jun-Feng. Human action recognition combined with object detection. Acta Automatica Sinica, 2020, 46(9): 1961−1970 [14] Wu J, Wang L, Wang L, Guo J, and Wu G. Learning actor relation graphs for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 9956−9966 [15] Ji S, Xu W, Yang M, and Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221−231 doi: 10.1109/TPAMI.2012.59 [16] Tran D, Bourdev L, Fergus R, Torresani L, and Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 4489−4497 [17] Cho K, Merrienboer B, Bahdanau D, Bougares F, Schwenk H, and et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL Press, 2014. 1724−1734 [18] Xu K, Ba J L, Kiros R, Cho K, Courville A, and et al. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, New York, USA: ACM Press, 2015. 2048−2057 [19] Yao T, Pan Y, Li Y, Qiu Z, and Mei T. Boosting image captioning with attributes. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2017. 4904−4912 [20] Aafaq N, Mian A, Liu W, Gilani S Z, and Shah M. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys, 2019, 52(6): 115(1--37) [21] Li S, Tao Z, Li K, and Fu Y. Visual to text: Survey of image and video captioning. IEEE Transactions on Emerging Topics in Computational Intelligence, 2019, 3(4): 297−312 doi: 10.1109/TETCI.2019.2892755 [22] Xu R, Xiong C, Chen W, and Corso J J. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2015. 2346−2352 [23] Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, and Saenko K. Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies, Stroudsburg, USA: ACL Press, 2015. 1494−1504 [24] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S and et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115: 211−252 doi: 10.1007/s11263-015-0816-y [25] Pan Y, Mei T, Yao T, Li H, and Rui Y. Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 4594−4602 [26] Pan Y, Yao T, Li H, and Mei T. Video captioning with transferred semantic attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 984−992 [27] 汤鹏杰, 谭云兰, 李金忠等. 密集帧率采样的视频标题生成. 计算机科学与探索, 2018, 12(6): 981−993 doi: 10.3778/j.issn.1673-9418.1705058Tang Peng-Jie, Tan Yun-Lan, Li Jin-Zhong, and Tan Bin. Dense frame rate sampling based model for video caption generation. Journal of Frontiers of Computer Science and Technology, 2018, 12(6): 981−993 doi: 10.3778/j.issn.1673-9418.1705058 [28] Dalal N, Triggs B, and Schmid C. Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2016. 428−441 [29] Wang H, Kläser A, Schmid C, and Liu C-L. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 2013, 103: 60−79 doi: 10.1007/s11263-012-0594-8 [30] Wang H, and Schmid C. Action recognition with improved trajectories. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2013. 3551−3558 [31] Simonyan K, and Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Conference on Neural Information Processing Systems, Cambridge, USA: MIT Press, 2014. 568−576 [32] Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, and Saenko K. Sequence to sequence-video to text. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 4534−4542 [33] Venugopalan S, Hendricks LA, Mooney R, and Saenko K. Improving lstm-based video description with linguistic knowledge mined from text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, USA: ACL Press, 2016. 1961−1966 [34] Tang P, Wang H, and Li Q. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15: 1−23 [35] Bin Y, Yang Y, Shen F, Xu X, and Shen H T. Bidirectional long-short term memory for video description. In: Proceedings of the ACM International Conference on Multimedia, New York, USA: ACM Press, 2016. 436−440 [36] Pasunuru R, and Bansal M. Multi-task video captioning with video and entailment generation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, Stroudsburg, USA: ACL Press, 2017. 1273−1283 [37] Li L, and Gong B. End-to-end video captioning with multitask reinforcement learning. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Piscataway, USA: IEEE Press, 2019. 339−348 [38] Pan P, Xu Z, Yang Y, Wu F, and Zhuang Y. Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 1029−1038 [39] Baraldi L, Grana C, and Cucchiara R. Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 3185--3194 [40] Xu J, Yao T, Zhang Y, and Mei T. Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the ACM International Conference on Multimedia, New York, USA: ACM Press, 2017. 537−545 [41] Song J, Guo Z, Gao L, Liu W, Zhang D, and Shen H T. Hierarchical LSTM with adjusted temporal attention for video captioning. In: Proceedings of the International Joint Conference on Artificial Intelligence, San Francisco, USA: Margan Kaufmann Press, 2017. 2737−2743 [42] Li W, Guo D, and Fang X. Multimodal architecture for video captioning with memory networks and an attention mechanism. Pattern Recognition Letters, 2018, 105: 23−29 doi: 10.1016/j.patrec.2017.10.012 [43] Chen S, and Jiang Y-G. Motion guided spatial attention for video captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2019. 8191−8198 [44] Zhang J, and Peng Y. Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 8327−8336 [45] Zhang J, and Peng Y. Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Transactions on Image Processing, 2020, 29: 6209−6222 doi: 10.1109/TIP.2020.2988435 [46] Wang B, Ma L, Zhang W, and Liu W. Reconstruction network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 7622−7631 [47] Zhang W, Wang B, Ma L, and Liu W. Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(12): 3088−3101 [48] Yao L, Torabi A, Cho K, Ballas N, Pal C, and et al. Describing videos by exploiting temporal structure. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 4507−4515 [49] Shen Z, Li J, Su Z, Li M, Chen Y, Jiang Y-G, and Xue X. Weakly supervised dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 5159−5167 [50] Johnson J, Karpathy A, and Li F-F. DenseCap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 4565−4574 [51] Wang J, Jiang W, Ma L, Liu W, and Xu Y. Bidirectional attentive fusion with context gating for dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 7190−7198 [52] Zhou L, Zhou Y, Corso J J, Socher R, and Xiong C. End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 8739−8748 [53] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, and et al. Attention is all you need. In: Proceedings of the Conference on Neural Information Processing Systems, Cambridge, USA: MIT Press, 2017. 5998−6008 [54] Zhou L, Kalantidis Y, Chen X, Corso J J, and Rohrbach M. Grounded video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6578−6587 [55] Mun J, Yang L, Zhou Z, Xu N, and Han B. Streamlined dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6588−6597 [56] Wang X, Chen W, Wu J, Wang Y-F, and Wang W-Y. Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 4213−4222 [57] Xiong Y, Dai B, and Lin D. Move forward and tell: A progressive generator of video descriptions. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2018. 468−483 [58] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R and Li F-F. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2014. 1725−1732 [59] Heilbron F C, Escorcia V, Ghanem B, and Niebles J C. ActivityNet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015. 961−970 [60] Shetty R, and Laaksonen J. Frame- and segment-level features and candidate pool evaluation for video caption gen eration. In: Proceedings of the ACM International Conference on Multimedia, New York, USA: ACM Press, 2016. 1073−1076 [61] Yu Y, Choi J, Kim Y, Yoo K, Lee S-H, and Kim G. Supervising neural attention models for video captioning by human gaze data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 490−498 [62] Wang J, Wang W, Huang Y, Wang L, and Tan T. M3: Multimodal memory modeling for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 7512−7520 [63] Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, and et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015. 2625−2634 [64] Tang P, Wang H, and Kwong S. Deep sequential fusion LSTM network for image description. Neurocomputing, 2018, 312: 154−164 doi: 10.1016/j.neucom.2018.05.086 [65] Pei W, Zhang J, Wang X, Ke L, Shen X, and Tai Y-W. Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 8347−8356 [66] Li X, Zhao B, and Lu X. Mam-RNN: Multi-level attention model based RNN for video captioning. In: Proceedings of the International Joint Conference on Artificial Intelligence, San Francisco, USA: Margan Kaufmann Press, 2017. 2208−2214 [67] Zhao B, Li X, and Lu X. Cam-RNN: Co-attention model based RNN for video captioning. IEEE Transactions on Image Processing, 2019, 28(11): 5552−5565 doi: 10.1109/TIP.2019.2916757 [68] Chen S, Jin Q, Chen J, and Hauptmann A G. Generating video descriptions with latent topic guidance. IEEE Transactions on Multimedia, 2019, 21(9): 2407−2418 doi: 10.1109/TMM.2019.2896515 [69] Gan C, Gan Z, He X, Gao J, and Deng L. StyleNet: Generating attractive visual captions with styles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 955−964 [70] Pan B, Cai H, Huang D, Lee K-H, Gaidon A, Adeli E, and Niebles J C. Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2020. 10867−10876 [71] Hemalatha M and Chandra Sekhar C. Domain-specific semantics guided approach to video captioning. In: Proceedings of the 2020 Winter Conference on Applications of Computer Vision, Piscataway, USA: IEEE Press, 2020. 1587−1596 [72] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 6299−6308 [73] Cherian A, Wang J, Hori C, Marks T M. Spatio-temporal ranked-attention networks for video captioning. In: Proceedings of the 2020 Winter Conference on Applications of Computer Vision, Piscataway, USA: IEEE Press, 2020. 1617−1626 [74] Wang L, Shang C, Qiu H, Zhao T, Qiu B, and Li H. Multi-stage tag guidance network in video caption. In: Proceedings of the ACM International Conference on Multimedia, New York, USA: ACM Press, 2020. 4610−4614 [75] Hou J, Wu X, Zhao W, Luo J, and Jia Y. Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2020. 8918−8927 [76] Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, and Zha Z. Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2020. 13278−13288 [77] Zheng Q, Wang C, and Tao D. Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2020. 13096−13105 [78] Hou J, Wu X, Zhang X, Qi Y, Jia Y, and Luo J. Joint commonsense and relation reasoning for image and video captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2020. 10973−10980 [79] Chen J, Pan Y, Li Y, Yao T, Chao H, and Mei T. Temporal deformable convolutional encoder-decoder networks for video captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2019. 8167−8174 [80] Liu S, Ren Z, and Yuan J. SibNet: Sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, to be published [81] Aafaq N, Akhtar N, Liu W, Gilani S Z, and Mian A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 12487−12496 [82] Yu H, Wang J, Huang Z, Yang Y, and Xu W. Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 4584−4593 [83] Iashin V, Rahtu E. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In: Proceedings of the British Machine Vision Conference, Berlin, Germany: Springer Press, 2020 [84] Park J, Darrell T, and Rohrbach A. Identity-aware multi-sentence video description. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2020. 360−378 [85] Escorcia V, Heilbron F C, Niebles J C, and Ghanem B. DAPs: Deep action proposals for action understanding. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2016. 768−784 [86] Krishna R, Hata K, Ren F, Li F-F, and Niebles J C. Dense-captioning events in videos. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2017. 706−715 [87] Li Y, Yao T, Pan Y, Chao H, and Mei T. Jointly localizing and describing events for dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2018. 7492−7500 [88] Wang T, Zheng H, Yu M, Tian Q, and Hu H. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 2020, to be published [89] Park J S, Rohrbach M, Darrell T, and Rohrbach A. Adversarial inference for multi-sentence video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6598−6608 [90] Devlin J, Chang M-W, Lee K, and Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies, Stroudsburg, USA: ACL Press, 2019. 4171−4186 [91] Sun C, Myers A, Vondrick C, Murphy K, and Schmid C. VideoBERT: A joint model for video and language representation learning. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2019. 7464−7473 [92] Xie S, Sun C, Huang J, Tu Z, and Murphy K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2018. 318−35 [93] Sun C, Baradel F, Murphy K, and Schmid C. Learning video representation using contrastive bidirectional transformer. [Online], available: https://arxiv.org/pdf/1906.05743.pdf, Dec. 18, 2020. [94] Luo H, Ji L, Huang H, Duan N, Li T, Li J, and et al. UniVL: A unified video and language pre-training model for multimodal understanding and generation. [Online], available: https://arxiv.org/pdf/2002.06353.pdf, Dec. 18, 2020. [95] Mathews A P, Xie L, and He X. SentiCap: Generating image descriptions with sentiments. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2016. 3574−3580 [96] Guo L, Liu J, Yao P, Li J, and Lu H. MSCap: Multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 4204−4213 [97] Park C C, Kim B, and Kim G. Attend to you: Personalized image captioning with context sequence memory networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2017. 895−903 [98] Shuster K, Humeau S, Hu H, Bordes A, and Weston J. Engaging image captioning via personality. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 12516−12526 [99] Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, and Luo J. “Factual” or “Emotional”: Stylized image captioning with adaptive learning and attention. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2018. 527−543 [100] Zhao W, Wu X, and Zhang X. MemCap: Memorizing style knowledge for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, New York NY USA: 2020. 12984−12992 [101] Girshick R, Donahue J, Darrell T, and Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2014. 580−587 [102] Girshick R. Fast R-CNN. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 1440−1448 [103] Ren S, He K, Girshick R, and Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137−1149 doi: 10.1109/TPAMI.2016.2577031 [104] Babenko A, and Lempitsky V. Aggregating deep convolutional features for image retrieval. In: Proceedings of the International Conference on Computer Vision, Piscataway, USA: IEEE Press, 2015. 1269−1277 [105] Kalantidis K, Mellina C, and Osindero S. Cross-dimensional weighting for aggregated deep convolutional features. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2016. 685−701 [106] Papineni K, Roukos S, Ward T, and Zhu W-J. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the Association for Computational Linguistics, Stroudsburg, USA: ACL Press, 2002. 311−318 [107] Banerjee S, and Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the Association for Computational Linguistics Workshop, Stroudsburg, USA: ACL Press, 2005. 228−231 [108] Lin C-Y, and Och F J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, Stroudsburg, USA: ACL Press, 2004. 21−26 [109] Vedantam R, Zitnick C L, and Parikh D. CIDEr: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015. 4566−4575 [110] Anderson P, Fernando B, Johnson M, and Gould S. SPICE: Semantic propositional image caption evaluation. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2016. 382−398 [111] Xu J, Mei T, Yao T, and Rui Y. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2016. 5288−5296 [112] Song J, Guo Y, Gao L, Li X, Hanjalic A, and Shen H-T. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(10): 3047−3058 doi: 10.1109/TNNLS.2018.2851077 [113] Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, and Shen H-T. Video captioning by adversarial LSTM. IEEE Transactions on Image Processing, 2018, 27(11): 5600−5611 doi: 10.1109/TIP.2018.2855422 [114] Chen Y, Wang S, Zhang W, and Huang Q. Less is more: Picking informative frames for video captioning. In: Proceedings of the European Conference on Computer Vision, Berlin, Germany: Springer Press, 2018. 367−384 [115] Thomason J, Venugopalan S, Guadarrama S, Saenko K, and Mooney R. Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the International Conference on Computational Linguistics, Stroudsburg, USA: ACL Press, 2014. 1218−1227 [116] Dong J, Li X, Lan W, Huo Y, and Snoek C G M. Early embedding and late reranking for video captioning. In: Proceedings of the ACM Conference on Multimedia, New York, USA: ACM Press, 2016. 1082−1086 [117] Gao L, Guo Z, Zhang H, Xu X, and Shen H-T. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, 2017, 19(9): 2045−2055 doi: 10.1109/TMM.2017.2729019 [118] Ramanishka V, Das A, Park D H, Venugopalan S, Hendricks L A, Rohrbach M, and Saenko K. Multimodal video description. In: Proceedings of the ACM Conference on Multimedia, New York, USA: ACM Press, 2016. 1092−1096 [119] Jin Q, Chen J, Chen S, Xiong Y, and Hauptmann A. Describing videos using multi-modal fusion. In: Proceedings of the ACM Conference on Multimedia, New York, USA: ACM Press, 2016, 1087−1091 [120] Zhou L, Xu C, and Corso J J. Towards automatic learning of procedures from web instructional videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park, USA: AAAI Press, 2018. 7590−7598 [121] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2015.3156 −3164 [122] Zhang M, Yang Y, Zhang H, Ji Y, Shen H T, and Chua T-S. More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing, 2019, 28(1): 32−44 doi: 10.1109/TIP.2018.2855415 [123] Yang L, Wang H, Tang P and Li Q. CaptionNet: A tailor-made recurrent neural network for generating image descriptions. IEEE Transactions on Multimedia, 2020, to be published [124] 汤鹏杰, 王瀚漓, 徐恺晟. LSTM逐层多目标优化及多层概率融合的图像描述. 自动化学报, 2018, 44(7): 1237−1249Tang Peng-Jie, Wang Han-Li, and Xu Kai-Sheng. Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM. Acta Automatica Sinica, 2018, 44(7): 1237−1249 [125] Li X, Jiang S, and Han J. Learning object context for dense captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, Menlo Park, USA: AAAI Press, 2019. 8650−8657 [126] Yin G, Sheng L, Liu B, Yu N, Wang X, ang Shao J. Context and attribute grounded dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6241−6250 [127] Kim D-J, Choi J, Oh T-H, and Kweon I S. Dense relational captioning: Triple-stream networks for relationship-based captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA: IEEE Press, 2019. 6271−6280 [128] Chatterjee M, and Schwing A G. Diverse and coherent paragraph generation from images. In: Proceedings of the European Conference on Computer Vision, Piscataway, USA: IEEE Press, 2018. 729−744 [129] Wang J, Pan Y, Yao T, Tang J, and Mei T. Convolutional auto-encoding of sentence topics for image paragraph generation. In: Proceedings of the International Joint Conference on Artificial Intelligence, San Francisco, USA: Margan Kaufmann Press, 2019. 940−946 -

计量
- 文章访问数: 126
- HTML全文浏览量: 41
- 被引次数: 0