-
摘要: 视频标题生成与描述是使用自然语言对视频进行总结与重新表达. 由于视频与语言之间存在异构特性, 其数据处理过程较为复杂. 本文主要对基于“编码−解码” 架构的模型做了详细阐述, 以视频特征编码与使用方式为依据, 将其分为基于视觉特征均值/最大值的方法、基于视频序列记忆建模的方法、基于三维卷积特征的方法及混合方法, 并对各类模型进行了归纳与总结. 最后, 对当前存在的问题及可能趋势进行了总结与展望, 指出需要生成融合情感、逻辑等信息的结构化语段, 并在模型优化、数据集构建、评价指标等方面进行更为深入的研究.Abstract: The task of video captioning and description is to summarize and re-express the visual content of video with natural language/text. It is challenging because it involves the transformation of different modal information, and there exists heterogeneity between the visual data and language. In this work, the models based on the “encoder-decoder” pipeline are mainly elaborated in detail. According to the encoding and usage of visual features, the current models are classified into four types: the models based on mean/max pooling feature, the models based on video sequential memory, the models based on 3D CNN feature, and the models based on hybrid features. A number of popular works of each type are described and analyzed. Finally, the existing problems and possible trends worth studying are summarized. It is pointed out that the prior knowledge including emotion and logical semantics in complex videos should be further mined and embedded for the generation of logical paragraph description. Moreover, it is still desired to further investigate the techniques of model optimization, dataset construction and evaluation metrics for video captioning and description.
-
环境感知作为自动驾驶系统的重要环节, 对于车辆与外界环境的理解、交互起关键作用. 然而, 真实情景中的行车环境感知, 需要解决复杂场景下感知精度不高、实时性不强等关键技术问题. 行车环境感知主要包括目标检测与语义分割[1]. 语义分割在像素级别上理解所捕获的场景, 与目标检测相比, 能够产生更加丰富的感知信息, 并且分割结果可以进一步用来识别、检测场景中的视觉要素, 辅助行车环境感知系统进行判断. 目前, 相关的公共图像分割数据集与语义分割网络大多数都是基于可见光图像. 可见光图像能够记录物体丰富的颜色和纹理特征, 但在光照条件不足或光照异常时(如: 暗黑中迎面的大灯照射), 可见光图像的质量会大幅降低, 导致网络无法正确分割对象, 进而影响行车环境感知系统在这些环境下的准确性. 红外热成像相机与可见光相机不同, 其通过探测物体热量获取红外辐射信息, 因此对光线与天气的变化更加鲁棒, 缺点在于红外热图像提供的信息量较少, 视觉效果模糊. 由此可见, 若仅依靠单一传感器, 难以精确分割不同环境下的场景. 本文主要研究行车环境下基于可见光与红外热图像的复杂场景分割, 尝试利用深度学习技术挖掘不同传感器之间的互补信息提升分割性能, 使车辆能够充分感知其周围环境.
场景分割作为行车环境感知的基本技术需求, 一直以来受到研究人员的关注. 目前, 绝大部分研究集中在可见光图像上, 分割方法从初期的基于阈值、区域、边缘等由人工设计特征的传统算法, 向基于深度学习的语义分割网络过渡; 研究内容则根据可见光图像分割的难点大致从增加分割精细度、增强网络对多尺度的泛化能力和学习物体空间相关性三个方向提升网络性能. 如文献[2]利用膨胀卷积模块用来保留特征图中的细节信息, 预测更加准确的结果; 文献[3]使用一个共享参数的卷积神经网络训练不同尺度的图像获得多尺度特征; 文献[4]利用循环神经网络适用于序列数据编码的特性, 捕捉物体的空间关系等. 虽然上述研究提高了分割准确率并解决了某些技术难题, 但大多数方法只注重提升精度而忽略了网络大小和分割速度, 导致所提出的方法难以在行车环境感知系统中落地. 此外, 基于可见光图像的分割方法无论如何改进, 其输入数据来源决定了这些方法无法避免因光线不足、分割对象与背景颜色纹理一致等导致的分割误差.
红外热成像相机由于其能够全天时、全天候有效工作的特性, 在车辆驾驶领域中的应用越来越广泛[5-6]. 例如, 对红外图像中的行人进行识别, 能提供危险区域、安全距离等重要信息, 从而辅助行车系统更好地进行路径规划, 提高其可靠性与鲁棒性. 一般来说, 面向红外图像的分割算法都是通过人工设计特征来描述前景与背景的差异, 如基于阈值、模糊集和最短路径等方法, 但它们通常对场景变化和噪声很敏感, 无法适应车辆所处的复杂环境.
近年来, 有学者开始关注基于多种传感器的感知方法[7], 尝试通过融合多模态数据充分挖掘信息, 提高行车感知系统的性能[8]. Ha等[9]首次尝试结合可见光与红外热图像进行场景分割, 提出了基于卷积神经网络的MFNet分割模型, 并创建了一个可见光与红外热图像的场景分割数据集. RTFNet[10]在MFNet的基础上引入残差结构[11]进一步加强了信息的融合, 提高了场景分割结果的准确性, 由于该网络结构过于庞大且参数数量显著增加, 与行车环境感知系统需要轻量级、实时性高的分割模型相违背, 有待进一步改进. 在此之前, 针对多传感器感知的研究集中在应用点云与可见光融合进行目标检测[12-13], 可见光与深度图像进行分割[14], 以及针对多光谱图像进行目标检测[15-16]等.
本文提出一种基于可见光与红外热图像的复杂场景分割模型DMSNet (Dual modal segmentation network), 该模型通过构建轻量级的双路特征空间自适应(Dual-path feature space adaptation, DPFSA)模块, 将红外热特征与可见光特征变换到同一空间下进行融合, 然后学习融合后的多模态特征, 并提取这些特征中的低层细节与高层语义信息, 从而实现对复杂场景的分割. 实验结果表明, 该模型可减少由于不同模态特征空间的差异带来的融合误差, 即使在光线发生变化时也表现出较强的鲁棒性, 分割结果相对其他方法也有明显改进.
1. 方法
本文所构建的模型以复杂场景的可见光与红外热两种模态图像作为输入, 输出该场景中不同类别物体的分割结果, 我们因此将它命名为双模分割网络(Dual modal segmentation network, DMSNet), 总体结构如图1所示.
1.1 场景分割模型
该网络主要包括编码器与解码器. 编码器使用两条路径分别提取可见光与红外热图像特征. 两条路径除了输入图像分别为彩色图像与灰度图像外, 其余部分结构一致, 均包含五组操作. 每组内包含一到三个3
$ \times $ 3卷积层, 卷积层后紧接着批归一化(Batch normalization)层[17], 用来保持特征在网络内分布的相对稳定, 然后是激活层. 每组之间采用步长为2的最大池化层缩小特征图空间尺寸, 同时增加卷积核数目, 由编码器的浅层至深层逐步学习到图像内更加丰富的语义信息. 由于DMSNet是面向行车环境感知的轻量级网络, 特征通道数目在编码器最深层也未超过96, 因此采用leaky-ReLU [18]作为网络所有的激活函数, 这样做能够避免常用的ReLU[19]激活函数造成大量神经元失活的问题.解码器负责融合两条编码路径学习到的特征, 依次通过五组操作逐步增加特征图空间尺寸, 并最终得到与输入图像尺寸一致的分割结果. 解码器每组内的操作与编码阶段类似, 包含卷积层、批归一化层与激活函数. 每组之间以缩放因子为2的最邻近插值法进行快速上采样, 以逐步恢复特征图空间尺寸. 进行上采样之前, 需要融合来自可见光编码器与红外热编码器同一尺寸的特征图. 为了缩小不同模态特征空间存在的差异, 本文提出双路特征空间自适应(Dual-path feature space adaptation, DPFSA)模块, 用来自动转换两种模态特征至同一空间, 并对它们进行融合. 该模块的详细设计将在第1.2节中阐述.
1.2 特征融合方法
文献[13]指出, 目前利用激光雷达数据与可见光图像融合进行道路检测的方法, 相对于仅基于可见光图像的算法, 正确率并没有明显提升. 这种现象主要是由于两种信息在数据空间与特征空间存在差异, 进而影响了二者的融合. 数据空间的差异是指激光雷达数据位于三维真实空间, 而可见光图像定义在二维平面上. 特征空间的差异来源于两种数据模态不同, 进而导致网络提取的特征也位于不同的空间, 这些都会对特征融合造成不利影响. 受该研究的启发, 本文将文献[13]中的特征空间转换(Feature space transformation, FST)模块进行改进并应用到DMSNet中.
FST模块将激光雷达特征全部以逐点相加的方式融进可见光特征, 导致转换后的特征与未转换的特征发生混淆, 一定程度给激光雷达信息增加了噪声, 并有可能对可见光特征带来负面影响. 针对这种不足, 本文设计了DPFSA模块, 用来执行特征空间的转换. 该模块结构如图2所示, 相比FST模块, 最大的改进在于保留了不同模态数据的特征向量, 且增加了预适应步骤(Pre-adaptation)与逆转换层(Reverse layer). 其中, 预适应步骤是为了增加模型的非线性能力; 逆转换层的设计则借鉴了文献[17]中的思想, 对转换完成的数据进一步执行卷积操作, 从而避免数据分布严重改变, 同时可增加模型的灵活性. 这些改进使得最终的场景分割模型在几乎不增加网络参数的情况下, 性能有了很大的提升.
该模块主要包含两个功能: 针对特征空间的转换, 以及将携带不同信息的特征进行融合. 对于特征空间转换, 首先使用一个1
$ \times $ 1卷积层与leaky-ReLU激活层对红外热特征进行预适应, 然后将预适应后的红外热特征与可见光特征输入到转换网络(TransNet)学习转换参数, 最后经过逆转换层完成对红外热特征空间的转换:$$ {{\boldsymbol{f}}}_{adapt\_ther}={G}_{rev}({{\boldsymbol{\alpha}}} {\boldsymbol{f}}_{pre\_ther}+{\boldsymbol{\beta}} ) $$ (1) 式中
${\boldsymbol{{f}}}_{adapt\_ther}$ 为完成空间转换后的红外热特征,$ {G}_{rev} $ 代表逆转换层进行的操作, 逆转换层与预适应的结构相同, 仅包含单个1$ \times $ 1卷积层与激活层, 用来改变特征通道数, 同时增加模型的非线性;${\boldsymbol{{f}}}_{pre\_ther}$ 是预适应后的红外热特征;${\boldsymbol{\alpha}}$ 与${\boldsymbol{\beta}}$ 则代表TransNet输出的转换参数, 它们分别由TransNet内的两个转换子网络计算得到:$$ {\boldsymbol{\alpha}} ={H}_{\alpha }({\boldsymbol{{f}}}_{pre\_ther},{\boldsymbol{{f}}}_{vis};{\boldsymbol{{W}}}_{\alpha }) $$ (2) $$ {\boldsymbol{\beta}} ={H}_{\beta }({\boldsymbol{{f}}}_{pre\_ther},{\boldsymbol{{f}}}_{vis};{\boldsymbol{{W}}}_{\beta }) $$ (3) 其中
$ {H}_{\alpha } $ 和$ {H}_{\beta } $ 分别代表两个转换子网络计算${\boldsymbol{\alpha}}$ 和${\boldsymbol{\beta}}$ 的全卷积运算,${{\boldsymbol{W}}}_{\alpha }$ 和${{\boldsymbol{W}}}_{\beta }$ 则是对应的参数,${{\boldsymbol{f}}}_{vis}$ 表示可见光特征.完成对特征空间的转换后, 接着进行特征间的融合. 经过转换后的红外热特征首先与可见光特征进行拼接, 再与前一组已经融合的结果进行逐点相加达到融合效果, 得到双路特征. DPFSA模块处理过程可表示为:
$$ \begin{split} &DPFSA\left({\boldsymbol{V}},{\boldsymbol{T}};{\boldsymbol{W}}\right)=\\ &\qquad{M}_{fuse}^{n}\left({{\boldsymbol{f}}}_{fuse}^{n+1},{{\boldsymbol{f}}}_{vis}^{n},{{\boldsymbol{f}}}_{adapt\_ther}^{n}\right) \end{split}$$ (4) 其中,
$ n $ 代表场景分割模型的第$ n $ 组,${\boldsymbol{V}}$ 、${\boldsymbol{T}}$ 分别为可见光与红外热图像,${{\boldsymbol{f}}}_{fuse}$ 代表DPFSA模块输出的双路特征经过解码器某一组卷积运算后的结果,${\boldsymbol{W}}$ 泛指该模块所有参数,$ {M}_{fuse} $ 为逐点相加的融合过程. 需要注意, 在$ n=5 $ 时, DPFSA模块仅接收两个输入, 处理过程变为${M}_{fuse}^{5}({{\boldsymbol{f}}}_{vis}^{5},{{\boldsymbol{f}}}_{adapt\_ther}^{5})$ , 结合图2与式(4)可见, DPFSA模块不仅保留了两种模态信息形成双路特征, 而且该双路特征经过处理后能够继续作为下一个DPFSA模块的输入, 这种方式最大程度地减少了信息的杂糅与损失, 增加了对红外热图像的利用率.1.3 损失函数
考虑到交叉熵损失在反向传播中更易优化, 而Dice[20]损失善于处理数据集中的类别不平衡问题, 本文构建新的损失函数
$ {L}_{mix} $ 如下:$$ {L}_{mix}={L}_{CE}+{L}_{Dice} $$ (5) $$ {L}_{CE}=-\frac{1}{N}\sum\limits_{k=1}^{K}\sum\limits_{i=1}^{N}{\pi \left(G\right)}_{i}^{k}{p}_{i}^{k} $$ (6) $$ {L}_{Dice}=1-\frac{2}{K}\sum\limits_{k=1}^{K}\frac{\sum\limits_{i=1}^{N}{p}_{i}^{k}{\pi \left(G\right)}_{i}^{k}}{\sum\limits_{i=1}^{N}{p}_{i}^{k}+\sum\limits_{i=1}^{N}{\pi \left(G\right)}_{i}^{k}} $$ (7) 其中
$ {L}_{CE} $ 表示交叉熵损失,$ {L}_{Dice} $ 表示Dice损失,$ K $ 为分割类别总数,$ G $ 代表图像对应的分割标签,$ N $ 为图像像素总个数,$ {\pi \left(G\right)}_{i}^{k} $ 将图像$ I $ 中第$ k $ 类像素点$ i $ 的分割标签映射为独热(one-hot)编码形式,$ {p}_{i}^{k} $ 映射任意数值到$ \left[\mathrm{0,1}\right] $ 范围内, 其计算公式如下:$$ {p}_{i}^{k}=\frac{\mathrm{e}\mathrm{x}\mathrm{p}\left({a}_{i}^{k}\right)}{\sum\limits_{k=1}^{K}\mathrm{e}\mathrm{x}\mathrm{p}\left({a}_{i}^{k}\right)} $$ (8) 2. 实验结果与分析
本文所有实验均通过基于CUDA10.0和cuDNN7.6.0的PyTorch1.2.0框架实现, 使用搭载了Intel Xeon Bronze 3104 CPU (1.70 GHz)和NVIDIA GeForce RTX 2080 Ti (11 GB)硬件的Windows10电脑训练. 模型初始学习速率设置为0.01, 每经过一轮迭代学习速率减少1 %, 模型通过SGD随机梯度下降算法进行迭代优化, 并使用动量为0.9、权重衰减系数为0.0005的策略避免模型过拟合. 本节首先介绍实验使用的数据集与评价指标, 然后通过消融实验验证DMSNet中DPFSA模块与混合损失函数的有效性, 并分析它们对模型产生的影响及可能原因, 最后与其他分割模型进行对比.
2.1 数据集与评价指标
1)数据集
本文主要使用文献[9]中公开的数据集(后面统称为“数据集A”), 一共包含1569幅行车环境下的城市场景图像, 其中820张拍摄于白天, 749张拍摄于夜晚. 该数据集使用InfRec R500红外热成像相机拍摄, 该设备能够同时获取可见光与红外热图像. 数据集中一共有8个类别被标注, 分别是汽车(Car)、行人(Person)、自行车(Bike)、路缘石(Curve)、车辆停止标识(Car stop)、护栏(Guardrail)、路障(Color cone)和突出物(Bump), 不属于上述类别的物体均以未标记(Unlabeled)处理. 由于场景中只有少量类别被标记, 未标记像素占据整体的93 %以上, 而已被标记的像素中, 不同类别像素占比相差达到43倍以上. 因此, 该数据集有较严重的类别不平衡问题. 在实际训练中, 本文采用了与文献[9]相同的数据划分策略, 50 %的图像用于训练, 25 %用于验证, 剩余的用作测试, 所有图像均被缩放至480
$ \times $ 640固定尺寸.由于面向行车环境的可见光与红外热多模态图像公开数据集稀缺, 本文使用PST900数据集[21] (后面统称为“数据集B”)作为实验补充. 该数据集面向机器人自主环境感知, 共包含894对720
$ \times $ 1280大小的可见光与红外热图像, 具体有5个类别: 背景(Background)、灭火器(Fire-extinguisher)、背包(Backpack)、手钻(Hand-drill)和幸存者(Survivor). 数据划分策略与数据集A保持一致.2)评价指标
本文采用两个指标衡量分割结果的性能, 分别为正确率(Acc)和交并比(IoU). 两个指标在所有类别上的平均结果分别以mAcc、mIoU指代, 计算公式如下:
$$ mAcc=\frac{1}{K}\sum\limits_{i=1}^{K}\left(\frac{{P}_{ii}}{\sum\limits_{j=1}^{K}{P}_{ij}}\right) $$ (9) $$ mIoU=\frac{1}{K-1}\sum\limits_{i=2}^{K}\left(\frac{{P}_{ii}}{\sum\limits_{j=2}^{K}\left({P}_{ij}+{P}_{ji}\right)-{P}_{ii}}\right) $$ (10) 本文
$ K $ 在数据集A、B上分别取为9和5, 即包含了未被标注的类别.$ {P}_{ij} $ 代表类别为$ i $ 的像素被预测为类别$ j $ 的数目. 在mIoU的计算中, 由于未被标注的像素占据绝大部分, 不同分割模型计算得到的IoU值非常接近, 因此该类别未被纳入考虑.2.2 消融实验
1) DPFSA模块分析
为了验证DPFSA模块的有效性, 现通过调整该模块内部结构得到另外两个模块, 并将它们和MFNet、FuseNet[14]进行对比实验. 两个调整后的模块如图3所示, 其中图3 (a)是在DPFSA基础上去掉逆转换层与预适应步骤, 为了表示方便, 将之命名为DPFSA-1, 该模块的提出是为了证明对特征空间进行转换的思路是可行的; 图3 (b)是在DPFSA基础上去除逆转换层(或者说在DPFSA-1的基础上增加了预适应步骤), 将之命名为DPFSA-2, 该模块的提出是为了证明单纯增加网络参数或层数不一定能提升分割精度.
为了保证比较的公平性, 排除损失函数对模型性能的影响, 将MFNet使用的交叉熵损失函数作为表1中DMSNet及其变种模型的损失函数, 并对白天与夜晚所有时间段内的图像进行测试. 由表1可知, 使用DPFSA-1模块的分割结果优于MFNet与FuseNet, 表明不同模态特征空间的差异能够通过这种方式缩小, 对特征空间进行转换的思路可行; 使用DPFSA-2模块的实验结果虽然提升了mAcc指标, 但mIoU指标却有所下降, 表明单纯通过增加网络参数或层数并不能保证模型正确率的提升, 更深的模型往往需要更多的训练数据, 且更难收敛. DPFSA模块相比于DPFSA-2模块, 主要的不同在于将转换后的特征进一步输入到逆转换层, 由表1可以看出这种方式显著提升了模型性能, 且相比于未改进的DPFSA-1, 参数量仅多出了0.18 MB, 此外, 模型参数量也只有FuseNet的12.1 %. 这也进一步证明, 模型性能的提升并非由于训练参数大量增多引起, 而是DPFSA模块起了关键作用.
表 1 不同模块在数据集A上的mAcc、mIoU值与参数量比较Table 1 Comparison of mAcc and mIoU values and parameter values of different modules on dataset AModels mAcc mIoU Parameters MFNet 63.5 64.9 2.81 MB FuseNet 61.9 63.8 46.4 MB DMSNet (DPFSA-1) 65.6 68.1 5.45 MB DMSNet (DPFSA-2) 68.9 65.1 5.54 MB DMSNet (DPFSA) 69.7 69.6 5.63 MB 注: Parameters 代表整个分割模型的参数量, 而非模块的参数量 2)损失函数分析
本文基于交叉熵(CE)与Dice构建损失函数, 为了证明该损失函数的优越性, 在DMSNet上使用了四种不同的损失函数进行训练. 表2列出了不同损失函数在数据集A上各类别的Acc结果与mAcc、mIoU指标值. 其中, Focal损失[22]的提出即为了解决样本不均衡导致模型准确率降低的问题, 它通过调制系数(Modulating factor)减少易分类样本的权重, 从而使得模型在训练时更专注难分类的样本. 但从表2可以发现, Focal损失在本文所使用的数据集上效果并不好, 很大程度是由于该数据集中不同类别的像素占比相差悬殊, 可达十几个数量级. 因此直接通过对难分类样本学习, 微小的噪声都将导致损失偏差严重, 影响模型收敛.
表 2 不同损失函数在数据集A上的Acc结果与mAcc、mIoU值Table 2 Acc results and mAcc and mIoU values of different loss functions on dataset ALosses Acc mAcc mIoU 1 2 3 4 5 6 7 8 9 CE 97.6 86.5 84.9 77.8 69.5 53.3 0.0 79.8 77.4 69.7 69.6 Focal 97.3 78.7 80.5 67.8 55.1 41.6 0.0 63.5 50.8 59.5 65.6 Dice 96.8 77.7 83.8 0.0 0.0 0.0 0.0 36.6 0.0 32.8 25.3 CE+Dice 97.6 87.6 83.5 79.5 73.2 47.5 0.0 74.7 92.1 70.7 70.3 注: 表中数字1 ~ 9为分割类别标号, 分别为 1: Unlabeled, 2: Car, 3: Pedestrian, 4: Bike, 5: Curve, 6: Car stop, 7: Guardrail, 8: Color cone, 9: Bump 单独使用Dice损失函数效果也较差, 主要原因可能是Dice损失的梯度形式类似于
$2\pi {(G)}^{2}/(p+ $ $ \pi (G))^{2}$ , 在$ p $ 与$ \pi \left(G\right) $ 均很小时, 该梯度会变得异常大, 导致整个训练过程不稳定. 虽然交叉熵损失不关注类别不平衡问题, 但其梯度更加简单、平稳, 并且能够学习到数据中主要类别的分布, 因此交叉熵损失相比Dice和Focal损失更适用于本文数据集.基于以上分析, 为了让模型能学习到高频类别特征的同时也能兼顾低频类别, 本文使用了交叉熵与Dice相结合的混合损失函数. 由表2可知, 本文提出的混合损失函数在mAcc和mIoU指标上均表现最优, 可有效提升模型性能. 这在很大程度是由于交叉熵损失在网络训练前期起了主导作用, 而Dice损失作为辅助项, 进一步优化了在低频类别上的分割准确率.
2.3 对比实验
本节从准确性和鲁棒性角度将DMSNet分别与SegNet、ENet、MFNet和FuseNet的分割性能进行对比分析. 其中SegNet与ENet是针对可见光图像的分割网络, 之所以被选择为比较对象, 是因为这两种网络参数量适中, 并且ENet是专门针对嵌入式端的高速度分割网络. 其余大多数网络虽然在分割精度上表现更好, 却具有庞大的网络结构与参数量(如RTFNet, 模型参数量为980.88 MB), 需要的硬件与计算条件也要求更高, 对于行车环境感知系统甚至可能无法承受. 为了确保对比实验的公平性, 分别使用两种图像训练并测试SegNet与ENet. 第一种是可见光图像作为三通道输入, 用3ch表示; 第二种为结合了可见光与红外热信息的图像, 但由于SegNet与ENet本身不具备处理多模态数据的网络结构, 因此第二种直接由可见光图像与红外热灰度图像在色彩维度上拼接作为四通道输入, 用4ch表示.
表3展示了不同模型在数据集A上各个类别的Acc与IoU评价结果, 以及它们的平均值. 可以看出, 除了个别类在其他模型上的分割结果具有优势外, 在绝大多数类别上DMSNet都更胜一筹, 且mAcc与mIoU指标相对于MFNet分别高出了7.2 %与5.4 %, 相对于FuseNet则各高出了8.8 %与6.5 %.
表 3 不同模型在数据集A上的Acc与IoU结果对比Table 3 Comparison of Acc and IoU results of different models on dataset AModels 2 3 4 5 6 7 8 9 mAcc mIoU Acc IoU Acc IoU Acc IoU Acc IoU Acc IoU Acc IoU Acc IoU Acc IoU SegNet (3ch) 82.6 94.1 67.7 75.6 73.7 80.8 55.9 97.1 39.1 43.5 0.0 0.0 0.0 0.0 48.9 86.8 51.7 59.7 SegNet (4ch) 84.4 93.1 85.5 84.7 76.0 74.7 58.2 96.5 44.2 43.6 0.0 0.0 0.0 0.0 74.4 95.6 57.8 60.9 ENet (3ch) 85.3 92.3 53.8 68.4 67.7 71.7 52.2 95.7 16.9 24.2 0.0 0.0 0.0 0.0 0.0 0.0 41.5 43.8 ENet (4ch) 75.5 89.6 68.1 71.7 66.8 67.6 63.2 88.5 41.5 34.1 0.0 0.0 0.0 0.0 93.2 78.1 56.2 53.6 FuseNet 76.8 91.2 69.3 80.5 71.2 78.6 60.1 95.8 30.8 28.1 0.0 0.0 68.4 37.9 83.1 98.5 61.9 63.8 MFNet 78.9 92.9 82.7 84.8 68.1 75.7 64.4 97.2 31.6 29.7 0.0 0.0 71.8 40.6 77.1 98.4 63.5 64.9 DMSNet 87.6 95.8 83.5 88.7 79.5 82.5 73.2 97.9 47.5 35.7 0.0 0.0 74.7 62.0 92.1 99.8 70.7 70.3 注: 表中数字2 ~ 9为分割类别标号, 表示法同表 2 此外, 为了验证所提出模型在不同数据集上的适用性与鲁棒性, 表4展示了各模型在数据集B上的测试结果. 不难发现, 文本提出的方法同样具有较强的分割性能.
表 4 不同模型在数据集B上的Acc与IoU结果对比Table 4 Comparison of Acc and IoU results of different models on dataset BModels 2 3 4 5 mAcc mIoU Acc IoU Acc IoU Acc IoU Acc IoU SegNet (3ch) 0.0 0.0 71.2 79.3 0.0 0.0 21.6 47.1 38.4 31.6 SegNet (4ch) 0.0 0.0 62.9 70.1 0.0 0.0 30.5 46.8 38.5 29.2 ENet (3ch) 0.0 0.0 77.6 85.5 0.0 0.0 73.4 90.9 49.9 44.1 ENet (4ch) 0.0 0.0 72.9 74.9 0.0 0.0 74.8 89.6 49.1 41.1 FuseNet 72.7 43.1 91.4 92.3 74.4 78.9 99.9 99.8 87.4 78.5 MFNet 66.7 47.0 88.7 91.0 95.2 90.1 96.3 99.8 89.1 81.9 DMSNet 67.8 43.5 89.1 90.4 96.3 97.5 99.3 99.9 90.2 82.8 注: 表中数字2 ~ 5为分割类别标号, 分别为 2: Fire-Extinguisher, 3: Backpack, 4: Hand-Drill, 5: Survivor 为了深入探究模型是否合理利用了两种模态信息, 本文进一步从时间角度比较不同模型对光照变化的鲁棒性. 表5列出了在白天与黑夜不同光线条件下不同模型在数据集A上的分割结果对比. 可以看出, 不经过任何处理直接将可见光与红外热图像拼接输入网络, 一定程度上影响了模型对于可见光数据的学习, 特别对于SegNet而言, 四通道输入相比三通道输入, 在白天的数据集上mAcc和mIoU有明显下降. 反观本文提出的DMSNet, 在任意时间段的分割性能均有明显提高, 这也进一步说明DMSNet高效利用了两种模态数据的互补信息, 对光照的变化表现出较强鲁棒性.
表 5 不同模型在数据集A白天与黑夜环境下的mAcc与mIoU结果对比Table 5 Comparison of mAcc and mIoU results of different models on dataset A in daytime and nighttimeModels Daytime Nighttime mAcc mIoU mAcc mIoU SegNet (3ch) 47.8 55.5 52.6 61.3 SegNet (4ch) 45.4 49.3 58.2 62.9 ENet (3ch) 42.1 40.8 38.6 39.1 ENet (4ch) 44.1 45.9 57.1 54.3 FuseNet 50.6 61.2 63.4 64.7 MFNet 49.0 63.3 65.8 65.1 DMSNet 57.7 69.1 71.8 71.3 图4展示了DMSNet、FuseNet和MFNet在数据集A中5组测试图像上的分割结果, 其中第一行是可见光图像, 第二行是红外热图像, 第三行为分割标签, 前3幅拍摄于白天, 后2幅拍摄于夜晚. 第四、五、六行分别为FuseNet、MFNet和DMSNet的分割结果. 可以看出, 相比于MFNet和FuseNet, 本文提出的DMSNet对物体类别的判断更加准确, 如第一列中的路障与第四列中的自行车分割结果; 对边界细节的处理效果也更好, 如图中的行人; 另外分割结果的噪声也较少, 如第三列和第五列中的汽车分割结果.
3. 结束语
针对现有场景分割模型大多基于可见光图像, 无法适应复杂环境变化, 且模型参数量庞大, 难以部署在行车环境感知系统中的问题, 本文构建了基于可见光与红外热图像的双模分割网络DMSNet. 从可见光与红外热图像两种模态特征空间存在差异的角度入手, 提出了DPFSA模块. 该模块以十分轻量的操作对红外热图像特征进行转换, 缩小了两种模态特征空间的距离, 从而能够在几乎不增加模型参数的情况下, 有效改进模型性能. 另外, 使用本文提出的混合损失函数也可提升分割精度. 不足之处在于, 本文使用的数据集类别极其不平衡, 甚至存在错误标记、对类别划分标准不一致等情况, 导致场景中出现频率低的物体无法被准确分割, 因此, 下一步的工作需要从数据增强、模型优化等方面解决低频类别分割难的问题.
-
表 1 部分基于视觉序列特征均值/最大值的模型在MSVD数据集上的性能表现(%)
Table 1 Performance (%) of a few popular models based on visual sequential feature with mean/max pooling on MSVD
表 4 其他部分主流模型在MSVD上的性能表现(%)
Table 4 Performance (%) of a few other popular models on MSVD
表 2 部分基于序列RNN视觉特征建模的模型在MSVD数据集上的性能表现(%)
Table 2 Performance (%) of a few popular models based on visual sequential feature with RNN on MSVD
Methods (方法) B-1 B-2 B-3 B-4 METEOR CIDEr S2VT[32] — — — — 29.8 — Res-F2F(G-R101-152)[34] 82.8 71.7 62.4 52.4 35.7 84.3 Joint-BiLSTM reinforced[35] — — — — 30.3 — HRNE with attention[38] 79.2 66.3 55.1 43.8 33.1 — Boundary-aware encoder[39] — — — 42.5 32.4 63.5 hLSTMat[41] 82.9 72.2 63.0 53.0 33.6 — Li et al[42] — — — 48.0 31.6 68.8 MGSA(I+C)[43] — — — 53.4 35.0 86.7 LSTM-GAN[113] — — — 42.9 30.4 — PickNet(V+L+C)[114] — — — 52.3 33.3 76.5 表 3 部分基于3D卷积特征的模型在MSVD数据集上的性能表现(%)
Table 3 Performance (%) of a few popular models based on 3D visual feature on MSVD
表 5 部分基于视觉序列均值/最大值的模型在MSR-VTT2016数据集上的性能表现(%)
Table 5 Performance (%) of visual sequential feature based models with mean/max pooling on MSR-VTT2016
表 8 其他主流模型在MSR-VTT2016上的性能(%)
Table 8 Performance (%) of other popular models on MRT-VTT2016
表 6 部分基于RNN视觉序列特征建模的模型在MSR-VTT2016数据集上的性能表现(%)
Table 6 Performance (%) of a few popular models based on visual sequential feature with RNN on MRT-VTT2016
Methods (方法) B-1 B-2 B-3 B-4 METEOR CIDEr Res-F2F (G-R101-152)[34] 81.1 67.2 53.7 41.4 29.0 48.9 hLSTMat[41] — — — 38.3 26.3 — Li et al[42] 76.1 62.1 49.1 37.5 26.4 — MGSA(I+A+C)[43] — — — 45.4 28.6 50.1 LSTM-GAN[113] — — — 36.0 26.1 — aLSTM[117] — — — 38.0 26.1 — VideoLAB[118] — — — 39.5 27.7 44.2 PickNet(V+L+C)[114] — — — 41.3 27.7 44.1 DenseVidCap[49] — — — 44.2 29.4 50.5 ETS(Local+Global)[48] 77.8 62.2 48.1 37.1 28.4 — 表 7 部分基于3D卷积特征的模型在MSR-VTT2016数据集上的性能表现(%)
Table 7 Performance (%) of a few popular models based on 3D visual sequential feature on MRT-VTT2016
表 9 部分基于RNN视觉序列特征建模的模型在ActivityNet captions数据集(验证集)上的性能表现 (%)
Table 9 Performance (%) of a few popular models based on visual sequential feature with RNN on ActivityNet captions dataset (validation set)
-
[1] Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2): 91-110 doi: 10.1023/B:VISI.0000029664.99615.94 [2] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). San Diego, CA, USA: IEEE, 2005. 886−893 [3] Nagel H H. A vision of “vision and language” comprises action: An example from road traffic. Artificial Intelligence Review, 1994, 8(2): 189-214 [4] Kojima A, Tamura T, Fukunaga K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 2002, 50(2): 171-184 doi: 10.1023/A:1020346032608 [5] Gupta A, Srinivasan P, Shi J B, Davis L S. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009. 2012−2019 [6] Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, et al. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot. In: Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia: IEEE, 2013. 2712−2719 [7] Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B. Translating video content to natural language descriptions. In: Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia: IEEE, 2013. 433−440 [8] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 2012. 1097−1105 [9] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, CA, USA, 2015. [10] Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 1−9 [11] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 770−778 [12] 胡建芳, 王熊辉, 郑伟诗, 赖剑煌. RGB-D行为识别研究进展及展望. 自动化学报, 2019, 45(5): 829-840Hu Jian-Fang, Wang Xiong-Hui, Zheng Wei-Shi, Lai Jian-Huang. RGB-D action recognition: Recent advances and future perspectives. Acta Automatica Sinica, 2019, 45(5): 829-840 [13] 周波, 李俊峰. 结合目标检测的人体行为识别. 自动化学报, 2020, 46(9): 1961-1970Zhou Bo, Li Jun-Feng. Human action recognition combined with object detection. Acta Automatica Sinica, 2020, 46(9): 1961-1970 [14] Wu J C, Wang L M, Wang L, Guo J, Wu G S. Learning actor relation graphs for group activity recognition. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 9956−9966 [15] Ji S W, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231 doi: 10.1109/TPAMI.2012.59 [16] Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 4489−4497 [17] Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: ACL, 2014. 1724−1734 [18] Xu K, Ba J L, Kiros R, Cho K, Courville A, Salakhutdinov R, et al. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR.org, 2015. 2048−2057 [19] Yao T, Pan Y W, Li Y H, Qiu Z F, Mei T. Boosting image captioning with attributes. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017. 4904−4912 [20] Aafaq N, Mian A, Liu W, Gilani S Z, Shah M. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys, 2020, 52(6): Article No. 115 [21] Li S, Tao Z Q, Li K, Fu Y. Visual to text: Survey of image and video captioning. IEEE Transactions on Emerging Topics in Computational Intelligence, 2019, 3(4): 297-312 doi: 10.1109/TETCI.2019.2892755 [22] Xu R, Xiong C M, Chen W, Corso J J. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. Austin, Texas: AAAI Press, 2015. 2346−2352 [23] Venugopalan S, Xu H J, Donahue J, Rohrbach M, Mooney R, Saenko K. Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado: ACL, 2015. 1494−1504 [24] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3): 211-252 doi: 10.1007/s11263-015-0816-y [25] Pan Y W, Mei T, Yao T, Li H Q, Rui Y. Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 4594−4602 [26] Pan Y W, Yao T, Li H Q, Mei T. Video captioning with transferred semantic attributes. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017. 984−992 [27] 汤鹏杰, 谭云兰, 李金忠, 谭彬. 密集帧率采样的视频标题生成. 计算机科学与探索, 2018, 12(6): 981-993 doi: 10.3778/j.issn.1673-9418.1705058Tang Peng-Jie, Tan Yun-Lan, Li Jin-Zhong, Tan Bin. Dense frame rate sampling based model for video caption generation. Journal of Frontiers of Computer Science and Technology, 2018, 12(6): 981-993 doi: 10.3778/j.issn.1673-9418.1705058 [28] Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European Conference on Computer Vision. Graz, Austria: Springer, 2006. 428−441 [29] Wang H, Kläser A, Schmid C, Liu C L. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 2013, 103(1): 60-79 doi: 10.1007/s11263-012-0594-8 [30] Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia: IEEE, 2013. 3551−3558 [31] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2014. 568−576 [32] Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K. Sequence to sequence-video to text. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 4534−4542 [33] Venugopalan S, Hendricks L A, Mooney R, Saenko K. Improving lstm-based video description with linguistic knowledge mined from text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: ACL, 2016. 1961−1966 [34] Tang P J, Wang H L, Li Q Y. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(2): Article No. 31 [35] Bin Y, Yang Y, Shen F M, Xu X, Shen H T. Bidirectional long-short term memory for video description. In: Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, The Netherlands: ACM, 2016. 436−440 [36] Pasunuru R, Bansal M. Multi-task video captioning with video and entailment generation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: ACL, 2017. 1273−1283 [37] Li L J, Gong B Q. End-to-end video captioning with multitask reinforcement learning. In: Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). Waikoloa, HI, USA: IEEE, 2019. 339−348 [38] Pan P B, Xu Z W, Yang Y, Wu F, Zhuang Y T. Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 1029−1038 [39] Baraldi L, Grana C, Cucchiara R. Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017. 3185−3194 [40] Xu J, Yao T, Zhang Y D, Mei T. Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, California, USA: ACM, 2017. 537−545 [41] Song J K, Gao L L, Guo Z, Liu W, Zhang D X, Shen H T. Hierarchical LSTM with adjusted temporal attention for video captioning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: AAAI Press, 2017. 2737−2743 [42] Li W, Guo D S, Fang X Z. Multimodal architecture for video captioning with memory networks and an attention mechanism. Pattern Recognition Letters, 2018, 105: 23-29 doi: 10.1016/j.patrec.2017.10.012 [43] Chen S X, Jiang Y G. Motion guided spatial attention for video captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8191-8198 [44] Zhang J C, Peng Y X. Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 8319−8328 [45] Zhang J C, Peng Y X. Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Transactions on Image Processing, 2020, 29: 6209-6222 doi: 10.1109/TIP.2020.2988435 [46] Wang B R, Ma L, Zhang W, Liu W. Reconstruction network for video captioning. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018. 7622−7631 [47] Zhang W, Wang B R, Ma L, Liu W. Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(12): 3088-3101 doi: 10.1109/TPAMI.2019.2920899 [48] Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, et al. Describing videos by exploiting temporal structure. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 4507−4515 [49] Shen Z Q, Li J G, Su Z, Li M J, Chen Y R, Jiang Y G, et al. Weakly supervised dense video captioning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017. 5159−5167 [50] Johnson J, Karpathy A, Fei-Fei L. DenseCap: Fully convolutional localization networks for dense captioning. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 4565−4574 [51] Wang J W, Jiang W H, Ma L, Liu W, Xu Y. Bidirectional attentive fusion with context gating for dense video captioning. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018. 7190−7198 [52] Zhou L W, Zhou Y B, Corso J J, Socher R, Xiong C M. End-to-end dense video captioning with masked transformer. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018. 8739−8748 [53] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, California, USA: Curran Associates Inc., 2017. 6000−6010 [54] Zhou L W, Kalantidis Y, Chen X L, Corso J J, Rohrbach M. Grounded video description. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 6571−6580 [55] Mun J, Yang L J, Zhou Z, Xu N, Han B. Streamlined dense video captioning. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 6581−6590 [56] Wang X, Chen W H, Wu J W, Wang Y F, Wang W Y. Video captioning via hierarchical reinforcement learning. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018. 4213−4222 [57] Xiong Y L, Dai B, Lin D H. Move forward and tell: A progressive generator of video descriptions. In: Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer, 2018. 489−505 [58] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale video classification with convolutional neural networks. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014. 1725−1732 [59] Heilbron F C, Escorcia V, Ghanem B, Niebles J C. ActivityNet: A large-scale video benchmark for human activity understanding. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 961−970 [60] Shetty R, Laaksonen J. Frame- and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, The Netherlands: ACM, 2016. 1073−1076 [61] Yu Y J, Choi J, Kim Y, Yoo K, Lee S H, Kim G. Supervising neural attention models for video captioning by human gaze data. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017. 6119−6127 [62] Wang J B, Wang W, Huang Y, Wang L, Tan T N. M3: Multimodal memory modelling for video captioning. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018. 7512−7520 [63] Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, et al. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 2625−2634 [64] Tang P J, Wang H L, Kwong S. Deep sequential fusion LSTM network for image description. Neurocomputing, 2018, 312: 154-164 doi: 10.1016/j.neucom.2018.05.086 [65] Pei W J, Zhang J Y, Wang X R, Ke L, Shen X Y, Tai Y W. Memory-attended recurrent network for video captioning. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 8339−8348 [66] Li X L, Zhao B, Lu X Q. Mam-RNN: Multi-level attention model based RNN for video captioning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: AAAI Press, 2017. 2208−2214 [67] Zhao B, Li X L, Lu X Q. Cam-RNN: Co-attention model based RNN for video captioning. IEEE Transactions on Image Processing, 2019, 28(11): 5552-5565 doi: 10.1109/TIP.2019.2916757 [68] Chen S Z, Jin Q, Chen J, Hauptmann A G. Generating video descriptions with latent topic guidance. IEEE Transactions on Multimedia, 2019, 21(9): 2407-2418 doi: 10.1109/TMM.2019.2896515 [69] Gan C, Gan Z, He X D, Gao J F, Deng L. StyleNet: Generating attractive visual captions with styles. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017. 955−964 [70] Pan B X, Cai H Y, Huang D A, Lee K H, Gaidon A, Adeli E, Niebles J C. Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020. 10867−10876 [71] Hemalatha M, Sekhar C C. Domain-specific semantics guided approach to video captioning. In: Proceedings of the 2020 Winter Conference on Applications of Computer Vision. Snowmass, CO, USA: IEEE, 2020. 1576−1585 [72] Cherian A, Wang J, Hori C, Marks T M. Spatio-temporal ranked-attention networks for video captioning. In: Proceedings of the 2020 Winter Conference on Applications of Computer Vision (WACV). Snowmass, CO, USA: IEEE, 2020. 1606−1615 [73] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017. 4724−4733 [74] Wang L X, Shang C, Qiu H Q, Zhao T J, Qiu B L, Li H L. Multi-stage tag guidance network in video caption. In: Proceedings of the 28th ACM International Conference on Multimedia. Seattle, WA, USA: ACM, 2020. 4610−4614 [75] Hou J Y, Wu X X, Zhao W T, Luo J B, Jia Y D. Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019. 8917−8926 [76] Zhang Z Q, Shi Y Y, Yuan C F, Li B, Wang P J, Hu W M, et al. Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020. 13275−13285 [77] Zheng Q, Wang C Y, Tao D C. Syntax-aware action targeting for video captioning. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020. 13093−13102 [78] Hou J Y, Wu X X, Zhang X X, Qi Y Y, Jia Y D, Luo J B. Joint commonsense and relation reasoning for image and video captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 10973-10980 doi: 10.1609/aaai.v34i07.6731 [79] Chen J W, Pan Y W, Li Y H, Yao T, Chao H Y, Mei T. Temporal deformable convolutional encoder-decoder networks for video captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8167-8174 [80] Liu S, Ren Z, Yuan J S. SibNet: Sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(9): 3259-3272 doi: 10.1109/TPAMI.2019.2940007 [81] Aafaq N, Akhtar N, Liu W, Gilani S Z, Mian A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 12479−12488 [82] Yu H N, Wang J, Huang Z H, Yang Y, Xu W. Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 4584−4593 [83] Iashin V, Rahtu E. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In: Proceedings of the British Machine Vision Conference (BMVC). Online (Virtual): Springer, 2020. 1−13 [84] Park J S, Darrell T, Rohrbach A. Identity-aware multi-sentence video description. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 360−378 [85] Krishna R, Hata K, Ren F, Li F F, Niebles J C. Dense-captioning events in videos. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017. 706−715 [86] Escorcia V, Heilbron F C, Niebles J C, Ghanem B. DAPs: Deep action proposals for action understanding. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016. 768−784 [87] Li Y H, Yao T, Pan Y W, Chao H Y, Mei T. Jointly localizing and describing events for dense video captioning. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018. 7492−7500 [88] Wang T, Zheng H C, Yu M J, Tian Q, Hu H F. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(5): 1890-1900 doi: 10.1109/TCSVT.2020.3014606 [89] Park J S, Rohrbach M, Darrell T, Rohrbach A. Adversarial inference for multi-sentence video description. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 6591−6601 [90] Sun C, Myers A, Vondrick C, Murphy K, Schmid C. VideoBERT: A joint model for video and language representation learning. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019. 7463−7472 [91] Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: ACL, 2019. 4171−4186 [92] Xie S N, Sun C, Huang J, Tu Z W, Murphy K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer, 2018. 318−335 [93] Sun C, Baradel F, Murphy K, Schmid C. Learning video representations using contrastive bidirectional transformer. arXiv: 1906.05743, 2019 [94] Luo H S, Ji L, Shi B T, Huang H Y, Duan N, Li T R, et al. UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv: 2002.06353, 2020 [95] Mathews A P, Xie L X, He X M. SentiCap: Generating image descriptions with sentiments. In: Proceedings of the 13th AAAI Conference on Artificial Intelligence. Phoenix, Arizona: AAAI Press, 2016. 3574−3580 [96] Guo L T, Liu J, Yao P, Li J W, Lu H Q. MSCap: Multi-style image captioning with unpaired stylized text. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 4199−4208 [97] Park C C, Kim B, Kim G. Attend to you: Personalized image captioning with context sequence memory networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017. 6432−6440 [98] Shuster K, Humeau S, Hu H X, Bordes A, Weston J. Engaging image captioning via personality. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 12508−12518 [99] Chen T L, Zhang Z P, You Q Z, Fang C, Wang Z W, Jin H L, et al. “Factual” or “Emotional”: Stylized image captioning with adaptive learning and attention. In: Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer, 2018. 527−543 [100] Zhao W T, Wu X X, Zhang X X. MemCap: Memorizing style knowledge for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 12984-12992 doi: 10.1609/aaai.v34i07.6998 [101] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014. 580−587 [102] Girshick R. Fast R-CNN. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 1440−1448 [103] Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149 doi: 10.1109/TPAMI.2016.2577031 [104] Yandex A B, Lempitsky V. Aggregating local deep features for image retrieval. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 1269−1277 [105] Kalantidis K, Mellina C, Osindero S. Cross-dimensional weighting for aggregated deep convolutional features. In: Proceedings of the European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016. 685−701 [106] Papineni K, Roukos S, Ward T, Zhu W J. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, Pennsylvania: ACL, 2002. 311−318 [107] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: ACL, 2005. 65−72 [108] Lin C Y, Och F J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). Barcelona, Spain: ACL, 2004. 605−612 [109] Vedantam R, Zitnick C L, Parikh D. CIDEr: Consensus-based image description evaluation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 4566−4575 [110] Anderson P, Fernando B, Johnson M, Gould S. SPICE: Semantic propositional image caption evaluation. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016. 382−398 [111] Xu J, Mei T, Yao T, Rui Y. MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 5288−5296 [112] Song J, Guo Y Y, Gao L L, Li X L, Hanjalic A, Shen H T. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(10): 3047-3058 doi: 10.1109/TNNLS.2018.2851077 [113] Yang Y, Zhou J, Ai J B, Bin Y, Hanjalic A, Shen H T, et al. Video captioning by adversarial LSTM. IEEE Transactions on Image Processing, 2018, 27(11): 5600-5611 doi: 10.1109/TIP.2018.2855422 [114] Chen Y Y, Wang S H, Zhang W G, Huang Q M. Less is more: Picking informative frames for video captioning. In: Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer, 2018. 367−384 [115] Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R. Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the 25th International Conference on Computational Linguistics. Dublin, Ireland: Dublin City University and Association for Computational Linguistics, 2014. 1218−1227 [116] Dong J F, Li X R, Lan W Y, Huo Y J, Snoek C G M. Early embedding and late reranking for video captioning. In: Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, The Netherlands: ACM, 2016. 1082−1086 [117] Gao L L, Guo Z, Zhang H W, Xu X, Shen H T. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, 2017, 19(9): 2045-2055 doi: 10.1109/TMM.2017.2729019 [118] Ramanishka V, Das A, Park D H, Venugopalan S, Hendricks L A, Rohrbach M, et al. Multimodal video description. In: Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, The Netherlands: ACM, 2016. 1092−1096 [119] Jin Q, Chen J, Chen S Z, Xiong Y F, Hauptmann A. Describing videos using multi-modal fusion. In: Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, The Netherlands: ACM, 2016. 1087−1091 [120] Zhou L W, Xu C L, Corso J J. Towards automatic learning of procedures from web instructional videos. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, Louisiana, USA: AAAI Press, 2018. 7590−7598 [121] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 3156−3164 [122] Zhang M X, Yang Y, Zhang H W, Ji Y L, Shen H T, Chua T S. More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing, 2019, 28(1): 32-44 doi: 10.1109/TIP.2018.2855415 [123] Yang L Y, Wang H L, Tang P J, Li Q Y. CaptionNet: A tailor-made recurrent neural network for generating image descriptions. IEEE Transactions on Multimedia, 2020, 23: 835-845 [124] 汤鹏杰, 王瀚漓, 许恺晟. LSTM逐层多目标优化及多层概率融合的图像描述. 自动化学报, 2018, 44(7): 1237-1249Tang Peng-Jie, Wang Han-Li, Xu Kai-Sheng. Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM. Acta Automatica Sinica, 2018, 44(7): 1237-1249 [125] Li X Y, Jiang S Q, Han J G. Learning object context for dense captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8650-8657 [126] Yin G J, Sheng L, Liu B, Yu N H, Wang X G, Shao J. Context and attribute grounded dense captioning. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 6234−6243 [127] Kim D J, Choi J, Oh T H, Kweon I S. Dense relational captioning: Triple-stream networks for relationship-based captioning. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019. 6264−6273 [128] Chatterjee M, Schwing A G. Diverse and coherent paragraph generation from images. In: Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer, 2018. 747−763 [129] Wang J, Pan Y W, Yao T, Tang J H, Mei T. Convolutional auto-encoding of sentence topics for image paragraph generation. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, China: AAAI Press, 2019. 940−946 期刊类型引用(10)
1. 李伟健,胡慧君. 基于潜在特征增强网络的视频描述生成方法. 计算机工程. 2024(02): 266-272 . 百度学术
2. 张煜杨,刘茂福. 基于双向特征金字塔的密集视频描述生成方法. 中国科技论文. 2024(02): 200-208 . 百度学术
3. 杨大伟,盘晓芳,毛琳,张汝波. 改进的密集视频描述Transformer译码算法. 计算机工程与应用. 2024(17): 89-97 . 百度学术
4. 崔家郡,康璐,马苗. 课堂师生交互智能分析技术研究综述. 计算机科学. 2024(10): 40-49 . 百度学术
5. 赵宏,陈志文,郭岚,安冬. 基于ViT与语义引导的视频内容描述生成. 计算机工程. 2023(05): 247-254 . 百度学术
6. 黄先开,张佳玉,王馨宇,王晓川,刘瑞军. 密集视频描述研究方法综述. 计算机工程与应用. 2023(12): 28-48 . 百度学术
7. 盘晓芳,杨大伟,毛琳. 密集视频描述中单词级遗忘度优化算法. 大连民族大学学报. 2022(03): 218-225 . 百度学术
8. 李公全,李智国,李卫星,高栋. 自然语言生成技术及其在军事领域应用. 中国电子科学研究院学报. 2022(10): 935-942+958 . 百度学术
9. 毛琳,高航,杨大伟,张汝波. 视频描述中链式语义生成网络. 光学精密工程. 2022(24): 3198-3209 . 百度学术
10. 马苗,陈小秋,田卓钰. 基于多模态特征的视频密集描述生成方法. 中文信息学报. 2022(11): 156-168 . 百度学术
其他类型引用(28)
-