唇读研究进展与展望

陈小鼎; 盛常冲; 匡纲要; 刘丽

doi:10.16383/j.aas.c190531

唇读研究进展与展望

doi: 10.16383/j.aas.c190531

1.
国防科技大学电子科学学院长沙 410073
2.
国防科技大学系统工程学院长沙 410073

基金项目: 国家自然科学基金(61872379)资助

详细信息

作者简介:
陈小鼎：国防科技大学系统工程学院硕士研究生. 主要研究方向为计算机视觉与模式识别. E-mail: chenxiaoding14@nudt.edu.cn

盛常冲：国防科技大学电子科学学院博士研究生. 主要研究方向为计算机视觉, 模式识别. E-mail: sheng_cc@nudt.edu.cn

匡纲要：国防科技大学电子科学学院教授. 主要研究方向为遥感图像处理, 目标识别. E-mail: kuanggangyao@nudt.edu.cn

刘丽：国防科技大学系统工程学院副教授. 主要研究方向为图像理解, 计算机视觉, 模式识别. 本文通信作者. E-mail: liuli_nudt@nudt.edu.cn

计量
- 文章访问数: 4801
- HTML全文浏览量: 1984
- PDF下载量: 424
- 被引次数: 4
出版历程
- 收稿日期: 2019-07-16
- 录用日期: 2019-11-16
- 网络出版日期: 2019-12-19
- 刊出日期: 2020-11-24

The State of the Art and Prospects of Lip Reading

1.
College of Electronic Science, National University of Defense Technology, Changsha 410073
2.
College of Systems Engineering, National University of Defense Technology, Changsha 410073

Funds: Supported by National Natural Science Foundation of China (61872379)

摘要

摘要: 唇读, 也称视觉语言识别, 旨在通过说话者嘴唇运动的视觉信息, 解码出其所说文本内容. 唇读是计算机视觉和模式识别领域的一个重要问题, 在公共安防、医疗、国防军事和影视娱乐等领域有着广泛的应用价值. 近年来, 深度学习技术极大地推动了唇读研究进展. 本文首先阐述了唇读研究的内容和意义, 并深入剖析了唇读研究面临的难点与挑战; 然后介绍了目前唇读研究的现状与发展水平, 对近期主流唇读方法进行了梳理、归类和评述, 包括传统方法和近期的基于深度学习的方法; 最后, 探讨唇读研究潜在的问题和可能的研究方向. 以期引起大家对唇读问题的关注与兴趣, 并推动与此相关问题的研究进展.
- 唇读 /
- 视觉语言识别 /
- 时空特征提取 /
- 计算机视觉 /
- 深度学习
Abstract: Lip reading, also known as visual speech recognition, aims to infer the content of a speech through the motion of the speaker´s mouth. Lip reading is an important issue in the field of computer vision and pattern recognition. It has a wide range of applications in the fields of public security, medical, defense military and professional filming. In recent years, deep learning technology has greatly promoted the progress of lip reading research. Starting from the definition of lip reading problem, this paper first expounds the content and significance of lip reading research, and deeply analyzes the difficulties and challenges of lip reading research. Then, the recent achievements of lip reading research are introduced, and the current mainstream lip reading methods are combed, categorized and reviewed as well, including traditional methods and recent methods based on deep learning. Finally, the potential problems and possible research directions of lip reading research are discussed to arouse the attention and interest of this research, and promote the research progress of related issues.
- Lip reading /
- visual speech recognition /
- spatiotemporal feature extraction /
- computer vision /
- deep learning

HTML全文

胰腺癌具有侵袭性强、转移早、恶性程度高、发展较快、预后较差等特征, 根据美国癌症协会报道, 其5年生存率低于10%, 死亡率非常高^[1]. 胰腺癌已成为严重威胁人类健康的重要疾病, 并对临床医学构成巨大挑战. 胰腺的准确分割对胰腺癌检测识别等任务起着至关重要的作用. 胰腺处于人体后腹部的解剖位置, 其脏器影像常被遮挡不易识别, 且其形状和空间位置多变, 在腹部CT图像中所占比例较小, 其准确分割问题亟待解决.

近年来, 由于深度神经网络的发展以及全卷积网络(Fully convolutional network, FCN)^[2]的出现, 医学图像分割准确率取得了较大提升. 针对不同患者间胰腺形态差异性较大的解剖特征, 基于单阶段深度学习分割算法极易受其较大背景区域影响, 导致分割准确率下降. 现阶段常用解决方法是基于由粗到细的分割算法^[3-6], 通过粗分割阶段输出掩码进行定位, 只保留胰腺及其周围部分区域作为细分割阶段网络输入, 减小背景区域对目标区域影响, 提高分割精度. 由粗到细的分割算法虽然减少了腹部影像背景区域对目标区域的干扰, 但是针对形态和空间位置多变的胰腺小器官增强前景区域同样重要. 同时粗分割阶段仅保留了定位框的位置信息, 却丢失了胰腺输出分割掩码的先验特征信息, 从而细分割阶段缺少粗分割阶段上下文信息, 有时会获得相比粗分割阶段更差的分割结果, 如图1所示. 此外, 由于在CT影像中胰腺与邻近器官密度较为接近、组织重叠部分界限分辨困难, 未合理利用相邻切片预测分割掩码上下文信息常导致误分割现象, 如图2所示. 结合相邻预测分割掩码容易看出, 中间切片存在误分割区域(红色部分), 合理利用预测分割掩码切片上下文信息能够校准误分割区域.

图 1 粗细分割存在问题示例

Fig. 1 A failure case of the coarse-to-fine pancreas segmentation approach

下载: 全尺寸图片幻灯片

图 2 误分割示例

Fig. 2 An example of false segmentation

下载: 全尺寸图片幻灯片

针对胰腺细分割阶段缺少粗分割阶段上下文信息的问题, 文献[3]提出了固定点的分割方法. 训练阶段使用胰腺标注数据训练粗分割网络, 然后使用粗分割网络的预测结果对原CT图像进行定位、剪裁, 只保留胰腺及其周围部分区域作为细分割网络输入, 通过反向传播, 优化细分割结果. 测试阶段, 固定细分割网络参数, 使用细分割网络预测掩码获得定位框并剪裁CT图像, 再次输入细分割网络, 迭代此过程获得优化的分割掩码, 以此缓解缺少阶段上下文信息的问题. 但是此分割方法本质上仅循环利用细分割定位框的位置信息, 缺少对分割掩码的循环利用, 缺少联合训练, 导致分割效果提升有限.

针对如何合理利用切片上下文信息解决胰腺与邻近器官密度较为接近、组织重叠部分界限分辨困难导致的误分割问题, 研究者提出了利用卷积长短期记忆网络(Convolutional long short-term memory, CLSTM)^[7]和三维分割网络的方法^[8-10]. 文献[8]将相邻CT切片输入到卷积门控循环单元(Convolutional gated recurrent units, CGRU)^[11], 使当前隐藏层输出信息融合到下一时序隐藏层中, 通过前向传播, 当前隐藏层可获得融合之前切片上下文信息的输出表示. 文献[9]通过双向卷积循环神经网络, 同时利用当前层前后切片上下文信息进行胰腺分割. 但是目前大多数基于卷积循环神经网络的分割方法在利用切片上下文信息时, 只能按照输入切片顺序、逆序或结合顺序和逆序的方式. 这些方式严重依赖输入序列顺序, 并且相隔越远的切片在前向传播过程中能够共享的上下文信息越少. 与前述方法不同, 文献[10]将邻近切片输入三维卷积神经网络, 有效利用切片上下文信息, 改善了分割结果. 但基于三维卷积神经网络的分割方法, 受限于三维训练数据量过少和显存消耗过大, 大多数方法是基于局部三维块的分割. 虽然局部块中切片上下文信息得到了合理利用, 但是全局三维信息却缺乏连续性, 导致分割掩码存在过多噪点. 相比于三维图像分割方法需要解决三维图像数据量过少及参数量过多带来的显存问题, 基于卷积循环神经网络的二维图像分割方法存在的问题可以通过设计算法解决.

根据以上分析, 本文针对现有基于由粗到细的二维胰腺分割方法中存在的问题, 设计了循环显著性校准网络, 其结合更多的阶段上下文信息和切片上下文信息. 通过设计的卷积自注意力校准模块跨顺序利用切片上下文信息校准每一阶段的胰腺分割掩码, 循环使用前一阶段的胰腺分割掩码定位目标区域, 增强当前阶段的网络输入, 完成分割任务的联合优化. 提出的方法在公开数据集上进行了实验验证, 结果表明其有效地解决了上述胰腺分割任务中存在的问题. 本文的主要贡献如下:

1) 提出循环显著性校准网络, 循环利用前一阶段胰腺分割掩码显著性增强当前阶段胰腺区域特征, 通过联合训练获取更多的阶段上下文信息.

2) 设计了卷积自注意力模块, 使得胰腺所有输入切片预测分割掩码之间可以平行地进行跨顺序上下文信息互监督, 校准预测分割掩码.

3) 在NIH (National institutes of health)和MSD (Medical segmentation decathlon)胰腺数据集进行了大量实验, 实验结果验证了提出方法的有效性及先进性.

1. 相关工作

由粗到细的两阶段分割方法. 由粗到细的两阶段分割方法主要分为两类: 基于传统算法和基于深度学习的方法. 前者主要使用如超像素、图谱等传统算法获得粗分割结果, 再通过随机森林、Graph-cut等方法获得细分割结果^[12-13]; 后者主要是基于深度学习的粗细分割方法^[14], 基于数据驱动、自动化学习模型参数, 进行像素级别分类, 因其高精度和稳定性, 逐渐取代传统由粗到细的分割方法.

基于深度学习的粗细分割方法在粗分割训练阶段, 输入CT切片$ {M^C} $, 经过粗分割卷积神经网络$ {f(M^C,{\theta}^C)} $, 预测结果记为$ {N^C} $, 与真实标签(Ground truth) $ {Y} $进行损失计算, 通过反向传播优化粗分割结果. 在细分割训练阶段, 针对粗分割网络预测结果, 使用最小外接矩形算法获得胰腺位置坐标$ (p_x, p_y, w, h) $, 对CT输入$ {M^C} $进行剪裁, 获得感兴趣区域$ {M^F} $作为细分割网络$ {f(M^F,{\theta}^F)} $的输入; 获得细分割网络输出预测结果$ {N^F} $并还原图像大小, 记为$ {Y^P} $, 与真实标签(Ground truth) $ {Y} $进行损失计算, 通过反向传播优化细分割结果. 其中, 上标参数$ {C} $、$ {F} $分别表示粗分割阶段和细分割阶段; $ {(p_x, p_y)} $, $ {w} $, $ {h} $分别表示外接矩形框的左上角坐标, 宽和高; $ {{\theta}^C} $, $ {{\theta}^F} $分别表示粗、细分割网络参数.

在测试阶段, 将CT切片输入训练好的粗细分割网络即可获得测试结果. 粗细分割方法中使用的网络主要是基于UNet^[15], FCN^[2]以及基于这两个基础结构的改进网络. 本文在粗细分割方法的基础上针对胰腺解剖性质提出基于循环显著性校准网络的胰腺分割方法, 均使用UNet^[15]作为基础骨干网络.

胰腺分割方法. 传统医学图像分割常用方法有水平集^[16]、混合概率图模型^[17]和活动轮廓模型^[18]等. 随着深度学习的发展, 基于卷积神经网络(Convolutional neural network, CNN)的分割方法由于其较高的精确度和较好的泛化性逐渐取代传统方法. 目前大多数基于深度学习的胰腺分割方法核心思想来源于FCN^[2], FCN改进了卷积神经网络, 用卷积层替代最后的全连接层, 同时将浅层语义特征通过上采样与深层特征相融合, 补充分割目标的位置信息, 提高了分割准确率. 另一种常用于胰腺分割的方法采用“编码器−解码器”结构^[15], 编码器负责逐层提取渐进的高级语义特征, 解码器通过反卷积或上采样的方法逐层恢复图像分辨率至原图大小, 同一层次编码器和解码器通过跳跃结构相连接.

由于胰腺形状、大小和位置多变, 上述单个阶段基于FCN或“编码器−解码器”结构的分割网络难以获得准确分割结果. 文献[3]首先提出了基于卷积神经网络由粗到细的两阶段分割算法, 使用粗分割掩码的位置信息剪裁细分割阶段网络的输入, 减小背景区域对胰腺区域分割的影响. 相比于文献[3], 文献[19]更进一步, 在使用由粗到细的两阶段分割算法的同时, 通过设计轻量化模块减少了粗细分割阶段模型的参数; 而文献[20]则直接以中心点为基础剪裁图像作为细分割网络的输入. 文献[21]利用肝脏、脾脏和肾脏的位置信息定位胰腺器官, 这不同于上述直接通过粗分割定位胰腺器官的方法. 文献[22]提出基于由下至上的方法, 首先使用超像素分块进行粗分割, 然后基于超像素块集成分割结果. 文献[23]使用最大池化方法融合CT切片三个轴信息, 获取候选区域, 在候选区域中从边缘至内部聚合分割结果. 文献[24]提出基于图谱的粗细分割方法, 改善了分割结果. 近来, 文献[25]提出了一种基于强化学习的两阶段分割算法. 首先, 使用DQN (Deep Q network)回归胰腺坐标位置, 剪裁只保留胰腺及其周围部分区域; 然后, 细分割阶段使用可行变卷积网络获得分割结果.

以上方法均取得了较为准确的分割结果, 但是将胰腺分割粗细两阶段分开训练, 细分割阶段缺少粗分割阶段上下文信息的问题, 依然难以用有效的方式处理. 文献[3]在测试阶段使用固定点算法, 固定细分割模型参数, 循环利用当前阶段预测分割掩码获取定位框位置信息作为下一阶段输入的先验, 以此达到使用之前阶段上下文信息的效果. 此方法本质上只迭代使用细分割定位框位置信息, 缺乏粗分割输出分割掩码的有效利用.

合理使用切片上下文信息解决胰腺误分割, 同样至关重要. 文献[8, 26]首先使用卷积神经网络提取特征, 然后利用卷积长短期记忆网络^[7]提取切片上下文信息分割胰腺, 但切片上下文信息不能够跨顺序、平行化共享, 并且前向传播存在信息丢失的问题. 文献[27]使用对抗学习思想, 分别使用两个判别器约束主分割网络, 捕获空间语义信息和切片上下文信息, 但对抗网络的不稳定性使得训练和测试结果波动性较大. 文献[28]使用相邻切片局部块作为输入, 编码器部分使用三维卷积以递进的方式逐层融合切片上下文信息, 解码器部分使用二维转置卷积输出中间切片分割掩码. 由于CT切片之间层厚和层间距的差异性, 且局部块输入未使用任何插值方法, 捕捉到的三维切片上下文信息具有不一致性和局部性. 文献[10, 29-30]使用三维分割方法获取切片上下文信息, 受限于显存和三维数据量, 全局三维信息缺乏连续性.

针对现有胰腺分割方法中缺少阶段上下文信息的问题, 以及在使用循环卷积神经网络分割胰腺的过程中, 利用相邻切片上下文信息存在顺序依赖并且随着相邻切片间隔距离的增加导致全局信息正相关减少的问题, 本文提出了一种循环利用阶段上下文信息和切片上下文信息的二维胰腺图像分割网络. 首先, 将相邻CT切片作为粗分割网络输入, 获得粗分割掩码; 然后, 通过最小矩形框算法对获得的粗分割掩码进行胰腺区域坐标定位; 接着, 使用粗分割掩码作为权重增强细分割阶段输入切片的胰腺区域特征, 获取细分割掩码, 细分割掩码以同样的方式增强下一阶段分割网络的输入; 最后, 循环迭代上述过程直到达到指定停止条件. 通过此方法, 有效降低了分割平均误差, 提高了分割方法的稳定性.

2. 本文方法

针对当前由粗到细的两阶段胰腺分割算法利用阶段上下文信息和切片上下文信息存在的问题, 本文提出了循环显著性校准网络. 其采用UNet和卷积自注意力校准模块作为骨干网络, 接受相邻横断位胰腺CT切片输入; 当前阶段卷积自注意力校准模块利用切片上下文信息校准UNet输出掩码的同时, 利用自身输出掩码显著性增强下一阶段UNet网络输入; 循环UNet和卷积自注意力校准模块, 联合阶段上下文信息和切片上下文信息提升分割性能, 整体网络架构如图3所示.

图 3 循环显著性校准网络

Fig. 3 Recurrent saliency calibration network

下载: 全尺寸图片幻灯片

2.1 循环显著性校准网络

本文聚焦于联合阶段间和胰腺序列图像切片上下文信息提升分割准确率. 为了合理利用当前阶段胰腺分割掩码的位置和形状等先验信息, 显著增强下一阶段分割网络的输入; 同时, 通过平行、跨顺序直接利用相邻切片分割掩码改善自身明显误分割现象, 提出循环显著性校准网络, 其分割迭代过程如图4所示. 循环显著性指每个阶段的校准分割掩码$ P $经增强模块$ {g(P,\varphi )} $特征提取后获得像素矩阵, 此像素矩阵为胰腺前景相关矩阵. 使用此像素矩阵和下一阶段输入图像$ M $进行像素对像素相乘, 显著增强胰腺区域, 抑制背景区域.

图 4 迭代过程

Fig. 4 Iteration process

下载: 全尺寸图片幻灯片

选择UNet基础分割网络模型$ {f(\cdot ,\theta )} $作为骨干网络, 该模型的输入为胰腺的相邻CT切片, 记为$ X $, 通过基础分割网络模型推断出输出掩码$ N $. 由于胰腺与邻近器官密度较为接近、组织重叠部分界限分辨困难, 容易导致基础分割网络出现误分割现象. 因此, 本文基于切片上下文信息设计了卷积自注意力校准模块$ {a(\cdot,\eta )} $, 校准基础分割网络输出的分割掩码$ N $, 卷积自注意力校准模块的输出表示为$ P $; 为了能够获取更加准确的胰腺位置, 设置了固定分割掩码像素阈值0.5, 来二值化$ P $, 其输出表示如式(1)所示.

$$ \begin{equation} Z = \left\{\begin{aligned}&1, & P_{i j} \geq 0.5 \\&0, & P_{i j}<0.5 \end{aligned}\right. \end{equation} $$

(1)

其中, $ {i} $、$ {j} $为分割掩码中像素值位置坐标.

通过对式(1)的输出$ {Z} $应用最小矩形框算法获得包围胰腺分割掩码框的位置坐标$ { (p_x, p_y, w, h)} $, 位置坐标获得过程如图5所示. 其中, 蓝色框为单个连通区域分割结果的定位, 绿色框为整合多段分割结果的定位.

图 5 基于最小矩形框的定位过程

Fig. 5 The process of localization based on minimum rectangle algorithm

下载: 全尺寸图片幻灯片

为改善下一阶段分割过程中缺少当前阶段上下文信息的问题, 使用校准模块输出分割掩码$ {P} $作为潜在变量输入到显著性增强模块$ {g(P,\varphi )} $, 提取特征概率作为下一阶段分割网络输入$ X $的先验空间权重, 并结合上述定位坐标$ { (p_x, p_y, w, h)} $增强并缩小下一阶段网络的输入, 显著减小背景区域对分割的影响. 对于在整个腹部图像中区域占比较小, 形状和位置多变的胰腺器官来说, 此过程极为重要, 其显著增强了胰腺区域, 弱化了不相关区域. 过程如式(2)所示.

$$ \begin{equation} M = {\rm{Crop}}(X \otimes g(P, \varphi)) \end{equation} $$

(2)

其中, $ M $为增强并缩小的下一阶段输入; $ \otimes $表示对应像素点相乘; Crop表示利用定位坐标$(p_x, p_y, w, h)$对各阶段输入做剪裁. $ \theta $, $ \eta $, $ \varphi $为相应模块共享网络参数.

图4右图是图4左图的展开形式, 其中$ {M_0} $作为胰腺初始输入图像和$ {X} $相同, 其大小远大于其他阶段的网络输入$ {M_t \;(t>0)} $, 所以第一次粗分割阶段和其余分割阶段网络参数$ \theta $应加以区分, 分别使用$ {\theta}^C $和$ {\theta}^F $表示. 在循环迭代过程中, 由于各分割阶段输入$ {X} $不变, 并且输入$ {X} $需与显著性增强模块$ {g(P,\varphi )} $输出作逐像素相乘, 为了保持$ {g(P,\varphi )} $是一个输入输出同大小的模块, 设置卷积核大小为$ 3\;\times 3 $, 步长为1, 填充为1.

整个循环迭代分割过程如式(3)所示.

$$ \begin{equation} P_{t} = a\left(f\left({\rm{Crop}}\left(X \otimes g\left(P_{t-1}, \varphi\right)\right), \theta\right), \eta\right) \end{equation} $$

(3)

根据以上分析可以看出, 整个网络运算过程是可微的, 结合所有阶段损失函数进行联合训练. 本文采用DSC (Dice-S$ {\phi} $rensen coefficient)作为损失函数, 如式(4)所示.

$$ \begin{equation} {\cal{L}}(Y, P) = 1-\frac{2 \sum Y P}{\sum Y+\sum P} \end{equation} $$

(4)

其中, $ Y $是真实标签, $ P $为各阶段预测分割掩码.

结合各阶段分割网络和卷积自注意力校准模块, DSC联合损失函数如式(5)所示.

$$ \begin{equation} {\cal{L}} = \sum\limits_{i = 0}^{T} {{{\lambda}}}_{i}\left[{\cal{L}}\left(Y, N_{i}\right)+{\cal{L}}\left(Y, P_{i}\right)\right] \end{equation} $$

(5)

其中, $ T $为循环分割次数停止阈值. 由于粗分割阶段和其余分割阶段胰腺切片输入大小不一致, 粗分割阶段主要用于获取胰腺的初步定位和粗分割掩码, 故设置较小的权重参数, 且满足$3\lambda_{0} = \lambda_{1} = \lambda_{2} = \cdots = \lambda_{T} = 3/(3T+1)$.

2.2 卷积自注意力校准模块

针对胰腺与邻近器官密度较为接近、组织重叠部分界限分辨困难而导致的误分割问题, 本文提出在循环显著性校准网络每个分割阶段嵌入卷积自注意力校准模块, 其合理利用切片上下文信息校准胰腺相邻切片误分割区域.

本文设计的卷积自注意力校准模块基于自注意力机制^[31]. 自注意力机制在处理序列信息输入时, 能够跨顺序、平行化地与序列输入中其他时间点输入进行直接交互. 但由于自注意力机制使用的线性变换忽略了图像像素之间的空间关系, 本文提出卷积自注意力校准模块, 使用卷积操作替换线性变换. 卷积自注意力校准模块如图6左图所示(图中以批次大小3为例), 图6右图为图6左图中获得单张校准分割掩码的计算过程, 其余两张切片校准分割掩码获得过程计算方式类似, 图中$ P $表示预测分割掩码的热力图显示, 中间是颜色条，其取值范围在$ [0,1.0] $之间. 为了描述和解释的方便, 将卷积自注意力校准模块中所有标量以向量或者矩阵的形式表示如下.

图 6 卷积自注意力校准模块网络图

Fig. 6 Network of convolution self-attention calibration module

下载: 全尺寸图片幻灯片

$$ \begin{split} &N_{t} = \left(\begin{array}{l} N_{t}^{1} \\N_{t}^{2} \\N_{t}^{3} \end{array}\right), Q = \left(\begin{array}{l} q^{1} \\q^{2} \\q^{3} \end{array}\right), \\ &K = \left(\begin{array}{l} k^{1} \\k^{2} \\k^{3} \end{array}\right), V = \left(\begin{array}{l} v^{1} \\v^{2} \\v^{3} \end{array}\right), \end{split}\qquad\qquad $$

$$ \begin{split} &B = \left(\begin{array}{lll} b_{11} & b_{12} & b_{13} \\b_{21} & b_{22} & b_{23} \\b_{31} & b_{32} & b_{33} \end{array}\right),\\ &C = \left(\begin{array}{lll} c_{11} & c_{12} & c_{13} \\c_{21} & c_{22} & c_{23} \\c_{31} & c_{32} & c_{33} \end{array}\right), P_{t} = \left(\begin{array}{l} P_{t}^{1} \\P_{t}^{2} \\P_{t}^{3} \end{array}\right) \end{split} $$

其中, $ N_t $为分割网络$ f( M_t,\theta ) $输出的相邻分割掩码; $ Q $为查询特征向量; $ K $, $ V $为键值对特征向量; $ B $为相似度度量矩阵; $ C $为像素权重矩阵; $ P_t $为校准模块输出掩码.

为了提取多样性的特征表示, 首先将相邻分割掩码$ N_t $中的每一个元素分别通过$ 3\times3 $卷积, 获得输出$ \alpha_{1}, \alpha_{2}, \alpha_{3} $; 然后将输出特征$ \alpha_{1} $分别通过3个不同的$ 1\times1 $卷积, 获得$ N_t^1 $的查询特征向量$q^1$和键值对特征向量$k^1$, $v^1$; 输出特征$ \alpha_{2} $, $ \alpha_{3} $以与$ \alpha_{1} $相同的方式获得$ (q^2,k^2,v^2 ) $和 $ (q^3,k^3,v^3 ) $, 其中所有$ 3\times3 $和$ 1\times1 $卷积都保持输出和输入大小一致. 下面通过$ Q $、$ K $、$ V $来表示输出掩码$ P_t $获得的过程, 如式(6) ~ 式(8).

$$ \begin{equation} B = Q K^{{\rm{T}}} \end{equation} $$

(6)

$$ \begin{equation} C = {\rm{softmax}} \left(\frac{B}{ \sqrt{d_{k}}}, dim = -2\right) \end{equation} $$

(7)

$$ \begin{equation} P_{t} = CV \end{equation} $$

(8)

$ B $为$ N_t $中的每个元素的特征查询向量分别与其他元素的键向量作相似度度量获得的矩阵, 表示其他元素对当前元素的影响程度, 这里相似度度量是像素对像素的乘法操作, $ d_k $为键向量维度. $ C $通过$ {\rm{softmax}} $函数对相似度度量矩阵进行归一化, $ dim = -2 $表示对倒数第二个维度进行归一化; 获得像素权重矩阵以后, 与$ N_t $中的每个元素的值向量$ v $像素对像素相乘, 再进行融合获得最终输出表示$ P_t $. 分别用式(9) ~ 式(15)表示详细计算过程, 其中卷积操作都拥有不同的参数.

$$ \begin{split} &\alpha_{1}, \alpha_{2}, \alpha_{3} = {\rm{conv}}_{3 \times 3}\left(N_{t}^{1}\right), \\ &\qquad{\rm{conv}}_{3 \times 3}\left(N_{t}^{2}\right), {\rm{conv}}_{3 \times 3}\left(N_{t}^{3}\right) \end{split} $$

(9)

$$ \begin{split} &\left(q^{1}, k^{1}, v^{1}\right) = {\rm{conv}}_{1 \times 1}\left(\alpha_{1}\right), \\ &\qquad{\rm{conv}}_{1 \times 1}\left(\alpha_{1}\right), {\rm{conv}}_{1 \times 1}\left(\alpha_{1}\right) \end{split} $$

(10)

$$ \begin{split} &\left(q^{2}, k^{2}, v^{2}\right) = {\rm{conv}}_{1 \times 1}\left(\alpha_{2}\right),\\ &\qquad{\rm{conv}}_{1 \times 1}\left(\alpha_{2}\right), {\rm{conv}}_{1 \times 1}\left(\alpha_{2}\right) \end{split} $$

(11)

$$ \begin{split} &\left(q^{3}, k^{3}, v^{3}\right) = {\rm{conv}}_{1 \times 1}\left(\alpha_{3}\right),\\ &\qquad{\rm{conv}}_{1 \times 1}\left(\alpha_{3}\right), {\rm{conv}}_{1 \times 1}\left(\alpha_{3}\right) \end{split} $$

(12)

$$ \begin{equation} b_{i, 1}, b_{i, 2}, b_{i, 3} = q^{i} k^{1}, q^{i} k^{2}, q^{i} k^{3}\;(i = 1,2,3) \end{equation} $$

(13)

$$ \begin{split} &\left(c_{j, 1}, c_{j, 2}, c_{j, 3}\right) =\\ &\qquad {\rm{softmax}} \left(\frac{b_{j, 1}}{ \sqrt{d_{k}}}, \frac{b_{j, 2}}{ \sqrt{d_{k}}}, \frac{b_{j, 3} }{ \sqrt{d_{k}}}\right)\;\;(j = 1,2,3) \end{split} $$

(14)

$$ \begin{equation} P_{t}^{1}, P_{t}^{2}, P_{t}^{3} = \sum\limits_{k = 1}^{3} c_{1, k} v^{k}, \sum\limits_{k = 1}^{3} c_{2, k} v^{k}, \sum\limits_{k = 1}^{3} c_{3, k} v^{k} \end{equation} $$

(15)

3. 实验结果与分析项

3.1 数据集及预处理

本文实验使用NIH^[22]胰腺分割数据集和MSD^[32]胰腺分割数据集. NIH数据集总共包含82位受试者的CT样本, 每位受试者样本中CT切片数量最少181张, 最多466张, 每一张切片大小为$ 512\;\times 512 $像素, 切片厚度在0.5 mm到1.0 mm之间; MSD数据集总共包含281位受试者的CT样本, 每位受试者样本中CT切片数量在37到751之间, 每一张切片大小为$ 512\times512 $像素.

在本文实验中, 所有CT切片的HU (Housefield unit)值根据统计结果被限制在 [−120, 340], 并把CT切片及其对应标签归一化到[0, 1]之间, 同时随机做$ [-15^{\circ}, 15^{\circ}] $随机旋转.

3.2 实验方法细节及评价指标

实验使用Pytorch 1.2.0版本, 在Ubuntu 16.04操作系统的2块RTX 2080ti独立显卡进行训练, 训练时批次大小设置为3, 使用ReLU^[33]作为激活函数, Adam^[34]作为优化方法, 学习率$ lr = 1.0\times10^{-4} $, 受限于显存大小, 训练阶段最大迭代次数$ T $设置为4. 实验使用4折交叉验证方法确保结果的鲁棒性. 数据集被平均分为4份, 每次选择其中的3份作为训练集, 另1份作为验证集. 共实验4次, 计算平均DSC准确率, 作为最终结果.

训练过程. 训练过程中需要通过反向传播算法最小化损失函数(式(5)). 值得注意的是: 训练过程前期, 由于网络参数的随机初始化, 各阶段产生了错误分割掩码, 所以训练初始阶段使用标准标签作为上下文先验, 增强并剪裁下一分割阶段网络的输入.

测试过程. 测试过程和训练过程不同, 测试阶段缺少标准标签, 所以使用各阶段分割掩码作为先验信息, 增强并缩小下一阶段的输入; 同样, 测试过程中不需要优化参数, 对于中间结果可以丢弃, 所以迭代次数的阈值不再限于GPU显存, 理论上可以无上界. 本文设定测试的循环分割次数停止阈值$ T $为6, 因为实验的观察结果表明, 当迭代次数较大时分割准确率提升有限.

实验采用DSC作为评价指标, 如式(16)所示, 真实标签和分割掩码交集的两倍与真实标签和分割掩码并集的比值. 其中, $ Y $是真实标签, $ P $为预测分割掩码.

$$ \begin{equation} {\rm{DSC}}(Y, P) = \frac{2 \sum YP}{\sum Y+\sum P} \end{equation} $$

(16)

3.3 实验对比分析

本节基于公开数据集(NIH和MSD)设置不同实验对照组, 验证基于循环显著性校准网络的胰腺分割方法. 主要分为5部分: 1) 阶段上下文信息有效性分析; 2) 切片上下文信息有效性分析; 3)结合阶段上下文信息和切片上下文信息的循环显著性校准网络有效性分析; 4)输入切片数目对分割结果的影响; 5)网络模型参数量及时间消耗.

3.3.1 阶段上下文信息有效性分析

本节对比实验展示阶段上下文信息对于分割性能的影响, 分别进行了两部分实验.

1) 针对粗细分割分开训练、粗细分割联合训练以及循环显著性网络联合训练进行对比实验. 粗细分割联合训练以及循环显著性网络联合训练都使用了显著性增强模块利用阶段上下文信息, 其中每一分割阶段都去掉了卷积自注意力校准模块. 实验结果如表1所示. 其中粗细分割联合训练相比于粗细分割分开训练在两个数据集上都展示了更高的平均分割准确率和更低的标准差, 其主要因为显著性增强模块显著性增强胰腺区域并联合粗细分割阶段上下文信息进行联合优化; 而循环显著性网络联合训练相比于粗细分割联合训练带来的分割效果提升, 来源于使用更多的阶段上下文信息联合训练. 由上述分析可知, 更多的阶段上下文信息对于胰腺分割准确率提升有重要贡献.

表 1 粗细分割分开训练、联合训练和循环显著性联合训练分割结果

Table 1 Segmentation of coarse-to-fine separate training, joint training and recurrent saliency joint training

方法	平均 DSC (%) ± Std (%)		最大 DSC (%)		最小 DSC (%)
方法	NIH	MSD	NIH	MSD	NIH	MSD
粗细分割分开训练	$81.96 \pm 5.79$	$78.92 \pm 9.61$	89.58	89.91	48.39	51.23
粗细分割联合训练	$83.08 \pm 5.47$	$80.80 \pm 8.79$	90.58	91.13	49.94	52.79
循环显著性网络联合训练	$85.56 \pm 4.79$	$83.24 \pm 5.93$	91.14	92.80	62.82	64.47

下载: 导出CSV

| 显示表格

2) 针对循环显著性网络测试阶段进行分析, 如表2所示. 随着第1次迭代, 在NIH数据集上, 胰腺的平均DSC准确率从76.81%上升到84.89%, 标准差从9.68%降到5.14%; 在MSD数据集上, 胰腺的平均DSC准确率从73.46%上升到81.67%, 标准差从11.73%降到8.05%. 由于粗分割阶段分割掩码上下文信息的引入, 平均DSC准确率和稳定性都有较大的提升. 但是, 后续迭代过程中, 由于分割掩码先验信息对于较准确分割结果作用的减少, 在两个数据集上平均DSC准确率和标准差仅仅小幅度上升和下降; 但对于最小DSC分割准确率提升明显, 分别从40.12%上升到最高的62.82%、47.76%上升到最高的64.47%, 有效提升了胰腺分割困难样本的DSC分割准确率.

表 2 循环显著性网络测试结果

Table 2 Test results of recurrent saliency network segmentation

迭代次数	平均 DSC (%) ± Std (%)		最大 DSC (%)		最小 DSC (%)
迭代次数	NIH	MSD	NIH	MSD	NIH	MSD
第 0 次迭代 (粗分割)	$76.81 \pm 9.68$	$73.46 \pm 11.73$	87.94	88.67	40.12	47.76
第 1 次迭代	$84.89 \pm 5.14$	$81.67 \pm 8.05$	91.02	91.89	50.36	52.90
第 2 次迭代	$83.34\pm 5.07$	$82.23 \pm 7.57$	90.96	91.94	53.73	56.81
第 3 次迭代	$85.63 \pm 4.96$	$82.78 \pm 6.83$	91.08	92.32	57.96	58.04
第 4 次迭代	$85.79 \pm 4.83$	$82.94 \pm 6.46$	91.15	92.56	62.97	63.73
第 5 次迭代	$85.82 \pm 4.82$	$83.15 \pm 6.04$	91.20	92.77	62.85	63.99
第 6 次迭代	$85.86 \pm 4.79$	$83.24 \pm 5.93$	91.14	92.80	62.82	64.47

下载: 导出CSV

| 显示表格

3.3.2 切片上下文信息有效性分析

本节对比实验展示切片上下文信息对于胰腺分割性能的影响, 分别进行了两部分实验.

1) 针对粗细分割以及循环显著性网络联合训练在添加和未添加卷积自注意力校准模块利用切片上下文信息情况下, 进行实验结果分析, 如表3所示. 相对于未添加卷积自注意力校准模块的粗细分割联合训练, 添加了卷积自注意力校准模块的粗细分割联合训练在NIH数据集上, 胰腺平均DSC准确率提升了1.64%, 标准差下降了0.40%; 在MSD数据集上, 胰腺平均DSC准确率提升了1.29%, 标准差下降了0.88%. 在两个数据集上, 胰腺最小DSC分割准确率也有所上升. 同样, 循环显著性网络联合训练在添加卷积自注意力校准模块(本文方法)时, 相比于未添加卷积自注意力校准模块, 其分割性能在分割准确率和稳定性上均提升明显. 由此可以看出, 卷积自注意力校准模块能够利用切片上下文信息改善胰腺分割结果.

表 3 添加校准模块结果对比

Table 3 Comparison results of adding calibration module

方法	平均 DSC (%) ± Std (%)		最大 DSC (%)		最小 DSC (%)
方法	NIH	MSD	NIH	MSD	NIH	MSD
粗细分割联合训练未添加校准模块	$83.08 \pm 5.47$	$80.80 \pm 8.79$	90.58	91.13	49.94	52.79
粗细分割联合训练添加校准模块	$84.72 \pm 5.07$	$82.09 \pm 7.91$	90.98	92.90	50.27	53.35
循环显著性网络未添加校准模块	$85.86 \pm 4.79$	$83.24 \pm 5.93$	91.14	92.80	62.82	64.47
循环显著性网络添加校准模块	$87.11 \pm 4.02$	$85.13 \pm 5.17$	92.57	94.48	67.30	68.24

下载: 导出CSV

| 显示表格

2) 针对本文方法中校准模块分别基于卷积自注意力或者基于卷积循环神经网络在分割胰腺时进行实验对比, 如表4所示. 将本文方法框架中卷积自注意力校准模块分别换成单层卷积长短期记忆循环神经网络(CLSTM)^[7]、单层卷积门控单元(ConvGRU)^[11]和单层轨迹门控循环单元(TrajGRU)^[35]等卷积循环神经网络, 进行实验对比. 从两个数据集的实验结果可以看出, 基于卷积自注意力的校准模块不管是在胰腺平均DSC分割准确率、标准差或者最大、最小分割准确率上都要好于部分基于卷积循环神经网络的校准模块^{[7, 11, 35]}.

表 4 胰腺分割基于CLSTM和自注意力结果对比

Table 4 Comparison results based on CLSTM and self-attention mechanism in pancreas segmentation

方法	平均 DSC (%) ± Std (%)		最大 DSC (%)		最小 DSC (%)
方法	NIH	MSD	NIH	MSD	NIH	MSD
基于 CLSTM 校准模块	$86.13 \pm 4.54$	$84.21 \pm 5.80$	91.20	93.47	63.18	64.76
基于 ConvGRU 校准模块	$86.34 \pm 4.21$	$84.41\pm 5.62$	92.31	94.05	65.73	66.02
基于 TrajGRU 校准模块	$86.96 \pm 4.14$	$84.87 \pm 5.22$	92.49	94.32	67.20	67.93
基于卷积自注意力校准模块	$ 87.11 \pm 4.02$	$85.13 \pm 5.17$	92.57	94.48	67.30	68.24

下载: 导出CSV

| 显示表格

3.3.3 循环显著性校准网络有效性分析

为进一步说明本文所提方法在胰腺分割方法中的优势, 本文方法与当前具有代表性的方法进行了比较.

NIH胰腺数据集上实验结果如表5所示, 本文方法与其他具有代表性的胰腺基准分割方法进行了比较. 相比于其他二维胰腺分割方法^{[3, 19-23, 26, 36]}, 在以下两方面改进: 1) 联合训练利用更多的阶段上下文信息; 2) 使用卷积自注意力校准模块校准每一阶段胰腺分割掩码. 平均DSC分割准确率从最高的85.40%提升到87.11%, 显著改善了胰腺平均分割结果; 最大分割准确率从最高的91.46%上升到92.57%. 相比于三维胰腺分割方法^{[5, 10, 29-30, 37]}, 本文提出的卷积自注意力校准模块充分利用切片上下文信息, 显著减少参数量(GPU显存消耗)的同时, 达到三维分割同等效果, 提高了运算效率, 并且将胰腺平均分割准确率从最高的86.19%提升到87.11%, 最大分割准确率从最高的91.90%上升到92.57%.

表 5 NIH数据集上不同分割方法结果比较(“—”表示文献中缺少参数说明)

Table 5 Comparison of different segmentation methods on NIH dataset (“—” indicates a lack of reference in the literature)

方法	分割维度	平均 DSC (%) ± Std (%)	最大 DSC (%)	最小 DSC (%)
文献 [22]	2D	$71.80 \pm 10.70$	86.90	25.00
文献 [23]	2D	$81.27 \pm 6.27$	88.96	50.69
文献 [36]	2D	$82.40 \pm 6.70$	90.10	60.00
文献 [3]	2D	$82.37 \pm 5.68$	90.85	62.43
文献 [37]	3D	$84.59 \pm 4.86$	91.45	69.62
文献 [10]	3D	$85.99 \pm 4.51$	91.20	57.20
文献 [5]	3D	$85.93 \pm 3.42$	91.48	75.01
文献 [29]	3D	$82.47 \pm 5.50$	91.17	62.36
文献 [20]	2D	$82.87 \pm 1.00$	87.67	81.18
文献 [19]	2D	$84.90 \pm -$	91.46	61.82
文献 [26]	2D	$85.35 \pm 4.13$	91.05	71.36
文献 [21]	2D	$85.40 \pm 1.60$	—	—
文献 [30]	3D	$86.19 \pm -$	91.90	69.17
本文方法	2D	87.11 ± 4.02	92.57	67.30

下载: 导出CSV

| 显示表格

MSD胰腺数据集上实验结果如表6所示, 本文方法与具有代表性的胰腺基准分割方法进行了比较. 相比于二维分割方法^[28], 平均DSC分割准确率从84.71%提升到85.13%, 标准差从7.13% 降到5.17%, 显著提升了胰腺分割方法的稳定性; 最小分割准确率从58.62% 上升到68.24%, 提高了困难样本的分割准确率. 相比于三维胰腺分割方法^[38-40], 平均DSC分割准确率从最高的84.22%提升到85.13%; 最大和最小分割准确率均有所提升.

表 6 MSD数据集上不同分割方法结果比较

Table 6 Comparison of different segmentation methods on MSD dataset

方法	分割维度	平均 DSC (%) ± Std (%)	最大 DSC (%)	最小 DSC (%)
文献 [39]	3D	$79.98\pm7.71$	93.73	61.64
文献 [38]	3D	$82.37\pm5.68$	90.85	62.43
文献 [28]	2D	$84.71\pm7.13$	95.54	58.62
文献 [40]	3D	$84.22\pm5.91$	92.75	66.58
本文方法	2D	85.13 ± 5.17	94.48	68.24

下载: 导出CSV

| 显示表格

本文方法在NIH及MSD胰腺数据集上箱线图如图7所示. 本文对部分结果进行了展示, 如图8、图9所示. 选取了5个受试者样本, 同一行为同一个受试者不同切片的胰腺分割结果. 蓝色实线代表预测结果, 红色实线代表真实标签. 从图中可以看出, 本文方法分割结果和真实标签非常接近.

图 7 本文方法在NIH数据集及MSD数据集上箱线图

Fig. 7 Box plot of the method in this paper on NIH dataset and MSD dataset

下载: 全尺寸图片幻灯片

图 8 NIH数据集分割结果对比

Fig. 8 Comparison of segmentation results on NIH dataset

下载: 全尺寸图片幻灯片

图 9 MSD数据集分割结果对比

Fig. 9 Comparison of segmentation results on MSD dataset

下载: 全尺寸图片幻灯片

3.3.4 输入切片数目对分割结果的影响

为进一步说明胰腺输入切片数目对本文方法的影响, 将切片数目输入分别设置为3、5、7进行实验比较, 如表7和表8所示. 随着胰腺输入切片数目的增加, 平均DSC分割准确率和最大DSC分割准确率均有小幅度提升, 最小DSC分割准确率提升更为明显. 可以看出, 增加切片数目对于分割困难样本具有较大的帮助. 对于胰腺器官边界模糊的困难样本、胰腺周围脂肪与十二指肠灰度分布较为接近的困难样本以及切片中分割目标较小的困难样本, 结合更多的切片数目能够明显提升目标分割精度.

表 7 NIH数据集不同网络输入切片数目分割结果比较

Table 7 Comparison of the segmentation of different network input slices on NIH dataset

网络输入切片数目	分割维度	平均 DSC (%) ± Std (%)	最大 DSC (%)	最小 DSC (%)
3	2D	$87.11\pm4.02$	92.57	67.30
5	2D	$87.53\pm3.74$	92.69	69.32
7	2D	$87.96\pm3.25$	92.94	71.91

下载: 导出CSV

| 显示表格

表 8 MSD数据集不同网络输入切片数目分割结果比较

Table 8 Comparison of the segmentation of different network input slices on MSD dataset

网络输入切片数目	分割维度	平均 DSC (%) ± Std (%)	最大 DSC (%)	最小 DSC (%)
3	2D	$85.13\pm5.17$	94.48	68.24
5	2D	$85.86\pm5.01$	94.75	70.31
7	2D	$86.29\pm4.80$	95.01	73.07

下载: 导出CSV

| 显示表格

3.3.5 网络参数量及时间消耗

本文使用UNet作为基础骨干网络, 除粗分割阶段, 后续分割阶段共享网络参数, 减少了参数量. 相比于FCN^[2]经典分割网络, 提出的分割模型具有更少的参数量, 如表9所示. 虽然相比于单阶段的UNet^[15], 3D UNet^[41], AttentionUNet^[42] 和UNet++^[43]等分割网络, 参数量有所增加, 但是单阶段的分割方法分割精度较低; 相比于Fix-point^[3]使用FCN作为骨干网络并且利用三个轴状面分别训练模型分割胰腺, 本文参数量显著减少; 相比于GGPFN^[28], 虽然参数量有所增加, 但是分割精度有所提升.

表 9 不同分割方法参数量比较

Table 9 Comparison of the number of parameters of different segmentation methods

方法	分割维度	参数量
FCN^[2]	2D	134.26 M
UNet^[15]	2D	28.34 M
3D UNet^[41]	3D	16.31 M
AttentionUNet^[42]	2D	35.06 M
UNet++ ^[43]	2D	36.74 M
Fix-point^[3]	2D	807.93 M
GGPFN^[28]	2D + 3D	42.00 M
本文方法	2D	59.47 M

下载: 导出CSV

| 显示表格

文献[44]使用胰腺器官的三个轴状面作为输入训练模型, 并且分割阶段使用额外两个模型融合视觉特征, 增加了时间消耗, 如表10所示. 文献[23]在三个轴状面上分别进行定位、分割, 在Titan X (12 GB) GPU上训练了9 ~ 12个小时. 文献[22]使用由下至上的方法, 首先使用超像素分块, 然后基于超像素分块集成分割结果, 每个阶段分开训练. 而本文胰腺分割方法使用端到端的训练方法, 降低了每个病例的平均测试时间. 文献[3]使用固定点方法, 分别使用三个FCN训练胰腺的三个轴状面输入图像, 循环使用细分割掩码位置信息优化分割掩码, 显著增加了时间消耗. 相比于上述方法, 本文胰腺分割方法虽然增加了循环显著性模块和校准模块, 但循环显著性模块和校准模块设计简单并且基于矩阵运算, 运算时间增加不明显, 并且本文方法仅基于横断面作为输入, 使用UNet^[14]而非FCN^[2]作为分割骨干网络, 显著减少参数量及前馈传播时间. 相比于文献[36]基于循环卷积神经网络使用多切片作为输入, 本文方法训练及测试时间有所增加, 但分割精确度提升明显.

表 10 不同分割方法时间消耗比较(“—”表示文献中缺少参数说明)

Table 10 Comparison of time consumption of different segmentation methods (“—” indicates a lack of reference in the literature)

方法	分割维度	每个病例平均测试时间 (min)	训练时间 (h)	设备
文献 [44]	2D	$2\sim 3$	—	—
文献 [22]	2D	$1\sim 3$	$\sim 55$	GTX Titan Z (12 GB)
文献 [23]	2D	$2\sim 3$	$9\sim12$	Titan X (12 GB)
文献 [3]	2D	$\sim 3$	—	—
文献 [36]	2D	—	$\sim 3$	GTX Titan X (12 GB)
本文方法	2D	1.1	$\sim 8$	RTX 2080ti (11 GB)

下载: 导出CSV

| 显示表格

4. 总结与展望

针对胰腺分割面临的问题, 本文提出了基于循环显著性校准网络的胰腺分割方法. 其主要贡献在于: 1)利用更多的阶段上下文信息联合训练, 改善了传统由粗到细胰腺分割方法仅使用粗分割阶段输出掩码定位框坐标信息作为细分割网络输入的先验, 导致缺少阶段上下文信息的问题; 2) 使用卷积自注意力校准模块跨顺序、平行化利用相邻切片上下文信息的同时, 自动校准每一分割阶段输出掩码, 解决了胰腺与邻近器官密度较为接近、组织重叠部分界限分辨困难导致的误分割问题. 和其他胰腺分割方法相比, 本文方法显著提高了样本平均DSC分割准确率并改善了困难样本分割结果. 本文方法可用于辅助医疗诊断, 后续研究将考虑如何进一步利用更多的阶段上下文信息及切片上下文信息改善分割结果的同时, 使用模型蒸馏方法轻量化模型框架.

图 1 唇读示意图

Fig. 1 Illustration of the lip reading task

下载: 全尺寸图片幻灯片

图 2 唇读难点示例. (a)第一行为单词place的实例, 第二行为单词please的实例, 唇形变化难以区分, 图片来自GRID数据集; (b)上下两行分别为单词wind在不同上下文环境下的不同读法/wind/与/waind/实例, 唇形变化差异较大; (c)上下两行分别为两位说话人说同一个单词after的实例, 唇形变化存在差异, 图片来自LRS3-TED数据集; (d)说话人在说话过程中头部姿态实时变化实例. 上述对比实例均采用相同的视频时长和采样间隔.

Fig. 2 Challenging examples of lip reading. (a) The upper line is an instance of the word place, the lower line is an instance of the word please; (b) The upper and lower lines are respectively different pronunciation of word wind in different contexts; (c) The upper and lower lines respectively tell the same word after, with big difference in lip motion; (d) An example of a real-time change in the head posture of the speaker during the speech. The above comparison examples all use the same video duration and sampling interval.

下载: 全尺寸图片幻灯片

图 3 唇读方法一般流程

Fig. 3 The general process of lip reading

下载: 全尺寸图片幻灯片

图 4 唇读研究过程中代表性方法. 传统特征提取方法: 主动形状模型ASM^[51], 主动表观模型AAM^[39], HiLDA^[38], LBP-TOP^[52], 局部判别图模型^[40], 图嵌入方法^[53], 随机森林流形对齐RFMA^[41], 隐变量方法^[54]; 深度学习方法: DBN/CNN+HMM混合模型^[42-48], SyncNet^[55], LipNet^[49], WLAS^[10], Transformer^[50], LCANet^[56], V2P^[15].

Fig. 4 Representative methods in the process of lip reading research. Traditional feature extraction methods:ASM^[51], AAM^[39], HiLDA^[38], LBP-TOP^[52], LDG^[40], Graph Embedding^[53], RFMA^[41], Hidden variable method^[54]; Deep learning based methods: DBN/CNN+HMM hybrid model^[42-48], SyncNet^[55], LipNet^[49], WLAS^[10], Transformer^[50], LCANet^[56], V2P^[15].

下载: 全尺寸图片幻灯片

图 5 线性变换特征提取方法一般流程

Fig. 5 The workflow of linear transformation feature extraction method

下载: 全尺寸图片幻灯片

图 6 连续帧曲线映射

Fig. 6 Continuous frame curve mapping

下载: 全尺寸图片幻灯片

图 7 ${\rm LBP}_{8,1}$算子

Fig. 7 ${\rm LBP}_{8,1}$ operator

下载: 全尺寸图片幻灯片

图 8 分块LBP-TOP特征提取

Fig. 8 Block LBP-TOP feature extraction

下载: 全尺寸图片幻灯片

图 9 语音产生的发音特征

Fig. 9 Articulatory features

下载: 全尺寸图片幻灯片

图 10 唇部轮廓ASM模型

Fig. 10 ASM model of lip profile

下载: 全尺寸图片幻灯片

图 11 典型CNN结构示例图

Fig. 11 A typical CNN structure example

下载: 全尺寸图片幻灯片

图 12 RNN及LSTM、GRU结构示例图

Fig. 12 The structure of RNN, LSTM and GRU

下载: 全尺寸图片幻灯片

图 13 CNN-RNN基本框架

Fig. 13 The network structure of CN-RNN

下载: 全尺寸图片幻灯片

图 14 LipNet构架

Fig. 14 The network architecture of LipNet

下载: 全尺寸图片幻灯片

图 15 WAS构架

Fig. 15 The network architecture of WAS

下载: 全尺寸图片幻灯片

图 16 三种唇读网络模型

Fig. 16 Three lip reading network models

下载: 全尺寸图片幻灯片

图 17 不同类型数据集变化趋势

Fig. 17 The trends of different types of datasets

下载: 全尺寸图片幻灯片

图 18 各类数据集示例

Fig. 18 Some examples of different datasets

下载: 全尺寸图片幻灯片

表 1 传统时空特征提取算法优缺点总结

Table 1 A summary of advantages and disadvantages of traditional spatiotemporal feature extraction methods

时空特征提取方法	代表性方法	优势	不足
基于表观的	全局图像线性变换^{[38,57,60-63]}, 图嵌入与流形^{[40-41, 53-54,65]}, LBP-TOP^[52，66], HOG^[67], 光流^{[29, 68]}···	1) 特征提取速度快; 2) 无需复杂的人工建模.	1) 对唇部区域提取精度要求高; 2) 对环境变化、姿态变化、噪声敏感; 3) 不同讲话者之间泛化性能较差.
基于形状的	轮廓描述^[69-72], AFs^[73], 形状模型^[74-75]···	1) 具有良好的可解释性; 2) 不同讲话者之间泛化性能较好; 3) 能有效去除冗余信息.	1) 会造成部分有用信息丢失; 2) 需要大量的人工标注; 3) 对于姿态变化非常敏感.
形状表观融合的	形状+表观特征串联^[76-77], 形状表观模型^[39]···	1) 特征表达能力较强; 2) 不同讲话者之间泛化性能较好.	1) 模型复杂，运算量大; 2) 需要大量的人工标注.

下载: 导出CSV

表 3 单词、短语和语句识别数据集, 其中(s)代表不同语句的数量. 下载地址为: MIRACL-VC^[171], LRW^[172], LRW-1000^[173], GRID^[174], OuluVS^[175], VIDTIMIT^[176], LILiR^[177], MOBIO^[178], TCD-TIMIT^[179], LRS^[180], VLRF^[181]

Table 3 Word, phrase and sentence lip reading datasets and their download link: MIRACL-VC^[171], LRW^[172], LRW-1000^[173], GRID^[174], OuluVS^[175], VIDTIMIT^[176], LILiR^[177], MOBIO^[178], TCD-TIMIT^[179], LRS^[180], VLRF^[181]

数据集	语种	识别任务	词汇量	话语数目	说话人数目	姿态	分辨率	谷歌引用	发布年份
IBMViaVoice	英语	语句	10 500	24 325	290	0	704 × 480, 30 fps	299	2000
VIDTIMIT	英语	语句	346 (s)	430	43	0	512 × 384, 25 fps	51	2002
AVICAR	英语	语句	1 317	10 000	100	−15$\sim$15	720 × 480, 30 fps	170	2004
AV-TIMIT	英语	语句	450 (s)	4 660	233	0	720 × 480, 30 fps	127	2004
GRID	英语	短语	51	34 000	34	0	720 × 576, 25 fps	700	2006
IV2	法语	语句	15 (s)	4 500	300	0,90	780 × 576, 25 fps	19	2008
UWB-07-ICAV	捷克语	语句	7 550 (s)	10 000	50	0	720 × 576, 50 fps	16	2008
OuluVS	英语	短语	10 (s)	1 000	20	0	720 × 576, 25 fps	211	2009
WAPUSK20	英语	短语	52	2 000	20	0	640 × 480, 32 fps	16	2010
LILiR	英语	语句	1 000	2 400	12	0, 30, 45, 60, 90	720 × 576, 25 fps	67	2010
BL	法语	语句	238 (s)	4 046	17	0, 90	720 × 576, 25 fps	12	2011
UNMC-VIER	英语	语句	11 (s)	4 551	123	0, 90	708 × 640, 25 fps	8	2011
MOBIO	英语	语句		30 186	152	0	640 × 480, 16 fps	175	2012
MIRACL-VC	英语	单词	10	1 500	15	0	640 × 480, 15 fps	22	2014
MIRACL-VC	英语	短语	10 (s)	1 500	15	0	640 × 480, 15 fps	22	2014
Austalk	英语	单词	966	966 000	1 000	0	640 × 480	11	2014
Austalk	英语	语句	59 (s)	59 000	1 000	0	640 × 480	11	2014
MODALITY	英语	单词	182 (s)	231	35	0	1 920 × 1 080, 100 fps	23	2015
RM-3000	英语	语句	1 000	3 000	1	0	360 × 640, 60 fps	7	2015
IBM AV-ASR	英语	语句	10 400		262	0	704 × 480, 30 fps	103	2015
TCD-TIMIT	英语	语句	5 954 (s)	6 913	62	0, 30	1920 × 1080, 30 fps	59	2015
OuluVS2	英语	短语	10	1 590	53	0, 30, 45, 60, 90	1920 × 1080, 30 fps	46	2015
OuluVS2	英语	语句	530 (s)	530	53	0, 30, 45, 60, 90	1920 × 1080, 30 fps	46	2015
LRW	英语	单词	500	550 000	1 000+	0$\sim$30	256 × 256, 25 fps	115	2016
HAVRUS	俄语	语句	1 530 (s)	4 000	20	0	640 × 480, 200 fps	13	2016
LRS2-BBC	英语	语句	62 769	144 482	1 000+	0$\sim$30	160 × 160, 25 fps	172	2017
VLRF	西班牙语	语句	1 374	10 200a	24	0	1 280 × 720, 50 fps	6	2017
LRS3-TED	英语	语句	70 000	151 819	1 000+	−90$\sim$90	224 × 224, 25 fps	2	2018
LRW-1000	中文	单词	1 000	745 187	2 000+	−90$\sim$90	1 920 × 1 080, 25 fps	0	2018
LSVSR	英语	语句	127 055	2 934 899	1 000+	−30$\sim$30	128 × 128, 23 ~ 30 fps	16	2018

下载: 导出CSV

表 2 字母、数字识别数据集. 下载地址为: AVLetters^[152], AVICAR^[153], XM2VTS^[154], BANCA^[155], CUAVE^[156], VALID^[157], CENSREC-1-AV^[158], Austalk^[159], OuluVS2^[160]

Table 2 Alphabet and digit lip reading datasets and their download link: AVLetters^[152], AVICAR^[153], XM2VTS^[154], BANCA^[155], CUAVE^[156], VALID^[157], CENSREC-1-AV^[158], Austalk^[159], OuluVS2^[160]

数据集	语种	识别任务	类别数目	话语数目	说话人数目	姿态	分辨率	谷歌引用	发布年份
AVLetters	英语	字母	26	780	10	0	376 × 288, 25 fps	507	1998
XM2VTS	英语	数字	10	885	295	0	720 × 576, 25 fps	1 617	1999
BANCA	多语种	数字	10	29 952	208	0	720 × 576, 25 fps	530	2003
AVICAR	英语	字母	26	26 000	100	−15$\sim$15	720 × 480, 30 fps	170	2004
AVICAR	英语	数字	13	23 000	100	−15$\sim$15	720 × 480, 30 fps	170	2004
CUAVE	英语	数字	10	7 000+	36	−90, 0, 90	720 × 480, 30 fps	292	2002
VALID	英语	数字	10	530	106	0	720 × 576, 25 fps	38	2005
AVLetters2	英语	字母	26	910	5	0	1 920 × 1 080, 50 fps	62	2008
IBMSR	英语	数字	10	1 661	38	−90, 0, 90	368 × 240, 30 fps	17	2008
CENSREC-1-AV	日语	数字	10	5 197	93	0	720 × 480, 30 fps	25	2010
QuLips	英语	数字	10	3 600	2	−90$\sim$90	720 × 576, 25 fps	21	2010
Austalk	英语	数字	10	24 000	1 000	0	640 × 480	11	2014
OuluVS2	英语	数字	10	159	53	0$\sim$90	1 920 × 1 080, 30 fps	46	2015

下载: 导出CSV

表 4 不同数据集下代表性方法比较

Table 4 Comparison of representative methods under different datasets

数据集	识别任务	参考文献	模型		主要实验条件				识别率
数据集	识别任务	参考文献	前端特征提取	后端分类器	音频信号	讲话者依赖	外部语言模型	最小识别单元	识别率
AVLetters	字母	^[41]	RFMA		×	√	×	字母	69.60 %
		^[48]	RTMRBM	SVM	√	√	×	字母	66.00 %
		^[42]	ST-PCA	Autoencoder	×	×	×	字母	64.40 %
		^[52]	LBP-TOP	SVM	×	√	×	字母	62.80 %
		^[52]	LBP-TOP	SVM	×	×	×	字母	43.50 %
		^[193]	DBNF+DCT	LSTM	×	√	×	字母	58.10 %
CUAVE	数字	^[102]	AAM	HMM	√	×	×	数字	83.00 %
		^[91]	HOG+MBH	SVM	×	×	×	数字	70.10 %
		^[91]	HOG+MBH	SVM	×	√	×	数字	90.00 %
		^[194]	DBNF	DNN-HMM	×	×	×	音素	64.90 %
		^[60]	DCT	HMM	√	×	×	数字	60.40 %
LRW	单词	^[128]	3D-CNN+ResNet	BiLSTM	×	×	×	单词	83.00 %
		^[131]	3D-CNN+ResNet	BiGRU	×	×	×	单词	82.00 %
		^[131]	3D-CNN+ResNet	BiGRU	√	×	×	单词	98.00 %
		^[10]	CNN	LSTM+Attention	×	×	×	单词	76.20 %
		^[9]	CNN		×	×	×	单词	61.10 %
GRID	短语	^[56]	3D-CNN+highway	BiGRU+Attention	×	√	×	字符	97.10 %
		^[10]	CNN	LSTM+Attention	×	√	×	单词	97.00 %
		^[134]	Feed-forward	LSTM	×	√	×	单词	84.70 %
		^[134]	Feed-forward	LSTM	√	√	×	单词	95.90 %
		^[49]	3D-CNN	BiGRU	×	×	×	字符	93.40 %
		^[126]	HOG	SVM	×	√	×	单词	71.20 %
LRS3-TED	语句	^[151]	3D-CNN+ResNet	Transformer+seq2seq	×	×	√	字符	41.10 %
		^[151]	3D-CNN+ResNet	Transformer +CTC	×	×	√	字符	33.70 %
		^[15]	3DCNN	BiLSTM+CTC	×	×	√	音素	44.90 %

下载: 导出CSV

参考文献(205)

[1]	McGurk H, MacDonald J. Hearing lips and seeing voices. Nature, 1976, 264(5588): 746−748 doi: 10.1038/264746a0
[2]	Potamianos G, Neti C, Gravier G, Garg A, Senior A W. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 2003, 91(9): 1306−1326 doi: 10.1109/JPROC.2003.817150
[3]	Calvert G A, Bullmore E T, Brammer M J, Campbell R, Williams S C R, McGuire P K, et al. Activation of auditory cortex during silent lipreading. Science, 1997, 276(5312): 593−596 doi: 10.1126/science.276.5312.593
[4]	Deafness and hearing loss [online] available:https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss, July 1, 2019
[5]	Tye-Murray N, Sommers M S, Spehar B. Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing. Ear and Hearing, 2007, 28(5): 656−668 doi: 10.1097/AUD.0b013e31812f7185
[6]	Akhtar Z, Micheloni C, Foresti G L. Biometric liveness detection: Challenges and research opportunities. IEEE Security and Privacy, 2015, 13(5): 63−72 doi: 10.1109/MSP.2015.116
[7]	Rekik A, Ben-Hamadou A, Mahdi W. Human machine interaction via visual speech spotting. In: Proceedings of the 2015 International Conference on Advanced Concepts for Intelligent Vision Systems. Catania, Italy: Springer, 2015. 566−574
[8]	Suwajanakorn S, Seitz S M, Kemelmacher-Shlizerman I. Synthesizing obama: Learning lip sync from audio. ACM Transactions on Graphics, 2017, 36(4): Article No.95
[9]	Chung J S, Zisserman A. Lip reading in the wild. In: Proceedings of the 2016 Asian Conference on Computer Vision. Taiwan, China: Springer, 2016. 87−103
[10]	Chung J S, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017. 3444−3453
[11]	Chen L, Li Z, K Maddox R K, Duan Z, Xu C. Lip movements generation at a glance. In: Proceedings of the 2018 European Conference on Computer Vision. Munich, Germany: Springer, 2018. 538−553
[12]	Gabbay A, Shamir A, Peleg S. Visual speech enhancement. arXiv preprint arXiv: 1711.08789, 2017
[13]	黄雅婷, 石晶, 许家铭, 徐波. 鸡尾酒会问题与相关听觉模型的研究现状与展望. 自动化学报, 2019, 45(2): 234−251 Huang Ya-Ting, Shi Jing, Xu Jia-Ming, Xu Bo. Research advances and perspectives on the cocktail party problem and related auditory models. Acta Automatica Sinica, 2019, 45(2): 234−251
[14]	Akbari H, Arora H, Cao L L, Mesgarani N. Lip2AudSpec: Speech reconstruction from silent lip movements video. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 2516−2520
[15]	Shillingford B, Assael Y, Hoffman M W, Paine T, Hughes C, Prabhu U, et al. Large-scale visual speech recognition. arXiv preprint arXiv: 1807.05162, 2018
[16]	Mandarin Audio-Visual Speech Recognition Challenge [online] available: http://vipl.ict.ac.cn/homepage/mavsr/index.html, July 1, 2019
[17]	Potamianos G, Neti C, Luettin J, Matthews I. Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing. Cambridge: MIT Press, 2004. 1−30
[18]	Zhou Z H, Zhao G Y, Hong X P, Pietikainen M. A review of recent advances in visual speech decoding. Image and Vision Computing, 2014, 32(9): 590−605 doi: 10.1016/j.imavis.2014.06.004
[19]	Fernandez-Lopez A, Sukno F M. Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 2018, 78: 53−72 doi: 10.1016/j.imavis.2018.07.002
[20]	姚鸿勋, 高文, 王瑞, 郎咸波. 视觉语言-唇读综述. 电子学报, 2001, 29(2): 239−246 doi: 10.3321/j.issn:0372-2112.2001.02.025 Yao Hong-Xun, Gao Wen, Wang Rui, Lang Xian-Bo. A survey of lipreading-one of visual languages. Acta Electronica Sinica, 2001, 29(2): 239−246 doi: 10.3321/j.issn:0372-2112.2001.02.025
[21]	Cox S J, Harvey R W, Lan Y, et al. The challenge of multispeaker lip-reading. In: Proceedings of AVSP. 2008: 179−184
[22]	Messer K, Matas J, Kittler J, et al. XM2VTSDB: The extended M2VTS database. In: Proceedings of the Second International Conference on Audio and Video-based Biometric Person Authentication. 1999, 964: 965−966
[23]	Bailly-Bailliére E, Bengio S, Bimbot F, Hamouz M, Kittler J, Mariéthoz J, et al. The BANCA database and evaluation protocol. In: Proceedings of the 2003 International Conference on Audio- and Video-based Biometric Person Authentication. Guildford, United Kingdom: Springer, 2003. 625−638
[24]	Ortega A, Sukno F, Lleida E, Frangi A F, Miguel A, Buera L, et al. AV@CAR: A Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: Proceedings of the 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal: European Language Resources Association, 2004. 763−766
[25]	Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, et al. AVICAR: Audio-visual speech corpus in a car environment. In: Proceedings of the 8th International Conference on Spoken Language Processing. Jeju Island, South Korea: International Speech Communication Association, 2004. 2489−2492
[26]	Twaddell W F. On defining the phoneme. Language, 1935, 11(1): 5−62
[27]	Woodward M F, Barber C G. Phoneme perception in lipreading. Journal of Speech and Hearing Research, 1960, 3(3): 212−222 doi: 10.1044/jshr.0303.212
[28]	Fisher C G. Confusions among visually perceived consonants. Journal of Speech and Hearing Research, 1968, 11(4): 796−804 doi: 10.1044/jshr.1104.796
[29]	Cappelletta L, Harte N. Viseme definitions comparison for visual-only speech recognition. In: Proceedings of the 19th European Signal Processing Conference. Barcelona, Spain: IEEE, 2011. 2109−2113
[30]	Wu Y, Ji Q. Facial landmark detection: A literature survey. International Journal of Computer Vision, 2019, 127(2): 115−142 doi: 10.1007/s11263-018-1097-z
[31]	Chrysos G G, Antonakos E, Snape P, Asthana A, Zafeiriou S. A comprehensive performance evaluation of deformable face tracking "in-the-wild". International Journal of Computer Vision, 2018, 126(2-4): 198−232 doi: 10.1007/s11263-017-0999-5
[32]	Koumparoulis A, Potamianos G, Mroueh Y, et al. Exploring ROI size in deep learning based lipreading. In: Proceedings of AVSP. 2017: 64−69
[33]	Deller J R Jr, Hansen J H L, Proakis J G. Discrete-Time Processing of Speech Signals. New York: Macmillan Pub. Co, 1993.
[34]	Rabiner L R, Juang B H. Fundamentals of Speech Recognition. Englewood Cliffs: Prentice Hall, 1993.
[35]	Young S, Evermann G, Gales M J F, Hain T, Kershaw D, Liu X Y, et al. The HTK Book. Cambridge: Cambridge University Engineering Department, 2002.
[36]	Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, et al. The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE, 2011.
[37]	Matthews I, Cootes T F, Bangham J A, Cox S, Harvey R. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(2): 198−213 doi: 10.1109/34.982900
[38]	Potamianos G, Graf H P, Cosatto E. An image transform approach for HMM based automatic lipreading. In: Proceedings of 1998 International Conference on Image Processing. Chicago, USA: IEEE, 1998. 173−177
[39]	Cootes T F, Edwards G J, Taylor C J. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(6): 681−685 doi: 10.1109/34.927467
[40]	Fu Y, Zhou X, Liu M, Hasegawa-Johnson M, Huang T S. Lipreading by locality discriminant graph. In: Proceedings of 2007 IEEE International Conference on Image Processing. San Antonio, USA: IEEE, 2007. III−325−III−328
[41]	Pei Y R, Kim T K, Zha H B. Unsupervised random forest manifold alignment for lipreading. In: Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013. 129−136
[42]	Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A Y. Multimodal deep learning. In: Proceeding of the 28th International Conference on Machine Learning. Washington, USA: ACM, 2011. 689−696
[43]	Salakhutdinov R, Mnih A, Hinton G. Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the 24th International Conference on Machine Learning. Corvallis, USA: ACM, 2007. 791−798
[44]	Huang J, Kingsbury B. Audio-visual deep learning for noise robust speech recognition. In: Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 7596−7599
[45]	Ninomiya H, Kitaoka N, Tamura S, et al. Integration of deep bottleneck features for audio-visual speech recognition. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. 2015.
[46]	Sui C, Bennamoun M, Togneri R. Listening with your eyes: Towards a practical visual speech recognition system using deep Boltzmann machines. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015. 154−162
[47]	Noda K, Yamaguchi Y, Nakadai K, Okuno H G, Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence, 2015, 42(4): 722−737 doi: 10.1007/s10489-014-0629-7
[48]	Hu D, Li X L, Lu X Q. Temporal multimodal learning in audiovisual speech recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 3574−3582
[49]	Assael Y M, Shillingford B, Whiteson S, De Freitas N. LipNet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016
[50]	Afouras T, Chung J S, Zisserman A. Deep lip reading: A comparison of models and an online application. arXiv preprint arXiv:1806.06053, 2018
[51]	Luettin J, Thacker N A. Speechreading using probabilistic models. Computer Vision and Image Understanding, 1997, 65(2): 163−178 doi: 10.1006/cviu.1996.0570
[52]	Zhao G Y, Barnard M, Pietikäinen M. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 2009, 11(7): 1254−1265 doi: 10.1109/TMM.2009.2030637
[53]	Zhou Z H, Zhao G Y, Pietikäinen M. Towards a practical lipreading system. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2011. 137−144
[54]	Zhou Z H, Hong X P, Zhao G Y, Pietikäinen M. A compact representation of visual speech data using latent variables. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(1): 1
[55]	Chung J S, Zisserman A. Out of time: Automated lip sync in the wild. In: Proceedings of Asian Conference on Computer Vision. Taiwan, China: Springer, 2016. 251−263
[56]	Xu K, Li D W, Cassimatis N, Wang X L. LCANet: End-to-end lipreading with cascaded attention-CTC. In: Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition. Xi'an, China: IEEE, 2018.−548−555
[57]	Lucey P J, Potamianos G, Sridharan S. A unified approach to multi-pose audio-visual ASR. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association. Antwerp, Belgium: Causal Productions Pty Ltd., 2007. 650−653
[58]	Almajai I, Cox S, Harvey R, Lan Y X. Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 2722−2726
[59]	Seymour R, Stewart D, Ming J. Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. EURASIP Journal on Image and Video Processing, 2007, 2008(1): Article No.810362
[60]	Estellers V, Gurban M, Thiran J P. On dynamic stream weighting for audio-visual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4): 1145−1157 doi: 10.1109/TASL.2011.2172427
[61]	Potamianos G, Neti C, Iyengar G, Senior A W, Verma A. A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology, 2001, 4(3−4): 193−208
[62]	Lucey P J, Sridharan S, Dean D B. Continuous pose-invariant lipreading. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association (Interspeech 2008) incorporating the 12th Australasian International Conference on Speech Science and Technology (SST 2008). Brisbane Australia: International Speech Communication Association, 2008. 2679−2682
[63]	Lucey P J, Potamianos G, Sridharan S. Patch-based analysis of visual speech from multiple views. In: Proceedings of the International Conference on Auditory-Visual Speech Processing 2008. Moreton Island, Australia: AVISA, 2008. 69−74
[64]	Tim Sheerman-Chase, Eng-Jon Ong, Richard Bowden. Cultural Factors in the Regression of Non-verbal Communication Perception. In Workshop on Human Interaction in Computer Vision, Barcelona, 2011
[65]	Zhou Z H, Zhao G Y, Pietikäinen M. Lipreading: A graph embedding approach. In: Proceedings of the 20th International Conference on Pattern Recognition. Istanbul, Turkey: IEEE, 2010. 523−526
[66]	Zhao G Y, Pietikäinen M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6): 915−928 doi: 10.1109/TPAMI.2007.1110
[67]	Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005. 886−893
[68]	Mase K, Pentland A. Automatic lipreading by optical-flow analysis. Systems and Computers in Japan, 1991, 22(6): 67−76 doi: 10.1002/scj.4690220607
[69]	Aleksic P S, Williams J J, Wu Z L, Katsaggelos A K. Audio-visual speech recognition using MPEG-4 compliant visual features. EURASIP Journal on Advances in Signal Processing, 2002, 2002(1): Article No. 150948
[70]	Brooke N M. Using the visual component in automatic speech recognition. In: Proceedings of the 4th International Conference on Spoken Language Processing. Philadelphia, USA: IEEE, 1996. 1656−1659
[71]	Cetingul H E, Yemez Y, Erzin E, Tekalp A M. Discriminative analysis of lip motion features for speaker identification and speech-reading. IEEE Transactions on Image Processing, 2006, 15(10): 2879−2891 doi: 10.1109/TIP.2006.877528
[72]	Nefian A V, Liang L H, Pi X B, Liu X X, Murphy K. Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2002, 2002(11): Article No.783042 doi: 10.1155/S1110865702206083
[73]	Kirchhoff K. Robust speech recognition using articulatory information Elektronische Ressource. 1999.
[74]	Cootes T F, Taylor C J, Cooper D H, Graham J. Active shape models-their training and application. Computer Vision and Image Understanding, 1995, 61(1): 38−59 doi: 10.1006/cviu.1995.1004
[75]	Luettin J, Thacker N A, Beet S W. Speechreading using shape and intensity information. In: Proceedings of the 4th International Conference on Spoken Language Processing. Philadelphia, USA: IEEE, 1996. 58−61
[76]	Dupont S, Luettin J. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2000, 2(3): 141−151 doi: 10.1109/6046.865479
[77]	Chan M T. HMM-based audio-visual speech recognition integrating geometric- and appearance-based visual features. In: Proceedings of the 4th Workshop on Multimedia Signal Processing. Cannes, France: IEEE, 2001. 9−14
[78]	Roweis S T, Sau L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000, 290(5500): 2323−2326 doi: 10.1126/science.290.5500.2323
[79]	Tenenbaum J B, de Silva V, Langford J C. A global geometric framework for nonlinear dimensionality reduction. Science, 2000, 290(5500): 2319−2323 doi: 10.1126/science.290.5500.2319
[80]	Yan S C, Xu D, Zhang B Y, Zhang H J, Yang Q, Lin S. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1): 40−51 doi: 10.1109/TPAMI.2007.250598
[81]	Fu Y, Yan S C, Huang T S. Classification and feature extraction by simplexization. IEEE Transactions on Information Forensics and Security, 2008, 3(1): 91−100 doi: 10.1109/TIFS.2007.916280
[82]	Ojala T, Pietikäinen M, Harwood D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 1996, 29(1): 51−59 doi: 10.1016/0031-3203(95)00067-4
[83]	Ojala T, Pietikäinen M, Mäenpää T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7): 971−987 doi: 10.1109/TPAMI.2002.1017623
[84]	刘丽, 赵凌君, 郭承玉, 王亮, 汤俊. 图像纹理分类方法研究进展和展望. 自动化学报, 2018, 44(4): 584−607 Liu Li, Zhao Ling-Jun, Guo Cheng-Yu, Wang Liang, Tang Jun. Texture classification: State-of-the-art methods and prospects. Acta Automatica Sinica, 2018, 44(4): 584−607
[85]	Pietikäinen M, Hadid A, Zhao G, Ahonen T. Computer Vision Using Local Binary Patterns. London: Springer, 2011.
[86]	Liu L, Chen J, Fieguth P, Zhao G Y, Chellappa R, Pietikäinen M. From BoW to CNN: Two decades of texture representation for texture classification. International Journal of Computer Vision, 2019, 127(1): 74−109 doi: 10.1007/s11263-018-1125-z
[87]	刘丽, 谢毓湘, 魏迎梅, 老松杨. 局部二进制模式方法综述. 中国图象图形学报, 2014, 19(12): 1696−1720 doi: 10.11834/jig.20141202 Liu Li, Xie Yu-Xiang, Wei Ying-Mei, Lao Song-Yang. Survey of Local Binary Pattern method. Journal of Image and Graphics, 2014, 19(12): 1696−1720 doi: 10.11834/jig.20141202
[88]	Horn B K P, Schunck B G. Determining optical flow. Artificial Intelligence, 1981, 17(1-3): 185−203 doi: 10.1016/0004-3702(81)90024-2
[89]	Bouguet J Y. Pyramidal implementation of the affine Lucas Kanade feature tracker description of the algorithm. Intel Corporation, 2001, 5: 1−9
[90]	Lucas B D, Kanade T. An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence. San Francisco, CA, United States: Morgan Kaufmann Publishers Inc., 1981. 674−679
[91]	Rekik A, Ben-Hamadou A, Mahdi W. An adaptive approach for lip-reading using image and depth data. Multimedia Tools and Applications, 2016, 75(14): 8609−8636 doi: 10.1007/s11042-015-2774-3
[92]	Shaikh A A, Kumar D K, Yau W C, Azemin M Z C, Gubbi J. Lip reading using optical flow and support vector machines. In: Proceedings of the 3rd International Congress on Image and Signal Processing. Yantai, China: IEEE, 2010. 327−330
[93]	Goldschen A J, Garcia O N, Petajan E. Continuous optical automatic speech recognition by lipreading. In: Proceedings of the 28th Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA, USA: IEEE, 1994. 572−577
[94]	King S, Frankel J, Livescu K, McDermott E, Richmond K, Wester M. Speech production knowledge in automatic speech recognition. The Journal of the Acoustical Society of America, 2007, 121(2): 723−742 doi: 10.1121/1.2404622
[95]	Kirchhoff K, Fink G A, Sagerer G. Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 2002, 37(3−4): 303−319 doi: 10.1016/S0167-6393(01)00020-6
[96]	Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, et al. Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU Summer Workshop. In: Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA: IEEE. 2007. IV−621−IV−624
[97]	Saenko K, Livescu K, Glass J, Darrell T. Production domain modeling of pronunciation for visual speech recognition. In: Proceeding of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia, USA: IEEE. 2005. v/473−v/476
[98]	Saenko K, Livescu K, Glass J, Darrell T. Multistream articulatory feature-based models for visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(9): 1700−1707 doi: 10.1109/TPAMI.2008.303
[99]	Saenko K, Livescu K, Siracusa M, Wilson K, Glass J, Darrell T. Visual speech recognition with loosely synchronized feature streams. In: Proceeding of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE. 2005. 1424−1431
[100]	Papcun G, Hochberg J, Thomas T R, Laroche F, Zacks J, Levy S. Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. The Journal of the Acoustical Society of America, 1992, 92(2): 688−700 doi: 10.1121/1.403994
[101]	Matthews I, Potamianos G, Neti C, Luettin J. A comparison of model and transform-based visual features for audio-visual LVCSR. In: Proceedings of the 2001 IEEE International Conference on Multimedia and Expo. Tokyo, Japan: IEEE, 2001. 825−828
[102]	Papandreou G, Katsamanis A, Pitsikalis V, Maragos P. Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17(3): 423−435 doi: 10.1109/TASL.2008.2011515
[103]	Hilder S, Harvey R W, Theobald B J. Comparison of human and machine-based lip-reading. In: Proceedings of the 2009 AVSP. 2009: 86−89
[104]	Lan Y X, Theobald B J, Harvey R. View independent computer lip-reading. In: Proceedings of the 2012 IEEE International Conference on Multimedia and Expo. Melbourne, Australia: IEEE, 2012. 432−437
[105]	Lan Y X, Harvey R, Theobald B J. Insights into machine lip reading. In: Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. Kyoto, Japan: IEEE, 2012. 4825−4828
[106]	Bear H L, Harvey R. Decoding visemes: Improving machine lip-reading. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 2009−2013
[107]	LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436−444 doi: 10.1038/nature14539
[108]	Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504−507 doi: 10.1126/science.1127647
[109]	Hong X P, Yao H X, Wan Y Q, Chen R. A PCA based visual DCT feature extraction method for lip-reading. In: Proceedings of the 2006 International Conference on Intelligent Information Hiding and Multimedia. Pasadena, USA: IEEE, 2006. 321−326
[110]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, NY, United States: Curran Associates Inc., 2012. 1097−1105
[111]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556, 2014
[112]	Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 1−9
[113]	He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 770−778
[114]	Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 2261−2269
[115]	Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: IEEE, 2018. 7132−7141
[116]	Liu L, Ouyang W L, Wang X G, Fieguth P, Chen J, Liu X W, et al. Deep learning for generic object detection: A survey. arXiv preprint arXiv: 1809.02165, 2018
[117]	Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 3431−3440
[118]	Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In: Proceedings of the 2013 IEEE international Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 6645−6649
[119]	Noda K, Yamaguchi Y, Nakadai K, Okuno H G, Ogata T. Lipreading using convolutional neural network. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association. Singapore: ISCA, 2014. 1149−1153
[120]	Ji S W, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221−231 doi: 10.1109/TPAMI.2012.59
[121]	Herath S, Harandi M, Porikli F. Going deeper into action recognition: A survey. Image and Vision Computing, 2017, 60: 4−21 doi: 10.1016/j.imavis.2017.01.010
[122]	Mroueh Y, Marcheret E, Goel V. Deep multimodal learning for audio-visual speech recognition. In: Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Queensland, Australia: IEEE, 2015. 2130−2134
[123]	Thangthai K, Harvey R W, Cox S J, et al. Improving lip-reading performance for robust audiovisual speech recognition using DNNs. In: Proceedings of the 2015 AVSP. 2015: 127−131.
[124]	Gers F A, Schmidhuber J, Cummins F. Learning to forget: Continual prediction with LSTM. Neural Computation, 2000, 12(10): 2451−2471 doi: 10.1162/089976600300015015
[125]	Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv: 1412.3555, 2014
[126]	Wand M, Koutník J, Schmidhuber J. Lipreading with long short-term memory. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 6115−6119
[127]	Garg A, Noyola J, Bagadia S. Lip reading using CNN and LSTM, Technical Report, CS231n Project Report, Stanford University, USA, 2016.
[128]	Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv: 1703.04105, 2017
[129]	Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. New York: ACM, 2006. 369−376
[130]	Miao Y, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In: Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. Arizona, USA: IEEE, 2015. 167−174
[131]	Petridis S, Stafylakis T, Ma P, Cai F P, Tzimiropoulos G, Pantic M. End-to-end audiovisual speech recognition. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 6548−6552
[132]	Fung I, Mak B. End-to-end low-resource lip-reading with Maxout Cnn and Lstm. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 2511−2515
[133]	Wand M, Schmidhuber J. Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv: 1708.01565, 2017
[134]	Wand M, Schmidhuber J, Vu N T. Investigations on end-to-end audiovisual fusion. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 3041−3045
[135]	Srivastava R K, Greff K, Schmidhuber J. Training very deep networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA United States: MIT Press, 2015. 2377−2385
[136]	Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014. 3104−3112
[137]	Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473, 2014
[138]	Chaudhari S, Polatkan G, Ramanath R, Mithal V. An attentive survey of attention models. arXiv preprint arXiv: 1904.02874, 2019
[139]	Wang F, Tax D M J. Survey on the attention based RNN model and its applications in computer vision. arXiv preprint arXiv: 1601.06823, 2016
[140]	Chung J S, Zisserman A. Lip reading in profile. In: Proceedings of the British Machine Vision Conference. Guildford: BMVA Press, 2017. 155.1−155.11
[141]	Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3): 211−252 doi: 10.1007/s11263-015-0816-y
[142]	Saitoh T, Zhou Z H, Zhao G Y, Pietikäinen M. Concatenated frame image based cnn for visual speech recognition. In: Proceedings of the 2016 Asian Conference on Computer Vision. Taiwan, China: Springer, 2016. 277−289
[143]	Lin M, Chen Q, Yan S C. Network in network. arXiv preprint arXiv: 1312.4400, 2013
[144]	Petridis S, Li Z W, Pantic M. End-to-end visual speech recognition with LSTMs. In: Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans, USA: IEEE, 2017. 2592−2596
[145]	Petridis S, Wang Y J, Li Z W, Pantic M. End-to-end audiovisual fusion with LSTMS. arXiv preprint arXiv: 1709.04343, 2017
[146]	Petridis S, Wang Y J, Li Z W, Pantic M. End-to-end multi-view lipreading. arXiv preprint arXiv: 1709.00443, 2017
[147]	Petridis S, Shen J, Cetin D, Pantic M. Visual-only recognition of normal, whispered and silent speech. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 6219−6223
[148]	Moon S, Kim S, Wang H H. Multimodal transfer deep learning with applications in audio-visual recognition. arXiv preprint arXiv: 1412.3121, 2014
[149]	Chollet F. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA: IEEE, 2017. 1800−1807
[150]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, United States: Curran Associates Inc., 2017. 6000−6010
[151]	Afouras T, Chung J S, Senior A, et al. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.DOI: 10.1109/TPAMI.2018.2889052
[152]	AV Letters Database [Online], available: http://www2.cmp.uea.ac.uk/~bjt/avletters/, October 27, 2020
[153]	AVICAR Project: Audio-Visual Speech Recognition in a Car [Online], available: http://www.isle.illinois.edu/sst/AVICAR/#information, October 27, 2020
[154]	The Extended M2VTS Database [Online], available: http://www.ee.surrey.ac.uk/CVSSP/xm2vtsdb/, October 27, 2020
[155]	The BANCA Database [Online], available: http://www.ee.surrey.ac.uk/CVSSP/banca/, October 27, 2020
[156]	CUAVE Group Set [Online], available: http://people.csail.mit.edu/siracusa/avdata/, October 27, 2020
[157]	VALID: Visual quality Assessment for Light field Images Dataset [Online], available: https://www.epfl.ch/labs/mmspg/downloads/valid/, October 27, 2020
[158]	Speech Resources Consortium [Online], available: http://research.nii.ac.jp/src/en/data.html, October 27, 2020
[159]	AusTalk [Online], available: https://austalk.edu.au/about/corpus/, October 27, 2020
[160]	OULUVS2: A MULTI-VIEW AUDIOVISUAL DATABASE [Online], available: http://www.ee.oulu.fi/research/imag/OuluVS2/, October 27, 2020
[161]	Patterson E K, Gurbuz S, Tufekci Z, Gowdy J N. CUAVE: A new audio-visual database for multimodal human-computer interface research. In: Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, Florida, USA: IEEE, 2002. II−2017−II−2020
[162]	Fox N A, O'Mullane B A, Reilly R B. VALID: A new practical audio-visual database, and comparative results. In: Proceedings of the 2005 International Conference on Audio-and Video-Based Biometric Person Authentication. Berlin, Germany: Springer, 2005. 777−786
[163]	Anina I, Zhou Z H, Zhao G Y, Pietikäinen M. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. In: Proceedings of the 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Ljubljana, Slovenia: IEEE, 2015. 1−5
[164]	Estival D, Cassidy S, Cox F, et al. AusTalk: an audio-visual corpus of Australian English. In: Proceedings of the 2014 LREC 2014.
[165]	Tamura S, Miyajima C, Kitaoka N, et al. CENSREC-1-AV: An audio-visual corpus for noisy bimodal speech recognition. In: Proceedings of the Auditory-Visual Speech Processing 2010. 2010.
[166]	Pass A, Zhang J G, Stewart D. An investigation into features for multi-view lipreading. In: Proceedings of the 2010 IEEE International Conference on Image Processing. Hong Kong, China: IEEE, 2010. 2417−2420
[167]	Neti C, Potamianos G, Luettin J, et al. Audio visual speech recognition. IDIAP, 2000.
[168]	Sanderson C. The vidtimit database. IDIAP, 2002.
[169]	Jankowski C, Kalyanswamy A, Basson S, Spitz J. NTIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In: Proceedings of the 1990 International Conference on Acoustics, Speech, and Signal Processing. Albuquerque, New Mexico, USA: IEEE, 1990. 109−112
[170]	Hazen T J, Saenko K, La C H, Glass J R. A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In: Proceedings of the 6th International Conference on Multimodal Interfaces. State College, PA, USA: ACM, 2004. 235−242
[171]	MIRACL-VC1 [Online], available: https://sites.google.com/site/achrafbenhamadou/-datasets/miracl-vc1, October 27, 2020
[172]	The Oxford-BBC Lip Reading in the Wild (LRW) Dataset [Online], available: http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html, October 27, 2020
[173]	LRW-1000: Lip Reading database [Online], available: http://vipl.ict.ac.cn/view_database.php?id=14, October 27, 2020
[174]	The GRID audiovisual sentence corpus [Online], available: http://spandh.dcs.shef.ac.uk/gridcorpus/, October 27, 2020
[175]	OuluVS database [Online], available: https://www.oulu.fi/cmvs/node/41315, October 27, 2020
[176]	VidTIMIT Audio-Video Dataset [Online], available: http://conradsanderson.id.au/vidtimit/#downloads, October 27, 2020
[177]	LiLiR [Online], available: http://www.ee.surrey.ac.uk/Projects/LILiR/datasets.html, October 27, 2020
[178]	MOBIO [Online], available: https://www.idiap.ch/dataset/mobio, October 27, 2020
[179]	TCD-TIMIT [Online], available: https://sigmedia.tcd.ie/TCDTIMIT/, October 27, 2020
[180]	Lip Reading Datasets [Online], available: http://www.robots.ox.ac.uk/~vgg/data/lip_reading/, October 27, 2020
[181]	Visual Lip Reading Feasibility (VRLF) [Online], available: https://datasets.bifrost.ai/info/845, October 27, 2020
[182]	Rekik A, Ben-Hamadou A, Mahdi W. A new visual speech recognition approach for RGB-D cameras. In: Proceedings of the 2014 International Conference Image Analysis and Recognition. Vilamoura, Portugal: Springer, 2014. 21−28
[183]	McCool C, Marcel S, Hadid A, Pietikäinen M, Matejka P, Cernockỳ J, et al. Bi-modal person recognition on a mobile phone: Using mobile phone data. In: Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops. Melbourne, Australia: IEEE, 2012. 635−640
[184]	Howell D. Confusion Modelling for Lip-Reading [Ph. D. dissertation], University of East Anglia, Norwich, 2015
[185]	Harte N, Gillen E. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 2015, 17(5): 603−615 doi: 10.1109/TMM.2015.2407694
[186]	Verkhodanova V, Ronzhin A, Kipyatkova I, Ivanko D, Karpov A, Zelezny M. HAVRUS corpus: High-speed recordings of audio-visual Russian speech. In: Proceedings of the 2016 International Conference on Speech and Computer. Budapest, Hungary: Springer, 2016. 338−345
[187]	Fernandez-Lopez A, Martinez O, Sukno F M. Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. In: Proceedings of the 12th IEEE International Conference on Automatic Face & Gesture Recognition. Washington, USA: IEEE, 2017. 208−215
[188]	Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 2006, 120(5): 2421−2424 doi: 10.1121/1.2229005
[189]	Vorwerk A, Wang X, Kolossa D, et al. WAPUSK20-A Database for Robust Audiovisual Speech Recognition. In: Proceedings of the 2010 LREC. 2010.
[190]	Czyzewski A, Kostek B, Bratoszewski P, Kotus J, Szykulski M. An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems, 2017, 49(2): 167−192 doi: 10.1007/s10844-016-0438-z
[191]	Afouras T, Chung J S, Zisserman A. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv: 1809.00496, 2018
[192]	Yang S, Zhang Y H, Feng D L, Yang M M, Wang C H, Xiao J Y, et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In: Proceedings of the 14th IEEE International Conference on Automatic Face and Gesture Recognition. Lille, France: IEEE, 2019. 1−8
[193]	Petridis S, Pantic M. Deep complementary bottleneck features for visual speech recognition. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 2304−2308
[194]	Rahmani M H, Almasganj F. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: Proceedings of the 3rd International Conference on Pattern Recognition and Image Analysis. Shahrekord, Iran: IEEE, 2017. 195−199
[195]	Dosovitskiy A, Fischer P, Ilg E, Häusser P, Hazirbas C, Golkov V, et al. FlowNet: Learning optical flow with convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015. 2758−2766
[196]	Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 1647−1655
[197]	Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA, United States: MIT Press, 2014. 568−576
[198]	Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 1933−1941
[199]	Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, United States: MIT Press, 2015. 2017−2025
[200]	Bhagavatula C, Zhu C C, Luu K, Savvides M. Faster than real-time facial alignment: A 3D spatial transformer network approach in unconstrained poses. In: Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 4000−4009
[201]	Baltrušaitis T, Ahuja C, Morency L P. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423−443 doi: 10.1109/TPAMI.2018.2798607
[202]	Loizou P C. Speech Enhancement: Theory and Practice. Boca Raton, FL: CRC Press, 2013.
[203]	Hou J C, Wang S S, Lai Y H, Tsao Y, Chang H W, Wang H M. Audio-visual speech enhancement based on multimodal deep convolutional neural network. arXiv preprint arXiv: 1703.10893, 2017
[204]	Ephrat A, Halperin T, Peleg S. Improved speech reconstruction from silent video. In: Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Venice, Italy: IEEE, 2017. 455−462
[205]	Gabbay A, Shamir A, Peleg S. Visual speech enhancement. arXiv preprint arXiv: 1711.08789, 2017.

施引文献

期刊类型引用(0)
其他类型引用(4)

资源附件(0)

访问统计

图(18) / 表(4)

计量

文章访问数: 4801
HTML全文浏览量: 1984
PDF下载量: 424
被引次数: 4

1. 相关工作
2. 本文方法
2.1 循环显著性校准网络
2.2 卷积自注意力校准模块
3. 实验结果与分析项
3.1 数据集及预处理
3.2 实验方法细节及评价指标
3.3 实验对比分析
4. 总结与展望

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

唇读研究进展与展望

doi: 10.16383/j.aas.c190531

计量

The State of the Art and Prospects of Lip Reading

1. 相关工作

2. 本文方法

2.1 循环显著性校准网络

2.2 卷积自注意力校准模块

3. 实验结果与分析项

3.1 数据集及预处理

3.2 实验方法细节及评价指标

3.3 实验对比分析

3.3.1 阶段上下文信息有效性分析

3.3.2 切片上下文信息有效性分析

3.3.3 循环显著性校准网络有效性分析

3.3.4 输入切片数目对分割结果的影响

3.3.5 网络参数量及时间消耗

4. 总结与展望

期刊类型引用(0)

其他类型引用(4)

计量

目录

1. 相关工作

2. 本文方法

2.1 循环显著性校准网络

2.2 卷积自注意力校准模块

3. 实验结果与分析项

3.1 数据集及预处理

3.2 实验方法细节及评价指标

3.3 实验对比分析

4. 总结与展望

留言板

唇读研究进展与展望

doi: 10.16383/j.aas.c190531

计量

出版历程

The State of the Art and Prospects of Lip Reading

1. 相关工作

2. 本文方法

2.1 循环显著性校准网络

2.2 卷积自注意力校准模块

3. 实验结果与分析项

3.1 数据集及预处理

3.2 实验方法细节及评价指标

3.3 实验对比分析

3.3.1 阶段上下文信息有效性分析

3.3.2 切片上下文信息有效性分析

3.3.3 循环显著性校准网络有效性分析

3.3.4 输入切片数目对分割结果的影响

3.3.5 网络参数量及时间消耗

4. 总结与展望

期刊类型引用(0)

其他类型引用(4)

计量

出版历程

目录

1. 相关工作

2. 本文方法

2.1 循环显著性校准网络

2.2 卷积自注意力校准模块

3. 实验结果与分析项

3.1 数据集及预处理

3.2 实验方法细节及评价指标

3.3 实验对比分析

4. 总结与展望