2.765

2022影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

唇读研究进展与展望

陈小鼎 盛常冲 匡纲要 刘丽

陈小鼎, 盛常冲, 匡纲要, 刘丽. 唇读研究进展与展望. 自动化学报, 2020, 46(11): 2275−2301 doi: 10.16383/j.aas.c190531
引用本文: 陈小鼎, 盛常冲, 匡纲要, 刘丽. 唇读研究进展与展望. 自动化学报, 2020, 46(11): 2275−2301 doi: 10.16383/j.aas.c190531
Chen Xiao-Ding, Sheng Chang-Chong, Kuang Gang-Yao, Liu Li. The state of the art and prospects of lip reading. Acta Automatica Sinica, 2020, 46(11): 2275−2301 doi: 10.16383/j.aas.c190531
Citation: Chen Xiao-Ding, Sheng Chang-Chong, Kuang Gang-Yao, Liu Li. The state of the art and prospects of lip reading. Acta Automatica Sinica, 2020, 46(11): 2275−2301 doi: 10.16383/j.aas.c190531

唇读研究进展与展望

doi: 10.16383/j.aas.c190531
基金项目: 国家自然科学基金(61872379)资助
详细信息
    作者简介:

    陈小鼎:国防科技大学系统工程学院硕士研究生. 主要研究方向为计算机视觉与模式识别. E-mail: chenxiaoding14@nudt.edu.cn

    盛常冲:国防科技大学电子科学学院博士研究生. 主要研究方向为计算机视觉, 模式识别. E-mail: sheng_cc@nudt.edu.cn

    匡纲要:国防科技大学电子科学学院教授. 主要研究方向为遥感图像处理, 目标识别. E-mail: kuanggangyao@nudt.edu.cn

    刘丽:国防科技大学系统工程学院副教授. 主要研究方向为图像理解, 计算机视觉, 模式识别. 本文通信作者. E-mail: liuli_nudt@nudt.edu.cn

The State of the Art and Prospects of Lip Reading

Funds: Supported by National Natural Science Foundation of China (61872379)
  • 摘要: 唇读, 也称视觉语言识别, 旨在通过说话者嘴唇运动的视觉信息, 解码出其所说文本内容. 唇读是计算机视觉和模式识别领域的一个重要问题, 在公共安防、医疗、国防军事和影视娱乐等领域有着广泛的应用价值. 近年来, 深度学习技术极大地推动了唇读研究进展. 本文首先阐述了唇读研究的内容和意义, 并深入剖析了唇读研究面临的难点与挑战; 然后介绍了目前唇读研究的现状与发展水平, 对近期主流唇读方法进行了梳理、归类和评述, 包括传统方法和近期的基于深度学习的方法; 最后, 探讨唇读研究潜在的问题和可能的研究方向. 以期引起大家对唇读问题的关注与兴趣, 并推动与此相关问题的研究进展.
  • 图  1  唇读示意图

    Fig.  1  Illustration of the lip reading task

    图  2  唇读难点示例. (a)第一行为单词place的实例, 第二行为单词please的实例, 唇形变化难以区分, 图片来自GRID数据集; (b)上下两行分别为单词wind在不同上下文环境下的不同读法/wind/与/waind/实例, 唇形变化差异较大; (c)上下两行分别为两位说话人说同一个单词after的实例, 唇形变化存在差异, 图片来自LRS3-TED数据集; (d)说话人在说话过程中头部姿态实时变化实例. 上述对比实例均采用相同的视频时长和采样间隔.

    Fig.  2  Challenging examples of lip reading. (a) The upper line is an instance of the word place, the lower line is an instance of the word please; (b) The upper and lower lines are respectively different pronunciation of word wind in different contexts; (c) The upper and lower lines respectively tell the same word after, with big difference in lip motion; (d) An example of a real-time change in the head posture of the speaker during the speech. The above comparison examples all use the same video duration and sampling interval.

    图  3  唇读方法一般流程

    Fig.  3  The general process of lip reading

    图  4  唇读研究过程中代表性方法. 传统特征提取方法: 主动形状模型ASM[51], 主动表观模型AAM[39], HiLDA[38], LBP-TOP[52], 局部判别图模型[40], 图嵌入方法[53], 随机森林流形对齐RFMA[41], 隐变量方法[54]; 深度学习方法: DBN/CNN+HMM混合模型[42-48], SyncNet[55], LipNet[49], WLAS[10], Transformer[50], LCANet[56], V2P[15].

    Fig.  4  Representative methods in the process of lip reading research. Traditional feature extraction methods:ASM[51], AAM[39], HiLDA[38], LBP-TOP[52], LDG[40], Graph Embedding[53], RFMA[41], Hidden variable method[54]; Deep learning based methods: DBN/CNN+HMM hybrid model[42-48], SyncNet[55], LipNet[49], WLAS[10], Transformer[50], LCANet[56], V2P[15].

    图  5  线性变换特征提取方法一般流程

    Fig.  5  The workflow of linear transformation feature extraction method

    图  6  连续帧曲线映射

    Fig.  6  Continuous frame curve mapping

    图  7  ${\rm LBP}_{8,1}$算子

    Fig.  7  ${\rm LBP}_{8,1}$ operator

    图  8  分块LBP-TOP特征提取

    Fig.  8  Block LBP-TOP feature extraction

    图  9  语音产生的发音特征

    Fig.  9  Articulatory features

    图  10  唇部轮廓ASM模型

    Fig.  10  ASM model of lip profile

    图  11  典型CNN结构示例图

    Fig.  11  A typical CNN structure example

    图  12  RNN及LSTM、GRU结构示例图

    Fig.  12  The structure of RNN, LSTM and GRU

    图  13  CNN-RNN基本框架

    Fig.  13  The network structure of CN-RNN

    图  14  LipNet构架

    Fig.  14  The network architecture of LipNet

    图  15  WAS构架

    Fig.  15  The network architecture of WAS

    图  16  三种唇读网络模型

    Fig.  16  Three lip reading network models

    图  17  不同类型数据集变化趋势

    Fig.  17  The trends of different types of datasets

    图  18  各类数据集示例

    Fig.  18  Some examples of different datasets

    表  1  传统时空特征提取算法优缺点总结

    Table  1  A summary of advantages and disadvantages of traditional spatiotemporal feature extraction methods

    时空特征提取方法代表性方法优势不足
    基于表观的全局图像线性变换[38,57,60-63],
    图嵌入与流形[40-41, 53-54,65],
    LBP-TOP[5266], HOG[67], 光流[29, 68]···
    1) 特征提取速度快;
    2) 无需复杂的人工建模.
    1) 对唇部区域提取精度要求高;
    2) 对环境变化、姿态变化、噪声敏感;
    3) 不同讲话者之间泛化性能较差.
    基于形状的轮廓描述[69-72],
    AFs[73], 形状模型[74-75]···
    1) 具有良好的可解释性;
    2) 不同讲话者之间泛化性能较好;
    3) 能有效去除冗余信息.
    1) 会造成部分有用信息丢失;
    2) 需要大量的人工标注;
    3) 对于姿态变化非常敏感.
    形状表观融合的形状+表观特征串联[76-77],
    形状表观模型[39]···
    1) 特征表达能力较强;
    2) 不同讲话者之间泛化性能较好.
    1) 模型复杂,运算量大;
    2) 需要大量的人工标注.
    下载: 导出CSV

    表  3  单词、短语和语句识别数据集, 其中(s)代表不同语句的数量. 下载地址为: MIRACL-VC[171], LRW[172], LRW-1000[173], GRID[174], OuluVS[175], VIDTIMIT[176], LILiR[177], MOBIO[178], TCD-TIMIT[179], LRS[180], VLRF[181]

    Table  3  Word, phrase and sentence lip reading datasets and their download link: MIRACL-VC[171], LRW[172], LRW-1000[173], GRID[174], OuluVS[175], VIDTIMIT[176], LILiR[177], MOBIO[178], TCD-TIMIT[179], LRS[180], VLRF[181]

    数据集语种识别任务词汇量话语数目说话人数目姿态分辨率谷歌引用发布年份
    IBMViaVoice英语语句10 50024 3252900704 × 480, 30 fps2992000
    VIDTIMIT英语语句346 (s)430430512 × 384, 25 fps512002
    AVICAR英语语句1 31710 000100−15$\sim$15720 × 480, 30 fps1702004
    AV-TIMIT英语语句450 (s)4 6602330720 × 480, 30 fps1272004
    GRID英语短语5134 000340720 × 576, 25 fps7002006
    IV2法语语句15 (s)4 5003000,90780 × 576, 25 fps192008
    UWB-07-ICAV捷克语语句7 550 (s)10 000500720 × 576, 50 fps162008
    OuluVS英语短语10 (s)1 000200720 × 576, 25 fps2112009
    WAPUSK20英语短语522 000200640 × 480, 32 fps162010
    LILiR英语语句1 0002 400120, 30, 45, 60, 90720 × 576, 25 fps672010
    BL法语语句238 (s)4 046170, 90720 × 576, 25 fps122011
    UNMC-VIER英语语句11 (s)4 5511230, 90708 × 640, 25 fps82011
    MOBIO英语语句30 1861520640 × 480, 16 fps1752012
    MIRACL-VC英语单词101 500150640 × 480, 15 fps222014
    短语10 (s)1 500
    Austalk英语单词966966 0001 0000640 × 480112014
    语句59 (s)59 000
    MODALITY英语单词182 (s)2313501 920 × 1 080, 100 fps232015
    RM-3000英语语句1 0003 00010360 × 640, 60 fps72015
    IBM AV-ASR英语语句10 4002620704 × 480, 30 fps1032015
    TCD-TIMIT英语语句5 954 (s)6 913620, 301920 × 1080, 30 fps592015
    OuluVS2英语短语101 590530, 30, 45, 60, 901920 × 1080, 30 fps462015
    语句530 (s)530
    LRW英语单词500550 0001 000+0$\sim$30256 × 256, 25 fps1152016
    HAVRUS俄语语句1 530 (s)4 000200640 × 480, 200 fps132016
    LRS2-BBC英语语句62 769144 4821 000+0$\sim$30160 × 160, 25 fps1722017
    VLRF西班牙语语句1 37410 200a2401 280 × 720, 50 fps62017
    LRS3-TED英语语句70 000151 8191 000+−90$\sim$90224 × 224, 25 fps22018
    LRW-1000中文单词1 000745 1872 000+−90$\sim$901 920 × 1 080, 25 fps02018
    LSVSR英语语句127 0552 934 8991 000+−30$\sim$30128 × 128, 23 ~ 30 fps162018
    下载: 导出CSV

    表  2  字母、数字识别数据集. 下载地址为: AVLetters[152], AVICAR[153], XM2VTS[154], BANCA[155], CUAVE[156], VALID[157], CENSREC-1-AV[158], Austalk[159], OuluVS2[160]

    Table  2  Alphabet and digit lip reading datasets and their download link: AVLetters[152], AVICAR[153], XM2VTS[154], BANCA[155], CUAVE[156], VALID[157], CENSREC-1-AV[158], Austalk[159], OuluVS2[160]

    数据集语种识别任务类别数目话语数目说话人数目姿态分辨率谷歌引用发布年份
    AVLetters英语字母26780100376 × 288, 25 fps5071998
    XM2VTS英语数字108852950720 × 576, 25 fps1 6171999
    BANCA多语种数字1029 9522080720 × 576, 25 fps5302003
    AVICAR英语字母2626 000100−15$\sim$15720 × 480, 30 fps1702004
    数字1323 000
    CUAVE英语数字107 000+36−90, 0, 90720 × 480, 30 fps2922002
    VALID英语数字105301060720 × 576, 25 fps382005
    AVLetters2英语字母26910501 920 × 1 080, 50 fps622008
    IBMSR英语数字101 66138−90, 0, 90368 × 240, 30 fps172008
    CENSREC-1-AV日语数字105 197930720 × 480, 30 fps252010
    QuLips英语数字103 6002−90$\sim$90720 × 576, 25 fps212010
    Austalk英语数字1024 0001 0000640 × 480112014
    OuluVS2英语数字10159530$\sim$901 920 × 1 080, 30 fps462015
    下载: 导出CSV

    表  4  不同数据集下代表性方法比较

    Table  4  Comparison of representative methods under different datasets

    数据集识别任务参考文献模型主要实验条件识别率
    前端特征提取后端分类器音频信号讲话者依赖外部语言模型最小识别单元
    AVLetters字母[41]RFMA××字母69.60 %
    [48]RTMRBMSVM×字母66.00 %
    [42]ST-PCAAutoencoder×××字母64.40 %
    [52]LBP-TOPSVM××字母62.80 %
    ××43.50 %
    [193]DBNF+DCTLSTM××字母58.10 %
    CUAVE数字[102]AAMHMM××数字83.00 %
    [91]HOG+MBHSVM×××数字70.10 %
    ×90.00 %
    [194]DBNFDNN-HMM×××音素64.90 %
    [60]DCTHMM××数字60.40 %
    LRW单词[128]3D-CNN+ResNetBiLSTM×××单词83.00 %
    [131]3D-CNN+ResNetBiGRU×××单词82.00 %
    ×98.00 %
    [10]CNNLSTM+Attention×××单词76.20 %
    [9]CNN×××单词61.10 %
    GRID短语[56]3D-CNN+highwayBiGRU+Attention××字符97.10 %
    [10]CNNLSTM+Attention××单词97.00 %
    [134]Feed-forwardLSTM××单词84.70 %
    95.90 %
    [49]3D-CNNBiGRU×××字符93.40 %
    [126]HOGSVM××单词71.20 %
    LRS3-TED语句[151]3D-CNN+ResNetTransformer+seq2seq××字符41.10 %
    Transformer +CTC33.70 %
    [15]3DCNNBiLSTM+CTC××音素44.90 %
    下载: 导出CSV
  • [1] McGurk H, MacDonald J. Hearing lips and seeing voices. Nature, 1976, 264(5588): 746−748 doi: 10.1038/264746a0
    [2] Potamianos G, Neti C, Gravier G, Garg A, Senior A W. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 2003, 91(9): 1306−1326 doi: 10.1109/JPROC.2003.817150
    [3] Calvert G A, Bullmore E T, Brammer M J, Campbell R, Williams S C R, McGuire P K, et al. Activation of auditory cortex during silent lipreading. Science, 1997, 276(5312): 593−596 doi: 10.1126/science.276.5312.593
    [4] Deafness and hearing loss [online] available:https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss, July 1, 2019
    [5] Tye-Murray N, Sommers M S, Spehar B. Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing. Ear and Hearing, 2007, 28(5): 656−668 doi: 10.1097/AUD.0b013e31812f7185
    [6] Akhtar Z, Micheloni C, Foresti G L. Biometric liveness detection: Challenges and research opportunities. IEEE Security and Privacy, 2015, 13(5): 63−72 doi: 10.1109/MSP.2015.116
    [7] Rekik A, Ben-Hamadou A, Mahdi W. Human machine interaction via visual speech spotting. In: Proceedings of the 2015 International Conference on Advanced Concepts for Intelligent Vision Systems. Catania, Italy: Springer, 2015. 566−574
    [8] Suwajanakorn S, Seitz S M, Kemelmacher-Shlizerman I. Synthesizing obama: Learning lip sync from audio. ACM Transactions on Graphics, 2017, 36(4): Article No.95
    [9] Chung J S, Zisserman A. Lip reading in the wild. In: Proceedings of the 2016 Asian Conference on Computer Vision. Taiwan, China: Springer, 2016. 87−103
    [10] Chung J S, Senior A, Vinyals O, Zisserman A. Lip reading sentences in the wild. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017. 3444−3453
    [11] Chen L, Li Z, K Maddox R K, Duan Z, Xu C. Lip movements generation at a glance. In: Proceedings of the 2018 European Conference on Computer Vision. Munich, Germany: Springer, 2018. 538−553
    [12] Gabbay A, Shamir A, Peleg S. Visual speech enhancement. arXiv preprint arXiv: 1711.08789, 2017
    [13] 黄雅婷, 石晶, 许家铭, 徐波. 鸡尾酒会问题与相关听觉模型的研究现状与展望. 自动化学报, 2019, 45(2): 234−251

    Huang Ya-Ting, Shi Jing, Xu Jia-Ming, Xu Bo. Research advances and perspectives on the cocktail party problem and related auditory models. Acta Automatica Sinica, 2019, 45(2): 234−251
    [14] Akbari H, Arora H, Cao L L, Mesgarani N. Lip2AudSpec: Speech reconstruction from silent lip movements video. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 2516−2520
    [15] Shillingford B, Assael Y, Hoffman M W, Paine T, Hughes C, Prabhu U, et al. Large-scale visual speech recognition. arXiv preprint arXiv: 1807.05162, 2018
    [16] Mandarin Audio-Visual Speech Recognition Challenge [online] available: http://vipl.ict.ac.cn/homepage/mavsr/index.html, July 1, 2019
    [17] Potamianos G, Neti C, Luettin J, Matthews I. Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing. Cambridge: MIT Press, 2004. 1−30
    [18] Zhou Z H, Zhao G Y, Hong X P, Pietikainen M. A review of recent advances in visual speech decoding. Image and Vision Computing, 2014, 32(9): 590−605 doi: 10.1016/j.imavis.2014.06.004
    [19] Fernandez-Lopez A, Sukno F M. Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 2018, 78: 53−72 doi: 10.1016/j.imavis.2018.07.002
    [20] 姚鸿勋, 高文, 王瑞, 郎咸波. 视觉语言-唇读综述. 电子学报, 2001, 29(2): 239−246 doi: 10.3321/j.issn:0372-2112.2001.02.025

    Yao Hong-Xun, Gao Wen, Wang Rui, Lang Xian-Bo. A survey of lipreading-one of visual languages. Acta Electronica Sinica, 2001, 29(2): 239−246 doi: 10.3321/j.issn:0372-2112.2001.02.025
    [21] Cox S J, Harvey R W, Lan Y, et al. The challenge of multispeaker lip-reading. In: Proceedings of AVSP. 2008: 179−184
    [22] Messer K, Matas J, Kittler J, et al. XM2VTSDB: The extended M2VTS database. In: Proceedings of the Second International Conference on Audio and Video-based Biometric Person Authentication. 1999, 964: 965−966
    [23] Bailly-Bailliére E, Bengio S, Bimbot F, Hamouz M, Kittler J, Mariéthoz J, et al. The BANCA database and evaluation protocol. In: Proceedings of the 2003 International Conference on Audio- and Video-based Biometric Person Authentication. Guildford, United Kingdom: Springer, 2003. 625−638
    [24] Ortega A, Sukno F, Lleida E, Frangi A F, Miguel A, Buera L, et al. AV@CAR: A Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In: Proceedings of the 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal: European Language Resources Association, 2004. 763−766
    [25] Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, et al. AVICAR: Audio-visual speech corpus in a car environment. In: Proceedings of the 8th International Conference on Spoken Language Processing. Jeju Island, South Korea: International Speech Communication Association, 2004. 2489−2492
    [26] Twaddell W F. On defining the phoneme. Language, 1935, 11(1): 5−62
    [27] Woodward M F, Barber C G. Phoneme perception in lipreading. Journal of Speech and Hearing Research, 1960, 3(3): 212−222 doi: 10.1044/jshr.0303.212
    [28] Fisher C G. Confusions among visually perceived consonants. Journal of Speech and Hearing Research, 1968, 11(4): 796−804 doi: 10.1044/jshr.1104.796
    [29] Cappelletta L, Harte N. Viseme definitions comparison for visual-only speech recognition. In: Proceedings of the 19th European Signal Processing Conference. Barcelona, Spain: IEEE, 2011. 2109−2113
    [30] Wu Y, Ji Q. Facial landmark detection: A literature survey. International Journal of Computer Vision, 2019, 127(2): 115−142 doi: 10.1007/s11263-018-1097-z
    [31] Chrysos G G, Antonakos E, Snape P, Asthana A, Zafeiriou S. A comprehensive performance evaluation of deformable face tracking "in-the-wild". International Journal of Computer Vision, 2018, 126(2-4): 198−232 doi: 10.1007/s11263-017-0999-5
    [32] Koumparoulis A, Potamianos G, Mroueh Y, et al. Exploring ROI size in deep learning based lipreading. In: Proceedings of AVSP. 2017: 64−69
    [33] Deller J R Jr, Hansen J H L, Proakis J G. Discrete-Time Processing of Speech Signals. New York: Macmillan Pub. Co, 1993.
    [34] Rabiner L R, Juang B H. Fundamentals of Speech Recognition. Englewood Cliffs: Prentice Hall, 1993.
    [35] Young S, Evermann G, Gales M J F, Hain T, Kershaw D, Liu X Y, et al. The HTK Book. Cambridge: Cambridge University Engineering Department, 2002.
    [36] Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, et al. The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE, 2011.
    [37] Matthews I, Cootes T F, Bangham J A, Cox S, Harvey R. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(2): 198−213 doi: 10.1109/34.982900
    [38] Potamianos G, Graf H P, Cosatto E. An image transform approach for HMM based automatic lipreading. In: Proceedings of 1998 International Conference on Image Processing. Chicago, USA: IEEE, 1998. 173−177
    [39] Cootes T F, Edwards G J, Taylor C J. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(6): 681−685 doi: 10.1109/34.927467
    [40] Fu Y, Zhou X, Liu M, Hasegawa-Johnson M, Huang T S. Lipreading by locality discriminant graph. In: Proceedings of 2007 IEEE International Conference on Image Processing. San Antonio, USA: IEEE, 2007. III−325−III−328
    [41] Pei Y R, Kim T K, Zha H B. Unsupervised random forest manifold alignment for lipreading. In: Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013. 129−136
    [42] Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A Y. Multimodal deep learning. In: Proceeding of the 28th International Conference on Machine Learning. Washington, USA: ACM, 2011. 689−696
    [43] Salakhutdinov R, Mnih A, Hinton G. Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the 24th International Conference on Machine Learning. Corvallis, USA: ACM, 2007. 791−798
    [44] Huang J, Kingsbury B. Audio-visual deep learning for noise robust speech recognition. In: Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 7596−7599
    [45] Ninomiya H, Kitaoka N, Tamura S, et al. Integration of deep bottleneck features for audio-visual speech recognition. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. 2015.
    [46] Sui C, Bennamoun M, Togneri R. Listening with your eyes: Towards a practical visual speech recognition system using deep Boltzmann machines. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015. 154−162
    [47] Noda K, Yamaguchi Y, Nakadai K, Okuno H G, Ogata T. Audio-visual speech recognition using deep learning. Applied Intelligence, 2015, 42(4): 722−737 doi: 10.1007/s10489-014-0629-7
    [48] Hu D, Li X L, Lu X Q. Temporal multimodal learning in audiovisual speech recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 3574−3582
    [49] Assael Y M, Shillingford B, Whiteson S, De Freitas N. LipNet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016
    [50] Afouras T, Chung J S, Zisserman A. Deep lip reading: A comparison of models and an online application. arXiv preprint arXiv:1806.06053, 2018
    [51] Luettin J, Thacker N A. Speechreading using probabilistic models. Computer Vision and Image Understanding, 1997, 65(2): 163−178 doi: 10.1006/cviu.1996.0570
    [52] Zhao G Y, Barnard M, Pietikäinen M. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 2009, 11(7): 1254−1265 doi: 10.1109/TMM.2009.2030637
    [53] Zhou Z H, Zhao G Y, Pietikäinen M. Towards a practical lipreading system. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2011. 137−144
    [54] Zhou Z H, Hong X P, Zhao G Y, Pietikäinen M. A compact representation of visual speech data using latent variables. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(1): 1
    [55] Chung J S, Zisserman A. Out of time: Automated lip sync in the wild. In: Proceedings of Asian Conference on Computer Vision. Taiwan, China: Springer, 2016. 251−263
    [56] Xu K, Li D W, Cassimatis N, Wang X L. LCANet: End-to-end lipreading with cascaded attention-CTC. In: Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition. Xi'an, China: IEEE, 2018.−548−555
    [57] Lucey P J, Potamianos G, Sridharan S. A unified approach to multi-pose audio-visual ASR. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association. Antwerp, Belgium: Causal Productions Pty Ltd., 2007. 650−653
    [58] Almajai I, Cox S, Harvey R, Lan Y X. Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 2722−2726
    [59] Seymour R, Stewart D, Ming J. Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. EURASIP Journal on Image and Video Processing, 2007, 2008(1): Article No.810362
    [60] Estellers V, Gurban M, Thiran J P. On dynamic stream weighting for audio-visual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4): 1145−1157 doi: 10.1109/TASL.2011.2172427
    [61] Potamianos G, Neti C, Iyengar G, Senior A W, Verma A. A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology, 2001, 4(3−4): 193−208
    [62] Lucey P J, Sridharan S, Dean D B. Continuous pose-invariant lipreading. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association (Interspeech 2008) incorporating the 12th Australasian International Conference on Speech Science and Technology (SST 2008). Brisbane Australia: International Speech Communication Association, 2008. 2679−2682
    [63] Lucey P J, Potamianos G, Sridharan S. Patch-based analysis of visual speech from multiple views. In: Proceedings of the International Conference on Auditory-Visual Speech Processing 2008. Moreton Island, Australia: AVISA, 2008. 69−74
    [64] Tim Sheerman-Chase, Eng-Jon Ong, Richard Bowden. Cultural Factors in the Regression of Non-verbal Communication Perception. In Workshop on Human Interaction in Computer Vision, Barcelona, 2011
    [65] Zhou Z H, Zhao G Y, Pietikäinen M. Lipreading: A graph embedding approach. In: Proceedings of the 20th International Conference on Pattern Recognition. Istanbul, Turkey: IEEE, 2010. 523−526
    [66] Zhao G Y, Pietikäinen M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6): 915−928 doi: 10.1109/TPAMI.2007.1110
    [67] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005. 886−893
    [68] Mase K, Pentland A. Automatic lipreading by optical-flow analysis. Systems and Computers in Japan, 1991, 22(6): 67−76 doi: 10.1002/scj.4690220607
    [69] Aleksic P S, Williams J J, Wu Z L, Katsaggelos A K. Audio-visual speech recognition using MPEG-4 compliant visual features. EURASIP Journal on Advances in Signal Processing, 2002, 2002(1): Article No. 150948
    [70] Brooke N M. Using the visual component in automatic speech recognition. In: Proceedings of the 4th International Conference on Spoken Language Processing. Philadelphia, USA: IEEE, 1996. 1656−1659
    [71] Cetingul H E, Yemez Y, Erzin E, Tekalp A M. Discriminative analysis of lip motion features for speaker identification and speech-reading. IEEE Transactions on Image Processing, 2006, 15(10): 2879−2891 doi: 10.1109/TIP.2006.877528
    [72] Nefian A V, Liang L H, Pi X B, Liu X X, Murphy K. Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2002, 2002(11): Article No.783042 doi: 10.1155/S1110865702206083
    [73] Kirchhoff K. Robust speech recognition using articulatory information Elektronische Ressource. 1999.
    [74] Cootes T F, Taylor C J, Cooper D H, Graham J. Active shape models-their training and application. Computer Vision and Image Understanding, 1995, 61(1): 38−59 doi: 10.1006/cviu.1995.1004
    [75] Luettin J, Thacker N A, Beet S W. Speechreading using shape and intensity information. In: Proceedings of the 4th International Conference on Spoken Language Processing. Philadelphia, USA: IEEE, 1996. 58−61
    [76] Dupont S, Luettin J. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2000, 2(3): 141−151 doi: 10.1109/6046.865479
    [77] Chan M T. HMM-based audio-visual speech recognition integrating geometric- and appearance-based visual features. In: Proceedings of the 4th Workshop on Multimedia Signal Processing. Cannes, France: IEEE, 2001. 9−14
    [78] Roweis S T, Sau L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000, 290(5500): 2323−2326 doi: 10.1126/science.290.5500.2323
    [79] Tenenbaum J B, de Silva V, Langford J C. A global geometric framework for nonlinear dimensionality reduction. Science, 2000, 290(5500): 2319−2323 doi: 10.1126/science.290.5500.2319
    [80] Yan S C, Xu D, Zhang B Y, Zhang H J, Yang Q, Lin S. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1): 40−51 doi: 10.1109/TPAMI.2007.250598
    [81] Fu Y, Yan S C, Huang T S. Classification and feature extraction by simplexization. IEEE Transactions on Information Forensics and Security, 2008, 3(1): 91−100 doi: 10.1109/TIFS.2007.916280
    [82] Ojala T, Pietikäinen M, Harwood D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 1996, 29(1): 51−59 doi: 10.1016/0031-3203(95)00067-4
    [83] Ojala T, Pietikäinen M, Mäenpää T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7): 971−987 doi: 10.1109/TPAMI.2002.1017623
    [84] 刘丽, 赵凌君, 郭承玉, 王亮, 汤俊. 图像纹理分类方法研究进展和展望. 自动化学报, 2018, 44(4): 584−607

    Liu Li, Zhao Ling-Jun, Guo Cheng-Yu, Wang Liang, Tang Jun. Texture classification: State-of-the-art methods and prospects. Acta Automatica Sinica, 2018, 44(4): 584−607
    [85] Pietikäinen M, Hadid A, Zhao G, Ahonen T. Computer Vision Using Local Binary Patterns. London: Springer, 2011.
    [86] Liu L, Chen J, Fieguth P, Zhao G Y, Chellappa R, Pietikäinen M. From BoW to CNN: Two decades of texture representation for texture classification. International Journal of Computer Vision, 2019, 127(1): 74−109 doi: 10.1007/s11263-018-1125-z
    [87] 刘丽, 谢毓湘, 魏迎梅, 老松杨. 局部二进制模式方法综述. 中国图象图形学报, 2014, 19(12): 1696−1720 doi: 10.11834/jig.20141202

    Liu Li, Xie Yu-Xiang, Wei Ying-Mei, Lao Song-Yang. Survey of Local Binary Pattern method. Journal of Image and Graphics, 2014, 19(12): 1696−1720 doi: 10.11834/jig.20141202
    [88] Horn B K P, Schunck B G. Determining optical flow. Artificial Intelligence, 1981, 17(1-3): 185−203 doi: 10.1016/0004-3702(81)90024-2
    [89] Bouguet J Y. Pyramidal implementation of the affine Lucas Kanade feature tracker description of the algorithm. Intel Corporation, 2001, 5: 1−9
    [90] Lucas B D, Kanade T. An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence. San Francisco, CA, United States: Morgan Kaufmann Publishers Inc., 1981. 674−679
    [91] Rekik A, Ben-Hamadou A, Mahdi W. An adaptive approach for lip-reading using image and depth data. Multimedia Tools and Applications, 2016, 75(14): 8609−8636 doi: 10.1007/s11042-015-2774-3
    [92] Shaikh A A, Kumar D K, Yau W C, Azemin M Z C, Gubbi J. Lip reading using optical flow and support vector machines. In: Proceedings of the 3rd International Congress on Image and Signal Processing. Yantai, China: IEEE, 2010. 327−330
    [93] Goldschen A J, Garcia O N, Petajan E. Continuous optical automatic speech recognition by lipreading. In: Proceedings of the 28th Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA, USA: IEEE, 1994. 572−577
    [94] King S, Frankel J, Livescu K, McDermott E, Richmond K, Wester M. Speech production knowledge in automatic speech recognition. The Journal of the Acoustical Society of America, 2007, 121(2): 723−742 doi: 10.1121/1.2404622
    [95] Kirchhoff K, Fink G A, Sagerer G. Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 2002, 37(3−4): 303−319 doi: 10.1016/S0167-6393(01)00020-6
    [96] Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, et al. Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU Summer Workshop. In: Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA: IEEE. 2007. IV−621−IV−624
    [97] Saenko K, Livescu K, Glass J, Darrell T. Production domain modeling of pronunciation for visual speech recognition. In: Proceeding of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia, USA: IEEE. 2005. v/473−v/476
    [98] Saenko K, Livescu K, Glass J, Darrell T. Multistream articulatory feature-based models for visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(9): 1700−1707 doi: 10.1109/TPAMI.2008.303
    [99] Saenko K, Livescu K, Siracusa M, Wilson K, Glass J, Darrell T. Visual speech recognition with loosely synchronized feature streams. In: Proceeding of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE. 2005. 1424−1431
    [100] Papcun G, Hochberg J, Thomas T R, Laroche F, Zacks J, Levy S. Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. The Journal of the Acoustical Society of America, 1992, 92(2): 688−700 doi: 10.1121/1.403994
    [101] Matthews I, Potamianos G, Neti C, Luettin J. A comparison of model and transform-based visual features for audio-visual LVCSR. In: Proceedings of the 2001 IEEE International Conference on Multimedia and Expo. Tokyo, Japan: IEEE, 2001. 825−828
    [102] Papandreou G, Katsamanis A, Pitsikalis V, Maragos P. Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17(3): 423−435 doi: 10.1109/TASL.2008.2011515
    [103] Hilder S, Harvey R W, Theobald B J. Comparison of human and machine-based lip-reading. In: Proceedings of the 2009 AVSP. 2009: 86−89
    [104] Lan Y X, Theobald B J, Harvey R. View independent computer lip-reading. In: Proceedings of the 2012 IEEE International Conference on Multimedia and Expo. Melbourne, Australia: IEEE, 2012. 432−437
    [105] Lan Y X, Harvey R, Theobald B J. Insights into machine lip reading. In: Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. Kyoto, Japan: IEEE, 2012. 4825−4828
    [106] Bear H L, Harvey R. Decoding visemes: Improving machine lip-reading. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 2009−2013
    [107] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436−444 doi: 10.1038/nature14539
    [108] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504−507 doi: 10.1126/science.1127647
    [109] Hong X P, Yao H X, Wan Y Q, Chen R. A PCA based visual DCT feature extraction method for lip-reading. In: Proceedings of the 2006 International Conference on Intelligent Information Hiding and Multimedia. Pasadena, USA: IEEE, 2006. 321−326
    [110] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, NY, United States: Curran Associates Inc., 2012. 1097−1105
    [111] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556, 2014
    [112] Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 1−9
    [113] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 770−778
    [114] Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 2261−2269
    [115] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: IEEE, 2018. 7132−7141
    [116] Liu L, Ouyang W L, Wang X G, Fieguth P, Chen J, Liu X W, et al. Deep learning for generic object detection: A survey. arXiv preprint arXiv: 1809.02165, 2018
    [117] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 3431−3440
    [118] Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In: Proceedings of the 2013 IEEE international Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 6645−6649
    [119] Noda K, Yamaguchi Y, Nakadai K, Okuno H G, Ogata T. Lipreading using convolutional neural network. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association. Singapore: ISCA, 2014. 1149−1153
    [120] Ji S W, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221−231 doi: 10.1109/TPAMI.2012.59
    [121] Herath S, Harandi M, Porikli F. Going deeper into action recognition: A survey. Image and Vision Computing, 2017, 60: 4−21 doi: 10.1016/j.imavis.2017.01.010
    [122] Mroueh Y, Marcheret E, Goel V. Deep multimodal learning for audio-visual speech recognition. In: Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Queensland, Australia: IEEE, 2015. 2130−2134
    [123] Thangthai K, Harvey R W, Cox S J, et al. Improving lip-reading performance for robust audiovisual speech recognition using DNNs. In: Proceedings of the 2015 AVSP. 2015: 127−131.
    [124] Gers F A, Schmidhuber J, Cummins F. Learning to forget: Continual prediction with LSTM. Neural Computation, 2000, 12(10): 2451−2471 doi: 10.1162/089976600300015015
    [125] Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv: 1412.3555, 2014
    [126] Wand M, Koutník J, Schmidhuber J. Lipreading with long short-term memory. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 6115−6119
    [127] Garg A, Noyola J, Bagadia S. Lip reading using CNN and LSTM, Technical Report, CS231n Project Report, Stanford University, USA, 2016.
    [128] Stafylakis T, Tzimiropoulos G. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv: 1703.04105, 2017
    [129] Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. New York: ACM, 2006. 369−376
    [130] Miao Y, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In: Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. Arizona, USA: IEEE, 2015. 167−174
    [131] Petridis S, Stafylakis T, Ma P, Cai F P, Tzimiropoulos G, Pantic M. End-to-end audiovisual speech recognition. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 6548−6552
    [132] Fung I, Mak B. End-to-end low-resource lip-reading with Maxout Cnn and Lstm. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 2511−2515
    [133] Wand M, Schmidhuber J. Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv: 1708.01565, 2017
    [134] Wand M, Schmidhuber J, Vu N T. Investigations on end-to-end audiovisual fusion. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 3041−3045
    [135] Srivastava R K, Greff K, Schmidhuber J. Training very deep networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA United States: MIT Press, 2015. 2377−2385
    [136] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014. 3104−3112
    [137] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473, 2014
    [138] Chaudhari S, Polatkan G, Ramanath R, Mithal V. An attentive survey of attention models. arXiv preprint arXiv: 1904.02874, 2019
    [139] Wang F, Tax D M J. Survey on the attention based RNN model and its applications in computer vision. arXiv preprint arXiv: 1601.06823, 2016
    [140] Chung J S, Zisserman A. Lip reading in profile. In: Proceedings of the British Machine Vision Conference. Guildford: BMVA Press, 2017. 155.1−155.11
    [141] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3): 211−252 doi: 10.1007/s11263-015-0816-y
    [142] Saitoh T, Zhou Z H, Zhao G Y, Pietikäinen M. Concatenated frame image based cnn for visual speech recognition. In: Proceedings of the 2016 Asian Conference on Computer Vision. Taiwan, China: Springer, 2016. 277−289
    [143] Lin M, Chen Q, Yan S C. Network in network. arXiv preprint arXiv: 1312.4400, 2013
    [144] Petridis S, Li Z W, Pantic M. End-to-end visual speech recognition with LSTMs. In: Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans, USA: IEEE, 2017. 2592−2596
    [145] Petridis S, Wang Y J, Li Z W, Pantic M. End-to-end audiovisual fusion with LSTMS. arXiv preprint arXiv: 1709.04343, 2017
    [146] Petridis S, Wang Y J, Li Z W, Pantic M. End-to-end multi-view lipreading. arXiv preprint arXiv: 1709.00443, 2017
    [147] Petridis S, Shen J, Cetin D, Pantic M. Visual-only recognition of normal, whispered and silent speech. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018. 6219−6223
    [148] Moon S, Kim S, Wang H H. Multimodal transfer deep learning with applications in audio-visual recognition. arXiv preprint arXiv: 1412.3121, 2014
    [149] Chollet F. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA: IEEE, 2017. 1800−1807
    [150] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, United States: Curran Associates Inc., 2017. 6000−6010
    [151] Afouras T, Chung J S, Senior A, et al. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.DOI: 10.1109/TPAMI.2018.2889052
    [152] AV Letters Database [Online], available: http://www2.cmp.uea.ac.uk/~bjt/avletters/, October 27, 2020
    [153] AVICAR Project: Audio-Visual Speech Recognition in a Car [Online], available: http://www.isle.illinois.edu/sst/AVICAR/#information, October 27, 2020
    [154] The Extended M2VTS Database [Online], available: http://www.ee.surrey.ac.uk/CVSSP/xm2vtsdb/, October 27, 2020
    [155] The BANCA Database [Online], available: http://www.ee.surrey.ac.uk/CVSSP/banca/, October 27, 2020
    [156] CUAVE Group Set [Online], available: http://people.csail.mit.edu/siracusa/avdata/, October 27, 2020
    [157] VALID: Visual quality Assessment for Light field Images Dataset [Online], available: https://www.epfl.ch/labs/mmspg/downloads/valid/, October 27, 2020
    [158] Speech Resources Consortium [Online], available: http://research.nii.ac.jp/src/en/data.html, October 27, 2020
    [159] AusTalk [Online], available: https://austalk.edu.au/about/corpus/, October 27, 2020
    [160] OULUVS2: A MULTI-VIEW AUDIOVISUAL DATABASE [Online], available: http://www.ee.oulu.fi/research/imag/OuluVS2/, October 27, 2020
    [161] Patterson E K, Gurbuz S, Tufekci Z, Gowdy J N. CUAVE: A new audio-visual database for multimodal human-computer interface research. In: Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, Florida, USA: IEEE, 2002. II−2017−II−2020
    [162] Fox N A, O'Mullane B A, Reilly R B. VALID: A new practical audio-visual database, and comparative results. In: Proceedings of the 2005 International Conference on Audio-and Video-Based Biometric Person Authentication. Berlin, Germany: Springer, 2005. 777−786
    [163] Anina I, Zhou Z H, Zhao G Y, Pietikäinen M. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. In: Proceedings of the 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Ljubljana, Slovenia: IEEE, 2015. 1−5
    [164] Estival D, Cassidy S, Cox F, et al. AusTalk: an audio-visual corpus of Australian English. In: Proceedings of the 2014 LREC 2014.
    [165] Tamura S, Miyajima C, Kitaoka N, et al. CENSREC-1-AV: An audio-visual corpus for noisy bimodal speech recognition. In: Proceedings of the Auditory-Visual Speech Processing 2010. 2010.
    [166] Pass A, Zhang J G, Stewart D. An investigation into features for multi-view lipreading. In: Proceedings of the 2010 IEEE International Conference on Image Processing. Hong Kong, China: IEEE, 2010. 2417−2420
    [167] Neti C, Potamianos G, Luettin J, et al. Audio visual speech recognition. IDIAP, 2000.
    [168] Sanderson C. The vidtimit database. IDIAP, 2002.
    [169] Jankowski C, Kalyanswamy A, Basson S, Spitz J. NTIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In: Proceedings of the 1990 International Conference on Acoustics, Speech, and Signal Processing. Albuquerque, New Mexico, USA: IEEE, 1990. 109−112
    [170] Hazen T J, Saenko K, La C H, Glass J R. A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In: Proceedings of the 6th International Conference on Multimodal Interfaces. State College, PA, USA: ACM, 2004. 235−242
    [171] MIRACL-VC1 [Online], available: https://sites.google.com/site/achrafbenhamadou/-datasets/miracl-vc1, October 27, 2020
    [172] The Oxford-BBC Lip Reading in the Wild (LRW) Dataset [Online], available: http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html, October 27, 2020
    [173] LRW-1000: Lip Reading database [Online], available: http://vipl.ict.ac.cn/view_database.php?id=14, October 27, 2020
    [174] The GRID audiovisual sentence corpus [Online], available: http://spandh.dcs.shef.ac.uk/gridcorpus/, October 27, 2020
    [175] OuluVS database [Online], available: https://www.oulu.fi/cmvs/node/41315, October 27, 2020
    [176] VidTIMIT Audio-Video Dataset [Online], available: http://conradsanderson.id.au/vidtimit/#downloads, October 27, 2020
    [177] LiLiR [Online], available: http://www.ee.surrey.ac.uk/Projects/LILiR/datasets.html, October 27, 2020
    [178] MOBIO [Online], available: https://www.idiap.ch/dataset/mobio, October 27, 2020
    [179] TCD-TIMIT [Online], available: https://sigmedia.tcd.ie/TCDTIMIT/, October 27, 2020
    [180] Lip Reading Datasets [Online], available: http://www.robots.ox.ac.uk/~vgg/data/lip_reading/, October 27, 2020
    [181] Visual Lip Reading Feasibility (VRLF) [Online], available: https://datasets.bifrost.ai/info/845, October 27, 2020
    [182] Rekik A, Ben-Hamadou A, Mahdi W. A new visual speech recognition approach for RGB-D cameras. In: Proceedings of the 2014 International Conference Image Analysis and Recognition. Vilamoura, Portugal: Springer, 2014. 21−28
    [183] McCool C, Marcel S, Hadid A, Pietikäinen M, Matejka P, Cernockỳ J, et al. Bi-modal person recognition on a mobile phone: Using mobile phone data. In: Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops. Melbourne, Australia: IEEE, 2012. 635−640
    [184] Howell D. Confusion Modelling for Lip-Reading [Ph. D. dissertation], University of East Anglia, Norwich, 2015
    [185] Harte N, Gillen E. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 2015, 17(5): 603−615 doi: 10.1109/TMM.2015.2407694
    [186] Verkhodanova V, Ronzhin A, Kipyatkova I, Ivanko D, Karpov A, Zelezny M. HAVRUS corpus: High-speed recordings of audio-visual Russian speech. In: Proceedings of the 2016 International Conference on Speech and Computer. Budapest, Hungary: Springer, 2016. 338−345
    [187] Fernandez-Lopez A, Martinez O, Sukno F M. Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. In: Proceedings of the 12th IEEE International Conference on Automatic Face & Gesture Recognition. Washington, USA: IEEE, 2017. 208−215
    [188] Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 2006, 120(5): 2421−2424 doi: 10.1121/1.2229005
    [189] Vorwerk A, Wang X, Kolossa D, et al. WAPUSK20-A Database for Robust Audiovisual Speech Recognition. In: Proceedings of the 2010 LREC. 2010.
    [190] Czyzewski A, Kostek B, Bratoszewski P, Kotus J, Szykulski M. An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems, 2017, 49(2): 167−192 doi: 10.1007/s10844-016-0438-z
    [191] Afouras T, Chung J S, Zisserman A. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv: 1809.00496, 2018
    [192] Yang S, Zhang Y H, Feng D L, Yang M M, Wang C H, Xiao J Y, et al. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In: Proceedings of the 14th IEEE International Conference on Automatic Face and Gesture Recognition. Lille, France: IEEE, 2019. 1−8
    [193] Petridis S, Pantic M. Deep complementary bottleneck features for visual speech recognition. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 2304−2308
    [194] Rahmani M H, Almasganj F. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: Proceedings of the 3rd International Conference on Pattern Recognition and Image Analysis. Shahrekord, Iran: IEEE, 2017. 195−199
    [195] Dosovitskiy A, Fischer P, Ilg E, Häusser P, Hazirbas C, Golkov V, et al. FlowNet: Learning optical flow with convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015. 2758−2766
    [196] Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 1647−1655
    [197] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA, United States: MIT Press, 2014. 568−576
    [198] Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 1933−1941
    [199] Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, United States: MIT Press, 2015. 2017−2025
    [200] Bhagavatula C, Zhu C C, Luu K, Savvides M. Faster than real-time facial alignment: A 3D spatial transformer network approach in unconstrained poses. In: Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 4000−4009
    [201] Baltrušaitis T, Ahuja C, Morency L P. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423−443 doi: 10.1109/TPAMI.2018.2798607
    [202] Loizou P C. Speech Enhancement: Theory and Practice. Boca Raton, FL: CRC Press, 2013.
    [203] Hou J C, Wang S S, Lai Y H, Tsao Y, Chang H W, Wang H M. Audio-visual speech enhancement based on multimodal deep convolutional neural network. arXiv preprint arXiv: 1703.10893, 2017
    [204] Ephrat A, Halperin T, Peleg S. Improved speech reconstruction from silent video. In: Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Venice, Italy: IEEE, 2017. 455−462
    [205] Gabbay A, Shamir A, Peleg S. Visual speech enhancement. arXiv preprint arXiv: 1711.08789, 2017.
  • 加载中
图(18) / 表(4)
计量
  • 文章访问数:  4074
  • HTML全文浏览量:  1769
  • PDF下载量:  378
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-07-16
  • 录用日期:  2019-11-16
  • 网络出版日期:  2019-12-19
  • 刊出日期:  2020-11-24

目录

    /

    返回文章
    返回