2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于DNN的低资源语音识别特征提取技术

秦楚雄 张连海

秦楚雄, 张连海. 基于DNN的低资源语音识别特征提取技术. 自动化学报, 2017, 43(7): 1208-1219. doi: 10.16383/j.aas.2017.c150654
引用本文: 秦楚雄, 张连海. 基于DNN的低资源语音识别特征提取技术. 自动化学报, 2017, 43(7): 1208-1219. doi: 10.16383/j.aas.2017.c150654
QIN Chu-Xiong, ZHANG Lian-Hai. Deep Neural Network Based Feature Extraction for Low-resource Speech Recognition. ACTA AUTOMATICA SINICA, 2017, 43(7): 1208-1219. doi: 10.16383/j.aas.2017.c150654
Citation: QIN Chu-Xiong, ZHANG Lian-Hai. Deep Neural Network Based Feature Extraction for Low-resource Speech Recognition. ACTA AUTOMATICA SINICA, 2017, 43(7): 1208-1219. doi: 10.16383/j.aas.2017.c150654

基于DNN的低资源语音识别特征提取技术

doi: 10.16383/j.aas.2017.c150654
基金项目: 

国家自然科学基金 61673395

国家自然科学基金 61403415

国家自然科学基金 61302107

详细信息
    作者简介:

    张连海 信息工程大学信息系统工程学院副教授.主要研究方向为语音信号处理与智能信息处理.E-mail:lianhaiz@sina.com

    通讯作者:

    秦楚雄 信息工程大学信息系统工程学院博士研究生.主要研究方向为智能信息处理.本文通信作者.E-mail:chuxiongq313@gmail.com

Deep Neural Network Based Feature Extraction for Low-resource Speech Recognition

Funds: 

Supported by National Natural Science Foundation of China 61673395

Supported by National Natural Science Foundation of China 61403415

Supported by National Natural Science Foundation of China 61302107

More Information
    Author Bio:

     Associate professor in the Department of Information and System Engineering, Information Engineering University. His research interest covers speech signal processing and intelligent information processing

    Corresponding author: QIN Chu-Xiong Ph. D. candidate in the Department of Information and System Engineering, Information Engineering University. His main research interest is intelligent information processing. Corresponding author of this paper.E-mail:chuxiongq313@gmail.com
  • 摘要: 针对低资源训练数据条件下深层神经网络(Deep neural network,DNN)特征声学建模性能急剧下降的问题,提出两种适合于低资源语音识别的深层神经网络特征提取方法.首先基于隐含层共享训练的网络结构,借助资源较为丰富的语料实现对深层瓶颈神经网络的辅助训练,针对BN层位于共享层的特点,引入Dropout,Maxout,Rectified linear units等技术改善多流训练样本分布不规律导致的过拟合问题,同时缩小网络参数规模、降低训练耗时;其次为了改善深层神经网络特征提取方法,提出一种基于凸非负矩阵分解(Convex-non-negative matrix factorization,CNMF)算法的低维高层特征提取技术,通过对网络的权值矩阵分解得到基矩阵作为特征层的权值矩阵,然后从该层提取一种新的低维特征.基于Vystadial 2013的1小时低资源捷克语训练语料的实验表明,在26.7小时的英语语料辅助训练下,当使用Dropout和Rectified linear units时,识别率相对基线系统提升7.0%;当使用Dropout和Maxout时,识别率相对基线系统提升了12.6%,且网络参数数量相对其他系统降低了62.7%,训练时间降低了25%.而基于矩阵分解的低维特征在单语言训练和辅助训练的两种情况下都取得了优于瓶颈特征(Bottleneck features,BNF)的识别率,且在辅助训练的情况下优于深层神经网络隐马尔科夫识别系统,提升幅度从0.8%~3.4%不等.
    1)  本文责任编委 贾珈
  • 图  1  基于SHL的网络结构示意图

    Fig.  1  SHL based network structures

    图  2  SHL-BN-MDNN的训练流程图

    Fig.  2  Diagram of SHL-BN-MDNN training scheme

    图  3  基于CNMF的低维特征提取方法

    Fig.  3  CNMF based low-dimensional feature extraction approach

    图  4  不同分解参数下基于CNMF的低维特征词错误率

    Fig.  4  WER of CNMF based low-dimensional features under difierent factorization parameters

    表  1  不同训练方法下BNF的WER (%)

    Table  1  WER of BNF based on different training methods (%)

    训练方案WER DNN参数数量(MB)
    单语言BNF67.423.57
    SHL + BNF63.258.34
    SHL + Dropout + Maxout + BNF58.953.11
    SHL + Dropout + ReLU + BNF62.748.34
    下载: 导出CSV

    表  2  不同dropout和maxout参数下的WER (%)

    Table  2  WER under difierent dropout and maxout parameters (%)

    Dropout-maxout参数HDF = 0.1HDF = 0.1HDF = 0.2HDF = 0.2HDF = 0.3HDF = 0.3
    BN-DF = 0BN-DF = 0.1BN-DF = 0BN-DF = 0.2BN-DF = 0BN-DF = 0.3
    Pooling尺寸: 512×2 (40×2)62.1160.7761.89
    Pooling尺寸: 342×3 (40×3)59.7261.1458.9560.3260.1361.5
    Pooling尺寸: 256×4 (40×4)61.2360.3661.84
    下载: 导出CSV

    表  3  基于单语言训练时各特征的识别性能WER (%)

    Table  3  Recognition performance WER each type of feature based on monolingual training (%)

    识别任务BNFCNMF低维特征SVD低维特征
    低资源Vystadial_en21.620.621.51
    低资源Vystadial_cz64.863.7664.43
    下载: 导出CSV

    表  4  基于SHL多语言训练的CNMF低维特征的WER (%)

    Table  4  WER of SHL multilingual training CNMF based low-dimensional features (%)

    CNMF特征提取方案第3层第4层第5层
    Sigmoid + 40维分解64.2764.9464.71
    Sigmoid + 50维分解63.8663.8164.99
    Dropout + Maxout + 40维分解60.3360.1359.59
    Dropout + Maxout + 50维分解59.5959.1259.95
    Dropout + ReLU + 40维分解63.7161.5961.28
    Dropout + ReLU + 50维分解62.1560.2661.84
    下载: 导出CSV

    表  5  BNF与CNMF低维特征的GMM tandem系统WER (%)

    Table  5  WER of BNF and CNMF based low-dimensional features on GMM tandem system (%)

    实验配置BNFCNMF低维特征
    Vystadial_en (单语言fMLLR) + Sigmoid-DNN21.620.6
    Vystadial_cz (单语言fMLLR) + Sigmoid-DNN64.863.76
    Vystadial_cz (单语言fbanks) + Sigmoid-DNN63.2563.81
    Vystadial_cz (单语言fbanks) + Dropout-maxout-DNN58.9559.12
    Vystadial_cz (单语言fbanks) + Dropout-ReLU-DNN62.7460.26
    下载: 导出CSV

    表  6  基于SHL多语言训练时SGMM tandem系统和DNN-HMM系统的WER (%)

    Table  6  WER of SGMM tandem systems and DNN-HMM hybrid systems based on SHL multilingual training (%)

    DNN隐含层结构BNFCNMF低维特征DNN-HMM
    5层1 024 (BN: 40)63.1561.7963.94
    Sigmoid3层1 024 (BN: 40)63.0961.8563.99
    3层512 (BN: 40)63.561.8463.96
    5层342 (*3, BN: 40)58.0357.858.24
    Dropout + Maxout3层342 (*3, BN: 40)60.6160.463.99
    3层171 (*3, BN: 40)62.6164.7268.77
    5层1 024 (BN: 40)60.7258.8259.57
    Dropout + ReLU3层1 024 (BN: 40)64.3559.1659.92
    3层512 (BN: 40)63.4361.6862.2
    下载: 导出CSV
  • [1] Thomas S. Data-driven Neural Network Based Feature Front-ends for Automatic Speech Recognition[Ph.D. dissertation], Johns Hopkins University, Baltimore, USA, 2012.
    [2] Grézl F, Karaát M, Kontár S, Černocký J. Probabilistic and bottle-neck features for LVCSR of meetings. In:Proceedings of the 2007 International Conference on Acoustics, Speech and Signal Processing (ICASSP). Hawaii, USA:IEEE, 2007. 757-760
    [3] Yu D, Seltzer M L. Improved bottleneck features using pretrained deep neural networks. In:Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH). Florence, Italy:Curran Associates, Inc., 2011. 237-240
    [4] Bao Y B, Jiang H, Dai L R, Liu R. Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition. In:Proceedings of the 2013 International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver, BC, Canada:IEEE, 2013. 6980-6984
    [5] Hinton G E, Deng L, Yu D, Dahl D E, Mohamed A R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T N, Kingsbury B. Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups. IEEE Signal Processing Magazine, 2012, 29 (6):82-97 doi: 10.1109/MSP.2012.2205597
    [6] Lal P, King S. Cross-lingual automatic speech recognition using tandem features. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21 (12):2506-2515 doi: 10.1109/TASL.2013.2277932
    [7] Veselý K, Karafiát M, Grézl F, Janda M, Egorova E. The language-independent bottleneck features. In:Proceedings of the 2012 IEEE Spoken Language Technology Workshop (SLT). Miami, Florida, USA:IEEE, 2012. 336-341
    [8] Tüske Z, Pinto J, Willett D, Schlüter R. Investigation on cross-and multilingual MLP features under matched and mismatched acoustical conditions. In:Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver, BC, Canada:IEEE, 2013. 7349-7353
    [9] Gehring J, Miao Y J, Metze F, Waibel A. Extracting deep bottleneck features using stacked auto-encoders. In:Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver, BC, Canada:IEEE, 2013. 3377-3381
    [10] Miao Y J, Metze F. Improving language-universal feature extraction with deep maxout and convolutional neural networks. In:Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH). Singapore:International Speech Communication Association, 2014. 800-804
    [11] Huang J T, Li J Y, Dong Y, Deng L, Gong Y F. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In:Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver, BC, Canada:IEEE, 2013. 7304-7308
    [12] Hinton G E, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R R. Improving neural networks by preventing co-adaptation of feature detectors. Computer Science, 2012, 3 (4):212-223
    [13] Goodfellow I J, Warde-Farley D, Mirza M, Courville A, Bengio Y. Maxout networks. In:Proceedings of the 30th International Conference on Machine Learning (ICML). Atlanta, GA, USA:ICML, 2013:1319-1327
    [14] Zeiler M D, Ranzato M, Monga R, Mao M, Yang K, Le Q V, Nguyen P, Senior A, Vanhoucke V, Dean J, Hinton G H. On rectified linear units for speech processing. In:Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver, BC, Canada:IEEE, 2013. 3517-3521
    [15] Dahl G E, Yu D, Deng L, Acero A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20 (1):30-42 doi: 10.1109/TASL.2011.2134090
    [16] Lee D D, Seung H S. Learning the parts of objects by non-negative matrix factorization. Nature, 1999, 401 (6755):788-791 doi: 10.1038/44565
    [17] Wilson K W, Raj B, Smaragdis P, Divakaran A. Speech denoising using nonnegative matrix factorization with priors. In:Proceedings of the 2008 International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Las Vegas, NV, USA:IEEE, 2008. 4029-4032
    [18] Mohammadiha N. Speech Enhancement Using Nonnegative Matrix Factorization and Hidden Markov Models[Ph.D. dissertation], KTH Royal Institute of Technology, Stockholm, Sweden, 2013.
    [19] Ding C H Q, Li T, Jordan M I. Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32 (1):45-55 doi: 10.1109/TPAMI.2008.277
    [20] Price P, Fisher W, Bernstein J, Pallett D. Resource management RM12.0[Online], available:https://catalog.ldc.upenn.edu/LDC93S3B, May 16, 2015
    [21] Garofolo J, Lamel L, Fisher W, Fiscus J, Pallett D, Dahlgren N, Zue V. TIMIT acoustic-phonetic continuous speech corpus[Online], available:https://catalog.ldc.upenn.edu/LDC93S1, May 16, 2015
    [22] Korvas M, Plátek O, Dušek O, Žćilka L, Jurčíček F. Vystadial 2013 English data[Online], available:https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4671-4, May 17, 2015
    [23] Korvas M, Plátek O, Dušek O, Žćilka L, Jurčíček F. Vystadial 2013 Czech data[Online], available:https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4670-6?show=full, May 17, 2015
    [24] Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y M, Schwarz P, Silovsky J, Stemmer G, Vesely K. The Kaldi speech recognition toolkit. In:Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Hawaii, USA:IEEE Signal Processing Society, 2011. 1-4
    [25] Miao Y J. Kaldi + PDNN:Building DNN-based ASR Systems with Kaldi and PDNN. arXiv preprint arXiv:1401. 6984, 2014.
    [26] Thurau C. Python matrix factorization module[Online], available:https://pypi.python.org/pypi/PyMF/0.1.9, September 25, 2015
    [27] Sainath T N, Kingsbury B, Ramabhadran B. Auto-encoder bottleneck features using deep belief networks. In:Proceedings of the 2012 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Kyoto, Japan:IEEE, 2012. 4153-4156
    [28] Miao Y J, Metze F. Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training. In:Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH). Lyon, France:Interspeech, 2013. 2237-2241
    [29] Miao Y J, Metze F, Rawat S. Deep maxout networks for low-resource speech recognition. In:Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Olomouc, Czech:IEEE, 2013. 398-403
    [30] Povey D, Burget L, Agarwal M, Akyazi P, Feng K, Ghoshal A, Glembek O, Goel N K, Karafiát M, Rastrow A, Rastrow R C, Schwarz P, Thomas S. Subspace Gaussian mixture models for speech recognition. In:Proceedings of the 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Texas, USA:IEEE, 2010. 4330-4333
    [31] 吴蔚澜, 蔡猛, 田垚, 杨晓昊, 陈振锋, 刘加, 夏善红.低数据资源条件下基于Bottleneck特征与SGMM模型的语音识别系统.中国科学院大学学报, 2015, 32 (1):97-102 http://www.cnki.com.cn/Article/CJFDTOTAL-ZKYB201501017.htm

    Wu Wei-Lan, Cai Meng, Tian Yao, Yang Xiao-Hao, Chen Zhen-Feng, Liu Jia, Xia Shan-Hong. Bottleneck features and subspace Gaussian mixture models for low-resource speech recognition. Journal of University of Chinese Academy of Sciences, 2015, 32 (1):97-102 http://www.cnki.com.cn/Article/CJFDTOTAL-ZKYB201501017.htm
  • 加载中
图(4) / 表(6)
计量
  • 文章访问数:  2965
  • HTML全文浏览量:  381
  • PDF下载量:  1030
  • 被引次数: 0
出版历程
  • 收稿日期:  2015-10-16
  • 录用日期:  2016-10-20
  • 刊出日期:  2017-07-20

目录

    /

    返回文章
    返回