秦楚雄 张连海

秦楚雄, 张连海. 基于DNN的低资源语音识别特征提取技术. 自动化学报, 2017, 43(7): 1208-1219. doi: 10.16383/j.aas.2017.c150654
QIN Chu-Xiong, ZHANG Lian-Hai. Deep Neural Network Based Feature Extraction for Low-resource Speech Recognition. ACTA AUTOMATICA SINICA, 2017, 43(7): 1208-1219. doi: 10.16383/j.aas.2017.c150654
国家自然科学基金 61673395

国家自然科学基金 61403415

国家自然科学基金 61302107


    张连海 信息工程大学信息系统工程学院副教授.主要研究方向为语音信号处理与智能信息处理.E-mail:lianhaiz@sina.com


    秦楚雄 信息工程大学信息系统工程学院博士研究生.主要研究方向为智能信息处理.本文通信作者.E-mail:chuxiongq313@gmail.com

Deep Neural Network Based Feature Extraction for Low-resource Speech Recognition


Supported by National Natural Science Foundation of China 61673395

Supported by National Natural Science Foundation of China 61403415

Supported by National Natural Science Foundation of China 61302107

     Associate professor in the Department of Information and System Engineering, Information Engineering University. His research interest covers speech signal processing and intelligent information processing

    Corresponding author: QIN Chu-Xiong Ph. D. candidate in the Department of Information and System Engineering, Information Engineering University. His main research interest is intelligent information processing. Corresponding author of this paper.E-mail:chuxiongq313@gmail.com
  • 摘要: 针对低资源训练数据条件下深层神经网络(Deep neural network,DNN)特征声学建模性能急剧下降的问题,提出两种适合于低资源语音识别的深层神经网络特征提取方法.首先基于隐含层共享训练的网络结构,借助资源较为丰富的语料实现对深层瓶颈神经网络的辅助训练,针对BN层位于共享层的特点,引入Dropout,Maxout,Rectified linear units等技术改善多流训练样本分布不规律导致的过拟合问题,同时缩小网络参数规模、降低训练耗时;其次为了改善深层神经网络特征提取方法,提出一种基于凸非负矩阵分解(Convex-non-negative matrix factorization,CNMF)算法的低维高层特征提取技术,通过对网络的权值矩阵分解得到基矩阵作为特征层的权值矩阵,然后从该层提取一种新的低维特征.基于Vystadial 2013的1小时低资源捷克语训练语料的实验表明,在26.7小时的英语语料辅助训练下,当使用Dropout和Rectified linear units时,识别率相对基线系统提升7.0%;当使用Dropout和Maxout时,识别率相对基线系统提升了12.6%,且网络参数数量相对其他系统降低了62.7%,训练时间降低了25%.而基于矩阵分解的低维特征在单语言训练和辅助训练的两种情况下都取得了优于瓶颈特征(Bottleneck features,BNF)的识别率,且在辅助训练的情况下优于深层神经网络隐马尔科夫识别系统,提升幅度从0.8%~3.4%不等.
  • 图  1  基于SHL的网络结构示意图

    Fig.  1  SHL based network structures

    图  2  SHL-BN-MDNN的训练流程图

    Fig.  2  Diagram of SHL-BN-MDNN training scheme

    图  3  基于CNMF的低维特征提取方法

    Fig.  3  CNMF based low-dimensional feature extraction approach

    图  4  不同分解参数下基于CNMF的低维特征词错误率

    Fig.  4  WER of CNMF based low-dimensional features under difierent factorization parameters

    表  1  不同训练方法下BNF的WER (%)

    Table  1  WER of BNF based on different training methods (%)

    训练方案WER DNN参数数量(MB)
    SHL + BNF63.258.34
    SHL + Dropout + Maxout + BNF58.953.11
    SHL + Dropout + ReLU + BNF62.748.34
    表  2  不同dropout和maxout参数下的WER (%)

    Table  2  WER under difierent dropout and maxout parameters (%)

    Dropout-maxout参数HDF = 0.1HDF = 0.1HDF = 0.2HDF = 0.2HDF = 0.3HDF = 0.3
    BN-DF = 0BN-DF = 0.1BN-DF = 0BN-DF = 0.2BN-DF = 0BN-DF = 0.3
    Pooling尺寸: 512×2 (40×2)62.1160.7761.89
    Pooling尺寸: 342×3 (40×3)59.7261.1458.9560.3260.1361.5
    Pooling尺寸: 256×4 (40×4)61.2360.3661.84
    表  3  基于单语言训练时各特征的识别性能WER (%)

    Table  3  Recognition performance WER each type of feature based on monolingual training (%)

    表  4  基于SHL多语言训练的CNMF低维特征的WER (%)

    Table  4  WER of SHL multilingual training CNMF based low-dimensional features (%)

    Sigmoid + 40维分解64.2764.9464.71
    Sigmoid + 50维分解63.8663.8164.99
    Dropout + Maxout + 40维分解60.3360.1359.59
    Dropout + Maxout + 50维分解59.5959.1259.95
    Dropout + ReLU + 40维分解63.7161.5961.28
    Dropout + ReLU + 50维分解62.1560.2661.84
    表  5  BNF与CNMF低维特征的GMM tandem系统WER (%)

    Table  5  WER of BNF and CNMF based low-dimensional features on GMM tandem system (%)

    Vystadial_en (单语言fMLLR) + Sigmoid-DNN21.620.6
    Vystadial_cz (单语言fMLLR) + Sigmoid-DNN64.863.76
    Vystadial_cz (单语言fbanks) + Sigmoid-DNN63.2563.81
    Vystadial_cz (单语言fbanks) + Dropout-maxout-DNN58.9559.12
    Vystadial_cz (单语言fbanks) + Dropout-ReLU-DNN62.7460.26
    表  6  基于SHL多语言训练时SGMM tandem系统和DNN-HMM系统的WER (%)

    Table  6  WER of SGMM tandem systems and DNN-HMM hybrid systems based on SHL multilingual training (%)

    5层1 024 (BN: 40)63.1561.7963.94
    Sigmoid3层1 024 (BN: 40)63.0961.8563.99
    3层512 (BN: 40)63.561.8463.96
    5层342 (*3, BN: 40)58.0357.858.24
    Dropout + Maxout3层342 (*3, BN: 40)60.6160.463.99
    3层171 (*3, BN: 40)62.6164.7268.77
    5层1 024 (BN: 40)60.7258.8259.57
    Dropout + ReLU3层1 024 (BN: 40)64.3559.1659.92
    3层512 (BN: 40)63.4361.6862.2
