各种不同的基于词格的鉴别性训练方法在中文单语以及中英双语语音识别系统中的性能改善调研及比较
doi: 10.3724/SP.J.1004.2012.01162
Improvement Comparison of Different Lattice-based Discriminative Training Methods in Chinese-monolingual and Chinese-English-bilingual Speech Recognition
-
摘要: 近年来, 鉴别性训练方法在语音识别领域已经显示出相当大的性能改善, 比如说MPE, fMPE以及BMMI等方法, 然而, 关于鉴别性训练的研究尚还有很多工作要做. 本文详细的对三种基于词格的鉴别性训练方法进行了调查和研究, 并对各方法的性能进行了展示. 然后, 还对不同的I平滑方法进行了分析对比, 从而得到了在中文单语语音识别情况下更加鲁棒的模型. 本文对不同鉴别性训练方法的互补特性做了研究, 通过ROVER融合算法完成了系统融合. 尽管鉴别性训练方法通常应用在单语言语音识别系统, 本文也系统的研究了鉴别性训练方法在双语语音识别中的应用, 包括MPE、fMPE和BMMI. 一种新的方法被使用去产生更好的用于双语模型训练的词格, 同时研究了双语语音识别环境下互补的鉴别性训练方法来得到最好的ROVER融合性能. 实验结果显示, 不同形式的鉴别性训练在单语和双语语音识别系统中都降低了词错误率, 同时融合有互补性的鉴别性训练方法很大程度的改善了系统的性能.
-
关键词:
- 鉴别性训练 /
- 双语语音识别 /
- 识别器输出投票错误较少(ROVER) /
- I-平滑
Abstract: Discriminative training approaches such as minimum phone error (MPE), feature minimum phone error (fMPE) and boosted maximum mutual information (BMMI) have brought remarkable improvement to the speech community in recent years, however, much work still remains to be done. This paper investigates the performances of three lattice-based discriminative training methods in detail, and does a comparison of different I-smoothing methods to obtain more robust models in the Chinese-monolingual situation. The complementary properties of the different discriminative training methods are explored to perform a system com-bination by recognizer output voting error reduction (ROVER). Although discriminative training is normally used in monolingual systems, this paper systematically investigates its use for bilingual speech recognition, including MPE, fMPE, and BMMI. A new method is proposed to generate significantly better lattices for training the bilingual model, and complementary discriminative training models are also explored to get the best ROVER performance in the bilingual situation. Experimental results show that all forms of discriminative training can reduce the word error rate in both monolingual and bilingual systems, and that combining complementary discriminative training methods can improve the performance significantly. -
[1] Bahl L R, Brown P F, de Souza P V, Mercer L R. Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: Proceedings of the 1986 IEEE International Conference on Acoustics, Speech, and Signal Processing. Tokyo, Japan: IEEE, 1986. 49-52[2] Povey D. Discriminative Training for Large Vocabulary Speech Recognition [Ph.D. dissertation], Cambridge University, USA, 2004[3] Povey D, Kingsbury B, Mangu L, Saon G, Soltau H, Zweig G. fMPE: discriminatively trained features for speech recognition. In: Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia, USA: IEEE, 2005. 961-964[4] Sha F, Saul L K. Large margin Gaussian mixture modeling for phonetic classification and recognition. In: Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing. Toulouse, France: IEEE, 2006. 265-268[5] Sha F, Saul L K. Comparison of large margin training to other discriminative methods for phonetic recognition by hidden markov models. In: Proceedings of the 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing. Honolulu, USA: IEEE, 2007. 313-316[6] Povey D, Kanevsky D, Kingsbury B, Ramabhadran B, Sanon G, Visweswariah K. Boosted MMI for model and feature-space discriminative training. In: Proceedings of the 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing. Las Vegas, USA: IEEE, 2008. 4057-4060[7] Fung P, Schultz T. Multilingual spoken language processing. IEEE Signal Processing Magazine, 2008, 25(3): 89-97[8] Schultz T, Waibel A. Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Communication, 2001, 35(1-2): 31-51[9] Khler J. Multilingual phone models for vocabulary-independent speech recognition tasks. Speech Communication, 2001, 35(1-2): 21-30[10] Wang Z R, Topkara U, Schultz T, Waibel A. Towards universal speech recognition. In: Proceedings of the 4th IEEE International Conference on Multimodal Interfaces. Pittsburgh, USA: IEEE, 2002. 247-252[11] Qian Y M, Liu J. Phone modeling and combining discriminative training for mandarin-english bilingual speech recognition. In: Proceedings of the 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing. Dallas, USA: IEEE, 2010. 4918-4921[12] Qian Y M, Liu J. Mandarin-English bilingual phone modeling and combining mpe based discriminative training for cross-language speech recognition. In: Proceedings of the 2010 International Symposium on Chinese Spoken Language Processing. Tainan, China: ISCA, 2010. 103-108[13] Young S, Evermann G, Gales M J F, Hain T, Kershaw D, Liu X A, Moore G, Odell J J, Ollason D, Povey D, Valtchev V, Woodland P. The HTK Book (for version 3.4). UK: Cambridge University Engineering Department, 2009[14] Stolcke A. SRILM--An extensible language modeling toolkit. In: Proceedings of the 2002 International Conference on Spoken Language Processing. Denver, USA: ISCA, 2002. 901-904[15] Zheng J, Cetin O, Hwang M Y, Lei X, Stolcke A, Morgan N. Combining discriminative feature, transform, and model training for large vocabulary speech recognition. In: Proceedings of the 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing. Honolulu, USA: IEEE, 2007. 633-636[16] Povey D, Woodland P C. Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, USA: IEEE, 2002. 105-108[17] Fiscus J G. A post-processing system to yield reduced word error rates: recognizer output voting error reduction (rover). In: Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding. Santa Barbara, USA: IEEE, 1997. 347-354[18] Xu H H, Zhu J, Wu G Y. An efficient multistage rover method for automatic speech recognition. In: Proceedings of the 2009 IEEE International Conference on Multimedia and Expo. Cancun, Mexico: IEEE, 2009. 894-897[19] Schlüter R, Müller B, Wessel F, Ney H. Interdependence of language models and discriminative training. In: Proceedings of the 1999 IEEE Workshop on Automatic Speech Recognition and Understanding. Keystone, CO: IEEE, 1999[20] Gillick L, Cox S J. Some statistical issues in the comparison of speech recognition algorithms. In: Proceedings of the 1989 IEEE International Conference on Acoustics, Speech, and Signal Processing. Glasgow, Scotland: IEEE, 1989. 532-535
点击查看大图
计量
- 文章访问数: 2373
- HTML全文浏览量: 70
- PDF下载量: 872
- 被引次数: 0