Vocal Tract Spectrum Conversion Using a Two-factor Gaussian Process Dynamic Model
-
摘要: 针对作者已经提出的双因子高斯过程隐变量模型(Two-factor Gaussian process latent variable model,TF-GPLVM)用于语音转换时未考虑语音的动态特征,并且模型训练时需要估计的参数较多的问题,提出引入隐马尔科夫模型(Hidden Markov model,HMM)对语音动态特征进行建模,并利用HMM隐状态对各帧语音进行关于语义内容的概率软分类,建立了分离精度更高、运算负荷较小的双因子高斯过程动态模型(Two-factor Gaussian process dynamic model,TF-GPDM).基于此模型,设计了一种全新的基于说话人特征替换的语音声道谱转换方案.主、客观实验结果表明,无论是与传统的统计映射和频率弯折转换方法相比,还是与双因子高斯过程隐变量模型方法相比,本文方法都获得了语音质量和转换相似度的提升,以及两项性能的更佳平衡.Abstract: We developed in a previous work a two-factor Gaussian process latent variable model (TF-GPLVM) to perform spectral conversion using a strategy of speaker characteristics replacement. Despite its improved performance compared with traditional mapping-based methods, the model suffers from two drawbacks: 1) it cannot capture the speech dynamical characteristics, and 2) there is a large number of parameters to estimate. To overcome these two drawbacks, we propose in this paper to combine TF-GPLVM with hidden Markov model (HMM), and develop an enhanced two-factor Gaussian process dynamic model (TF-GPDM). In the model, the speech dynamics are modeled by state transition probability of HMM, meanwhile speech frames are categorized into a limited number of phonetic content classes using HMM states. Both subjective and objective evaluations show that, compared with both traditional mapping-based methods, such as Gaussian mixture model (GMM) and FW, and TF-GPLVM based one, the proposed TF-GPDM not only improves the speech quality and identity similarity, but also reaches a better compromise between the two dimensions.
-
[1] Moulines E, Sagisaka Y. Voice conversion: state of the art and perspectives. Special Issue of Speech Communication. The Netherlands, 1995, 16(2): 125-126 [2] Furui S. Research of individuality features in speech waves and automatic speaker recognition techniques. Speech Communication, 1986, 5(2): 183-197 [3] Abe M, Nakamura S, Shikano K, Kuwabara H. Voice conversion through vector quantization. In: Proceedings of the 1998 IEEE International Conference on Acoustic, Speech, and Signal Processing. New York, USA: IEEE, 1988. 655-658 [4] Arslan L M. Speaker transformation algorithm using segmental codebooks (STASC). Speech Communication, 1999, 28(3): 211-226 [5] Narendranath M, Murthy H A, Rajendran S, Yegnanarayana B. Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 1995, 16(2): 207-216 [6] Guido R C, Vieira L S, Júnior S B, Sanchez F L, Maciel C D, Fonseca E S, Pereira J C. A neural-wavelet architecture for voice conversion. Neurocomputing, 2007, 71(1-3): 174 -180 [7] Desai S, Black A W, Yegnanarayana B, Prahallad K. Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(5): 954-964 [8] Stylianou Y, Cappé;O, Moulines E. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 1998, 6(2): 131-142 [9] Kain A B. High Resolution Voice Transformation [Ph.D. dissertation], OGI School of Science and Engineering at Oregon Health and Science University, United States, 2001 [10] Toda T, Black A W, Tokuda K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8): 2222-2235 [11] Helander E, Virtanen T, Nurminen J, Gabbouj M. Voice conversion using partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(5): 912-921 [12] Qiao Y, Saito D, Minematsu N. HMM-based sequence-to-frame mapping for voice conversion. In: Proceedings of the 2010 IEEE International Conference on Acoustic, Speech, and Signal Processing. Dallas, TX: IEEE, 2010. 4830-4833 [13] Zen H, Nankaku Y, Tokuda K. Continuous stochastic feature mapping based on trajectory HMMs. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(2): 417-430 [14] Helander E, Silén H, Virtanen T, Gabbouj M. Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(3): 806-817 [15] Valbret H, Moulines E, Tubach J P. Voice transformation using PSOLA technique. Speech Communication, 1992, 11(2-3): 175-187 [16] Sundermann D, Ney H. VTLN-based voice conversion. In: Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology. Darmstadt, Germany: IEEE, 2003. 556-559 [17] Shuang Z W, Bakis R, Qin Y. Voice conversion based on mapping formants. In: Proceedings of the 2006 TC-STAR Workshop on Speech-to-Speech Translation. Barcelona, Spain: ISCA, 2006. 219-223 [18] Godoy E, Rosec O, Chonavel T. Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4): 1313-1323 [19] Erro D, Moreno A, Bonafonte A. Voice conversion based on weighted frequency warping. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(5): 922-931 [20] Toda T, Saruwatari H, Shikano K. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Salt Lake City, USA: IEEE, 2001. 841-844 [21] Belin P, Zatorre R J, Lafaille P, Ahad P, Pike B. Voice-selective areas in human auditory cortex. Nature, 2000, 403(6767): 309-312 [22] Minematsu N. Human speech model based on information separation and its application to speech processing. In: Proceedings of the 7th International Symposium on Chinese Spoken Language Processing. Tainan, China: IEEE, 2010. 477-482 [23] Latinus M, Belin P. Human voice perception. Current Biology, 2011, 21(4): R143-R145 [24] Popa V, Nurminen J, Gabbouj M. A novel technique for voice conversion based on style and content decomposition with bilinear models. In: Proceedings of the 2009 Interspeech. Brighton, UK: ISCA, 2009. 2655-2658 [25] Xu N, Yang Z, Zhang L H, Zhu W P, Bao J Y. Voice conversion based on state-space model for modelling spectral trajectory. Electronics Letters, 2009, 45(14): 763-764 [26] Sun X J, Zhang X W, Cao T Y, Yang J B, Sun J. Voice conversion using a two-factor Gaussian process latent variable model. Przeglad Elektrotechniczny, 2012, 88(12a): 318-324 [27] Rasmussen C E, Williams C K I. Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006. 7-13 [28] Knagenhjelm H P, Kleijin W B. Spectral dynamics is more important than spectral distortion. In: Proceedings of the 1995 IEEE International Conference on Acoustic, Speech, and Signal Processing. Detroit, USA: IEEE, 1995. 732-735 [29] Duxans H, Bonafonte A, Kain E, van Santen J. Including dynamic and phonetic information in voice conversion systems. In: Proceedings of the 2004 International Conference on Spoken Language Processing. Jeju Island, Korea: ISCA, 2004. 1193-1196 [30] Duxans H. Voice Conversion Applied to Text-to-Speech Systems [Ph.D. dissertation], Polytechnic University of Catalonia, Barcelona, 2006 [31] Møller M F. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 1993, 6(4): 525-533 [32] Kominek J, Black A W. The CMU ARCTIC speech databases. In: Proceedings of the 5th ISCA Speech Synthesis Workshop. Pittsburgh, USA: ISCA, 2005. 223-224 [33] Erro D, Moreno A, Bonafonte A. Flexible harmonic/stochastic speech synthesis. In: Proceedings of the 6th ISCA Workshop on Speech Synthesis. Bonn, Germany: ISCA, 2007. 194-199
点击查看大图
计量
- 文章访问数: 1833
- HTML全文浏览量: 62
- PDF下载量: 750
- 被引次数: 0