基于双因子高斯过程动态模型的声道谱转换方法

孙新建; 张雄伟; 杨吉斌; 曹铁勇; 钟新毅

doi:10.3724/SP.J.1004.2014.01198

基于双因子高斯过程动态模型的声道谱转换方法

doi: 10.3724/SP.J.1004.2014.01198

1.
解放军理工大学通信工程学院南京 210007;
2.
解放军理工大学指挥信息系统学院南京 210007

基金项目:

国家自然科学基金（61072042），江苏省自然科学基金（BK2012510），解放军理工大学预先研究基金（20110205，20110211）资助

详细信息

作者简介:
张雄伟中国人民解放军理工大学指挥信息系统学院教授. 主要研究方向为多媒体信息处理，智能计算，压缩感知.E-mail：xwzhang@public1.ptt.js.cn

计量
- 文章访问数: 1909
- HTML全文浏览量: 74
- PDF下载量: 765
- 被引次数: 0
出版历程
- 收稿日期: 2012-12-12
- 修回日期: 2013-05-21
- 刊出日期: 2014-06-20

Vocal Tract Spectrum Conversion Using a Two-factor Gaussian Process Dynamic Model

1.
College of Communication Engineering, PLA University of Science and Technology, Nanjing 210007;
2.
College of Command Information Systems, PLA University of Science and Technology, Nanjing 210007

Funds:

Supported by National Natural Science Foundation of China (61072042), Natural Science Foundation of Jiangsu Province (BK2012510), and Pre-research Foundation of PLA University of Science and Technology (20110205, 20110211)

摘要

摘要: 针对作者已经提出的双因子高斯过程隐变量模型（Two-factor Gaussian process latent variable model，TF-GPLVM）用于语音转换时未考虑语音的动态特征，并且模型训练时需要估计的参数较多的问题，提出引入隐马尔科夫模型（Hidden Markov model，HMM）对语音动态特征进行建模，并利用HMM隐状态对各帧语音进行关于语义内容的概率软分类，建立了分离精度更高、运算负荷较小的双因子高斯过程动态模型（Two-factor Gaussian process dynamic model，TF-GPDM）.基于此模型，设计了一种全新的基于说话人特征替换的语音声道谱转换方案.主、客观实验结果表明，无论是与传统的统计映射和频率弯折转换方法相比，还是与双因子高斯过程隐变量模型方法相比，本文方法都获得了语音质量和转换相似度的提升，以及两项性能的更佳平衡.
- 声道谱转换 /
- 高斯过程隐变量模型 /
- 双因子模型 /
- 隐马尔科夫模型 /
- 语音动态特征
Abstract: We developed in a previous work a two-factor Gaussian process latent variable model (TF-GPLVM) to perform spectral conversion using a strategy of speaker characteristics replacement. Despite its improved performance compared with traditional mapping-based methods, the model suffers from two drawbacks: 1) it cannot capture the speech dynamical characteristics, and 2) there is a large number of parameters to estimate. To overcome these two drawbacks, we propose in this paper to combine TF-GPLVM with hidden Markov model (HMM), and develop an enhanced two-factor Gaussian process dynamic model (TF-GPDM). In the model, the speech dynamics are modeled by state transition probability of HMM, meanwhile speech frames are categorized into a limited number of phonetic content classes using HMM states. Both subjective and objective evaluations show that, compared with both traditional mapping-based methods, such as Gaussian mixture model (GMM) and FW, and TF-GPLVM based one, the proposed TF-GPDM not only improves the speech quality and identity similarity, but also reaches a better compromise between the two dimensions.
- Vocal tract spectrum conversion /
- Gaussian process latent variable model (GPLVM) /
- two-factor model /
- hidden Markov model (HMM) /
- speech dynamical characteristics

HTML全文

参考文献(33)

[1]	Moulines E, Sagisaka Y. Voice conversion: state of the art and perspectives. Special Issue of Speech Communication. The Netherlands, 1995, 16(2): 125-126
[2]	Furui S. Research of individuality features in speech waves and automatic speaker recognition techniques. Speech Communication, 1986, 5(2): 183-197
[3]	Abe M, Nakamura S, Shikano K, Kuwabara H. Voice conversion through vector quantization. In: Proceedings of the 1998 IEEE International Conference on Acoustic, Speech, and Signal Processing. New York, USA: IEEE, 1988. 655-658
[4]	Arslan L M. Speaker transformation algorithm using segmental codebooks (STASC). Speech Communication, 1999, 28(3): 211-226
[5]	Narendranath M, Murthy H A, Rajendran S, Yegnanarayana B. Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 1995, 16(2): 207-216
[6]	Guido R C, Vieira L S, Júnior S B, Sanchez F L, Maciel C D, Fonseca E S, Pereira J C. A neural-wavelet architecture for voice conversion. Neurocomputing, 2007, 71(1-3): 174 -180
[7]	Desai S, Black A W, Yegnanarayana B, Prahallad K. Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(5): 954-964
[8]	Stylianou Y, Cappé；O, Moulines E. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 1998, 6(2): 131-142
[9]	Kain A B. High Resolution Voice Transformation [Ph.D. dissertation], OGI School of Science and Engineering at Oregon Health and Science University, United States, 2001
[10]	Toda T, Black A W, Tokuda K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8): 2222-2235
[11]	Helander E, Virtanen T, Nurminen J, Gabbouj M. Voice conversion using partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(5): 912-921
[12]	Qiao Y, Saito D, Minematsu N. HMM-based sequence-to-frame mapping for voice conversion. In: Proceedings of the 2010 IEEE International Conference on Acoustic, Speech, and Signal Processing. Dallas, TX: IEEE, 2010. 4830-4833
[13]	Zen H, Nankaku Y, Tokuda K. Continuous stochastic feature mapping based on trajectory HMMs. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(2): 417-430
[14]	Helander E, Silén H, Virtanen T, Gabbouj M. Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(3): 806-817
[15]	Valbret H, Moulines E, Tubach J P. Voice transformation using PSOLA technique. Speech Communication, 1992, 11(2-3): 175-187
[16]	Sundermann D, Ney H. VTLN-based voice conversion. In: Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology. Darmstadt, Germany: IEEE, 2003. 556-559
[17]	Shuang Z W, Bakis R, Qin Y. Voice conversion based on mapping formants. In: Proceedings of the 2006 TC-STAR Workshop on Speech-to-Speech Translation. Barcelona, Spain: ISCA, 2006. 219-223
[18]	Godoy E, Rosec O, Chonavel T. Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4): 1313-1323
[19]	Erro D, Moreno A, Bonafonte A. Voice conversion based on weighted frequency warping. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(5): 922-931
[20]	Toda T, Saruwatari H, Shikano K. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Salt Lake City, USA: IEEE, 2001. 841-844
[21]	Belin P, Zatorre R J, Lafaille P, Ahad P, Pike B. Voice-selective areas in human auditory cortex. Nature, 2000, 403(6767): 309-312
[22]	Minematsu N. Human speech model based on information separation and its application to speech processing. In: Proceedings of the 7th International Symposium on Chinese Spoken Language Processing. Tainan, China: IEEE, 2010. 477-482
[23]	Latinus M, Belin P. Human voice perception. Current Biology, 2011, 21(4): R143-R145
[24]	Popa V, Nurminen J, Gabbouj M. A novel technique for voice conversion based on style and content decomposition with bilinear models. In: Proceedings of the 2009 Interspeech. Brighton, UK: ISCA, 2009. 2655-2658
[25]	Xu N, Yang Z, Zhang L H, Zhu W P, Bao J Y. Voice conversion based on state-space model for modelling spectral trajectory. Electronics Letters, 2009, 45(14): 763-764
[26]	Sun X J, Zhang X W, Cao T Y, Yang J B, Sun J. Voice conversion using a two-factor Gaussian process latent variable model. Przeglad Elektrotechniczny, 2012, 88(12a): 318-324
[27]	Rasmussen C E, Williams C K I. Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006. 7-13
[28]	Knagenhjelm H P, Kleijin W B. Spectral dynamics is more important than spectral distortion. In: Proceedings of the 1995 IEEE International Conference on Acoustic, Speech, and Signal Processing. Detroit, USA: IEEE, 1995. 732-735
[29]	Duxans H, Bonafonte A, Kain E, van Santen J. Including dynamic and phonetic information in voice conversion systems. In: Proceedings of the 2004 International Conference on Spoken Language Processing. Jeju Island, Korea: ISCA, 2004. 1193-1196
[30]	Duxans H. Voice Conversion Applied to Text-to-Speech Systems [Ph.D. dissertation], Polytechnic University of Catalonia, Barcelona, 2006
[31]	Møller M F. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 1993, 6(4): 525-533
[32]	Kominek J, Black A W. The CMU ARCTIC speech databases. In: Proceedings of the 5th ISCA Speech Synthesis Workshop. Pittsburgh, USA: ISCA, 2005. 223-224
[33]	Erro D, Moreno A, Bonafonte A. Flexible harmonic/stochastic speech synthesis. In: Proceedings of the 6th ISCA Workshop on Speech Synthesis. Bonn, Germany: ISCA, 2007. 194-199

施引文献

资源附件(0)

访问统计

计量

文章访问数: 1909
HTML全文浏览量: 74
PDF下载量: 765
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

基于双因子高斯过程动态模型的声道谱转换方法

doi: 10.3724/SP.J.1004.2014.01198

作者简介:
张雄伟中国人民解放军理工大学指挥信息系统学院教授. 主要研究方向为多媒体信息处理，智能计算，压缩感知.E-mail：xwzhang@public1.ptt.js.cn

计量

Vocal Tract Spectrum Conversion Using a Two-factor Gaussian Process Dynamic Model

计量

目录

留言板

基于双因子高斯过程动态模型的声道谱转换方法

doi: 10.3724/SP.J.1004.2014.01198

作者简介: 张雄伟 中国人民解放军理工大学指挥信息系统学院教授. 主要研究方向为多媒体信息处理，智能计算，压缩感知.E-mail：xwzhang@public1.ptt.js.cn

计量

出版历程

Vocal Tract Spectrum Conversion Using a Two-factor Gaussian Process Dynamic Model

计量

出版历程

目录

作者简介:
张雄伟中国人民解放军理工大学指挥信息系统学院教授. 主要研究方向为多媒体信息处理，智能计算，压缩感知.E-mail：xwzhang@public1.ptt.js.cn