2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于i向量和变分自编码相对生成对抗网络的语音转换

李燕萍 曹盼 左宇涛 张燕 钱博

李燕萍, 曹盼, 左宇涛, 张燕, 钱博. 基于i向量和变分自编码相对生成对抗网络的语音转换. 自动化学报, 2022, 48(7): 1824−1833 doi: 10.16383/j.aas.c190733
引用本文: 李燕萍, 曹盼, 左宇涛, 张燕, 钱博. 基于i向量和变分自编码相对生成对抗网络的语音转换. 自动化学报, 2022, 48(7): 1824−1833 doi: 10.16383/j.aas.c190733
Li Yan-Ping, Cao Pan, Zuo Yu-Tao, Zhang Yan, Qian Bo. Voice conversion based on i-vector with variational autoencoding relativistic standard generative adversarial network. Acta Automatica Sinica, 2022, 48(7): 1824−1833 doi: 10.16383/j.aas.c190733
Citation: Li Yan-Ping, Cao Pan, Zuo Yu-Tao, Zhang Yan, Qian Bo. Voice conversion based on i-vector with variational autoencoding relativistic standard generative adversarial network. Acta Automatica Sinica, 2022, 48(7): 1824−1833 doi: 10.16383/j.aas.c190733

基于i向量和变分自编码相对生成对抗网络的语音转换

doi: 10.16383/j.aas.c190733
基金项目: 国家自然科学青年基金(61401227), 国家自然科学基金(61872199, 61872424), 金陵科技学院智能人机交互科技创新团队建设专项(218/010119200113)资助
详细信息
    作者简介:

    李燕萍:南京邮电大学通信与信息工程学院副教授. 2009年获南京理工大学博士学位. 主要研究方向为语音转换和说话人识别. 本文通信作者. E-mail: liyp@njupt.edu.cn

    曹盼:南京邮电大学通信与信息工程学院硕士研究生. 2017年获淮阴师范学院学士学位. 主要研究方向为语音转换和深度学习. E-mail: abreastpc@163.com

    左宇涛:南京邮电大学通信与信息工程学院硕士研究生. 主要研究方向为语音转换. E-mail: zuoyt@chinatelecom.cn

    张燕:金陵科技学院软件工程学院教授. 2017年获南京理工大学博士学位. 主要研究方向为模式识别和领域软件工程. E-mail: zy@jit.edu.cn

    钱博:南京电子技术研究所高级工程师. 2007年获南京理工大学博士学位. 主要研究方向为模式识别和人工智能. E-mail: sandson6@163.com

Voice Conversion Based on i-vector With Variational Autoencoding Relativistic Standard Generative Adversarial Network

Funds: Supported by National Natural Science Foundation of Youth Foundation of China (61401227), National Natural Science Foundation of China (61872199, 61872424), and Special Project of Intelligent Human-Computer Interaction Technology Innovation Team Building of Jinling Institute of Technology (218/010119200113)
More Information
    Author Bio:

    LI Yan-Ping Associate professor at the School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications. She received her Ph.D. degree from Nanjing University of Science and Technology in 2009. Her interest research covers voice conversion and speaker recognition. Corresponding author of this paper

    CAO Pan Master student at the School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications. She received her bachelor degree from Huaiyin Normal University in 2017. Her research interest covers voice conversion and deep learning

    ZUO Yu-Tao Master student at the School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications. His main research interest is voice conversion

    ZHANG Yan Professor at the School of Software Engineering, Jinling Institute of Technology. She received her Ph.D. degree from Nanjing University of Science and Technology in 2017. Her research interest covers pattern recognition and domain software engineering

    QIAN Bo Senior engineer at Nanjing Institute of Electronic Technology. He received his Ph.D. degree from Nanjing University of Science and Technology in 2007. His research interest covers pattern recognition and artificial intelligence

  • 摘要: 提出一种基于i向量和变分自编码相对生成对抗网络的语音转换方法, 实现了非平行文本条件下高质量的多对多语音转换. 性能良好的语音转换系统, 既要保持重构语音的自然度, 又要兼顾转换语音的说话人个性特征是否准确. 首先为了改善合成语音自然度, 利用生成性能更好的相对生成对抗网络代替基于变分自编码生成对抗网络模型中的Wasserstein生成对抗网络, 通过构造相对鉴别器的方式, 使得鉴别器的输出依赖于真实样本和生成样本间的相对值, 克服了Wasserstein生成对抗网络性能不稳定和收敛速度较慢等问题. 进一步为了提升转换语音的说话人个性相似度, 在解码阶段, 引入含有丰富个性信息的i向量, 以充分学习说话人的个性化特征. 客观和主观实验表明, 转换后的语音平均梅尔倒谱失真距离值较基准模型降低4.80%, 平均意见得分值提升5.12%, ABX 值提升8.60%, 验证了该方法在语音自然度和个性相似度两个方面均有显著的提高, 实现了高质量的语音转换.
  • 图  1  基于VARSGAN + i-vector 模型的整体流程图

    Fig.  1  Framework of voice conversion based on VARSGAN + i-vector network

    图  2  VARSGAN+i-vector 模型原理示意图

    Fig.  2  Schematic diagram of VARSGAN+i-vector network

    图  3  VARSGAN + i-vector 模型网络结构示意图

    Fig.  3  Structure of VARSGAN + i-vector network

    图  4  16 种转换情形下5种模型的转换语音的MCD值对比

    Fig.  4  Average MCD of five models for 16 conversion cases

    图  5  4大类转换情形下不同模型的MCD值对比

    Fig.  5  Comparison of MCD of different models for four conversion cases

    图  6  5种模型在不同转换类别下的MOS值对比

    Fig.  6  Comparison of MOS for different conversion categories in five models

    图  7  同性转换情形下5种模型转换语音的ABX图

    Fig.  7  ABX test results of five models for intra-gender

    图  8  异性转换情形下5种模型转换语音的ABX图

    Fig.  8  ABX test results of five models for inter-gender

  • [1] Godoy E, Rosec O, Chonavel T. Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Transactions on Audio, Speech and Language Processing, 2011, 20(4): 1313-1323
    [2] Toda T, Chen L H, Saito D, Villavicencio F, Wester M, Wu Z, et al. The voice conversion challenge 2016. In: Proceedings of the 2016 Interspeech. San Francisco, USA: 2016. 1632−1636
    [3] Dong M, Yang C, Lu Y, Ehnes J W, Huang D, Ming H, et al. Mapping frames with DNN-HMM recognizer for non-parallel voice conversion. In: Proceedings of the 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Hong Kong, China: IEEE, 2015. 488−494
    [4] Zhang M, Tao J, Tian J, Wang X. Text-independent voice conversion based on state mapped codebook. In: Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Las Vegas, USA: IEEE, 2008. 4605−4608
    [5] Nakashika T, Takiguchi T, Minami Y. Non-parallel training in voice conversion using an adaptive restricted boltzmann machine. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(11): 2032-2045 doi: 10.1109/TASLP.2016.2593263
    [6] Mouchtaris A, Van der Spiegel J, Mueller P. Nonparallel training for voice conversion based on a parameter adaptation approach. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(3): 952-963 doi: 10.1109/TSA.2005.857790
    [7] Hsu C C, Hwang H T, Wu Y C, Tsaoet Y, Wang H M. Voice conversion from non-parallel corpora using variational auto-encoder. In: Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Jeju, South Korea: IEEE, 2016. 1−6
    [8] Hsu C C, Hwang H T, Wu Y C, Tsao Y, Wang H M. Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks. In: Proceedings of the 2017 Interspeech. Stockholm, Sweden, 2017. 3364−3368
    [9] Kameoka H, Kaneko T, Tanaka K, Hojo N. StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks. In: Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT). Athens, Greece: IEEE, 2018. 266−273
    [10] Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J. High-quality nonparallel voice conversion based on cycle-consistent adversarial network. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE, 2018. 5279−5283
    [11] Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning International Conference on Machine Learning. Sydney, Australia: ACM, 2017. 214−223
    [12] 王坤峰, 苟超, 段艳杰, 林懿伦, 郑心湖, 王飞跃. 生成式对抗网络GAN的研究进展与展望. 自动化学报, 2017, 43(3): 321-332

    Wang Kun-Feng, Gou Chao, Duan Yan-Jie, Lin Yi-Lun, Zheng Xin-Hu, Wang Fei-Yue. Generative Adversarial Networks: The State of the Art and Beyond. Acta Automatica Sinica, 2017, 43(3): 321-332.
    [13] Baby D, Verhulst S. Sergan. Speech enhancement using relativistic generative adversarial networks with gradient penalty. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, United Kingdom: IEEE, 2019. 106−110
    [14] Dehak N, Kenny P J, Dehak R, Dumouchelet P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 19(4): 788-798
    [15] 汪海彬, 郭剑毅, 毛存礼, 余正涛. 基于通用背景-联合估计 (UB-JE) 的说话人识别方法. 自动化学报, 2018, 44(10): 1888-1895

    Wang Hai-Bin, Guo Jian-Yi, Mao Cun-Li, Yu Zheng-Tao. Speaker recognition based on universal Background-Joint Estimation (UB-JE). Acta Automatica Sinica, 2018, 44(10): 1888-1895
    [16] Matějka P, Glembek O, Castaldo F, Alam M J, Plchot O, Kenny P, et al. Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing. Prague, Czech Republic: IEEE, 2011. 4828−4831
    [17] Kanagasundaram A, Vogt R, Dean D, Sridharan S, Mason M. I-vector based speaker recognition on short utterances. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA). Florence, Italy, 2011. 2341−2344
    [18] 张一珂, 张鹏远, 颜永红. 基于对抗训练策略的语言模型数据增强技术. 自动化学报, 2018, 44(5): 891-900

    Zhang Yi-Ke, Zhang Peng-Yuan, Yan Yong-Hong. Data augmentation for language models via adversarial training. Acta Automatica Sinica, 2018, 44(5): 891-900
    [19] Mao X, Li Q, Xie H, Lau R Y K, Wang Z, Smolley S P. Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 2794−2802
    [20] Morise M, Yokomori F, Ozawa K. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems, 2016, 99(7): 1877-1884
    [21] Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A C. Improved training of wasserstein gans. In: Proceedings of the Advances in Neural Information Processing Systems. Leicester, United Kingdom: IEEE, 2017. 5767−5777
    [22] Lorenzo-Trueba J, Yamagishi J, Toda T, Satio D, Villavicencio F, Kinnunen T, et al. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In: Proceedings of the Odyssey 2018 The Speaker and Language Recognition Workshop. Les Sables d'Olonne, France: ISCA Speaker and Language Characterization Special Interest Group, 2018. 195−202
    [23] Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve neural network acoustic models. Computer Science, 2013, 30(1): 1152-1160
    [24] 梁瑞秋, 赵力, 王青云. 语音信号处理(C++版). 北京: 机械工业出版社, 2018.

    Liang Rui-Qiu, Zhao Li, Wang Qing-Yun. Speech Signal Preprocessing (C++). Beijing: China Machine Press, 2018.
    [25] 张雄伟, 陈亮, 杨吉斌. 现代语音处理技术及应用. 北京: 机械工业出版社, 2003.

    Zhang Xiong-Wei, Chen Liang, Yang Ji-Bin. Modern Speech Processing Technology and Application. Beijing: China Machine Press, 2003.
    [26] Chou J C, Lee H Y. One-shot voice conversion by separating speaker and content representations with instance normalization. In: Proceedings of the 2019 Interspeech. Graz, Austria, 2019. 664−668
  • 加载中
图(8)
计量
  • 文章访问数:  1000
  • HTML全文浏览量:  174
  • PDF下载量:  102
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-10-23
  • 录用日期:  2020-07-27
  • 网络出版日期:  2022-03-08
  • 刊出日期:  2022-07-01

目录

    /

    返回文章
    返回