2.793

2018影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于i向量和变分自编码相对生成对抗网络的语音转换

李燕萍 曹盼 左宇涛 张燕 钱博

李燕萍, 曹盼, 左宇涛, 张燕, 钱博. 基于i向量和变分自编码相对生成对抗网络的语音转换. 自动化学报, 2020, 46(x): 1−10 doi: 10.16383/j.aas.c190733
引用本文: 李燕萍, 曹盼, 左宇涛, 张燕, 钱博. 基于i向量和变分自编码相对生成对抗网络的语音转换. 自动化学报, 2020, 46(x): 1−10 doi: 10.16383/j.aas.c190733
Li Yan-Ping, Cao Pan, Zuo Yu-Tao, Zhang Yan, Qian Bo. Voice conversion based on i-vector with variational autoencoding relativistic standard generative adversarial network. Acta Automatica Sinica, 2020, 46(x): 1−10 doi: 10.16383/j.aas.c190733
Citation: Li Yan-Ping, Cao Pan, Zuo Yu-Tao, Zhang Yan, Qian Bo. Voice conversion based on i-vector with variational autoencoding relativistic standard generative adversarial network. Acta Automatica Sinica, 2020, 46(x): 1−10 doi: 10.16383/j.aas.c190733

基于i向量和变分自编码相对生成对抗网络的语音转换

doi: 10.16383/j.aas.c190733
基金项目: 国家自然科学青年基金(61401227), 国家自然科学面上基金(61872199, 61872424), 金陵科技学院智能人机交互科技创新团队建设专项(218/010119200113)资助
详细信息
    作者简介:

    李燕萍:南京邮电大学通信与信息工程学院副教授. 2009年获得南京理工大学博士学位. 主要研究方向为语音转换和说话人识别. E-mail: liyp@njupt.edu.cn

    曹盼:南京邮电大学通信与信息工程学院研究生. 2020年获得南京邮电大学硕士学位.主要研究方向为语音转换和深度学习. E-mail: abreastpc@163.com

    左宇涛:南京邮电大学通信与信息工程学院研究生. 2019年获得南京邮电大学硕士学位. 主要研究方向为语音转换. E-mail: zuoyt@chinatelecom.cn

    张燕:金陵科技学院软件工程学院教授. 2017年获得南京理工大学博士学位. 主要研究方向为模式识别和领域软件工程. E-mail: zy@jit.edu.cn

    钱博:南京电子技术研究所高级工程师. 2007年获得南京理工大学博士学位. 主要研究方向为模式识别和人工智能. E-mail: sandson6@163.com

Voice Conversion based on i-vector with Variational Autoencoding Relativistic Standard Generative Adversarial Network

Funds: Supported by National Natural Science Foundation of Youth Foundation of China (61401227), National Natural Science Foundation of China (61872199, 61872424), and Special Project of Intelligent Human-Computer Interaction Technology Innovation Team Building of Jinling Institute of Technology (218/010119200113)
  • 摘要: 该文提出一种基于i 向量和变分自编码相对生成对抗网络的语音转换方法, 实现了非平行文本条件下高质量的多对多语音转换. 性能良好的语音转换系统, 既要保持重构语音的自然度, 又要兼顾转换语音的说话人个性特征是否准确. 首先为了改善合成语音自然度, 利用生成性能更好的相对生成对抗网络代替 基于变分自编码生成对抗网络模型中的Wasserstein生成对抗网络, 通过构造相对鉴别器的方式, 使得鉴别器的输出依赖于真实样本和生成样本间的相对值, 克服了Wasserstein生成对抗网络性能不稳定和收敛速度较慢等问题. 进一步为了 提升转换语音的说话人个性相似度, 在解码阶段, 引入含有丰富个性信息的i-vector, 以充分学习说话人的个性化特征. 客观和主观实验表明, 转换后的语音平均MCD值较基准模型降低4.80%, MOS 值提升5.12%, ABX 值提升8.60%, 验证了该方法在语音自然度 和个性相似度两个方面均有显著的提高, 实现了高质量的语音转换.
  • 图  1  基于VARSGAN+i-vector 模型的整体流程图

    Fig.  1  Framework of voice conversion based on VARSGAN+i-vector network

    图  2  VARSGAN+i-vector 模型原理示意图

    Fig.  2  Schematic diagram of VARSGAN+i-vector network

    图  3  VARSGAN+i-vector 模型网络结构示意图

    Fig.  3  Structure of VARSGAN+i-vector network

    图  4  16 种转换情形下五种模型的转换语音的MCD值对比

    Fig.  4  Average MCD of five models for 16 conversion cases

    图  5  四类转换情形下不同模型的MCD值对比

    Fig.  5  Comparison of MCD of different models for four conversion cases

    图  6  五种模型在不同转换类别下的MOS值对比

    Fig.  6  Comparison of MOS for different conversion categories in five models

    图  7  同性转换情形下五种模型转换语音的ABX图

    Fig.  7  ABX test results of five models for intra-gender

    图  8  异性转换情形下五种模型转换语音的ABX图

    Fig.  8  ABX test results of five models for inter-gender

  • [1] Godoy E, Rosec O, Chonavel T. Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Transactions on Audio, Speech and Language Processing, 2011, 20(4): 1313−1323
    [2] Toda T, Chen L H, Saito D, et al. The voice conversion challenge 2016. 2016 INTERSPEECH, San Francisco, USA, 2016. 1632−1636.
    [3] Dong M, Yang C, Lu Y, et al. Mapping frames with DNN-HMM recognizer for non-parallel voice conversion. In: Proceedings of the 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Hong Kong, China: IEEE, 2015. 488−494
    [4] Zhang M, Tao J, Tian J, Wang X. Text-independent voice conversion based on state mapped codebook. In: Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Las Vegas, USA: IEEE, 2008. 4605−4608
    [5] Nakashika T, Takiguchi T, Minami Y. Non-parallel training in voice conversion using an adaptive restricted boltzmann machine. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(11): 2032−2045 doi: 10.1109/TASLP.2016.2593263
    [6] Mouchtaris A, Van der Spiegel J, Mueller P. Nonparallel training for voice conversion based on a parameter adaptation approach. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(3): 952−963 doi: 10.1109/TSA.2005.857790
    [7] Hsu C C, Hwang H T, Wu Y C, Tsaoet Y, Wang H M. Voice conversion from non-parallel corpora using variational auto-encoder. In: Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Jeju, South Korea: IEEE, 2016. 1−6
    [8] Hsu C C, H.-T., Y.-C. Wu, Y. Tsao, and H.-M. Wang. Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks. 2017 INTERSPEECH, 2017. 3364−3368
    [9] Kameoka H, Kaneko T, Tanaka K, Hojo N. StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks. In: Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT). Athens, Greece: IEEE, 2018. 266−273
    [10] Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J. High-quality nonparallel voice conversion based on cycle-consistent adversarial network. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE, 2018. 5279−5283
    [11] Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning International Conference on Machine Learning (ICML). Sydney, Australia: ACM, 2017. 214−223
    [12] 王坤峰, 苟超, 段艳杰, 林懿伦, 郑心湖, 王飞跃. 生成式对抗网络GAN的研究进展与展望. 自动化学报, 2017, 43(3): 321−332

    Wang Kun-Feng, Gou Chao, Duan Yan-Jie, Lin Yi-Lun, Zheng Xin-Hu, Wang Fei-Yue. Generative Adversarial Networks: The State of the Art and Beyond. Acta Automatica Sinica, 2017, 43(3): 321−332
    [13] Baby D, Verhulst S. Sergan. Speech enhancement using relativistic generative adversarial networks with gradient penalty. In: Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, United Kingdom: IEEE, 2019. 106−110
    [14] Dehak N, Kenny P J, Dehak R, Dumouchelet P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 19(4): 788−798
    [15] 汪海彬, 郭剑毅, 毛存礼, 余正涛. 基于通用背景-联合估计 (UB-JE) 的说话人识别方法. 自动化学报, 2018, 44(10): 1888−1895

    Wang Hai-Bin, Guo Jian-Yi, Mao Cun-Li, Yu Zheng-Tao. Speaker recognition based on universal Background-Joint Estimation (UB-JE). Acta Automatica Sinica, 2018, 44(10): 1888−1895
    [16] Matějka P, Glembek O, Castaldo F, et al. Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prague, Czech Republic: IEEE, 2011. 4828−4831
    [17] Kanagasundaram A, Vogt R, Dean D B, et al. I-vector based speaker recognition on short utterances. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA). Florence, Italy, 2011. 2341−2344
    [18] 张一珂, 张鹏远, 颜永红. 基于对抗训练策略的语言模型数据增强技术. 自动化学报, 2018, 44(5): 891−900

    Zhang Yi-Ke, Zhang Peng-Yuan, Yan Yong-Hong. Data augmentation for language models via adversarial training. Acta Automatica Sinica, 2018, 44(5): 891−900
    [19] Mao X, Li Q, Xie H, et al. Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 2794−2802
    [20] Morise M, Yokomori F, Ozawa K. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems, 2016, 99(7): 1877−1884
    [21] Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A C. Improved training of wasserstein gans. In: Proceedings of the Advances in neural information processing systems. Leicester, United Kingdom: IEEE, 2017. 5767−5777
    [22] Lorenzo-Trueba J, Yamagishi J, Toda T, et al. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In: Odyssey 2018 The Speaker and Language Recognition Workshop. Les Sables d'Olonne, France: ISCA Speaker and Language Characterization Special Interest Group, 2018. 195−202
    [23] Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve neural network acoustic models. Computer Science, 2013, 30(1): 1152−1160
    [24] 梁瑞秋, 赵力, 王青云[著]. 语音信号处理(C++版).北京: 机械工业出版社, 2018

    Liang Rui-qiu, Zhao Li, Wang Qing-yun[Author]. Speech Signal Preprocessing (C++). Beijing: China Machine Press, 2018
    [25] 张雄伟, 陈亮, 杨吉斌[著]. 现代语音处理技术及应用. 北京: 机械工业出版社, 2003

    Zhang Xiong-Wei, Chen Liang, Yang Ji-Bin[Author]. Modern Speech Processing Technology and Application. Beijing: China Machine Press, 2007
    [26] Chou J C, Lee H Y. One-Shot voice conversion by separating speaker and content representations with instance normalization. 2019 INTERSPEECH, Graz, Austria, 2019. 664−668
  • 加载中
计量
  • 文章访问数:  8
  • HTML全文浏览量:  1
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-10-23
  • 录用日期:  2020-07-27

目录

    /

    返回文章
    返回