Improving Speech Enhancement in Unseen Noise Using Deep Convolutional Neural Network
-
摘要: 为了进一步提高基于深度学习的语音增强方法在未知噪声下的性能,本文从神经网络的结构出发展开研究.基于在时间与频率两个维度上,语音和噪声信号的局部特征都具有强相关性的特点,采用深度卷积神经网络(Deep convolutional neural network,DCNN)建模来表示含噪语音和纯净语音之间的复杂非线性关系.通过设计有效的训练特征和训练目标,并建立合理的网络结构,提出了基于深度卷积神经网络的语音增强方法.实验结果表明,在未知噪声条件下,本文方法相比基于深度神经网络(Deep neural network,DNN)的方法在语音质量和可懂度两种指标上都有明显提高.Abstract: In order to further improve the performance of speech enhancement method based on deep learning in unseen noise, this paper focuses on the architecture of neural network. Based on the strong correlation between local characteristics of speech and noise signals in time and frequency domains, a deep convolutional neural network (DCNN) model is used to represent the complex nonlinear relationship between noisy speech and clean speech. By designing effective training features and training target, and establishing reasonable network architecture, a speech enhancement method based on DCNN is proposed. Experimental results show that under the condition of unseen noise, the proposed method significantly outperforms the methods based on deep neural network (DNN) in terms of both speech quality and intelligibility.
-
-
表 1 三种方法的平均PESQ得分
Table 1 The average PESQ score for three methods
噪声类型 信噪比
(dB)含噪语音 DNN_11F DNN_15F DCNN Factory2 -5 1.73 2.25 2.27 ${\bf 2.33}$ 0 2.07 2.57 2.58 ${\bf 2.65}$ 5 2.40 2.83 2.82 ${\bf 2.89}$ Buccaneer1 -5 1.36 1.88 1.92 ${\bf 1.93}$ 0 1.63 2.24 2.26 ${\bf 2.27}$ 5 1.95 2.54 2.54 ${\bf 2.56} $ Destroyer engine -5 1.59 2.01 1.99 ${\bf 2.15} $ 0 1.81 2.27 2.26 ${\bf 2.46}$ 5 2.10 2.53 2.55 $ {\bf 2.76}$ HF channel -5 1.36 1.7 1.71 ${\bf 2.03} $ 0 1.58 2.04 2.06 ${\bf 2.37}$ 5 1.85 2.38 2.39 ${\bf 2.65}$ 表 2 三种方法的平均STOI得分
Table 2 The average STOI score for three methods
噪声类型 信噪比
(dB)含噪语音 DNN_11F DNN_15F DCNN Factory2 -5 0.65 0.76 0.76 ${\bf 0.78 }$ 0 0.76 0.85 0.84 ${\bf 0.86 } $ 5 0.85 0.89 0.89 ${\bf 0.91 }$ Buccaneer1 -5 0.51 0.66 0.66 ${\bf 0.68 }$ 0 0.63 0.77 0.77 ${\bf 0.78 }$ 5 0.75 0.85 0.85 ${\bf 0.86 }$ Destroyer engine -5 0.57 0.62 0.63 ${\bf 0.70 }$ 0 0.69 0.75 0.75 ${\bf 0.82 }$ 5 0.81 0.85 0.85 ${\bf 0.90 }$ HF channel -5 0.57 0.69 0.69 ${\bf 0.73 }$ 0 0.69 0.78 0.79 ${\bf 0.82 }$ 5 0.80 0.86 0.86 ${\bf 0.88 }$ 表 3 三种方法的平均SegSNR
Table 3 The average SegSNR for three methods
噪声类型 信噪比
(dB)含噪语音
(dB)DNN_11F
(dB)DNN_15F
(dB)DCNN
(dB)Factory2 -5 -6.90 -0.69 -0.59 -0.05 0 -4.50 0.34 0.42 0.95 5 -1.57 1.24 1.29 1.80 Buccaneer1 -5 -7.21 -1.52 -1.40 -0.96 0 -4.90 -0.50 -0.39 0.11 5 -2.03 0.46 0.53 1.03 Destroyer engine -5 -7.15 -2.86 -2.81 -2.16 0 -4.90 -1.37 -1.24 -0.54 5 -1.91 0.04 0.21 0.89 HF channel -5 -7.24 -1.13 -1.21 0.35 0 -4.91 0.05 -0.02 1.34 5 -2.09 1.04 1.02 2.03 -
[1] Loizou P C. Speech Enhancement:Theory and Practice. Florida:CRC Press, 2013. [2] Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1985, 33(2):443-445 http://ieeexplore.ieee.org/document/1164550/ [3] Cohen I. Noise spectrum estimation in adverse environments:Improved minima controlled recursive averaging. IEEE Transactions on speech and audio processing, 2003, 11(5):466-475 http://www.researchgate.net/publication/3333946_Noise_spectrum_estimation_in_adverse_environments_improved_minima_controlled_recursive_averaging [4] Mohammadiha N, Smaragdis P, Leijon A. Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(10):2140-2151 doi: 10.1109/TASL.2013.2270369 [5] 刘文举, 聂帅, 梁山, 张学良.基于深度学习语音分离技术的研究现状与进展.自动化学报, 2016, 42(6):819-833 http://www.aas.net.cn/CN/abstract/abstract18873.shtmlLiu Wen-Ju, Nie Shuai, Liang Shan, Zhang Xue-Liang. Deep learning based speech separation technology and its developments. Acta Automatica Sinica, 2016, 42(6):819-833 http://www.aas.net.cn/CN/abstract/abstract18873.shtml [6] Wang Y X, Wang D L. Towards scaling up classification-based speech separation. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(7):1381-1390 doi: 10.1109/TASL.2013.2250961 [7] Wang Y X, Narayanan A, Wang D L. On training targets for supervised speech separation. IEEE Transactions on Audio, Speech, and Language Processing, 2014, 22(12):1849-1858 doi: 10.1109/TASLP.2014.2352935 [8] Xu Y, Du J, Dai L R, Lee C H. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Processing Letters, 2014, 21(1):65-68 doi: 10.1109/LSP.2013.2291240 [9] Xu Y, Du J, Dai L R, Lee C H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(1):7-19 http://www.researchgate.net/publication/272436458_A_Regression_Approach_to_Speech_Enhancement_Based_on_Deep_Neural_Networks [10] Williamson D S, Wang Y X, Wang D L. Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(3):483-492 doi: 10.1109/TASLP.2015.2512042 [11] Xu Y, Du J, Huang Z, Dai L R, Lee C H. Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA, 2015. 1508-1512 [12] Wang Y X, Chen J T, Wang D L. Deep Neural Network Based Supervised Speech Segregation Generalizes to Novel Noises Through Large-scale Training, Technical Report OSU-CISRC-3/15-TR02, Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA, 2015 [13] Chen J T, Wang Y X, Yoho S E, Wang D L, Healy E W. Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. The Journal of the Acoustical Society of America, 2016, 139(5):2604-2612 doi: 10.1121/1.4948445 [14] Chen J T, Wang Y X, Wang D L. Noise perturbation for supervised speech separation. Speech Communication, 2016, 78:1-10 https://www.sciencedirect.com/science/article/pii/S0167639315001405 [15] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the International Conference on Neural Information Processing Systems. Nevada, USA: Curran Associates Inc. 2012. 1097-1105 http://www.researchgate.net/publication/267960550_ImageNe [16] Abdel-Hamid O, Mohamed A, Jiang H, Penn G. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. Kyoto, Japan: IEEE, 2012. 4277-4280 [17] Abdel-Hamid O, Deng L, Yu D. Exploring convolutional neural network structures and optimization techniques for speech recognition. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association. Lyon, France: ISCA, 2013. 3366-3370 http://www.researchgate.net/publication/264859599_Exploring_Convolutional_Neural_Network_Structures_and_Optimization_Techniques_for_Speech_Recognition [18] Sainath T N, Kingsbury B, Saon G, Soltau H, Mohamed A R, Dahl G, Ramabhadran B. Deep convolutional neural networks for large-scale speech tasks. Neural Networks, 2015, 64:39-48 https://www.sciencedirect.com/science/article/pii/S0893608014002007 [19] Qian Y M, Bi M X, Tan T, Yu K. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2016, 24(12):2263-2276 http://www.researchgate.net/publication/308823854_Very_Deep_Convolutional_Neural_Networks_for_Robust_Speech_Recognition [20] Bi M X, Qian Y M, Yu K. Very deep convolutional neural networks for LVCSR. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA, 2015. 3259-3263 [21] Qian Y, Woodland P C. Very deep convolutional neural networks for robust speech recognition. In: Proceedings of the 2016 IEEE Spoken Language Technology Workshop. San Juan, Puerto Rico: IEEE, 2016. 481-488 http://www.researchgate.net/publication/313587893_Very_deep_convolutional_neural_networks_for_robust_speech_recognition [22] Sercu T, Puhrsch C, Kingsbury B, LeCun Y. Very deep multilingual convolutional neural networks for LVCSR. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016. 4955-4959 [23] Sercu T, Goel V. Advances in very deep convolutional neural networks for LVCSR. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. California, USA: ISCA, 2016. 3429-3433 http://www.researchgate.net/publication/307889292_Advances_in_Very_Deep_Convolutional_Neural_Networks_for_LVCSR [24] Park S R, Lee J. A fully convolutional neural network for speech enhancement. arXiv: 1609. 07132, 2016. [25] Fu S W, Tsao Y, Lu X. SNR-Aware convolutional neural network modeling for speech enhancement. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016. 8-12 http://www.researchgate.net/publication/307889660_SNR-Aware_Convolutional_Neural_Network_Modeling_for_Speech_Enhancement [26] Garofolo J S, Lamel L F, Fisher W M, Fiscus J G, Pallett D S, Dahlgren N L, Zue V. TIMIT acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia, 1993. https://www.researchgate.net/publication/243787812_TIMIT_acoustic-phonetic_continuous_speech_corpus [27] Hu G N. 100 nonspeech sounds[online], available: http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html, April 20, 2004 [28] Varga A, Steeneken Herman J M. Assessment for automatic speech recognition:Ⅱ. NOISEX-92:a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 1993, 12(3):247-251 doi: 10.1016/0167-6393(93)90095-3 [29] Beerends J G, Rix A W, Hollier M P, Hekstra A P. Perceptual evaluation of speech quality (PESQ)——a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech and Signal Processing. Utah, USA: IEEE, 2001. 749-752 http://dl.acm.org/citation.cfm?id=1259107 [30] Taal C H, Hendriks R C, Heusdens R, Jensen J. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7):2125-2136 doi: 10.1109/TASL.2011.2114881 [31] Yu D, Eversole A, Seltzer M L, Yao K S, Huang Z H, Guenter B, Kuchaiev O, Zhang Y, Seide F, Wang H M, Droppo J, Zweig G, Rossbach C, Currey J, Gao J, May A, Peng B L, Stolcke A, Slaney M. An Introduction to Computational Networks and the Computational Network Toolkit, Technical Report, Tech. Rep. MSR, Microsoft Research, 2014. 期刊类型引用(39)
1. 杨涛. 基于机器学习的语音增强技术. 电声技术. 2024(03): 39-41 . 百度学术
2. 杨波. 基于卷积神经网络的实时语音分割优化研究. 电声技术. 2024(05): 46-48 . 百度学术
3. 张文安,林安迪,杨旭升,俞立,杨小牛. 融合深度学习的贝叶斯滤波综述. 自动化学报. 2024(08): 1502-1516 . 本站查看
4. 郑盼盼,闫东. 基于深度卷积神经网络的城市噪声识别研究. 电声技术. 2024(09): 41-43 . 百度学术
5. 胡翔,杨洋,蒋长江,潘自强,匡仲琴. 一种基于深度神经网络的电力系统调度控制语音识别模型. 电子器件. 2023(01): 90-95 . 百度学术
6. 高建清,屠彦辉,马峰,付中华. 基于渐进比率掩蔽目标的自适应噪声估计方法. 计算机应用. 2023(04): 1303-1308 . 百度学术
7. 李鑫元,黄鹤鸣. 基于并行卷积循环网络的单通道语音增强系统. 计算机工程与设计. 2023(04): 1181-1188 . 百度学术
8. 沈学利,田桂源,姜彦吉,马琳琳. 基于双阶段Conv-Transformer的时频域语音增强算法. 计算机工程. 2023(06): 123-130 . 百度学术
9. 陈晋音,吴长安,郑海斌,王巍,温浩. 基于通用逆扰动的对抗攻击防御方法. 自动化学报. 2023(10): 2172-2187 . 本站查看
10. 李辉,景浩,严康华,徐良浩. 基于卷积循环网络与非局部模块的语音增强方法. 电子科技. 2022(03): 8-15 . 百度学术
11. 徐秋平,任玲,樊玺炫,王义华. 语音识别技术在轨道交通AFC系统中的应用研究. 现代城市轨道交通. 2022(04): 31-35 . 百度学术
12. 许春冬,徐琅,周滨. 结合优化U-Net和残差神经网络的单通道语音增强算法. 现代电子技术. 2022(09): 35-40 . 百度学术
13. 李文志,屈晓旭. 基于注意力机制和残差卷积网络的语音增强. 舰船电子工程. 2022(05): 96-100 . 百度学术
14. 李辉,景浩,严康华,邹波蓉,侯庆华,武会斌. 基于双通道卷积注意力网络的语音增强方法. 河南理工大学学报(自然科学版). 2022(05): 127-136 . 百度学术
15. 李江和,王玫. 一种用于因果式语音增强的门控循环神经网络. 计算机工程. 2022(11): 77-82 . 百度学术
16. 陈晋音,沈诗婧,苏蒙蒙,郑海斌,熊晖. 车牌识别系统的黑盒对抗攻击. 自动化学报. 2021(01): 121-135 . 本站查看
17. SHI Wenhua,ZHANG Xiongwei,ZOU Xia,SUN Meng,LI Li,REN Zhengbing. Time-frequency mask estimation-based speech enhancement using deep encoder-decoder neural network. Chinese Journal of Acoustics. 2021(01): 141-154 . 必应学术
18. 董宏越,马建芬,张朝霞. 基于时域波形映射-频域谐波损失的语音增强. 计算机工程与设计. 2021(06): 1677-1683 . 百度学术
19. 唐艳凤,林俊强,马振丰. 基于Cauchy模型的行人轮廓提取及目标检测. 计算机测量与控制. 2021(07): 41-45 . 百度学术
20. 王钇翔,吕忆蓝,台文鑫,孙建强,蓝天. 基于区域自适应多尺度卷积的单声道语音增强算法. 计算机应用研究. 2021(11): 3264-3267 . 百度学术
21. 储有亮,李梁. 基于DBLSTM-DCNN的骨导和气导语音转换. 声学技术. 2021(06): 815-821 . 百度学术
22. 连海伦,周健,胡雨婷,郑文明. 利用深度卷积神经网络将耳语转换为正常语音. 声学学报. 2020(01): 137-144 . 百度学术
23. 娄迎曦,袁文浩,彭荣群. 基于准循环神经网络的语音增强方法. 计算机工程. 2020(04): 316-320 . 百度学术
24. 时文华,张雄伟,邹霞,孙蒙,李莉. 联合深度编解码网络和时频掩蔽估计的单通道语音增强. 声学学报. 2020(03): 299-307 . 百度学术
25. 刘晓宇,武鲁,许少华. 一种深层过程神经网络及其在信号分类中的应用. 软件导刊. 2020(03): 60-64 . 百度学术
26. 董兴磊,胡英,黄浩,吾守尔·斯拉木. 基于卷积非负矩阵部分联合分解的强噪声单声道语音分离. 自动化学报. 2020(06): 1200-1209 . 本站查看
27. 刘虹,袁三男. 基于多尺度残差深度卷积神经网络的语音识别. 计算机应用与软件. 2020(11): 275-279 . 百度学术
28. 许春冬,徐琅,周滨,凌贤鹏. 单通道语音增强技术的研究现状与发展趋势. 江西理工大学学报. 2020(05): 55-64 . 百度学术
29. 袁文浩,娄迎曦,夏斌,孙文珠. 基于卷积门控循环神经网络的语音增强方法. 华中科技大学学报(自然科学版). 2019(04): 13-18 . 百度学术
30. 袁文浩,娄迎曦,梁春燕,王志强. 感知联合优化的深度神经网络语音增强方法. 西安电子科技大学学报. 2019(02): 89-94 . 百度学术
31. 姚红革,沈新霞,李宇,喻钧,雷松泽. 多模态融合的深度学习脑肿瘤检测方法. 光子学报. 2019(07): 165-176 . 百度学术
32. 袁文浩,梁春燕,夏斌. 基于深度神经网络的因果形式语音增强模型. 计算机工程. 2019(08): 255-259 . 百度学术
33. 韦博轩,张冀聪. EEG及MEG痫样棘波检测算法研究现状. 中国医疗设备. 2019(11): 30-33 . 百度学术
34. 黄志东. 鲁棒性语音识别技术研究综述. 信息通信. 2019(11): 20-22 . 百度学术
35. 陈郑平,米为民,林静怀,王恒,王昊,董根源. 电网调控操作智能助手方案探讨. 电力系统自动化. 2019(22): 173-179+186 . 百度学术
36. 任晓霞. 基于Dropout深度卷积神经网络的ST段波形分类算法. 传感技术学报. 2018(08): 1217-1222 . 百度学术
37. 刘亚,王静,田新诚. 基于C#和Matlab混合编程的轴承故障诊断系统. 计算机应用. 2018(S2): 236-238+242 . 百度学术
38. 罗秀芝,马本学,李小霞,胡洋洋,王文霞,雷声渊. 基于卷积神经网络干制哈密大枣纹理分级. 新疆农业科学. 2018(12): 2220-2227 . 百度学术
39. 吴耀春,赵荣珍,靳伍银,何天经,武杰. 利用DCNN融合多传感器特征的故障诊断方法. 振动.测试与诊断. 2021(02): 362-369+416 . 百度学术
其他类型引用(39)
-