2.793

2018影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

文本无关说话人识别中句级特征提取方法研究综述

陈晨 韩纪庆 陈德运 何勇军

陈晨, 韩纪庆, 陈德运, 何勇军. 文本无关说话人识别中句级特征提取方法研究综述. 自动化学报, 2020, 46(x): 1−25 doi: 10.16383/j.aas.c200521
引用本文: 陈晨, 韩纪庆, 陈德运, 何勇军. 文本无关说话人识别中句级特征提取方法研究综述. 自动化学报, 2020, 46(x): 1−25 doi: 10.16383/j.aas.c200521
Chen Chen, Han Ji-Qing, Chen De-Yun, He Yong-Jun. Utterance-Level feature extraction in text-independent speaker recognition: a review. Acta Automatica Sinica, 2020, 46(x): 1−25 doi: 10.16383/j.aas.c200521
Citation: Chen Chen, Han Ji-Qing, Chen De-Yun, He Yong-Jun. Utterance-Level feature extraction in text-independent speaker recognition: a review. Acta Automatica Sinica, 2020, 46(x): 1−25 doi: 10.16383/j.aas.c200521

文本无关说话人识别中句级特征提取方法研究综述

doi: 10.16383/j.aas.c200521
基金项目: 国家自然科学基金 (U1736210, 61673142), 黑龙江省自然科学基金(JJ2019JQ0013, F2017013), 哈尔滨市杰出青年人才基金 (2017RAYXJ013)
详细信息
    作者简介:

    陈晨:哈尔滨理工大学讲师、博士后, 主要研究方向为语音信号处理、音频信息分析、说话人识别等. E-mail: chenc@hrbust.edu.cn

    韩纪庆:哈尔滨工业大学教授、博导, 主要研究方向为语音信号处理、音频信息分析等. E-mail: jqhan@hit.edu.cn

    陈德运:哈尔滨理工大学教授、博导, 主要研究方向为语模式识别、机器学习等. E-mail: chendeyun@hrbust.edu.cn

    何勇军:哈尔滨理工大学教授、博导, 主要研究方向为语音信号处理、图像处理等. E-mail: holywit@163.com

Utterance-Level Feature Extraction in Text-Independent Speaker Recognition: A Review

Funds: Supported by the National Natural Science Foundation of China (U1736210、61673142), the Natural Science Foundation of Heilongjiang Province of China (JJ2019JQ0013, F2017013), and the Outstanding Youth Talent Foundation of Harbin of China (2017RAYXJ013)
  • 摘要: 句级 (Utterance-level) 特征提取是文本无关说话人识别领域中的重要研究方向之一. 与只能刻画短时语音特性的帧级 (Frame-level) 特征相比, 句级特征中包含了更丰富的说话人个性信息; 且不同时长语音的句级特征均具有固定维度, 更便于与大多数常用的模式识别方法相结合. 近年来, 句级特征提取的研究取得了很大的进展, 鉴于其在说话人识别中的重要地位, 本文将对近期具有代表性的句级特征提取方法与技术进行整理与综述, 并分别从前端处理、基于任务分段式与驱动式策略的特征提取方法, 以及后端处理4方面进行论述, 最后还将对未来的研究趋势展开探讨与分析.
  • 图  1  语音活动检测的功能示意图

    Fig.  1  Schematic diagram of voice activity detection

    图  2  MFCC特征提取过程示意图

    Fig.  2  Schematic diagram of MFCC extraction

    图  3  帧级特征序列经特征规整后的直方图对比

    Fig.  3  Histogram comparison of frame-level feature sequences after feature normalization

    图  4  GMM均值超矢量提取过程示意图

    Fig.  4  Schematic diagram of GMM mean supervector extraction

    图  5  两种网络结构对比

    Fig.  5  Comparison of two different network structures

    图  6  两种目标函数对应网络的结构示意图对比

    Fig.  6  Comparison of the structure of the networks corresponding to the two different objective functions

    图  7  TDMF方法示意图

    Fig.  7  Schematic diagram of TDMF method

    表  1  不同特征空间学习方法汇总信息

    Table  1  Information of different feature space learning methods

    方法描述特点
    经典MAP
    方法[29]
    $ {{M}}_{s,h}={{m}}+{{D}}{{z}}_{s,h} $MAP自适应方法
    $ {{D}} $为对角矩阵,
    $ {{z}}_{s,h} \sim {\mathbb{N}}\left({\bf{0}},{{I}}\right) $
    无法进行信道补偿
    本征音
    模型[36, 37]
    $ {{M}}_{s,h}={{m}}+{{V}}{{y}}_{s,h} $能够获得低维句
    级特征表示
    $ {{V}} $为低秩矩阵,
    $ {{y}}_{s,h} \sim {\mathbb{N}}\left({\bf{0}},{{I}}\right) $
    无法进行信道补偿
    本征信道
    模型[37]
    $ {{M}}_{s,h}={{m}}+{{D}}{{z}}_{s}+{{U}}{{x}}_{h} $能够进行信道补偿
    $ {{D}} $为对角矩阵,
    $ {{z}}_{s} \sim {\mathbb{N}}\left({\bf{0}},{{I}}\right) $
    需要提供同一说话人的
    多信道语音数据
    $ {{U}} $为低秩矩阵,
    $ {{y}}_{s,h} \sim {\mathbb{N}}\left({\bf{0}},{{I}}\right) $
    说话人子空间中包
    含残差信息
    联合因子分
    析模型[38]
    ${{M} }_{s,h}={{m} }+V{{y} }_{s}+{{U} }{{x} }_{h}+{{D} }{{z} }_{s,h}$独立学习说话人
    信息与信道信息
    $ {{V}} $为低秩矩阵,
    $ {{y}}_{s} \sim {\mathbb{N}}\left({\bf{0}},{{I}}\right) $
    需要提供同一说话人
    的多信道语音数据
    计算复杂度高
    $ {{U}} $为低秩矩阵,
    $ {{x}}_{h} \sim {\mathbb{N}}\left({\bf{0}},{{I}}\right) $
    $ {{D}} $为对角矩阵,
    $ {{z}}_{s} \sim {\mathbb{N}}\left({\bf{0}},{{I}}\right) $
    总变化空
    间模型[39, 40]
    $ {{M}}_{s,h}={{m}}+{{T}}{{w}}_{s,h}+{{\varepsilon}}_{s,h} $学习均值超矢量中的
    全部变化信息
    $ {{T}} $为低秩矩阵,
    $ {{w}}_{s,h} \sim {\mathbb{N}}\left({\bf{0}},{{I}}\right) $
    获取i-vector特征后
    再进行会话补偿
    $ {{\varepsilon}}_{s,h} $为残差矢量$ {{\varepsilon}}_{s,h} $在不同方法中
    的形式不同
    下载: 导出CSV

    表  2  基于不同残差假设的无监督总变化空间模型

    Table  2  Unsupervised TVS model based on different residual assumptions

    方法描述E步M步计算复杂度
    FEFA[40]$ {{M}}_{s,h}={{m}}+{{T}}{{w}}_{s,h} $
    输入为统计量无残差假设
    $\begin{align}&{{L} }={\left({{I} }+\displaystyle\sum\limits_{c=1}^{C}{N}_{s,h}^{c}{ {{T} } }_{c}^{\rm{T} }{\Sigma }_{c}^{-1}{ {{T} } }_{c}\right)}^{-1}\\ &{{E} }={{L} }\displaystyle\sum\limits_{c=1}^{C}{ {{T} } }_{c}^{\rm{T} }{{\varSigma } }_{c}^{-1}\left({{F} }_{s,h}^{c}-{N}_{s,h}^{c}{{\mu } }_{c}\right)\\ &\Upsilon ={{L}}+{{E}}{{{E}}}^{\rm{T}}\end{align} $${ {{T} } }_{c}=\!\left[\displaystyle\sum\limits_{s,h}\left({{F} }_{s,h}^{c}\!-\!{N}_{s,h}^{c}{{\mu } }_{c}\right){{E} }\right]\!\!{\left(\displaystyle\sum\limits_{s,h}{N}_{s,h}^{c}\Upsilon \right)}^{-1}$$ {\rm{O}}\left(CFR+C{R}^{2}+{R}^{3}\right) $
    PPCA[43, 44]$ {{M}}_{s,h}={{m}}+{{T}}{{w}}_{s,h}+{{\varepsilon}}_{s,h} $
    残差协方差矩阵各向同性
    $\begin{align}&{{L} }={\left({{I} }+\dfrac{1}{ {\sigma }^{2} }{ {{T} } }^{\rm{T} }{{T} }\right)}^{-1}\\ &{{E} }=\dfrac{1}{ {\sigma }^{2} }{{L} }{ {{T} } }^{\rm{T} }\left({{M} }_{s,h}-{{m} }\right)\\ &\Upsilon ={{L}}+{{E}}{{{E}}}^{\rm{T}} \end{align}$$\begin{align}&{{T} }=\left[\displaystyle\sum\limits_{s,h}\left({{M} }_{s,h}-{{m} }\right){{E} }\right]{\left(\displaystyle\sum\limits_{s,h}\Upsilon \right)}^{-1}\\& {\sigma }^{2}=\dfrac{1}{CF{\displaystyle\sum\limits }_{s,h}1}\{ {\left({{M} }_{s,h}-{{m} }\right)}^{\rm{T} }\left({{M} }_{s,h}-{{m} }\right)-\\ &{\rm{T} }{\rm{r} }\left(\Upsilon { {{T} } }^{\rm{T} }{{T} })\right\} \end{align}$$ {\rm{O}}\left(CFR\right) $
    FA[44, 45]$ {{M}}_{s,h}={{m}}+{{T}}{{w}}_{s,h}+{{\varepsilon}}_{s,h} $
    残差协方差矩阵各向异性
    $ \begin{align} &{{L}}={\left({{I}}+{{{T}}}^{\rm{T}}{{\varPhi }}^{-1}{{T}}\right)}^{-1}\\ &{{E}}={{L}}{{{T}}}^{\rm{T}}{{\varPhi }}^{-1}\left({{M}}_{s,h}-{{m}}\right) \\ &\Upsilon ={{L}}+{{E}}{{{E}}}^{\rm{T}}\end{align} $$\begin{align}&{{T} }=\left[\displaystyle\sum\limits_{ {{s} },{{h} } }\left({{M} }_{ {{s} },{{h} } }-{{m} }\right){{E} }\right]{\left(\displaystyle\sum\limits_{s,h}\Upsilon \right)}^{-1}\\ &{\sigma }^{2}=\dfrac{1}{CF{\displaystyle\sum\limits }_{s,h}1}\{\left({{M} }_{s,h}-{{m} }\right){\left({{M} }_{s,h}-{{m} }\right)}^{\rm{T} }-\\ &{ {{T} } }^{\rm{T} }\Upsilon {{T} }\}\odot {{I} } \end{align}$$ {\rm{O}}\left(CFR\right) $
    下载: 导出CSV

    表  3  基于不同映射关系假设的无监督总变化空间模型

    Table  3  Unsupervised TVS model based on different mapping relations

    目的方法特点
    映射关系改进局部变化模型[47]利用GMM均值超矢量中各个高斯分量与i-vector特征之间的局部可变性
    稀疏编码[48]利用字典学习来压缩总变化空间矩阵
    广义变化模型[49]将映射关系中高斯分布假设扩展到高斯混合分布
    不理想数据库改善先验补偿[50]对不同数据库中的先验信息进行建模, 学习能够对其进行偿的映射关系
    不确定性传播[51]对映射关系中不确定性因素所产生的影响进行建模, 降低环境失真产生的影响
    学习速度提升广义i-vector估计[52]利用正交属性来提高计算速度
    随机奇异值分解[53]通过近似估计提升计算速度
    下载: 导出CSV

    表  4  不同有监督总变化空间模型汇总信息

    Table  4  Information of different supervised TVS models

    方法特点
    PLS[54]学习GMM均值超矢量与其类别标签的公共子空间,
    并将其作为总变化空间, 然后将GMM均值超
    矢量在公共子空间上的投影用作i-vector特征
    PPLS[55]学习GMM均值超矢量与其类别标签的公共
    隐变量, 并将其作为i-vector特征
    SPPCA[56]学习GMM均值超矢量与其对应的长时GMM
    均值超矢量的公共隐变量, 并将其作为i-vector特征
    最小最大策略[57]训练使得最大风险最小化的估计器
    下载: 导出CSV

    表  5  不同会话补偿方法汇总信息

    Table  5  Information of different session compensation methods

    目标方法特点
    子空间投影LDA[60]类内散度最小、类间散度最大
    WCCN[61]降低预期错误率
    NAP[62]消除扰动方向
    NDA[63]学习局部类间区分性信息、类内共性信息
    LWLDA[64, 65]以成对的方式来获取类内散度
    特征重构SC[66]直接对原始特征进行稀疏重构
    BSBL[67]利用块内相关性对原始特征进行稀疏重构
    FDDL[68]引入Fisher正则项来增加字典
    对不同类别的区分性
    下载: 导出CSV

    表  6  不同目标函数汇总信息

    Table  6  Information of different objective functions

    目标方法目标函数
    多分类交叉熵${L_{cro}} = - [y\log \hat y + (1 - y)\log (1 - \hat y)]$
    Softmax${L_s} = - \dfrac{1}{N}\displaystyle \sum\limits_{n = 1}^N {\log } \frac{ { { {\rm{e} } ^{ {{\theta } }_{ {y_n} }^{\rm{T} }f({ {{x} }_n})} } } }{ {\displaystyle \sum\limits_{k = 1}^K { { {\rm{e} } ^{ {{\theta } }_k^{\rm{T} }f({ {{x} }_n})} } } } }$
    Center[98]${L}_{c}=\dfrac{1}{2N}{\displaystyle \sum\limits_{n=1}^{N}\Vert f(}{{x} }_{n})-{{c} }_{ {y}_{n} }{\Vert }^{2}$
    L-softmax[99]${L}_{l-s}=-\dfrac{1}{N}{\displaystyle \sum\limits_{n=1}^{N}{\rm{log} } }\frac{ {\rm{e} }^{\Vert { {{\theta } } }_{ {y}_{n} }\Vert \Vert f({{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})} }{ {\rm{e} }^{\Vert { {{\theta } } }_{ {y}_{n} }\Vert \Vert f({{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})}+{\displaystyle \sum\limits_{k\ne {y}_{n} }{\rm{e} }^{\Vert { {{\theta } } }_{k}\Vert \Vert f({{x} }_{n})\Vert {\rm{cos} }({\alpha }_{k,n})} } }$
    A-softmax[100]${L}_{a-s}=-\dfrac{1}{N}{\displaystyle \sum\limits_{n=1}^{N}{\rm{log} } }\frac{ {\rm{e} }^{\Vert f({{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})} }{ {\rm{e} }^{\Vert f({{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})}+{\displaystyle \sum\limits_{k\ne {y}_{n} }{\rm{e} }^{\Vert { {{\theta } } }_{k}\Vert \Vert f({{x} }_{n})\Vert {\rm{cos} }({\alpha }_{k,n})} } }$
    AM-softmax[101]${L_{am{\rm{ - } }s} } = - \dfrac{1}{N}\displaystyle \sum\limits_{n = 1}^N {\log } \frac{ { { {\rm{e} } ^{s \cdot [\cos ({\alpha _{ {y_n},n} }) - m]} } } }{ { { {\rm{e} } ^{s \cdot [\cos ({\alpha _{ {y_n},n} }) - m]} } + \displaystyle \sum\limits_{k \ne {y_n} } { { {\rm{e} } ^{\cos ({\alpha _{k,n} })} } } } }$
    度量学习Contrastive[102]${L_{con}} = yd\left[ {f({{{{x}}}_1}),f({{{{x}}}_2})} \right] + (1 - y)\max \{ 0,m - d\left[ {f({{{{x}}}_1}),f({{{{x}}}_2})} \right]\} $
    Triplet[103]${L_{trip}} = \max \{ 0,d\left[ {f({{{{x}}}_p}),f({{{{x}}}_a})} \right] - d\left[ {f({{{{x}}}_n}),f({{{{x}}}_a})} \right] + m\} $
    下载: 导出CSV

    表  7  联合优化方法汇总信息

    Table  7  Information of different joint optimization methods

    阶段方法描述
    会话补偿+
    分类器
    DNN-PLDA[104]用PLDA指导DNN学习
    Bilevel[105]稀疏编码用于会话补偿, 并分别用
    SVM与softmax分类器指导稀疏字典学习
    总变化空间+
    分类器
    TDVM[106]用PLDA指导TVS学习
    全部阶段F2S2I[107]用PLDA指导DNN模仿i-vector
    方法各阶段进行学习
    TDMF[108]用PLDA指导UBM与TVS学习
    下载: 导出CSV

    表  8  常用数据库信息

    Table  8  Information of common databases

    数据库年份声学环境类别数语音段数/总时长开源
    CN-CELEB[126]2019多媒体1000300 h
    VoxCeleb[89]VoxCeleb1[73]2017多媒体1251153,516
    VoxCeleb2[75]2018多媒体61121,128,246
    SITW[127]2016多媒体2992800
    Forensic Comparison[128]2015电话5521264
    NIST SRE12[129]2012电话/麦克风2000+
    ELSDSR[130]2005纯净语音22198
    SWITCHBOARD[131]1992电话311433,039-
    TIMIT[132]1990纯净语音6306300
    下载: 导出CSV
  • [1] Reynolds D A. An overview of automatic speaker recognition technology. In: Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, USA: IEEE, 2002.4072−4075
    [2] Aghajan H, Delgado R L-C, Augusto J C. Human-Centric Interfaces for Ambient Intelligence. Oxford: Academic Press, 2010
    [3] Poddar A, Sahidullah M, Saha G. Speaker verification with short utterances: A review of challenges, trends and opportunities. IET Biometrics, 2018, 7(2): 91−101 doi: 10.1049/iet-bmt.2017.0065
    [4] 韩纪庆, 张磊, 郑铁然. 语音信号处理. 第3版. 北京: 清华大学出版社, 2019

    Han Ji-Qing, Zhang Lei, Zheng Tie-Ran. Speech Signal Processing. 3rd. Beijing: Tsinghua University Press, 2019
    [5] Nematollahi M A, Al-Haddad S A R. Distant speaker recognition: An overview. International Journal of Humanoid Robotics, 2016, 13(2): 1−45
    [6] Hansen J H L, Hasan T. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 2015, 32(6): 74−99 doi: 10.1109/MSP.2015.2462851
    [7] Kinnunen T, Li H. An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 2010, 52(1): 12−40 doi: 10.1016/j.specom.2009.08.009
    [8] Markel J, Oshika B, Gray A. Long-term feature averaging for speaker recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1977, 25(4): 330−337 doi: 10.1109/TASSP.1977.1162961
    [9] Li K, Wrench E. An approach to text-independent speaker recognition with short utterances. In: Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Boston, USA: IEEE, 1983.555−558
    [10] Chen S H, Wu H T, Chang Y, Truong T K. Robust voice activity detection using perceptual wavelet-packet transform and Teager energy operator. Pattern Recognition Letters, 2007, 28(11): 1327−1332 doi: 10.1016/j.patrec.2006.11.023
    [11] Fujimoto M, Ishizuka K, Nakatani T. A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme. In: Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Las Vegas, USA: IEEE, 2008.4441−4444
    [12] Li K, Swamy M N S, Ahmad M O. An improved voice activity detection using higher order statistics. IEEE Transactions on Speech and Audio Processing, 2005, 13(5): 965−974 doi: 10.1109/TSA.2005.851955
    [13] Soleimani S A, Ahadi S M. Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses. In: Proceeding of the International Conference on Information and Communication Technologies: From Theory to Applications. Damascus, Syria: IEEE, 2008.1−5
    [14] Sohn J, Kim N S, Sung W A. A statistical model-based voice activity detection. IEEE Signal Processing Letters, 1999, 6(1): 1−3 doi: 10.1109/97.736233
    [15] Chang J H, Kim N S. Voice activity detection based on complex Laplacian model. Electronics Letter, 2003, 39(7): 632−634 doi: 10.1049/el:20030392
    [16] Ramirez J, Segura J C, Benitez C, Garcia L, Rubio A. Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Processing Letters, 2005, 12(10): 689−692 doi: 10.1109/LSP.2005.855551
    [17] Tong S, Gu H, Yu K. A comparative study of robustness of deep learning approaches for VAD. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016.5695−5699
    [18] Atal B S. Automatic recognition of speakers from their voices. Proceeding of the IEEE, 1976, 64(4): 460−475 doi: 10.1109/PROC.1976.10155
    [19] Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 1980, 28(4): 357−366 doi: 10.1109/TASSP.1980.1163420
    [20] Hermansky H. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 1990, 87(4): 1738−1752 doi: 10.1121/1.399423
    [21] Koenig W, Dunn H, Lacy L. The sound spectrograph. Journal of the Acoustical Society of America, 1946, 18(1): 19−49 doi: 10.1121/1.1916342
    [22] Lecun Y, Boser B, Denker J, Henderson D, Howard R, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4): 541−551 doi: 10.1162/neco.1989.1.4.541
    [23] 林景栋, 吴欣怡, 柴毅, 尹宏鹏. 卷积神经网络结构优化综述. 自动化学报, 2020, 46(1): 24−37

    Lin Jing-Dong, Wu Xin-Yi, Chai Yi, Yin Hong-Peng. Structure optimization of convolutional neural networks: A survey. Acta Automatica Sinica, 2020, 46(1): 24−37
    [24] Furui S. Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1981, 29(2): 254−272 doi: 10.1109/TASSP.1981.1163530
    [25] Pelecanos J, Sridharan S. Feature warping for robust speaker verification. In: Proceeding of the Odyssey: The Speaker and Language Recognition Workshop, Crete, Greece: ISCA, 2001.1−5
    [26] Sadjadi S O, Slaney M, Heck A L. MSR identity toolbox v1.0: A MATLAB toolbox for speaker recognition research. Microsoft Research Technical Report, 2013
    [27] Campbell W M, Sturim D E, Reynolds D A. Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 2006, 13(5): 308−311 doi: 10.1109/LSP.2006.870086
    [28] Reynolds D A. Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 1995, 17: 91−108 doi: 10.1016/0167-6393(95)00009-D
    [29] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 2000, 10: 19−41 doi: 10.1006/dspr.1999.0361
    [30] Wang W, Han J, Zheng T, Zheng G, Liu H. A robust sparse auditory feature for speaker verification. Journal of Computational Information Systems, 2013, 9(22): 8987−8993
    [31] Wang W, Han J, Zheng T, Zheng G. Robust speaker verification based on max pooling of sparse representation. Journal of Computers, 2014, 24(4): 56−65
    [32] He Y, Chen C, Han J. Noise-robust speaker recognition based on morphological component analysis. In: Proceeding of the Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA, 2015.3001−3005
    [33] Wang W, Han J, Zheng T, Zheng G, Zhou X. Speaker verification via modeling kurtosis using sparse coding. International Journal of Pattern Recognition and Artificial Intelligence, 2016, 30(3): 1−20
    [34] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1977, 39(1): 1−38
    [35] Gauvain J, Lee C. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 1994, 2(2): 291−298 doi: 10.1109/89.279278
    [36] Kuhn R, Junqua J, Nguyen P, Niedzielski N. Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing, 2000, 8(6): 695−707 doi: 10.1109/89.876308
    [37] Kenny P, Mihoubi M, Dumouchel P. New MAP estimators for speaker recognition. In: Proceeding of the European Conference on Speech Communication and Technology. Geneva, Switzerland: ISCA, 2003.2961−2964
    [38] Kenny P, Boulianne G, Ouellet P, Dumouchel P. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio Speech and Language Processing, 2007, 15(4): 1435−1447 doi: 10.1109/TASL.2006.881693
    [39] Dehak N, Dehak R, Kenny P, Brümmer N, Dumouchel, P. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association. Brighton, UK: ISCA, 2009.1559−1562
    [40] Dehak N, Kenny P J, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Transactions on Audio Speech and Language Processing, 2011, 19(4): 788−798 doi: 10.1109/TASL.2010.2064307
    [41] Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 1987, 2(1-3): 37−52 doi: 10.1016/0169-7439(87)80084-9
    [42] Lei Z, Yang Y. Maximum likelihood i-vector space using PCA for speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association, Florence, Italy: ISCA, 2011.2725−2728
    [43] Tipping M E, Bishop C M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1999, 61(3): 611−622 doi: 10.1111/1467-9868.00196
    [44] Vestman V, Kinnunen T. Supervector compression strategies to speed up i-vector system development. In: Proceeding of the Odyssey: The Speaker and Language Recognition Workshop. Les Sables d'Olonne, France: ISCA, 2018.357−364
    [45] Gorsuch R L. Factor Analysis. 2nd. Hillsdale: Lawrence Earlbaum Associates, 1983
    [46] Roweis S T. EM algorithms for PCA and SPCA. In: Proceeding of the Advances in Neural Information Processing Systems, Denver, USA: Curran Associates, Inc., 1997.626−632
    [47] Chen L, Lee K A, Ma B, Guo W, Dai L. Local variability vector for text-independent speaker verification. In: Proceeding of the International Symposium on Chinese Spoken Language Processing, Singapore: IEEE, 2014.54−58
    [48] Xu L, Lee K A, Li H, Yang Z. Sparse coding of total variability. In: Proceeding of the Annual Conference of the International Speech Communication Association, Dresden, Germany: ISCA, 2015.102−1026
    [49] Ma J, Sethu V, Ambikairajah E, Lee K A. Generalized variability model for speaker verification. IEEE Signal Processing Letters, 2018, 25(12): 1775−1779 doi: 10.1109/LSP.2018.2874814
    [50] Shepstone S E, Lee K A, Li H, Tan Z, Soren H J. Total variability modeling using sourcespecific priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(3): 504−517 doi: 10.1109/TASLP.2016.2515506
    [51] Ribas D, Vincent E. An improved uncertainty propagation method for robust i-vector based speaker recognition. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK: IEEE, 2019.6331−6335
    [52] Xu L, Lee K A, Li H, Yang Z. Generalizing i-vector estimation for rapid speaker recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 2018, 26(4): 749−759 doi: 10.1109/TASLP.2018.2793670
    [53] Travadi R, Narayanan S. Efficient estimation and model generalization for the total variability model. Computer Speech and Language, 2019, 53: 43−64 doi: 10.1016/j.csl.2018.07.003
    [54] Chen C, Han J. Partial least squares based total variability space modeling for i-vector speaker verification. Chinese Journal of Electronics. 2018, 27 (6): 1229−1233
    [55] Chen C, Han J, Pan Y. Speaker verification via estimating total variability space using probabilistic partial least squares. In: Proceeding of the Annual Conference of the International Speech Communication Association, Stockholm, Swedish: ISCA, 2017.1537−1541
    [56] Lei Y, Hansen J. Speaker recognition using supervised probabilistic principal component analysis. In: Proceeding of the Annual Conference of the International Speech Communication Association, Florence, Italy: ISCA, 2010.382−385
    [57] Huber, Peter J. A robust version of the probability ratio test. Annals of Mathematical Statistics, 1965, 36(6): 1753−1758 doi: 10.1214/aoms/1177699803
    [58] Hautamaki V, Cheng Y, Rajan P, Lee C H. Minimax i-vector extractor for short duration speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association, Lyon, France: ISCA, 2013.3708−3712
    [59] Vogt R J, Baker B J, Sridharan S. Modelling session variability in text independent speaker verification. In: Proceeding of the European Conference on Speech Communication and Technology, Lisbon, Portugal: ISCA, 2005.3117−3120
    [60] Fisher R A. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 1936, 7(2): 179−188 doi: 10.1111/j.1469-1809.1936.tb02137.x
    [61] Hatch A O, Kajarekar S, Stolcke A. Within-class covariance normalization for SVM-based speaker recognition. In: Proceeding of the Annual Conference of the International Speech Communication Association, Pittsburgh, USA: ISCA, 2006.1471−1474
    [62] Campbell W M, Sturim D E, Reynolds D A, Solomonoff A. SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France: IEEE, 2006
    [63] Sadjadi S O, Pelecanos J, Zhu W. Nearest neighbor discriminant analysis for robust speaker recognition. In: Proceeding of the Annual Conference of the International Speech Communication Association. Singapore: ISCA, 2014.1860−-1864
    [64] Misra A, Ranjan S, Hansen J H. Locally weighted linear discriminant analysis for robust speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association. Stockholm, Swedish: ISCA, 2017.2864−2868
    [65] Misra A, Hansen J H. Modelling and compensation for language mismatch in speaker verification. Speech Communication, 2018, 96: 58−66 doi: 10.1016/j.specom.2017.09.004
    [66] Li M, Zhang X, Yan Y, Narayanan S S. Speaker verification using sparse representations on total variability i-vectors. In: Proceeding of the Annual Conference of the International Speech Communication Association. Florence, Italy: ISCA, 2011.2729−2732
    [67] Wang W, Han J, Zheng T, Zheng G, Shao M. Speaker recognition via block sparse Bayesian learning. International Journal of Multimedia and Ubiquitous Engineering, 2015, 10(7): 247−254 doi: 10.14257/ijmue.2015.10.7.26
    [68] 王伟, 韩纪庆, 郑铁然, 郑贵滨, 陶耀. 基于Fisher判别字典学习的说话人识别. 电子与信息学报, 2016, 38(2): 367−372

    Wang Wei, Han Ji-Qing, Zheng Tie-Ran, Zheng Gui-Bin, Tao Yao. Speaker recognition based on Fisher discrimination dictionary learning. Journal of Electronics & Information Technology, 2016, 38(2): 367−372
    [69] Variani E, Lei X, Mcdermott E, Moreno I L, Gonzalez-Dominguez J. Deep neural networks for small footprint text-dependent speaker verification. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE, 2014.4080−4084
    [70] Snyder D, Garcia-Romero D, Povey D, Khudanpur S. Deep neural network embeddings for text-independent speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association. Stockholm, Swedish: ISCA, 2017.99−1003
    [71] Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S. X-vectors: robust DNN embeddings for speaker recognition. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul, South Korea: IEEE, 2018.5329−5333
    [72] Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: Delving deep into convolutional nets. In: Proceeding of the British Machine Vision Conference. Nottingham, UK: Springer, 2014
    [73] Nagrani A, Chung J S, Zisserman A. VoxCeleb: A large-scale speaker identification dataset. In: Proceeding of the Annual Conference of the International Speech Communication Association. Stockholm, Swedish: ISCA, 2017.261−2620
    [74] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016.770−778
    [75] Chung J S, Nagrani A, Zisserman A. Voxceleb2: Deep speaker recognition. In: Proceeding of the Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018: 1086−1090
    [76] Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Bengio Y. Generative adversarial nets. In: Proceeding of the Advances in Neural Information Processing Systems, Montreal, Canada: Curran Associates, Inc., 2014.2672−2680
    [77] Zhang Z, Wang L, Kai A, Yamada T, Li W, Iwahashi M. Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. Eurasip Journal on Audio Speech and Music Processing, 2015, 1: 1−13
    [78] Richardson F, Reynolds D, Dehak N. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 2015, 22(10): 1671−1675 doi: 10.1109/LSP.2015.2420092
    [79] Chen Y, Lopez-Moreno I, Sainath T N, Visontai M, Alvarez R, Parada C. Locally connected and convolutional neural networks for small footprint speaker recognition. In: Proceeding of the Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA, 2015.1136−1140
    [80] Li L, Chen Y, Shi Y, Tang Z, Wang D. Deep speaker feature learning for text-independent speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association. Stockholm, Swedish: ISCA, 2017.1542−1546
    [81] Prince S J D, Elder J H. Probabilistic linear discriminant analysis for inferences about identity. In: Proceeding of the IEEE International Conference on Computer Vision. Rio de Janeiro, Brazil: IEEE, 2007
    [82] Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceeding of the Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA, 2015.3214−3218
    [83] Villalba J, Chen N, Snyder D, Garcia-Romero D, Dehak N. State-of-the-art speaker recognition for telephone and video speech: The JHU-MIT submission for NIST SRE18. In: Proceeding of the Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019.1488−1492
    [84] Povey D, Cheng G, Wang Y, Li K, Khudanpur S. Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceeding of the Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018.3743−3747
    [85] Snyder D, Garcia-Romero D, Sell G, McCree A, Povey D. Speaker recognition for multi-speaker conversations using x-vectors, ” In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE, 2019.5796−5800
    [86] Kanagasundaram A, Sridharan S, Sriram G, Prachi S, Fookes C. A study of x-vector based speaker recognition on short utterances. In: Proceeding of the Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019
    [87] Garcia-Romero D, Snyder D, Sell G, McCree A, Khudanpur S. X-vector DNN refinement with full-length recordings for speaker recognition. In: Proceeding of the Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019.1493−1496
    [88] Hong Q, Wu C, Wang H, Huang C. Statistics pooling time delay neural network based on x-vector for speaker verification. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE, 2020.6849−6853
    [89] Nagrani A, Chung J S, Xie W, Zisserman A. Voxceleb: Large-scale speaker verification in the wild. Computer Science and Language, 2020, 60: 1−15
    [90] Hajibabaei M, Dai D. Unified hypersphere embedding for speaker recognition. arXiv preprint arXiv: 1807.08312, 2018
    [91] Xie W, Nagrani A, Chung J S, Zisserman A. Utterance-level aggregation for speaker recognition in the wild. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE, 2019.5791−5795
    [92] Yu Y, Fan L, Li W. End-to-end text-independent speaker verification with triplet loss on short utterances. In: Proceeding of the Annual Conference of the International Speech Communication Association. Stockholm, Swedish: ISCA, 2017.1487−1491
    [93] Cai W, Chen J, Li M. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In: Proceeding of the Odyssey: The Speaker and Language Recognition Workshop. Les Sables d'Olonne, France: ISCA, France, 2018
    [94] Yu Y, Fan L, Li W. Ensemble additive margin softmax for speaker verification. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE, 2019.6046−6050
    [95] Ding W, He L. MTGAN: Speaker verification through multitasking triplet generative adversarial networks. In: Proceeding of the Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018.3633−3637
    [96] Zhou J, Jiang T, Li L, Hong Q, Wang Z, Xia B. Training multi-task adversarial network for extracting noise-robust speaker embedding. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE, 2019.6196−6200
    [97] Yang Y, Wang S, Sun M, Qian Y, Yu K. Generative adversarial networks based x-vector augmentation for robust probabilistic linear discriminant analysis in speaker verification. In: Proceeding of the International Symposium on Chinese Spoken Language Processing. Taipei, China: IEEE, 2018.205−209
    [98] Li N, Tuo D, Su D, Li Z, Yu D. Deep discriminative embeddings for duration robust speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018.2262−2266
    [99] Liu Y, He L, Liu J. Large margin softmax loss for speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019.2873−2877
    [100] Huang Z, Wang S, Yu K. Angular softmax for short-duration text-independent speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018.3623−3627
    [101] Yu Y Q, Fan L, Li W J. Ensemble additive margin softmax for speaker verification. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE, 2019.6046−6050
    [102] Bhattacharya G, Alam M J, Gupta V, Kenny P. Deeply fused speaker embeddings for text-independent speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018.3588−3592
    [103] Zhang C, Koishida K, Hansen J H. Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2018, 26(9): 1633−1644 doi: 10.1109/TASLP.2018.2831456
    [104] Zheng T, Han J, Zheng G. Deep neural network based discriminative training for i-vector/PLDA speaker verification. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul, South Korea: IEEE, 2018.5354−5358
    [105] Chen C, Wang W, He Y, Han J. A bilevel framework for joint optimization of session compensation and classification for speaker identification. Digital Signal Processing, 2019, 89: 104−115 doi: 10.1016/j.dsp.2019.03.008
    [106] Chen C, Han J. Task-driven variability model for speaker verification. Circuits, Systems, and Signal Processing. 2020, 39: 3125−3144
    [107] Rohdin J, Silnova A, Diez M, Plchot O, Matejka P, Burget L. End-to-end DNN based speaker recognition inspired by i-vector and PLDA. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul, South Korea: IEEE, 2018.4874−4878
    [108] Chen C, Han J. TDMF: Task-driven multilevel framework for end-to-end speaker verification. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: ISCA, 2020.6809−6813
    [109] Migdalas A, Pardalos P M, Varbrand P. Multilevel Optimization: Algorithms and Applications. Germany: Springer Science & Business Media, 2013
    [110] Kenny P. Bayesian speaker verification with heavy-tailed priors. In: Proceeding of the Odyssey: The Speaker and Language Recognition Workshop. Brno, Czech Republic: ISCA, 2010.1−4
    [111] Garcia-Romero D, Espy-Wilson C Y. Analysis of i-vector length normalization in speaker recognition systems. In: Proceeding of the Annual Conference of the International Speech Communication Association. Florence, Italy: ISCA, 2011.249−252
    [112] Pan Y, Zheng T, Chen C. I-vector Kullback-Leibler divisive normalization for PLDA speaker verification. In: Proceeding of the IEEE Global Conference on Signal and Information Processing. Montreal, Canada: IEEE, 2017.56−60
    [113] Burget L, Plchot O, Cumani S, Glembek O, Brümmer N. Discriminatively trained probabilistic linear discriminant analysis for speaker verification. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Prague, Czech Republic: IEEE, 2011.4832−4835
    [114] Cumani S, Laface P. Joint estimation of PLDA and nonlinear transformations of speaker vectors. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2017, 25(10): 1890−1900 doi: 10.1109/TASLP.2017.2724198
    [115] Cumani S, Laface P. Scoring heterogeneous speaker vectors using nonlinear transformations and tied PLDA models. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2018, 26(5): 995−1009 doi: 10.1109/TASLP.2018.2806305
    [116] Kenny P, Stafylakis T, Ouellet P, Alam M J, Dumouchel P. PLDA for speaker verification with utterances of arbitrary duration. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013.7649−7653
    [117] Ma J, Sethu V, Ambikairajah E, Lee K A. Twin model G-PLDA for duration mismatch compensation in text-independent speaker verification. In: Proceeding of the Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016: 1853−1857
    [118] Ma J, Sethu V, Ambikairajah E, Lee K A. Duration compensation of i-vectors for short duration speaker verification. Electronics Letters, 2017, 53(6): 405−407 doi: 10.1049/el.2016.4629
    [119] Villalba J, Lleida E. Handling i-vectors from different recording conditions using multi-channel simplified PLDA in speaker recognition. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013.6763−6767
    [120] Garcia-Romero D, McCree A. Supervised domain adaptation for i-vector based speaker recognition. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE, 2014: 4047−4051
    [121] Richardson F S, Reynolds D A, Nemsick B. Channel compensation for speaker recognition using MAP adapted PLDA and denoising DNNs. In: Proceeding of the Odyssey: The Speaker and Language Recognition Workshop. Bilbao, Spain: ISCA, 2016.225−230
    [122] Hong Q, Li L, Zhang J, Wan L, Guo H. Transfer learning for PLDA-based speaker verification. Speech Communication, 2017, 92: 90−99 doi: 10.1016/j.specom.2017.05.004
    [123] Li N, Mak M W. SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015, 23(10): 1648−1659 doi: 10.1109/TASLP.2015.2442757
    [124] Mak M W, Pang X, Chien J T. Mixture of PLDA for noise robust i-vector speaker verification. IEEE/ACM Transactions on Audio Speech and Language Processing, 2016, 24(1): 130−142 doi: 10.1109/TASLP.2015.2499038
    [125] Villalba J, Miguel A, Ortega A, Lleida E. Bayesian networks to model the variability of speaker verification scores in adverse environments. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(12): 2327−2340 doi: 10.1109/TASLP.2016.2607343
    [126] Fan Y, Kang J, Li L, Li K, Wang D. CN-CELEB: A challenging Chinese speaker recognition dataset. arXiv preprint arXiv: 1911.01799, 2019
    [127] McLaren M, Ferrer L, Castan D, Lawson A. The speakers in the wild (SITW) speaker recognition database. In: Proceeding of the Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016.818−822
    [128] Morrison G, Zhang C, Enzinger E, Ochoa F, Bleach D, Johnson M, et al. Forensic database of voice recordings of 500+ Australian English speakers[Online], available: http://databases.forensic-voice-comparison.net, November 10, 2020.
    [129] Greenberg C S. The NIST year 2012 speaker recognition evaluation plan. Technical Report, 2012
    [130] Feng L, Hansen L K. A new database for speaker recognition. IMM-Technical Report, 2005
    [131] Godfrey J J, Holliman E C, McDaniel J. SWITCHBOARD: Telephone speech corpus for research and development. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. San Francisco, USA: IEEE, 1992.517−520
    [132] Jankowski C, Kalyanswamy A, Basson S, Spitz J. TIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database. In: Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing. Albuquerque, USA: IEEE, 1990.109−122
    [133] 王金甲, 纪绍男, 崔琳, 夏静, 杨倩. 基于注意力胶囊网络的家庭活动识别. 自动化学报, 2019, 45(11): 2199−2204

    Wang Jin-Jia, Ji Shao-Nan, Cui Lin, Xia Jing, Yang Qian. Domestic Activity Recognition Based on Attention Capsule Network. Acta Automatica Sinica, 2019, 45(11): 2199−2204
    [134] Wang H, Dinkel H, Wang S, Qian Y, Yu K. Dual-adversarial domain adaptation for generalized replay attack detection. In: Proceeding of the Annual Conference of the International Speech Communication Association, Shanghai, China: ISCA, 2020.1086−1090
    [135] 黄雅婷, 石晶, 许家铭, 徐波. 鸡尾酒会问题与相关听觉模型的研究现状与展望. 自动化学报, 2019, 45(2): 3−20

    Huang Ya-Ting, Shi Jing, Xu Jia-Ming, Xu Bo. Research advances and perspectives on the cocktail party problem and related auditory models. Acta Automatica Sinica, 2019, 45(2): 3−20
    [136] Lin Q, Hou Y, Li M. Self-attentive similarity measurement strategies in speaker diarization. In: Proceeding of the Annual Conference of the International Speech Communication Association, Shanghai, China: ISCA, 2020.284−288
  • 加载中
计量
  • 文章访问数:  123
  • HTML全文浏览量:  55
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-07-09
  • 修回日期:  2020-09-03
  • 网络出版日期:  2020-12-10

目录

    /

    返回文章
    返回