2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

文本无关说话人识别中句级特征提取方法研究综述

陈晨 韩纪庆 陈德运 何勇军

朱煜, 方观寿, 郑兵兵, 韩飞. 基于旋转框精细定位的遥感目标检测方法研究. 自动化学报, 2023, 49(2): 415−424 doi: 10.16383/j.aas.c200261
引用本文: 陈晨, 韩纪庆, 陈德运, 何勇军. 文本无关说话人识别中句级特征提取方法研究综述. 自动化学报, 2022, 48(3): 664−688 doi: 10.16383/j.aas.c200521
Zhu Yu, Fang Guan-Shou, Zheng Bing-Bing, Han Fei. Research on detection method of refined rotated boxes in remote sensing. Acta Automatica Sinica, 2023, 49(2): 415−424 doi: 10.16383/j.aas.c200261
Citation: Chen Chen, Han Ji-Qing, Chen De-Yun, He Yong-Jun. Utterance-level feature extraction in text-independent speaker recognition: A review. Acta Automatica Sinica, 2022, 48(3): 664−688 doi: 10.16383/j.aas.c200521

文本无关说话人识别中句级特征提取方法研究综述

doi: 10.16383/j.aas.c200521
基金项目: 国家自然科学基金(62101163), 黑龙江省自然科学基金(LH2021F029), 中国博士后科学基金(2021M701020), 黑龙江省博士后专项经费(LBH-Z20020), 黑龙江省普通高校基本科研业务费专项资金(2020-KYYWF-0341)资助
详细信息
    作者简介:

    陈晨:哈尔滨理工大学讲师, 博士后. 主要研究方向为语音信号处理, 音频信息分析, 说话人识别. E-mail: chenc@hrbust.edu.cn

    韩纪庆:哈尔滨工业大学教授. 主要研究方向为语音信号处理, 音频信息分析. 本文通信作者. E-mail: jqhan@hit.edu.cn

    陈德运:哈尔滨理工大学教授. 主要研究方向为模式识别, 机器学习. E-mail: chendeyun@hrbust.edu.cn

    何勇军:哈尔滨理工大学教授. 主要研究方向为语音信号处理, 图像处理. E-mail: holywit@163.com

Utterance-level Feature Extraction in Text-independent Speaker Recognition: A Review

Funds: Supported by National Natural Science Foundation of China (62101163), Natural Science Foundation of Heilongjiang Province (LH2021F029), China Postdoctoral Science Foundation (2021M701020), Heilongjiang Postdoctoral Fund (LBH-Z20020), and Fundamental Research Foundation for Universities of Heilongjiang Province (2020-KYYWF-0341)
More Information
    Author Bio:

    CHEN Chen Lecturer and postdoctor at Harbin University of Science and Technology. Her research interest covers speech signal processing, audio information analysis, speaker recognition

    HAN Ji-Qing Professor at Harbin Institute of Technology. His research interest covers speech signal processing and audio information analysis. Corresponding author of this paper

    CHEN De-Yun Professor at Harbin University of Science and Technology. His research interest covers pattern recognition and machine learning

    HE Yong-Jun Professor at Harbin University of Science and Technology. His research interest covers speech signal processing and image processing

  • 摘要: 句级 (Utterance-level) 特征提取是文本无关说话人识别领域中的重要研究方向之一. 与只能刻画短时语音特性的帧级 (Frame-level) 特征相比, 句级特征中包含了更丰富的说话人个性信息; 且不同时长语音的句级特征均具有固定维度, 更便于与大多数常用的模式识别方法相结合. 近年来, 句级特征提取的研究取得了很大的进展, 鉴于其在说话人识别中的重要地位, 本文对近期具有代表性的句级特征提取方法与技术进行整理与综述, 并分别从前端处理、基于任务分段式与驱动式策略的特征提取方法, 以及后端处理等方面进行论述, 最后对未来的研究趋势展开探讨与分析.
  • 近年来, 随着遥感技术的发展, 高质量的遥感图像日益增多, 这为遥感领域的应用奠定了基础. 遥感图像广泛应用于灾害监测、资源调查、土地利用评价、农业产值测算、城市建设规划等领域[1], 对于社会和经济发展具有重要的意义. 而目标检测作为遥感图像处理的应用之一, 获得图中特定目标类别和位置. 通常关注飞机、机场、船舶、桥梁和汽车等目标, 因此对于民用和军用领域有着十分重要的用途[2]. 在民用领域中, 船舶的定位有利于海上救援行动, 车辆的定位有利于车辆计数和分析道路的拥堵情况等. 在军事领域中, 这些类别信息的检测获取, 有利于快速且精准地锁定攻击目标位置、分析战争形势以及制定军事行动等. 因此对于遥感图像中的目标进行精准检测至关重要.

    目标检测是计算机视觉领域中一个重要且具有挑战性的研究热点. 随着深度学习的快速发展, 目标检测器的性能取得了显著进步, 已经广泛应用于各个行业. 目前常用的目标检测器大致可以分为两级检测器和单级检测器两类[3]. 两级检测器是基于区域卷积神经网络(Regions with convolutional neural network, R-CNN)框架, 检测过程分为两个阶段. 第1阶段从图像中生成一系列候选框区域, 第2阶段从候选框区域中提取特征, 然后使用分类器和回归器进行预测. Faster R-CNN[4]作为两级检测器的经典方法, 提出候选区域生成网络(Region proposal networks, RPN)用于候选框的产生, 从而快速、准确地实现端到端检测. 之后区域全卷积网络(Region-based fully convolutional network, R-FCN)[5]、Cascade R-CNN[6]等两级检测器的出现进一步提高目标检测的精度. 单级检测器将检测问题简化为回归问题, 仅仅由一系列卷积层进行分类回归, 而不需要产生候选框及特征提取阶段. 因此这类方法通常检测速度较快. 例如, Redmon等[7]提出YOLO检测器, 将图像划分为一系列网格区域, 每个网格区域直接回归得到边界框. Liu等[8]提出SSD检测器, 在多个不同尺度大小的特征图上直接分类回归. Lin等[9]提出Focal Loss分类损失函数, 解决单级检测器的类别不平衡问题, 进一步提高检测精度. 这些先进的目标检测技术往往用于水平边界框的生成, 然而在遥感图像中, 大多数检测目标呈现出任意方向排列, 对于横纵比大或者密集排列的目标, 仅仅采用水平框检测将包含过多的冗余信息, 影响检测效果. 因此旋转方向成为不可忽视的因素.

    早期应用于遥感领域的旋转框检测算法主要来源于文本检测, 例如R2CNN[10]和RPN[11]等. 然而由于遥感图像背景复杂且空间分辨率变化较大, 相比于二分类的文本检测具有更大困难, 因此这些优秀的文本检测算法直接应用于遥感领域中并不能取得较好的检测效果. 近年来, 随着目标检测算法的发展以及针对遥感图像的深入研究, 涌现出许多性能良好的旋转框检测算法. 例如Ding等[12]提出旋转感兴趣区域学习器(Region of interest transformer, RoI), 将水平框转换为旋转框, 并在学习器中执行边界框的回归; Zhang等[13]提出通过捕获全局场景和局部特征的相关性增强特征; Azimi等[14]提出基于多尺度卷积核的图像级联方法; Yang等[15]提出像素注意力机制抑制图像噪声, 突出目标的特征, 并且在SmoothL1损失[4]中引入IoU常数因子解决旋转框的边界问题, 使旋转框预测更加精确. Yang等[16]设计精细调整模块, 采用特征调整模块, 通过插值操作实现特征对齐. Xu等[17]提出回归4种长度比来表示对应边的相对偏移距离, 并且引入了一个真实框与其水平边界框面积比作为倾斜因子, 用于对每个目标水平或旋转检测的选择. Wei等[18]提出利用预测内部中线实现旋转目标检测的方法. Li等[19]提出利用预测的掩模获取旋转框的方法. Wang等[20]提出了一种基于初始横向连接的特征金字塔网络(Feature pyramid networks, FPN)增强算法, 同时利用语义注意力机制网络提供语义特征, 从复杂的背景中提取目标.

    因此, 目前在遥感图像中用于旋转框检测的方法大致可以分为两种. 其中一种算法整体结构仍然为水平框检测, 仅仅在回归预测分支中增加一些变量的获取, 例如角度因子等. 这种算法使得在网络预测的像素中包含较多背景信息, 容易出现图1所示的角度偏移以及漏检较多等问题. 另一种算法预设含有角度的锚点框, 然后采用旋转候选框内的像素进行预测. 由于目标的旋转角度较多, 因此这种算法需要预设大量的锚点框以保证召回率, 这样会极大地增加计算量.

    图 1  遥感图像目标检测问题可视化
    Fig. 1  Visualization of remote sensing images object detection problem

    针对上述不足, 本文结合这两种处理方法的优势, 以Faster R-CNN[21]为基础, 提出一种用于旋转框检测的网络R2-FRCNN (Refined rotated faster R-CNN). 该网络依次采用上述两种旋转框处理方法, 将前一种方法得到旋转框的过程视为粗调, 这个阶段产生的旋转框作为后一种方法的预设框, 然后对于旋转框再次进行调整, 这个过程称为细调. 两阶段调整使得网络输出更加精确的预测框. 此外, 针对遥感图像存在较多小目标的特点, 本文提出像素重组特征金字塔结构(Pixel-recombination feature pyramid network, PFPN), 相比于传统的金字塔网络, 本文的金字塔结构使得特征局部信息与全局信息相结合, 从而突出复杂背景下小目标的特征响应. 同时为了更好地提取表征目标信息的特征, 用于后续预测阶段, 本文在粗调阶段设计积分感兴趣区域池化方法(Integrate region of interest pool, IRoIPool), 以及在精调阶段设计旋转感兴趣区域池化方法(Rotated region of interest pool, RRoIPool), 提升复杂背景下小目标的检测精度. 最后, 本文在粗调和细调阶段均采用全连接层与卷积层结合的预测分支以及SmoothLn回归损失函数, 进一步提升算法性能.

    本文结构安排如下: 第1节详细阐述本文提出的旋转框检测网络R2-FRCNN; 第2节通过与官方基准方法和现有方法的实验结果进行对比, 以及本文方法各模块的分离实验, 评估本文方法的性能; 第3节总结.

    本节对提出的网络R2-FRCNN结构以及各模块进行阐述. 首先介绍R2-FRCNN网络的整体结构, 然后详细介绍各个模块(像素重组金字塔结构、感兴趣区域特征提取和网络预测分支结构), 最后介绍本文使用的损失函数.

    图2展示了R2-FRCNN网络的整体结构, 可以分为基础网络、像素重组金字塔、候选区域生成网络RPN、粗略调整阶段和精细调整阶段5个部分.

    图 2  R2-FRCNN网络结构图
    Fig. 2  The structure of R2-FRCNN

    本文采用ResNet[22]作为算法的基础网络, 将C3C4C5C6特征层用于构建特征金字塔结构, 增强网络对于小目标的检测能力. 由金字塔产生的P3P4P5P6P7 5个特征层上, 每个像素点预设3个锚点框, 锚点框的长宽比为{1:1, 1:2, 2:1}, 尺寸大小为8, 经由RPN[4]调整锚点框的位置生成一系列候选框. 然后选择置信度较高的2000个候选框用于粗略调整阶段, 该模块的回归过程将水平框调整为旋转框. 最后这些候选框进入精细调整阶段, 再次调整旋转框的位置, 得到更好的检测效果. 经过两阶段调整后的框, 选择后一阶段中最大分类数值作为置信度, 同时采用旋转非极大抑制算法处理, 选取邻域内置信度较高的框, 并且抑制低置信度的框, 这些高置信度的候选框即为网络输出预测框.

    特征金字塔结构[23]被广泛应用于许多先进的目标检测算法中, 这个结构的设计在于浅层的定位信息准确, 深层的语义信息丰富, 通过融合深浅层特征图, 提升对于小目标的检测性能. 如表1所示, RoI-Transformer (RT)[12]、CADNet[13]、SCRDet[15]、R3Det[16]和GV R-CNN (GV)[17]均采用了深浅层融合特征, 表现出优异的检测性能, 而R2CNN[10]未使用特征融合, 取得的检测结果远低于其他方法. 图3为本文设计的像素重组金字塔结构. 该结构分为2个阶段: 第1阶段为$ {C}_{i} $$ {M}_{i} $, 采用尺度转化的方式, 利用局部特征信息的同时, 融合上下层构建金字塔结构; 第2阶段为$ {M}_{i} $$ {P}_{i} $, 采用非局部注意力[24]模块, 利用全局信息, 突出目标区域的特征.

    表 1  不同方法在DOTA数据集的检测精度对比(%)
    Table 1  Comparison of detection accuracy of different methods in DOTA (%)
    类别R2CNN[10]RT[12]CADNet[13]SCRDet[15]R3Det[16]GV[17]本文方法
    飞机80.9488.6487.8089.9889.2489.6489.10
    棒球场65.6778.5282.4080.6580.8185.0081.22
    桥梁35.3443.4449.4052.0951.1152.2654.47
    田径场67.4475.9273.5068.3665.6277.3472.97
    小型车辆59.9268.8171.1068.3670.6773.0179.99
    大型车辆50.9173.6864.5060.3276.0373.1482.28
    船舶55.8183.5976.6072.4178.3286.8287.64
    网球场90.6790.7490.9090.8590.8390.7490.54
    篮球场66.9277.2779.2087.9484.8979.0287.31
    储油罐72.3981.4673.3086.8684.4286.8186.33
    足球场55.0658.3948.4065.0265.1059.5554.20
    环形车道52.2353.5460.9066.6857.1870.9168.18
    港口55.1462.8362.0066.2568.1072.9476.12
    游泳池53.3558.9367.0068.2468.9870.8670.83
    直升机48.2247.6762.2065.2160.8857.3259.19
    平均准确率60.6769.5669.9072.6172.8175.0276.02
    下载: 导出CSV 
    | 显示表格
    图 3  像素重组金字塔结构
    Fig. 3  The structure of pixel-recombination pyramid

    在第1阶段中, 特征上采样对于金字塔结构是一个关键的操作. 最常用的特征上采样方式为插值和转置卷积[25]. 插值法仅考虑相邻像素, 无法获取密集预测任务所需的丰富语义信息. 转置卷积作为卷积的逆运算, 将其作为上采样方式存在2点不足[26]: 1)对于整个特征图都采用同样的卷积核, 而不考虑特征图中的目标信息, 限制了上采样过程对于局部变化的响应; 2)若采用较大的卷积核将会增加大量参数. 本文引入尺度转换作为特征上采样方法. 深浅层特征融合的操作过程如图4所示. 该方法首先利用“通道转化”方法[27]压缩通道数(本文压缩系数$r=0.5$), 增大特征图尺寸, 即:

    图 4  特征融合结构
    Fig. 4  The structure of feature fusion
    $${I_{H,W,C}} = {I_{\left\lfloor {H/r} \right\rfloor ,\left\lfloor {W/r} \right\rfloor ,C \cdot {r^2} + r \cdot {\rm{mod}}\left( {W,r} \right) + {\rm{mod}}\left( {H,r} \right)}}$$ (1)

    然后, 采用$1 \times 1 $的卷积层用于调整通道数, 再由Softmax函数[28]作用于每一通道的特征层. 最后采用式(2)进行加权求和, 使得特征融合过程更好地利用局部信息.

    $$ \left\{\begin{aligned} &{y}_{m,n,c}=\displaystyle\sum\limits_{i=-2}^{2}\displaystyle\sum\limits_{j=-2}^{2}{x}_{m+i,n+j,c}\cdot {w}_{m,n,k} \\ &k=\left(i+2\right)\times 5+j+2 \end{aligned}\right. $$ (2)

    式中, $m、n$分别表示像素的横、纵位置, $ c $表示$ C $特征层当前通道, $ k $表示$ M $特征层当前通道.

    第2阶段采用非局部注意力模块, 利用特征图中目标与全局特征的关系, 突出目标区域的响应.

    根据非局部注意力模块的定义, 假设$ C $为通道数, $ s $为尺度大小, ${{G}}$为特征图尺度的乘积即$s\times s,x$为输入特征图, $ q\left(x\right) $$ k\left(x\right) $$ v\left(x\right) $定义为采用不同线性转换的结果:

    $$ q\left({x}^{s}\right)={{W}_{q}^{s}}^{\rm T}{x}^{s} $$ (3)
    $$ k\left({x}^{s}\right)={{W}_{k}^{s}}^{\rm T}{x}^{s} $$ (4)
    $$ v\left({x}^{s}\right)={{W}_{v}^{s}}^{\rm T}{x}^{s} $$ (5)

    式中, 系数矩阵${{W}}_{{q}}^{{s}},\;{{W}}_{{k}}^{{s}}\in { \bf{R}}^{{{C}}\times {{C}}/8},\;{{W}}_{{v}}^{{s}}\in {\bf{R}}^{{{C}}\times {{C}}}$.

    ${{q}}\left({{x}}^{{s}}\right)$${{k}}\left({{x}}^{{s}}\right)$矩阵相乘, 得二维矩阵${{o}}^{{s}}\in {\bf{R}}^{{{G}}\times {{G}}}$; 再运用Softmax将矩阵的每一行转换为概率值, 最后与${{v}}\left({{x}}^{{s}}\right)$矩阵相乘后再与输入相加, 得输出量${{{x}}^{{s}}}'$:

    $$ {{x}^{s}}'={x}^{s}+{\left({o}^{s}v^{\rm{T}}{\left({x}^{s}\right)}\right)}^{\rm T} $$ (6)

    在本文的特征金字塔结构中, 第1阶段输出的$ {M}_{3} $$ {M}_{4} $由于尺度较大, 直接用于非局部注意力模块计算量较大. 因此为了保留这两层的语义信息, 同时再次融合不同层的特征, 该结构将$ {M}_{3} $$ {M}_{4} $池化为$ {M}_{5} $的尺寸大小, 然后计算这3层的均值输入非局部注意力模块, 再由插值操作输出对应相等尺寸的特征图. $ {M}_{6} $$ {M}_{7} $的特征图直接应用非局部注意力模块得到$ {P}_{6} $$ {P}_{7} $层.

    感兴趣区域特征提取模块主要用于固定输出尺寸大小, 提取表征框内区域的特征, 便于后续的网络预测. 本文的RoI特征提取模块主要分为粗调阶段的水平框和细调阶段的旋转框RoI特征提取两部分.

    自然场景图像中的目标通常是固定方向呈现, 因此两阶段式目标检测算法采用水平框的RoI特征提取. 目前, 应用较为广泛的RoI特征提取是RoIPooling[4]和RoI Align[29]. 图5(a)为RoI池化原理图, 选择量化后块中最大像素值作为池化后的结果. 然而量化的结果会导致提取的小目标像素存在偏差, 影响检测效果. 图5(b)为RoI对齐原理图, 取消量化操作, 采用双线性插值在块中计算出N个浮点坐标的像素值, 均值作为块的结果. 然而这个操作存在两点不足: 采样点数量需要预先设置, 不同大小候选框设置了相同数量的采样点.

    图 5  常用RoI特征提取示意图
    Fig. 5  The schematic diagram of commonRoI feature extraction

    因此, 本文采用精确RoI (Precise RoI, Pr-RoI)池化方法[30]的特征提取操作, 如图6所示, 由插值操作将块内特征视为一个连续的过程, 采用积分方法获得整个块的像素和, 其均值作为块的结果, 即:

    $$ {\rm{IRoIPool}}\left(bin,{\cal{F}}\right)=\dfrac{{\int }_{{y}_{1}}^{{y}_{2}}{\int }_{{x}_{1}}^{{x}_{2}}f\left(x,y\right){\rm d}x{\rm d}y}{\left({x}_{2}-{x}_{1}\right)\times \left({y}_{2}-{y}_{1}\right)} $$ (7)
    图 6  IRoIPool特征提取示意图
    Fig. 6  The diagram of IRoIPool feature extraction

    式中, $ f(x,y) $为采用面积插值法[15]所得的像素值.

    旋转框RoI特征提取直接采用积分操作较为复杂, 因此本文将积分操作视为块内一定数量的像素之和, 从而得到块的均值, 即:

    $$ {\rm{RRoIPool}}\left(bin,{\cal{F}}\right)=\frac{\displaystyle\sum\limits _{y={y}_{1}}^{{y}_{2}}\displaystyle\sum\limits _{x={x}_{1}}^{{x}_{2}}f\left(x,y\right)}{{N}_{x}\times {N}_{y}} $$ (8)
    $$ {N_x} = \left\lfloor {\dfrac{{{x_2} - {x_1}}}{{{l_x}}}} \right\rfloor + 1,{N_y} = \left\lfloor {\dfrac{{{y_2} - {y_1}}}{{{l_y}}}} \right\rfloor + 1 $$ (9)

    式中, $ ({x}_{1},{y}_{1}) $$ ({x}_{2},{y}_{2}) $分别为旋转框在水平位置处的左上角和右下角点, $ {l}_{x} $$ {l}_{y} $分别为水平方向和垂直方向的采样距离, 如图7所示.

    根据候选框的大小决定采样点的数量. 然而采样距离太小会导致计算量大幅增加, 因此为平衡检测效率与精度, 本文将采样距离$ {l}_{x} $$ {l}_{y} $设置为0.4.

    旋转框在水平位置处采样点的坐标为$ ({x}_{h},{y}_{h}) $, 旋转框$ w $所对应的边与横轴正方向的夹角为$ \theta $, 旋转框的中心点为$ ({c}_{x},{c}_{y}) $, 由式(10)转化为旋转框中的坐标$ (x,y) $, 再由面积插值法得到该位置的像素值.

    $$ \left[ \begin{array}{c}x\\ y\end{array} \right] = \left[ \begin{array}{ccc}{\rm cos}\theta & - {\rm sin}\theta & \left(1 - {\rm cos}\theta \right) \cdot {c}_{x} + {\rm sin}\theta \cdot {c}_{y}\\ {\rm sin}\theta & {\rm cos}\theta & - {\rm sin}\theta \cdot {c}_{x} + \left(1 - {\rm cos}\theta \right) \cdot {c}_{y} \end{array} \right]\left[ \begin{array}{c}{x}_{h}\\ {y}_{h}\\ 1\end{array} \right] $$ (10)
    图 7  旋转RoI特征提取示意图
    Fig. 7  The diagram of rotated RoI feature extraction

    本文方法与R3Det类似, 都使用了精细调整旋转框的定位. 然而R3Det每一次调整的预测分支直接采用卷积层操作, 但是卷积操作为水平滑动, 用于旋转框回归将会包含一些背景像素干扰预测结果, 而本文方法采用旋转框感兴趣区域提取框内的特征信息用于预测, 更加有利于检测性能的提升.

    目标检测算法分为定位和分类两个任务. 一般而言, 两级检测器的预测分支采用全连接层, 而单级检测器的预测分支采用卷积层. Wu等[31]发现这两个任务适合于不同的预测分支结构, 全连接层更适合用于分类任务, 卷积层更适合用于回归任务. 因此, 本文采用图8所示的预测分支结构.

    图 8  预测分支结构图
    Fig. 8  The diagram of prediction branch

    在本文采用的预测分支中, 分类结构保持不变, 仍然采用全连接层. 而回归分支采用一系列ResNet网络中的ResBlock结构(本文使用2个).

    本文提出网络的损失函数包含RPN阶段${L}_{{\rm{RPN}}}$、粗略调整阶段$ {L}_{ro} $和精细调整阶段$ {L}_{re} $, 即:

    $$ L={L}_{{\rm{RPN}}}+{L}_{ro}+{L}_{re} $$ (11)

    每一阶段的损失函数都包含分类损失和回归损失. 分类损失采用交叉熵损失函数[4]. 回归损失采用SmoothLn损失函数[32], 如式(12)所示, 相比于SmoothL1损失函数[4], 该损失函数的一阶导数是连续存在的, 具有良好的光滑性.

    $$ S{L}_{n}\left(x\right)=\left(\left|x\right|+1\right){\rm ln}\left(\left|x\right|+1\right)-\left|x\right| $$ (12)
    $$ \dfrac{\partial S{L}_{n}\left(x\right)}{\partial x}={\rm sign}\left(x\right)\cdot{\rm ln}\left({\rm sign}\left( {x} \right)\cdot{ x+1}\right) $$ (13)

    此外, 式(11)中RPN阶段为水平框的回归, 因此使用$x、y、w、h$4个值代表水平框. 粗调阶段和细调阶段为旋转框的回归, 使用$x、y、 w、 h、\theta$5个值代表旋转框, 因此旋转框的回归转换值定义为:

    $$ \left[\begin{array}{c}{t}_{x}\\ {t}_{y}\end{array}\right] = \left[\begin{array}{cc}{\rm cos}\theta & {\rm sin}\theta \\ -{\rm sin}\theta & {\rm cos}\theta \end{array}\right]\left[\begin{array}{c}{x}_{t}-{x}_{a}\\ {y}_{t}-{y}_{a}\end{array}\right]\left[\begin{array}{cc}\dfrac{1}{{w}_{a}}& 0\\ 0& \dfrac{1}{{h}_{a}}\end{array}\right] $$ (14)
    $$ {t}_{w}=\log_2\left(\frac{{w}_{t}}{{w}_{a}}\right),\;\;\;{t}_{h}=\log_2\left(\frac{{h}_{t}}{{h}_{a}}\right) $$ (15)
    $$ {t}_{\theta }=\left({\theta }_{t}-{\theta }_{a}\right){\rm{mod}}\;2\pi $$ (16)

    式中, $x、y、w、h、\theta$分别为旋转框中心点的横、纵坐标, 框的宽度、高度和旋转角度. ${x}_{t}、{x}_{a}$分别表示真实框和候选框的值.

    本文实验设备使用英特尔E5-2683 CPU, 英伟达GTX 1080Ti显卡, 64 GB内存的服务器, 实验环境为Ubuntu 16.04.4操作系统、Cuda9.0、Cudnn7.4.2、Pytorch1.1.0、Python3.7.

    本文实验中采用3个GPU进行训练, 批处理大小为3 (GPU显存限制), 输入图像统一为1024$\times$1024分辨率. 训练的迭代次数为15轮, 同时使用衰减系数为0.0001、动量为0.9的随机梯度下降作为优化器, 初始的学习率设置为0.01, 分别在第8、第11轮和第14轮将学习率降低10倍. 图9是在DOTA 数据集上训练过程的损失下降曲线图(一轮训练有4500次迭代), 在第8轮(36000次迭代)出现明显的损失下降.

    图 9  在DOTA上训练过程损失曲线图
    Fig. 9  Train loss on DOTA

    本文使用DOTA[21]用于算法的评估. DOTA是由旋转框标注的大型公开数据集, 主要用于遥感图像目标检测任务. 该数据集包含由各个不同传感器和平台采集的2806张图像, 图像的大小范围从800 × 800像素到4000 × 4000像素, 含有各种尺度、方向和形状. 专家选择15种常见类别对这些图像进行标注, 总共标注188282个目标对象, 包括飞机、棒球场、桥梁、田径场、小型车辆、大型车辆、船舶、网球场、篮球场、储油罐、足球场、环形车道、港口、游泳池和直升机. 另外该数据集选取一半的图像作为训练集, 1/6作为验证集, 1/3作为测试集, 其中测试集的标注不公开. 为降低高分辨率图像由于压缩对于小目标的影响, 本文将所有图像统一裁剪为1024 × 1024的子图像, 重叠为200像素.

    本文方法采用ResNet50与可变形卷积[33]相结合作为基础网络进行本节实验. 为了评估本文方法的性能, 实验数据均采用官方提供的训练集和测试集. 实验结果通过提交到DOTA评估服务器上获得, 本文方法的评估结果平均准确率为0.7602, 超过目前官方提供的基准方法[21].

    除了与官方基准方法进行对比, 本节实验还与R2CNN[10]、RoI-Transformer[12]、CADNet[13]、SCRDet[15]、R3Det[16]和GV R-CNN[17]进行对比分析, 各方法的检测结果如表1所示.

    表1中的检测结果可以看出, 本文方法的检测结果优于其他方法, 达到76.02%的平均准确率. 其中桥梁、小型车辆、大型车辆、船舶和港口这些类别取得最高检测精度. 由图10可以看出, 这些类别的目标在遥感数据集中尺寸较小, 并且往往呈现出密集排列, 因此说明本文方法对于在这类场景的检测更具有优势. 此外, 飞机、网球场、篮球场、储水池、游泳池等类别在遥感数据集中尺寸较大, 对于这些目标本文方法仍取得与其他方法中最高检测精度相差不大的结果. 这些检测结果说明本文方法能够有效地用于检测遥感图像中的目标.

    图 10  各类别检测结果展示
    Fig. 10  Visualization of each category detection

    1)各模块对于检测精度的影响

    为验证本文方法各模块的有效性, 本节进行了一系列对比实验. 表2展示了网络在DOTA 数据集上不同模块设置的检测结果. 其中“√”表示采用该项设置, ConvFc表示采用第1.4节设计的预测分支结构. 对比实验分析如下:

    表 2  R2-FRCNN模块分离检测结果
    Table 2  R2-FRCNN module separates detection results
    模块R2-FRCNN
    基准设置
    精细调整
    IRoIPool
    RRoIPool
    PFPN
    SmoothLn
    ConvFc
    平均准确率 (%)69.5273.6273.9974.3174.9775.1375.96
    下载: 导出CSV 
    | 显示表格

    a)基准设置. 本节实验将扩展后的Faster R-CNN OBB[21]用于旋转框检测任务. 其中, 基础网络采用ResNet50[22], 并且采用特征金字塔[23], RoI特征提取采用RoI Align[29], 回归分支采用Smoo-thL1损失函数[4]. 为了保证实验的公平性和准确性, 后续实验参数设置都是严格一致.

    b)精细调整. 在实验的精细调整阶段, 初始候选区域特征提取选择Rotated RoI Align (RRoI Align)方法, 该方法为RoI Align[29]在旋转框中的应用. 由表2的结果显示, 精细调整阶段的添加, 使得检测效果得到大幅提升, 评估指标平均准确率增加4.10%. 说明提取旋转候选框内像素进一步调整是有必要的, 这个阶段避免了水平框特征提取包含过多背景像素的问题, 从而提升对较大横纵比目标的检测效果. 然而在实验中发现, 在精细调整结构中多次调整提升效果并不明显, 从一次调整增加为两次调整, 平均准确率为73.68%, 仅仅增加0.06%, 因此为了减少参数量, 本文后续实验的精细调整阶段采用一次调整过程.

    c) RoI特征提取. 实验中, 将第1.3节提出的IRoIPool和RRoIPool用于替换初始两阶段调整模块的RoI Align和RRoI Align. 由表2的实验结果显示, 相比于初始RoI特征提取方法, IRoIPool方法使得检测精度平均准确率提升0.37%, RRoIPool方法使得检测精度平均准确率进一步提升0.32%, 说明本文设计的RoI特征提取更为有效. 本文后续将对这两个特征提取方法的结构做进一步研究.

    d) PFPN结构. 为了更好地验证PFPN的作用, 本文对此设计了两组实验. 第1组, 金字塔结构的深浅层不进行尺寸转化和非局部注意力模块, 仅仅采用$ 1\times 1 $的卷积将特征层的通道数转化为256, 网络的其他结构和训练超参数保持一致, 平均准确率仅为64.55%, 由于DOTA数据集中小目标较多, 因此说明PFPN金字塔结构对于小目标的检测效果显著. 第2组实验的结果见表2, 相比于FPN, PFPN使得平均准确率提升0.66%, 说明本文提出的PFPN结构对于遥感目标的检测更为有效.

    e)网络预测分支. 本节针对预测分支进行两部分的实验, 即回归损失函数和预测分支结构. 由表2可以看出, 相比于SmoothL1, 回归损失函数采用SmoothLn, 使得检测精度平均准确率提升0.16%. 此外, 采用第1.4节所设计的预测分支结构, 分类过程采用全连接层, 回归过程采用卷积层, 仅增加2个ResBlock模块, 使得平均准确率提升0.83%. 由此说明回归过程采用SmoothLn函数和卷积层更加适合旋转框目标检测.

    2)感兴趣区域特征提取模块研究

    本节研究不同RoI特征提取结构对于检测精度的影响, 实验分为水平候选框特征提取方法和旋转候选框特征提取方法两部分. 实验结果分别见表3表4所示.

    表 3  不同水平框特征提取方法的实验结果
    Table 3  Experimental results of feature extraction methods of different horizontal boxes
    模块平均准确率 + 精细调整
    方法RoIPoolingRoI AlignIRoIPool
    平均准确率 (%)71.2173.6273.99
    下载: 导出CSV 
    | 显示表格
    表 4  不同旋转框特征提取方法的实验结果
    Table 4  Experimental results of different featureextraction methods of rotated boxes
    模块平均准确率 + 精细调整 + IRoIPool
    方法RRoI A-PoolingRRoI AlignRRoIPool
    平均准确率 (%)73.3873.9974.31
    下载: 导出CSV 
    | 显示表格

    表3的实验结果显示, 采用RoIPooling方式的检测精度相对较低, 其量化操作降低了对于小目标的检测效果. 而RoI Align方式取消量化操作, 采用插值方式使得平均准确率提升2.41%, 说明提取连续的特征有利于目标检测. 本文方法在面积插值法的基础上引入积分操作, 平均准确率提升0.37%. 相比于前一种方式选取固定数量的像素点, 本文采用的积分操作类似于选取较多点, 可以提取更多特征, 有利于检测效果的提升.

    表4为采用不同旋转框特征提取方法的检测结果. 第1种方法旋转感兴趣区域平均池化方法(Rotated region of interest average pooling, RRoI A-Pooling)选取旋转框内的像素点, 像素均值作为提取的特征. 第2种方法采用类似RoI Align的方式在旋转框内选择浮点数坐标, 运用双线性插值获得对应的像素值, 平均准确率提升0.61%. 本文采用方法RRoIPool可以根据旋转框大小选择不同数量的像素点表示特征. 相比于第2种方式提升0.32%, 说明本文采用的旋转框特征提取方式更适合于精细调整模块.

    基于深度学习的目标检测算法在自然场景图像中取得了很大进展. 然而遥感图像存在背景复杂、小目标较多、排列方向任意等难点, 常见的目标检测算法并不满足这类场景的应用需求. 因此本文提出一种粗调与细调两阶段结合的旋转框检测网络R2-FRCNN用于遥感图像检测任务. 并且设计像素重组金字塔结构, 提高复杂背景下小目标的检测性能. 同时在粗调阶段设计一种水平框特征提取方法IRoIPool, 细调阶段设计旋转框特征提取方法RRoIPool. 此外, 本文还采用SmoothLn回归损失函数, 以及全连接层和卷积层结合的预测分支, 进一步提升检测精度. 实验结果表明本文方法在大型公共数据集DOTA上获得了较好的检测效果. 然而本文方法存在检测速度较慢、GPU资源消耗较大等缺点, 因此在后续的工作中也将针对网络的轻量化展开进一步研究.

  • 图  1  语音活动检测的功能示意图

    Fig.  1  Schematic diagram of voice activity detection

    图  2  MFCC特征提取过程示意图

    Fig.  2  Schematic diagram of MFCC extraction

    图  3  帧级特征序列经特征规整后的直方图对比

    Fig.  3  Histogram comparison of frame-level feature sequences after feature normalization

    图  4  GMM均值超矢量提取过程示意图

    Fig.  4  Schematic diagram of GMM mean supervector extraction

    图  5  两种网络结构对比

    Fig.  5  Comparison of two different network structures

    图  6  两种目标函数对应网络的结构示意图对比

    Fig.  6  Comparison of the structure of the networks corresponding to the two different objective functions

    图  7  TDMF方法示意图

    Fig.  7  Schematic diagram of TDMF method

    表  1  不同特征空间学习方法汇总信息

    Table  1  Information of different feature space learning methods

    方法描述特点
    经典MAP方法[29]$ {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{D}}{\boldsymbol{z}}_{s,h} $MAP 自适应方法
    $ {\boldsymbol{D}} $为对角矩阵, $ {\boldsymbol{z}}_{s,h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $无法进行信道补偿
    本征音模型[36-37]$ {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{V}}{\boldsymbol{y}}_{s,h} $能够获得低维句级特征表示
    $ {\boldsymbol{V}} $为低秩矩阵, $ {\boldsymbol{y}}_{s,h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $无法进行信道补偿
    本征信道模型[37]$ {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{D}}{\boldsymbol{z}}_{s}+{\boldsymbol{U}}{\boldsymbol{x}}_{h} $能够进行信道补偿
    $ {\boldsymbol{D}} $为对角矩阵, $ {\boldsymbol{z}}_{s} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $需要提供同一说话人的多信道语音数据
    $ {\boldsymbol{U}} $为低秩矩阵, $ {\boldsymbol{y}}_{s,h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $说话人子空间中包含残差信息
    联合因子分析模型[38]${\boldsymbol{M} }_{s,h}={\boldsymbol{m} }+V{\boldsymbol{y} }_{s}+{\boldsymbol{U} }{\boldsymbol{x} }_{h}+{\boldsymbol{D} }{\boldsymbol{z} }_{s,h}$独立学习说话人信息与信道信息
    需要提供同一说话人的多信道语音数据, 计算复杂度高
    $ {\boldsymbol{V}} $为低秩矩阵, $ {\boldsymbol{y}}_{s} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $
    $ {\boldsymbol{U}} $为低秩矩阵, $ {\boldsymbol{x}}_{h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $
    $ {\boldsymbol{D}} $为对角矩阵, $ {\boldsymbol{z}}_{s} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $
    总变化空间模型[39-40]$ {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{T}}{\boldsymbol{w}}_{s,h}+{\boldsymbol{\varepsilon}}_{s,h} $学习均值超矢量中的全部变化信息
    $ {\boldsymbol{T}} $为低秩矩阵, $ {\boldsymbol{w}}_{s,h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $获取 I-vector 特征后再进行会话补偿
    $ {\boldsymbol{\varepsilon}}_{s,h} $为残差矢量$ {\boldsymbol{\varepsilon}}_{s,h} $在不同方法中的形式不同
    下载: 导出CSV

    表  2  基于不同残差假设的无监督总变化空间模型

    Table  2  Unsupervised TVS model based on different residual assumptions

    方法描述E 步M 步计算复杂度
    FEFA[40]$ {{\boldsymbol{M} }_{s,h}={\boldsymbol{m} }+{\boldsymbol{T} }{\boldsymbol{w} }_{s,h}}$
    输入为统计量无残差假设
    ${\begin{align}&{\boldsymbol{L} }={\left({\boldsymbol{I} }+\displaystyle\sum\limits_{c=1}^{C}{N}_{s,h}^{c}{ {\boldsymbol{T} } }_{c}^{\rm{T} }{\boldsymbol{\Sigma }}_{c}^{-1}{ {\boldsymbol{T} } }_{c}\right)}^{-1}\\ &{\boldsymbol{E} }={\boldsymbol{L} }\displaystyle\sum\limits_{c=1}^{C}{ {\boldsymbol{T} } }_{c}^{\rm{T} }{\boldsymbol{\Sigma } }_{c}^{-1}\left({\boldsymbol{F} }_{s,h}^{c}-{N}_{s,h}^{c}{\boldsymbol{\mu } }_{c}\right)\\ &\Upsilon ={\boldsymbol{L}}+{\boldsymbol{E}}{{\boldsymbol{E}}}^{\rm{T}}\end{align}} $$ {{ {\boldsymbol{T} } }_{c}=\left[\displaystyle\sum\limits_{s,h}\left({\boldsymbol{F} }_{s,h}^{c}-{N}_{s,h}^{c}{\boldsymbol{\mu } }_{c}\right){\boldsymbol{E} }\right]{\left(\displaystyle\sum\limits_{s,h}{N}_{s,h}^{c}\Upsilon \right)}^{-1}}$$ { {\rm{O}}\left(CFR+C{R}^{2}+{R}^{3}\right)} $
    PPCA[43-44]$ { {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{T}}{\boldsymbol{w}}_{s,h}+{\boldsymbol{\varepsilon}}_{s,h}} $
    残差协方差矩阵各向同性
    $ {\begin{align}&{\boldsymbol{L} }={\left({\boldsymbol{I} }+\dfrac{1}{ {\sigma }^{2} }{ {\boldsymbol{T} } }^{\rm{T} }{\boldsymbol{T} }\right)}^{-1}\\ &{\boldsymbol{E} }=\dfrac{1}{ {\sigma }^{2} }{\boldsymbol{L} }{ {\boldsymbol{T} } }^{\rm{T} }\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right)\\ &\Upsilon ={\boldsymbol{L}}+{\boldsymbol{E}}{{\boldsymbol{E}}}^{\rm{T}} \end{align}}$$ {\begin{aligned}{\boldsymbol{T} }=&\left[\displaystyle\sum\limits_{s,h}\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right){\boldsymbol{E} }\right]{\left(\displaystyle\sum\limits_{s,h}\Upsilon \right)}^{-1}\\{\sigma }^{2}=&\;\dfrac{1}{CF\displaystyle\sum\limits _{s,h}1}\{ {\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right)}^{\rm{T} }\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right)-\\ &{\rm{t} }{\rm{r} }\left(\Upsilon { {\boldsymbol{T} } }^{\rm{T} }{\boldsymbol{T} })\right\} \end{aligned} }$$ {{\rm{O}}\left(CFR\right) }$
    FA[44-45]$ { {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{T}}{\boldsymbol{w}}_{s,h}+{\boldsymbol{\varepsilon}}_{s,h} }$
    残差协方差矩阵各向异性
    $ {\begin{align} &{\boldsymbol{L}}={\left({\boldsymbol{I}}+{{\boldsymbol{T}}}^{\rm{T}}{\boldsymbol{\varPhi }}^{-1}{\boldsymbol{T}}\right)}^{-1}\\ &{\boldsymbol{E}}={\boldsymbol{L}}{{\boldsymbol{T}}}^{\rm{T}}{\boldsymbol{\varPhi }}^{-1}\left({\boldsymbol{M}}_{s,h}-{\boldsymbol{m}}\right) \\ &\Upsilon ={\boldsymbol{L}}+{\boldsymbol{E}}{{\boldsymbol{E}}}^{\rm{T}}\end{align}} $$ {\begin{aligned}{\boldsymbol{T} }=&\left[\displaystyle\sum\limits_{ {\boldsymbol{s} },{\boldsymbol{h} } }\left({\boldsymbol{M} }_{ {\boldsymbol{s} },{\boldsymbol{h} } }-{\boldsymbol{m} }\right){\boldsymbol{E} }\right]{\left(\displaystyle\sum\limits_{s,h}\Upsilon \right)}^{-1}\\ {\sigma }^{2}=\;&\dfrac{1}{CF\displaystyle\sum\limits _{s,h}1}\{\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right){\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right)}^{\rm{T} }-\\ &{ {\boldsymbol{T} } }^{\rm{T} }\Upsilon {\boldsymbol{T} }\}\odot {\boldsymbol{I} } \end{aligned} }$$ { {\rm{O}}\left(CFR\right)} $
    下载: 导出CSV

    表  3  基于不同映射关系假设的无监督总变化空间模型

    Table  3  Unsupervised TVS model based on different mapping relations

    目的方法特点
    映射关系改进局部变化模型[47]利用 GMM 均值超矢量中各个高斯分量与 I-vector 特征之间的局部可变性
    稀疏编码[48]利用字典学习来压缩总变化空间矩阵
    广义变化模型[49]将映射关系中高斯分布假设扩展到高斯混合分布
    不理想数据库改善先验补偿[50]对不同数据库中的先验信息进行建模, 学习能够对其进行偿的映射关系
    不确定性传播[51]对映射关系中不确定性因素所产生的影响进行建模, 降低环境失真产生的影响
    学习速度提升广义 I-vector 估计[52]利用正交属性提升计算速度
    随机奇异值分解[53]通过近似估计提升计算速度
    下载: 导出CSV

    表  4  不同有监督总变化空间模型汇总信息

    Table  4  Information of different supervised TVS models

    方法特点
    PLS[54]学习 GMM 均值超矢量与其类别标签的公共子空间,并将其作为总变化空间,
    然后将 GMM 均值超矢量在公共子空间上的投影用作 I-vector 特征
    PPLS[55]学习 GMM 均值超矢量与其类别标签的公共隐变量, 并将其作为 I-vector 特征
    SPPCA[56]学习 GMM 均值超矢量与其对应的长时 GMM 均值超矢量的公共隐变量, 并将其作为 I-vector 特征
    最小最大策略[57]训练使得最大风险最小化的估计器
    下载: 导出CSV

    表  5  不同会话补偿方法汇总信息

    Table  5  Information of different session compensation methods

    目标方法特点
    子空间投影LDA[60]类内散度最小、类间散度最大
    WCCN[61]降低预期错误率
    NAP[62]消除扰动方向
    NDA[63]学习局部类间区分性信息、类内共性信息
    LWLDA[64-65]以成对的方式来获取类内散度
    特征重构SC[66]直接对原始特征进行稀疏重构
    BSBL[67]利用块内相关性对原始特征进行稀疏重构
    FDDL[68]引入 Fisher 正则项来增加字典对不同类别的区分性
    下载: 导出CSV

    表  6  不同目标函数汇总信息

    Table  6  Information of different objective functions

    目标方法目标函数
    多分类交叉熵${L_{{\rm{cro}}} } = - [y\log \hat y + (1 - y)\log (1 - \hat y)]$
    Softmax${L_s} = - \dfrac{1}{N}\displaystyle \sum\limits_{n = 1}^N {\log } \frac{ { { {\rm{e} } ^{ {\boldsymbol{\theta } }_{ {y_n} }^{\rm{T} }f({ {\boldsymbol{x} }_n})} } } }{ {\displaystyle \sum\limits_{k = 1}^K { { {\rm{e} } ^{ {\boldsymbol{\theta } }_k^{\rm{T} }f({ {\boldsymbol{x} }_n})} } } } }$
    Center[98]${L}_{c}=\dfrac{1}{2N}{\displaystyle \sum\limits_{n=1}^{N}\Vert f(}{\boldsymbol{x} }_{n})-{\boldsymbol{c} }_{ {y}_{n} }{\Vert }^{2}$
    L-softmax[99]${L}_{{\rm{l}}\text{-}{\rm{s}}}=-\dfrac{1}{N}{\displaystyle \sum\limits_{n=1}^{N}{\rm{log} } }\displaystyle\frac{ {\rm{e} }^{\Vert { {\boldsymbol{\theta } } }_{ {y}_{n} }\Vert \Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})} }{ {\rm{e} }^{\Vert { {\boldsymbol{\theta } } }_{ {y}_{n} }\Vert \Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})}+{\displaystyle \sum\limits_{k\ne {y}_{n} }{\rm{e} }^{\Vert { {\boldsymbol{\theta } } }_{k}\Vert \Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }({\alpha }_{k,n})} } }$
    A-softmax[100]${L}_{{\rm{a}}\text{-}{\rm{s}}}=-\dfrac{1}{N}{\displaystyle \sum\limits_{n=1}^{N}{\rm{log} } }\displaystyle\frac{ {\rm{e} }^{\Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})} }{ {\rm{e} }^{\Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})}+{\displaystyle \sum\limits_{k\ne {y}_{n} }{\rm{e} }^{\Vert { {\boldsymbol{\theta } } }_{k}\Vert \Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }({\alpha }_{k,n})} } }$
    AM-softmax[101]${L_{{\rm{am}}\text{-}{\rm{s}}} } = - \dfrac{1}{N}\displaystyle \sum\limits_{n = 1}^N {\log } \frac{ { { {\rm{e} } ^{s \cdot [\cos ({\alpha _{ {y_n},n} }) - m]} } } }{ { { {\rm{e} } ^{s \cdot [\cos ({\alpha _{ {y_n},n} }) - m]} } + \displaystyle \sum\limits_{k \ne {y_n} } { { {\rm{e} } ^{\cos ({\alpha _{k,n} })} } } } }$
    度量学习Contrastive[102]${L_{{\rm{con}}} } = yd\left[ {f({ {\boldsymbol{\boldsymbol{x} } }_1}),f({ {\boldsymbol{\boldsymbol{x} } }_2})} \right] + (1 - y)\max \{ 0,m - d\left[ {f({ {\boldsymbol{\boldsymbol{x} } }_1}),f({ {\boldsymbol{\boldsymbol{x} } }_2})} \right]\}$
    Triplet[103]${L_{{\rm{trip}}} } = \max \{ 0,d\left[ {f({ {\boldsymbol{\boldsymbol{x} } }_p}),f({ {\boldsymbol{\boldsymbol{x} } }_a})} \right] - d\left[ {f({ {\boldsymbol{\boldsymbol{x} } }_n}),f({ {\boldsymbol{\boldsymbol{x} } }_a})} \right] + m\}$
    下载: 导出CSV

    表  7  联合优化方法汇总信息

    Table  7  Information of different joint optimization methods

    阶段方法描述
    会话补偿 + 分类器DNN-PLDA[104]用 PLDA 指导 DNN 学习
    Bilevel[105]稀疏编码用于会话补偿, 并分别用 SVM 与 softmax 分类器指导稀疏字典学习
    总变化空间 + 分类器TDVM[106]用 PLDA 指导 TVS 学习
    全部阶段F2S2I[107]用 PLDA 指导 DNN 模仿 I-vector 方法各阶段进行学习
    TDMF[108]用 PLDA 指导 UBM 与 TVS 学习
    下载: 导出CSV

    表  8  常用数据库信息

    Table  8  Information of common databases

    数据库年份声学环境类别数语音段数/总时长开源
    CN-CELEB[126]2019多媒体1000300 h
    VoxCeleb[89]:VoxCeleb1[73]2017多媒体1251153516
    VoxCeleb2[75]2018多媒体61121128246
    SITW[127]2016多媒体2992800
    Forensic Comparison[128]2015电话5521264
    NIST SRE12[129]2012电话/麦克风2000+
    ELSDSR[130]2005纯净语音22198
    SWITCHBOARD[131]1992电话311433039
    TIMIT[132]1990纯净语音6306300
    下载: 导出CSV
  • [1] Reynolds D A. An overview of automatic speaker recognition technology. In: Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, USA: IEEE, 2002. IV-4072−IV-4075
    [2] Aghajan H, Delgado R L C, Augusto J C. Human-Centric Interfaces for Ambient Intelligence. Burlington: Academic Press, 2010.
    [3] Poddar A, Sahidullah M, Saha G. Speaker verification with short utterances: A review of challenges, trends and opportunities. IET Biometrics, 2018, 7(2): 91-101 doi: 10.1049/iet-bmt.2017.0065
    [4] 韩纪庆, 张磊, 郑铁然. 语音信号处理. 第3版. 北京: 清华大学出版社, 2019.

    Han Ji-Qing, Zhang Lei, Zheng Tie-Ran. Speech Signal Processing (3rd edition). Beijing: Tsinghua University Press, 2019.
    [5] Nematollahi M A, Al-Haddad S A R. Distant speaker recognition: An overview. International Journal of Humanoid Robotics, 2016, 13(2): Article No. 1550032 doi: 10.1142/S0219843615500322
    [6] Hansen J H L, Hasan T. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 2015, 32(6): 74-99 doi: 10.1109/MSP.2015.2462851
    [7] Kinnunen T, Li H Z. An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 2010, 52(1): 12-40 doi: 10.1016/j.specom.2009.08.009
    [8] Markel J, Oshika B, Gray A. Long-term feature averaging for speaker recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1977, 25(4): 330-337 doi: 10.1109/TASSP.1977.1162961
    [9] Li K, Wrench E. An approach to text-independent speaker recognition with short utterances. In: Proceedings of the 1983 IEEE International Conference on Acoustics, Speech, and Signal Processing. Boston, USA: IEEE, 1983. 555−558
    [10] Chen S H, Wu H T, Chang Y, Truong T K. Robust voice activity detection using perceptual wavelet-packet transform and Teager energy operator. Pattern Recognition Letters, 2007, 28(11): 1327-1332 doi: 10.1016/j.patrec.2006.11.023
    [11] Fujimoto M, Ishizuka K, Nakatani T. A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme. In: Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. Las Vegas, USA: IEEE, 2008. 4441−4444
    [12] Li K, Swamy M N S, Ahmad M O. An improved voice activity detection using higher order statistics. IEEE Transactions on Speech and Audio Processing, 2005, 13(5): 965-974 doi: 10.1109/TSA.2005.851955
    [13] Soleimani S A, Ahadi S M. Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses. In: Proceedings of the 3rd International Conference on Information and Communication Technologies: From Theory to Applications. Damascus, Syria: IEEE, 2008. 1−5
    [14] Sohn J, Kim N S, Sung W. A statistical model-based voice activity detection. IEEE Signal Processing Letters, 1999, 6(1): 1-3 doi: 10.1109/97.736233
    [15] Chang J H, Kim N S. Voice activity detection based on complex Laplacian model. Electronics Letters, 2003, 39(7): 632-634 doi: 10.1049/el:20030392
    [16] Ramirez J, Segura J C, Benitez C, Garcia L, Rubio A. Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Processing Letters, 2005, 12(10): 689-692 doi: 10.1109/LSP.2005.855551
    [17] Tong S B, Gu H, Yu K. A comparative study of robustness of deep learning approaches for VAD. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai, China: IEEE, 2016. 5695−5699
    [18] Atal B S. Automatic recognition of speakers from their voices. Proceedings of the IEEE, 1976, 64(4): 460-475 doi: 10.1109/PROC.1976.10155
    [19] Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1980, 28(4): 357-366 doi: 10.1109/TASSP.1980.1163420
    [20] Hermansky H. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 1990, 87(4): 1738-1752 doi: 10.1121/1.399423
    [21] Koenig W, Dunn H K, Lacy L Y. The sound spectrograph. The Journal of the Acoustical Society of America, 1946, 18(1): 19-49 doi: 10.1121/1.1916342
    [22] LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4): 541−551
    [23] 林景栋, 吴欣怡, 柴毅, 尹宏鹏. 卷积神经网络结构优化综述. 自动化学报, 2020, 46(1): 24-37

    Lin Jing-Dong, Wu Xin-Yi, Chai Yi, Yin Hong-Peng. Structure optimization of convolutional neural networks: A survey. Acta Automatica Sinica, 2020, 46(1): 24-37
    [24] Furui S. Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1981, 29(2): 254-272 doi: 10.1109/TASSP.1981.1163530
    [25] Pelecanos J W, Sridharan S. Feature warping for robust speaker verification. In: Proceedings of the 2001 A Speaker Odyssey: The Speaker Recognition Workshop. Crete, Greece: ISCA, 2001. 1−5
    [26] Sadjadi S O, Slaney M, Heck A L. MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research, Microsoft Research Technical Report MSR-TR-2013-133, 2013.
    [27] Campbell W M, Sturim D E, Reynolds D A. Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 2006, 13(5): 308-311 doi: 10.1109/LSP.2006.870086
    [28] Reynolds D A. Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 1995, 17(1−2): 91-108 doi: 10.1016/0167-6393(95)00009-D
    [29] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 2000, 10(1−3): 19-41 doi: 10.1006/dspr.1999.0361
    [30] Wang W, Han J, Zheng T, Zheng G, Liu H. A robust sparse auditory feature for speaker verification. Journal of Computational Information Systems, 2013, 9(22): 8987-8993
    [31] Wang W, Han J Q, Zheng T R, Zheng G B. Robust speaker verification based on max pooling of sparse representation. Journal of Computers, 2014, 24(4): 56-65
    [32] He Y J, Chen C, Han J Q. Noise-robust speaker recognition based on morphological component analysis. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA, 2015. 3001−3005
    [33] Wang W, Han J Q, Zheng T R, Zheng G B, Zhou X Y. Speaker verification via modeling kurtosis using sparse coding. International Journal of Pattern Recognition and Artificial Intelligence, 2016, 30(3): Article No. 1659008 doi: 10.1142/S0218001416590084
    [34] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1977, 39(1): 1-22
    [35] Gauvain J L, Lee C H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 1994, 2(2): 291-298 doi: 10.1109/89.279278
    [36] Kuhn R, Junqua J C, Nguyen P, Niedzielski N. Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing, 2000, 8(6): 695-707 doi: 10.1109/89.876308
    [37] Kenny P, Mihoubi M, Dumouchel P. New MAP estimators for speaker recognition. In: Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH). Geneva, Switzerland: ISCA, 2003. 2961−2964
    [38] Kenny P, Boulianne G, Ouellet P, Dumouchel P. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(4): 1435-1447 doi: 10.1109/TASL.2006.881693
    [39] Dehak N, Dehak R, Kenny P, Brümmer N, Dumouchel P. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In: Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH). Brighton, UK: ISCA, 2009. 1559−1562
    [40] Dehak N, Kenny P J, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(4): 788-798 doi: 10.1109/TASL.2010.2064307
    [41] Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 1987, 2(1−3): 37-52 doi: 10.1016/0169-7439(87)80084-9
    [42] Lei Z C, Yang Y C. Maximum likelihood I-vector space using PCA for speaker verification. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH). Florence, Italy: ISCA, 2011. 2725−2728
    [43] Tipping M E, Bishop C M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B Statistical Methodology), 1999, 61(3): 611-622 doi: 10.1111/1467-9868.00196
    [44] Vestman V, Kinnunen T. Supervector compression strategies to speed up I-vector system development. In: Proceedings of the 2018 Odyssey: The Speaker and Language Recognition Workshop. Les Sables d' Olonne, France: ISCA, 2018. 357−364
    [45] Gorsuch R L. Factor Analysis (2nd edition). Hillsdale: Lawrence Erlbaum Associates, 1983.
    [46] Roweis S T. EM algorithms for PCA and SPCA. In: Proceedings of the 10th International Conference on Neural Information Processing Systems. Denver, USA: MIT Press, 1997. 626−632
    [47] Chen L P, Lee K A, Ma B, Guo W, Li H Z, Dai L R. Local variability vector for text-independent speaker verification. In: Proceedings of the 9th International Symposium on Chinese Spoken Language Processing. Singapore, Singapore: IEEE, 2014. 54−58
    [48] Xu L T, Lee K A, Li H Z, Yang Z. Sparse coding of total variability matrix. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany: ISCA, 2015. 1022−1026
    [49] Ma J B, Sethu V, Ambikairajah E, Lee K A. Generalized variability model for speaker verification. IEEE Signal Processing Letters, 2018, 25(12): 1775-1779 doi: 10.1109/LSP.2018.2874814
    [50] Shepstone S E, Lee K A, Li H Z, Tan Z H, Jensen S H. Total variability modeling using source-specific priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(3): 504-517 doi: 10.1109/TASLP.2016.2515506
    [51] Ribas D, Vincent E. An improved uncertainty propagation method for robust I-vector based speaker recognition. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 6331−6335
    [52] Xu L T, Lee K A, Li H Z, Yang Z. Generalizing I-vector estimation for rapid speaker recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(4): 749-759 doi: 10.1109/TASLP.2018.2793670
    [53] Travadi R, Narayanan S. Efficient estimation and model generalization for the totalvariability model. Computer Speech and Language, 2019, 53: 43-64
    [54] Chen C, Han J Q. Partial least squares based total variability space modeling for I-vector speaker verification. Chinese Journal of Electronics. 2018, 27(6): 1229-1233 doi: 10.1049/cje.2018.06.001
    [55] Chen C, Han J Q, Pan Y L. Speaker verification via estimating total variability space using probabilistic partial least squares. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Swedish: ISCA, 2017. 1537−1541
    [56] Lei Y, Hansen J H L. Speaker recognition using supervised probabilistic principal component analysis. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH). Makuhari, Japan: ISCA, 2010. 382−385
    [57] Huber J. A robust version of the probability ratio test. Annals of Mathematical Statistics, 1965, 36(6): 1753-1758 doi: 10.1214/aoms/1177699803
    [58] Hautamäki V, Cheng Y C, Rajan P, Lee C H. Minimax i-vector extractor for short duration speaker verification. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH). Lyon, France: ISCA, 2013. 3708−3712
    [59] Vogt R, Baker B, Sridharan S. Modelling session variability in text-independent speaker verification. In: Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH). Lisbon, Portugal: ISCA, 2005. 3117−3120
    [60] Fisher R A. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 1936, 7(2): 179-188 doi: 10.1111/j.1469-1809.1936.tb02137.x
    [61] Hatch A O, Kajarekar S S, Stolcke A. Within-class covariance normalization for SVM-based speaker recognition. In: Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH). Pittsburgh, USA: ISCA, 2006. 1471−1474
    [62] Campbell W M, Sturim D E, Reynolds D A, Solomonoff A. SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing. Toulouse, France: IEEE, 2006.
    [63] Sadjadi S O, Pelecanos J W, Zhu W Z. Nearest neighbor discriminant analysis for robust speaker recognition. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH). Singapore, Singapore: ISCA, 2014. 1860−1864
    [64] Misra A, Ranjan S, Hansen J H L. Locally weighted linear discriminant analysis for robust speaker verification. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 2864−2868
    [65] Misra A, Hansen J H L. Modelling and compensation for language mismatch in speaker verification. Speech Communication, 2018, 96: 58-66 doi: 10.1016/j.specom.2017.09.004
    [66] Li M, Zhang X, Yan Y H, Narayanan S S. Speaker verification using sparse representations on total variability I-vectors. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH). Florence, Italy: ISCA, 2011. 2729−2732
    [67] Wang W, Han J Q, Zheng T R, Zheng G B, Shao M G. Speaker recognition via block sparse Bayesian learning. International Journal of Multimedia and Ubiquitous Engineering, 2015, 10(7): 247-254 doi: 10.14257/ijmue.2015.10.7.26
    [68] 王伟, 韩纪庆, 郑铁然, 郑贵滨, 陶耀. 基于Fisher判别字典学习的说话人识别. 电子与信息学报, 2016, 38(2): 367-372

    Wang Wei, Han Ji-Qing, Zheng Tie-Ran, Zheng Gui-Bin, Tao Yao. Speaker recognition based on Fisher discrimination dictionary learning. Journal of Electronics and Information Technology, 2016, 38(2): 367-372
    [69] Variani E, Lei X, McDermott E, Moreno I L, Gonzalez-Dominguez J. Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy: IEEE, 2014. 4052−4056
    [70] Snyder D, Garcia-Romero D, Povey D, Khudanpur S. Deep neural network embeddings for text-independent speaker verification. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 999−1003
    [71] Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S. X-Vectors: Robust DNN embeddings for speaker recognition. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE, 2018. 5329−5333
    [72] Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of the 2014 British Machine Vision Conference (BMVC). Nottingham, UK: BMVA Press, 2014: 1−5
    [73] Nagrani A, Chung J S, Zisserman A. VoxCeleb: A large-scale speaker identification dataset. In: Proceedings of the 18the Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 2616−2620
    [74] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 770−778
    [75] Chung J S, Nagrani A, Zisserman A. VoxCeleb2: Deep speaker recognition. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 1086−1090
    [76] Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2014. 2672−2680
    [77] Zhang Z F, Wang L B, Kai A, Yamada T, Li W F, Iwahashi M. Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. Eurasip Journal on Audio, Speech, and Music Processing, 2015, 2015(1): Article No. 12 doi: 10.1186/s13636-015-0056-7
    [78] Richardson F, Reynolds D, Dehak N. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 2015, 22(10): 1671-1675 doi: 10.1109/LSP.2015.2420092
    [79] Chen Y H, Lopez-Moreno I, Sainath T N, Visontai M, Alvarez R, Parada C. Locally-connected and convolutional neural networks for small footprint speaker recognition. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany: ISCA, 2015. 1136−1140
    [80] Li L T, Chen Y X, Shi Y, Tang Z Y, Wang D. Deep speaker feature learning for text-independent speaker verification. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 1542−1546
    [81] Prince S J D, Elder J H. Probabilistic linear discriminant analysis for inferences about identity. In: Proceedings of the 11th IEEE International Conference on Computer Vision. Rio de Janeiro, Brazil: IEEE, 2007. 1−8
    [82] Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany: ISCA, 2015. 3214−3218
    [83] Villalba J, Chen N X, Snyder D, Garcia-Romero D, McCree A, Sell G, et al. State-of-the-art speaker recognition for telephone and video speech: The JHU-MIT submission for NIST SRE18. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019. 1488−1492
    [84] Povey D, Cheng G F, Wang Y M, Li K, Xu H N, Yarmohammadi M, et al. Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 3743−3747
    [85] Snyder D, Garcia-Romero D, Sell G, McCree A, Povey D, Khudanpur S. Speaker recognition for multi-speaker conversations using X-vectors. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 5796−5800
    [86] Kanagasundaram A, Sridharan S, Ganapathy S, Singh P, Fookes C. A study of X-vector based speaker recognition on short utterances. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019. 2943−2947
    [87] Garcia-Romero D, Snyder D, Sell G, McCree A, Povey D, Khudanpur S. X-vector DNN refinement with full-length recordings for speaker recognition. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019. 1493−1496
    [88] Hong Q B, Wu C H, Wang H M, Huang C L. Statistics pooling time delay neural network based on X-vector for speaker verification. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 2020. 6849−6853
    [89] Nagrani A, Chung J S, Xie W D, Zisserman A. Voxceleb: Large-scale speaker verification in the wild. Computer Science and Language, 2020, 60: Article No. 101027
    [90] Hajibabaei M, Dai D X. Unified hypersphere embedding for speaker recognition. arXiv preprint arXiv: 1807.08312, 2018.
    [91] Xie W D, Nagrani A, Chung J S, Zisserman A. Utterance-level aggregation for speaker recognition in the wild. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 5791−5795
    [92] Zhang C L, Koishida K. End-to-end text-independent speaker verification with triplet loss on short utterances. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 1487−1491
    [93] Cai W C, Chen J K, Li M. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In: Proceedings of the 2018 Odyssey: The Speaker and Language Recognition Workshop. Les Sables d'Olonne, France: ISCA, 2018. 74−81
    [94] Li C, Ma X K, Jiang B, Li X G, Zhang X W, Liu X, Cao Y, Kannan A, Zhu Z Y. Deep speaker: An end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304, 2017.
    [95] Ding W H, He L. MTGAN: Speaker verification through multitasking triplet generative adversarial networks. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 3633−3637
    [96] Zhou J F, Jiang T, Li L, Hong Q Y, Wang Z, Xia B Y. Training multi-task adversarial network for extracting noise-robust speaker embeddings. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 6196−6200
    [97] Yang Y X, Wang S, Sun M, Qian Y M, Yu K. Generative adversarial networks based X-vector augmentation for robust probabilistic linear discriminant analysis in speaker verification. In: Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP). Taipei, China: IEEE, 2018. 205−209
    [98] Li N, Tuo D Y, Su D, Li Z F, Yu D. Deep discriminative embeddings for duration robust speaker verification. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 2262−2266
    [99] Liu Y, He L, Liu J. Large margin softmax loss for speaker verification. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019. 2873−2877
    [100] Huang Z L, Wang S, Yu K. Angular softmax for short-duration text-independent speaker verification. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 3623−3627
    [101] Yu Y Q, Fan L, Li W J. Ensemble additive margin softmax for speaker verification. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 6046−6050
    [102] Bhattacharya G, Alam J, Gupta V, Kenny P. Deeply fused speaker embeddings for text-independent speaker verification. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 3588−3592
    [103] Zhang C L, Koishida K, Hansen J H L. Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(9): 1633-1644 doi: 10.1109/TASLP.2018.2831456
    [104] Zheng T R, Han J Q, Zheng G B. Deep neural network based discriminative training for I-vector/PLDA speaker verification. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE, 2018. 5354−5358
    [105] Chen C, Wang W, He Y J, Han J Q. A bilevel framework for joint optimization of session compensation and classification for speaker identification. Digital Signal Processing, 2019, 89: 104-115 doi: 10.1016/j.dsp.2019.03.008
    [106] Chen C, Han J Q. Task-driven variability model for speaker verification. Circuits, Systems, and Signal Processing, 2020, 39(6): 3125-3144 doi: 10.1007/s00034-019-01315-7
    [107] Rohdin J, Silnova A, Diez M, Plchot O, Matějka P, Burget L. End-to-end DNN based speaker recognition inspired by I-vector and PLDA. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE, 2018. 4874−4878
    [108] Chen C, Han J Q. TDMF: Task-driven multilevel framework for end-to-end speaker verification. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 2020. 6809−6813
    [109] Migdalas A, Pardalos P M, Varbränd P. Multilevel Optimization: Algorithms and Applications. Boston: Springer Science and Business Media, 2013.
    [110] Kenny P. Bayesian speaker verification with heavy-tailed priors. In: Proceedings of the 2010 Odyssey: The Speaker and Language Recognition Workshop. Brno, Czech Republic: ISCA, 2010. 1−4
    [111] Garcia-Romero D, Espy-Wilson C Y. Analysis of I-vector length normalization in speaker recognition systems. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH). Florence, Italy: ISCA, 2011. 249−252
    [112] Pan Y L, Zheng T R, Chen C. I-vector Kullback-Leibler divisive normalization for PLDA speaker verification. In: Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). Montreal, Canada: IEEE, 2017. 56−60
    [113] Burget L, Plchot O, Cumani S, Glembek O, Matějka P, Brümmer N. Discriminatively trained probabilistic linear discriminant analysis for speaker verification. In: Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prague, Czech Republic: IEEE, 2011. 4832−4835
    [114] Cumani S, Laface P. Joint estimation of PLDA and nonlinear transformations of speaker vectors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(10): 1890-1900 doi: 10.1109/TASLP.2017.2724198
    [115] Cumani S, Laface P. Scoring heterogeneous speaker vectors using nonlinear transformations and tied PLDA models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(5): 995-1009 doi: 10.1109/TASLP.2018.2806305
    [116] Kenny P, Stafylakis T, Ouellet P, Alam J, Dumouchel P. PLDA for speaker verification with utterances of arbitrary duration. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 7649−7653
    [117] Ma J B, Sethu V, Ambikairajah E, Lee K A. Twin model G-PLDA for duration mismatch compensation in text-independent speaker verification. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016. 1853−1857
    [118] Ma J B, Sethu V, Ambikairajah E, Lee K A. Duration compensation of I-vectors for short duration speaker verification. Electronics Letters, 2017, 53(6): 405-407 doi: 10.1049/el.2016.4629
    [119] Villalba J, Lleida E. Handling I-vectors from different recording conditions using multi-channel simplified PLDA in speaker recognition. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 6763−6767
    [120] Garcia-Romero D, McCree A. Supervised domain adaptation for I-vector based speaker recognition. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy: IEEE, 2014. 4047−4051
    [121] Richardson F, Nemsick B, Reynolds D. Channel compensation for speaker recognition using MAP adapted PLDA and denoising DNNs. In: Proceedings of the 2016 Odyssey: The Speaker and Language Recognition Workshop. Bilbao, Spain: ISCA, 2016. 225−230
    [122] Hong Q Y, Li L, Zhang J, Wan L H, Guo H Y. Transfer learning for PLDA-based speaker verification. Speech Communication, 2017, 92: 90-99 doi: 10.1016/j.specom.2017.05.004
    [123] Li N, Mak M W. SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(10): 1648-1659 doi: 10.1109/TASLP.2015.2442757
    [124] Mak M W, Pang X M, Chien J T. Mixture of PLDA for noise robust I-vector speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(1): 130-142 doi: 10.1109/TASLP.2015.2499038
    [125] Villalba J, Miguel A, Ortega A, Lleida E. Bayesian networks to model the variability of speaker verification scores in adverse environments. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(12): 2327-2340 doi: 10.1109/TASLP.2016.2607343
    [126] Fan Y, Kang J W, Li L T, Li K C, Chen H L, Cheng S T, et al. CN-Celeb: A challenging Chinese speaker recognition dataset. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 2020. 7604−7608
    [127] McLaren M, Ferrer L, Castán D, Lawson A. The speakers in the wild (SITW) speaker recognition database. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016. 818−822
    [128] Morrison G S, Zhang C, Enzinger E, Ochoa F, Bleach D, Johnson M, et al. Forensic database of voice recordings of 500+ Australian English speakers [Online], available: http://databases.forensic-voice-comparison.net/, November 10, 2020
    [129] Greenberg C S. The NIST Year 2012 Speaker Recognition Evaluation plan, Technical Report NIST_SRE12_evalplan.v17, 2012.
    [130] Feng L, Hansen L K. A New Database for Speaker Recognition, IMM-Technical Report, 2005.
    [131] Godfrey J J, Holliman E C, McDaniel J. SWITCHBOARD: Telephone speech corpus for research and development. In: Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing. San Francisco, USA: IEEE, 1992. 517−520
    [132] Jankowski C, Kalyanswamy A, Basson S, Spitz J. TIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In: Proceedings of the 1990 IEEE International Conference on Acoustics, Speech, and Signal Processing. Albuquerque, USA: IEEE, 1990. 109−122
    [133] 王金甲, 纪绍男, 崔琳, 夏静, 杨倩. 基于注意力胶囊网络的家庭活动识别. 自动化学报, 2019, 45(11): 2199-2204

    Wang Jin-Jia, Ji Shao-Nan, Cui Lin, Xia Jing, Yang Qian. Domestic activity recognition based on attention capsule network. Acta Automatica Sinica, 2019, 45(11): 2199-2204
    [134] Wang H J, Dinkel H, Wang S, Qian Y M, Yu K. Dual-adversarial domain adaptation for generalized replay attack detection. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020. 1086−1090
    [135] 黄雅婷, 石晶, 许家铭, 徐波. 鸡尾酒会问题与相关听觉模型的研究现状与展望. 自动化学报, 2019, 45(2): 234-251

    Huang Ya-Ting, Shi Jing, Xu Jia-Ming, Xu Bo. Research advances and perspectives on the cocktail party problem and related auditory models. Acta Automatica Sinica, 2019, 45(2): 234-251
    [136] Lin Q J, Hou Y, Li M. Self-attentive similarity measurement strategies in speaker diarization. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020. 284−288
  • 期刊类型引用(22)

    1. 李耀龙,陈晓林,林浩,王宇,王春林. DySnake-YOLO:改进的YOLOv9c电路板表面缺陷检测方法. 计算机工程与应用. 2025(03): 242-252 . 百度学术
    2. 雷帮军,朱涵. 基于上下文空间感知的遥感图像旋转目标检测. 电光与控制. 2025(03): 69-75 . 百度学术
    3. 李璇,冯昭明,徐宇航,马雷,程莉. 基于空间匹配校准的预成端盒端口信息自动化识别. 控制与决策. 2025(04): 1367-1376 . 百度学术
    4. 张华卫,张文飞,蒋占军,廉敬,吴佰靖. 引入上下文信息和Attention Gate的GUS-YOLO遥感目标检测算法. 计算机科学与探索. 2024(02): 453-464 . 百度学术
    5. 管文青,周世斌,张国鹏. 混合注意力特征增强的航空图像目标检测. 计算机工程与应用. 2024(04): 249-257 . 百度学术
    6. 禹鑫燚,林密,卢江平,欧林林. 基于向量叉乘标签分配的遥感图像目标检测算法. 高技术通讯. 2024(02): 132-142 . 百度学术
    7. 王志林,于瓅. 基于改进YOLOv5的遥感图像检测. 重庆科技学院学报(自然科学版). 2024(02): 62-67 . 百度学术
    8. 张云佐,郭威,李文博. 遥感图像密集小目标全方位精准检测算法. 吉林大学学报(工学版). 2024(04): 1105-1113 . 百度学术
    9. 陈天鹏,胡建文. 基于改进FCOS的遥感图像舰船目标检测. 计算机科学. 2024(S1): 479-485 . 百度学术
    10. 魏瑶坤,康运江,王丹伟,赵鹏,徐斌. 改进YOLOv5s的旋转框工业零件检测算法. 激光与光电子学进展. 2024(14): 155-164 . 百度学术
    11. 程凯伦,胡晓兵,陈海军,李虎. 基于改进YOLOv5s的遥感图像目标检测方法. 激光与光电子学进展. 2024(18): 285-291 . 百度学术
    12. 焦仕昂,罗亮,杨萌,翟宏睿,刘维勤. 基于改进YOLOv7的光学遥感图像船舶旋转目标检测. 武汉理工大学学报(交通科学与工程版). 2024(05): 903-908 . 百度学术
    13. 董燕,魏铭宏,高广帅,刘洲峰,李春雷. 基于双重标签分配的遥感有向目标检测方法. 计算机科学. 2024(S2): 496-504 . 百度学术
    14. 温桂炜,杨志钢. 面向遥感图像目标检测的特征增强和融合方法. 应用科技. 2024(05): 305-310 . 百度学术
    15. 张娜,包梓群,罗源,吴彪,涂小妹. 改进的Cascade R-CNN算法在目标检测上的应用. 电子学报. 2023(04): 896-906 . 百度学术
    16. 庄文华,唐晓刚,张斌权,原光明. 基于改进YOLOv5的遥感图像旋转框目标检测. 电子设计工程. 2023(14): 137-141+146 . 百度学术
    17. 顾东泽,王敬东,姜宜君,廖元晖. 一种基于CenterNet的多朝向建筑物检测方法. 电子测量技术. 2023(10): 150-154 . 百度学术
    18. 沈中华,陈万委,甘增康. 基于改进YOLOv5的旋转目标检测算法及其应用研究. 包装工程. 2023(19): 229-237 . 百度学术
    19. 刘恩海,许佳音,李妍,樊世燕. 自适应特征细化的遥感图像有向目标检测. 计算机工程与应用. 2023(24): 155-164 . 百度学术
    20. 何林远,白俊强,贺旭,王晨,刘旭伦. 基于稀疏Transformer的遥感旋转目标检测. 激光与光电子学进展. 2022(18): 55-63 . 百度学术
    21. 王宏乐,王兴林,李文波,邹阿配,叶全洲,刘大存. 一种基于解耦旋转锚框匹配策略的谷粒检测方法. 广东农业科学. 2022(12): 143-150 . 百度学术
    22. 安胜彪,娄慧儒,陈书旺,白宇. 基于深度学习的旋转目标检测方法研究进展. 电子测量技术. 2021(21): 168-178 . 百度学术

    其他类型引用(19)

  • 加载中
图(7) / 表(8)
计量
  • 文章访问数:  1764
  • HTML全文浏览量:  989
  • PDF下载量:  354
  • 被引次数: 41
出版历程
  • 收稿日期:  2020-07-09
  • 修回日期:  2020-09-03
  • 网络出版日期:  2020-12-10
  • 刊出日期:  2022-03-25

目录

/

返回文章
返回