Utterance-level Feature Extraction in Text-independent Speaker Recognition: A Review
-
摘要: 句级 (Utterance-level) 特征提取是文本无关说话人识别领域中的重要研究方向之一. 与只能刻画短时语音特性的帧级 (Frame-level) 特征相比, 句级特征中包含了更丰富的说话人个性信息; 且不同时长语音的句级特征均具有固定维度, 更便于与大多数常用的模式识别方法相结合. 近年来, 句级特征提取的研究取得了很大的进展, 鉴于其在说话人识别中的重要地位, 本文对近期具有代表性的句级特征提取方法与技术进行整理与综述, 并分别从前端处理、基于任务分段式与驱动式策略的特征提取方法, 以及后端处理等方面进行论述, 最后对未来的研究趋势展开探讨与分析.Abstract: Utterance-level feature extraction is one of the most important researches in text-independent speaker recognition. Compared with the frame-level features which only contain the short-term speech characteristics, the utterance-level features can effectively capture more speaker discriminative information. Meanwhile, it also has another advantage that any utterance with a variable duration can be represented as a fixed-dimension feature. Thus, the utterance-level features are easy to integrate with most commonly-used pattern recognition methods. In recent years, the researches on utterance-level feature extraction have made great progress. Considering the importance of utterance-level feature extraction in speaker recognition, this paper will organize and summarize the typical methods. Specifically, the front-end processing, the feature extraction based on the task-segmented strategy and task-driven strategy, and the back-end processing are introduced respectively. Finally, the future trends in speaker recognition are discussed and analyzed.
-
近年来, 随着遥感技术的发展, 高质量的遥感图像日益增多, 这为遥感领域的应用奠定了基础. 遥感图像广泛应用于灾害监测、资源调查、土地利用评价、农业产值测算、城市建设规划等领域[1], 对于社会和经济发展具有重要的意义. 而目标检测作为遥感图像处理的应用之一, 获得图中特定目标类别和位置. 通常关注飞机、机场、船舶、桥梁和汽车等目标, 因此对于民用和军用领域有着十分重要的用途[2]. 在民用领域中, 船舶的定位有利于海上救援行动, 车辆的定位有利于车辆计数和分析道路的拥堵情况等. 在军事领域中, 这些类别信息的检测获取, 有利于快速且精准地锁定攻击目标位置、分析战争形势以及制定军事行动等. 因此对于遥感图像中的目标进行精准检测至关重要.
目标检测是计算机视觉领域中一个重要且具有挑战性的研究热点. 随着深度学习的快速发展, 目标检测器的性能取得了显著进步, 已经广泛应用于各个行业. 目前常用的目标检测器大致可以分为两级检测器和单级检测器两类[3]. 两级检测器是基于区域卷积神经网络(Regions with convolutional neural network, R-CNN)框架, 检测过程分为两个阶段. 第1阶段从图像中生成一系列候选框区域, 第2阶段从候选框区域中提取特征, 然后使用分类器和回归器进行预测. Faster R-CNN[4]作为两级检测器的经典方法, 提出候选区域生成网络(Region proposal networks, RPN)用于候选框的产生, 从而快速、准确地实现端到端检测. 之后区域全卷积网络(Region-based fully convolutional network, R-FCN)[5]、Cascade R-CNN[6]等两级检测器的出现进一步提高目标检测的精度. 单级检测器将检测问题简化为回归问题, 仅仅由一系列卷积层进行分类回归, 而不需要产生候选框及特征提取阶段. 因此这类方法通常检测速度较快. 例如, Redmon等[7]提出YOLO检测器, 将图像划分为一系列网格区域, 每个网格区域直接回归得到边界框. Liu等[8]提出SSD检测器, 在多个不同尺度大小的特征图上直接分类回归. Lin等[9]提出Focal Loss分类损失函数, 解决单级检测器的类别不平衡问题, 进一步提高检测精度. 这些先进的目标检测技术往往用于水平边界框的生成, 然而在遥感图像中, 大多数检测目标呈现出任意方向排列, 对于横纵比大或者密集排列的目标, 仅仅采用水平框检测将包含过多的冗余信息, 影响检测效果. 因此旋转方向成为不可忽视的因素.
早期应用于遥感领域的旋转框检测算法主要来源于文本检测, 例如R2CNN[10]和RPN[11]等. 然而由于遥感图像背景复杂且空间分辨率变化较大, 相比于二分类的文本检测具有更大困难, 因此这些优秀的文本检测算法直接应用于遥感领域中并不能取得较好的检测效果. 近年来, 随着目标检测算法的发展以及针对遥感图像的深入研究, 涌现出许多性能良好的旋转框检测算法. 例如Ding等[12]提出旋转感兴趣区域学习器(Region of interest transformer, RoI), 将水平框转换为旋转框, 并在学习器中执行边界框的回归; Zhang等[13]提出通过捕获全局场景和局部特征的相关性增强特征; Azimi等[14]提出基于多尺度卷积核的图像级联方法; Yang等[15]提出像素注意力机制抑制图像噪声, 突出目标的特征, 并且在SmoothL1损失[4]中引入IoU常数因子解决旋转框的边界问题, 使旋转框预测更加精确. Yang等[16]设计精细调整模块, 采用特征调整模块, 通过插值操作实现特征对齐. Xu等[17]提出回归4种长度比来表示对应边的相对偏移距离, 并且引入了一个真实框与其水平边界框面积比作为倾斜因子, 用于对每个目标水平或旋转检测的选择. Wei等[18]提出利用预测内部中线实现旋转目标检测的方法. Li等[19]提出利用预测的掩模获取旋转框的方法. Wang等[20]提出了一种基于初始横向连接的特征金字塔网络(Feature pyramid networks, FPN)增强算法, 同时利用语义注意力机制网络提供语义特征, 从复杂的背景中提取目标.
因此, 目前在遥感图像中用于旋转框检测的方法大致可以分为两种. 其中一种算法整体结构仍然为水平框检测, 仅仅在回归预测分支中增加一些变量的获取, 例如角度因子等. 这种算法使得在网络预测的像素中包含较多背景信息, 容易出现图1所示的角度偏移以及漏检较多等问题. 另一种算法预设含有角度的锚点框, 然后采用旋转候选框内的像素进行预测. 由于目标的旋转角度较多, 因此这种算法需要预设大量的锚点框以保证召回率, 这样会极大地增加计算量.
针对上述不足, 本文结合这两种处理方法的优势, 以Faster R-CNN[21]为基础, 提出一种用于旋转框检测的网络R2-FRCNN (Refined rotated faster R-CNN). 该网络依次采用上述两种旋转框处理方法, 将前一种方法得到旋转框的过程视为粗调, 这个阶段产生的旋转框作为后一种方法的预设框, 然后对于旋转框再次进行调整, 这个过程称为细调. 两阶段调整使得网络输出更加精确的预测框. 此外, 针对遥感图像存在较多小目标的特点, 本文提出像素重组特征金字塔结构(Pixel-recombination feature pyramid network, PFPN), 相比于传统的金字塔网络, 本文的金字塔结构使得特征局部信息与全局信息相结合, 从而突出复杂背景下小目标的特征响应. 同时为了更好地提取表征目标信息的特征, 用于后续预测阶段, 本文在粗调阶段设计积分感兴趣区域池化方法(Integrate region of interest pool, IRoIPool), 以及在精调阶段设计旋转感兴趣区域池化方法(Rotated region of interest pool, RRoIPool), 提升复杂背景下小目标的检测精度. 最后, 本文在粗调和细调阶段均采用全连接层与卷积层结合的预测分支以及SmoothLn回归损失函数, 进一步提升算法性能.
本文结构安排如下: 第1节详细阐述本文提出的旋转框检测网络R2-FRCNN; 第2节通过与官方基准方法和现有方法的实验结果进行对比, 以及本文方法各模块的分离实验, 评估本文方法的性能; 第3节总结.
1. 旋转框目标检测方法
本节对提出的网络R2-FRCNN结构以及各模块进行阐述. 首先介绍R2-FRCNN网络的整体结构, 然后详细介绍各个模块(像素重组金字塔结构、感兴趣区域特征提取和网络预测分支结构), 最后介绍本文使用的损失函数.
1.1 网络结构设计
图2展示了R2-FRCNN网络的整体结构, 可以分为基础网络、像素重组金字塔、候选区域生成网络RPN、粗略调整阶段和精细调整阶段5个部分.
本文采用ResNet[22]作为算法的基础网络, 将C3、C4、C5和C6特征层用于构建特征金字塔结构, 增强网络对于小目标的检测能力. 由金字塔产生的P3、P4、P5、P6和P7 5个特征层上, 每个像素点预设3个锚点框, 锚点框的长宽比为{1:1, 1:2, 2:1}, 尺寸大小为8, 经由RPN[4]调整锚点框的位置生成一系列候选框. 然后选择置信度较高的2000个候选框用于粗略调整阶段, 该模块的回归过程将水平框调整为旋转框. 最后这些候选框进入精细调整阶段, 再次调整旋转框的位置, 得到更好的检测效果. 经过两阶段调整后的框, 选择后一阶段中最大分类数值作为置信度, 同时采用旋转非极大抑制算法处理, 选取邻域内置信度较高的框, 并且抑制低置信度的框, 这些高置信度的候选框即为网络输出预测框.
1.2 像素重组金字塔结构
特征金字塔结构[23]被广泛应用于许多先进的目标检测算法中, 这个结构的设计在于浅层的定位信息准确, 深层的语义信息丰富, 通过融合深浅层特征图, 提升对于小目标的检测性能. 如表1所示, RoI-Transformer (RT)[12]、CADNet[13]、SCRDet[15]、R3Det[16]和GV R-CNN (GV)[17]均采用了深浅层融合特征, 表现出优异的检测性能, 而R2CNN[10]未使用特征融合, 取得的检测结果远低于其他方法. 图3为本文设计的像素重组金字塔结构. 该结构分为2个阶段: 第1阶段为
$ {C}_{i} $ →$ {M}_{i} $ , 采用尺度转化的方式, 利用局部特征信息的同时, 融合上下层构建金字塔结构; 第2阶段为$ {M}_{i} $ →$ {P}_{i} $ , 采用非局部注意力[24]模块, 利用全局信息, 突出目标区域的特征.表 1 不同方法在DOTA数据集的检测精度对比(%)Table 1 Comparison of detection accuracy of different methods in DOTA (%)类别 R2CNN[10] RT[12] CADNet[13] SCRDet[15] R3Det[16] GV[17] 本文方法 飞机 80.94 88.64 87.80 89.98 89.24 89.64 89.10 棒球场 65.67 78.52 82.40 80.65 80.81 85.00 81.22 桥梁 35.34 43.44 49.40 52.09 51.11 52.26 54.47 田径场 67.44 75.92 73.50 68.36 65.62 77.34 72.97 小型车辆 59.92 68.81 71.10 68.36 70.67 73.01 79.99 大型车辆 50.91 73.68 64.50 60.32 76.03 73.14 82.28 船舶 55.81 83.59 76.60 72.41 78.32 86.82 87.64 网球场 90.67 90.74 90.90 90.85 90.83 90.74 90.54 篮球场 66.92 77.27 79.20 87.94 84.89 79.02 87.31 储油罐 72.39 81.46 73.30 86.86 84.42 86.81 86.33 足球场 55.06 58.39 48.40 65.02 65.10 59.55 54.20 环形车道 52.23 53.54 60.90 66.68 57.18 70.91 68.18 港口 55.14 62.83 62.00 66.25 68.10 72.94 76.12 游泳池 53.35 58.93 67.00 68.24 68.98 70.86 70.83 直升机 48.22 47.67 62.20 65.21 60.88 57.32 59.19 平均准确率 60.67 69.56 69.90 72.61 72.81 75.02 76.02 在第1阶段中, 特征上采样对于金字塔结构是一个关键的操作. 最常用的特征上采样方式为插值和转置卷积[25]. 插值法仅考虑相邻像素, 无法获取密集预测任务所需的丰富语义信息. 转置卷积作为卷积的逆运算, 将其作为上采样方式存在2点不足[26]: 1)对于整个特征图都采用同样的卷积核, 而不考虑特征图中的目标信息, 限制了上采样过程对于局部变化的响应; 2)若采用较大的卷积核将会增加大量参数. 本文引入尺度转换作为特征上采样方法. 深浅层特征融合的操作过程如图4所示. 该方法首先利用“通道转化”方法[27]压缩通道数(本文压缩系数
$r=0.5$ ), 增大特征图尺寸, 即:$${I_{H,W,C}} = {I_{\left\lfloor {H/r} \right\rfloor ,\left\lfloor {W/r} \right\rfloor ,C \cdot {r^2} + r \cdot {\rm{mod}}\left( {W,r} \right) + {\rm{mod}}\left( {H,r} \right)}}$$ (1) 然后, 采用
$1 \times 1 $ 的卷积层用于调整通道数, 再由Softmax函数[28]作用于每一通道的特征层. 最后采用式(2)进行加权求和, 使得特征融合过程更好地利用局部信息.$$ \left\{\begin{aligned} &{y}_{m,n,c}=\displaystyle\sum\limits_{i=-2}^{2}\displaystyle\sum\limits_{j=-2}^{2}{x}_{m+i,n+j,c}\cdot {w}_{m,n,k} \\ &k=\left(i+2\right)\times 5+j+2 \end{aligned}\right. $$ (2) 式中,
$m、n$ 分别表示像素的横、纵位置,$ c $ 表示$ C $ 特征层当前通道,$ k $ 表示$ M $ 特征层当前通道.第2阶段采用非局部注意力模块, 利用特征图中目标与全局特征的关系, 突出目标区域的响应.
根据非局部注意力模块的定义, 假设
$ C $ 为通道数,$ s $ 为尺度大小,${{G}}$ 为特征图尺度的乘积即$s\times s,x$ 为输入特征图,$ q\left(x\right) $ 、$ k\left(x\right) $ 和$ v\left(x\right) $ 定义为采用不同线性转换的结果:$$ q\left({x}^{s}\right)={{W}_{q}^{s}}^{\rm T}{x}^{s} $$ (3) $$ k\left({x}^{s}\right)={{W}_{k}^{s}}^{\rm T}{x}^{s} $$ (4) $$ v\left({x}^{s}\right)={{W}_{v}^{s}}^{\rm T}{x}^{s} $$ (5) 式中, 系数矩阵
${{W}}_{{q}}^{{s}},\;{{W}}_{{k}}^{{s}}\in { \bf{R}}^{{{C}}\times {{C}}/8},\;{{W}}_{{v}}^{{s}}\in {\bf{R}}^{{{C}}\times {{C}}}$ .${{q}}\left({{x}}^{{s}}\right)$ 与${{k}}\left({{x}}^{{s}}\right)$ 矩阵相乘, 得二维矩阵${{o}}^{{s}}\in {\bf{R}}^{{{G}}\times {{G}}}$ ; 再运用Softmax将矩阵的每一行转换为概率值, 最后与${{v}}\left({{x}}^{{s}}\right)$ 矩阵相乘后再与输入相加, 得输出量${{{x}}^{{s}}}'$ :$$ {{x}^{s}}'={x}^{s}+{\left({o}^{s}v^{\rm{T}}{\left({x}^{s}\right)}\right)}^{\rm T} $$ (6) 在本文的特征金字塔结构中, 第1阶段输出的
$ {M}_{3} $ 和$ {M}_{4} $ 由于尺度较大, 直接用于非局部注意力模块计算量较大. 因此为了保留这两层的语义信息, 同时再次融合不同层的特征, 该结构将$ {M}_{3} $ 和$ {M}_{4} $ 池化为$ {M}_{5} $ 的尺寸大小, 然后计算这3层的均值输入非局部注意力模块, 再由插值操作输出对应相等尺寸的特征图.$ {M}_{6} $ 和$ {M}_{7} $ 的特征图直接应用非局部注意力模块得到$ {P}_{6} $ 和$ {P}_{7} $ 层.1.3 感兴趣区域特征提取模块
感兴趣区域特征提取模块主要用于固定输出尺寸大小, 提取表征框内区域的特征, 便于后续的网络预测. 本文的RoI特征提取模块主要分为粗调阶段的水平框和细调阶段的旋转框RoI特征提取两部分.
自然场景图像中的目标通常是固定方向呈现, 因此两阶段式目标检测算法采用水平框的RoI特征提取. 目前, 应用较为广泛的RoI特征提取是RoIPooling[4]和RoI Align[29]. 图5(a)为RoI池化原理图, 选择量化后块中最大像素值作为池化后的结果. 然而量化的结果会导致提取的小目标像素存在偏差, 影响检测效果. 图5(b)为RoI对齐原理图, 取消量化操作, 采用双线性插值在块中计算出N个浮点坐标的像素值, 均值作为块的结果. 然而这个操作存在两点不足: 采样点数量需要预先设置, 不同大小候选框设置了相同数量的采样点.
因此, 本文采用精确RoI (Precise RoI, Pr-RoI)池化方法[30]的特征提取操作, 如图6所示, 由插值操作将块内特征视为一个连续的过程, 采用积分方法获得整个块的像素和, 其均值作为块的结果, 即:
$$ {\rm{IRoIPool}}\left(bin,{\cal{F}}\right)=\dfrac{{\int }_{{y}_{1}}^{{y}_{2}}{\int }_{{x}_{1}}^{{x}_{2}}f\left(x,y\right){\rm d}x{\rm d}y}{\left({x}_{2}-{x}_{1}\right)\times \left({y}_{2}-{y}_{1}\right)} $$ (7) 式中,
$ f(x,y) $ 为采用面积插值法[15]所得的像素值.旋转框RoI特征提取直接采用积分操作较为复杂, 因此本文将积分操作视为块内一定数量的像素之和, 从而得到块的均值, 即:
$$ {\rm{RRoIPool}}\left(bin,{\cal{F}}\right)=\frac{\displaystyle\sum\limits _{y={y}_{1}}^{{y}_{2}}\displaystyle\sum\limits _{x={x}_{1}}^{{x}_{2}}f\left(x,y\right)}{{N}_{x}\times {N}_{y}} $$ (8) $$ {N_x} = \left\lfloor {\dfrac{{{x_2} - {x_1}}}{{{l_x}}}} \right\rfloor + 1,{N_y} = \left\lfloor {\dfrac{{{y_2} - {y_1}}}{{{l_y}}}} \right\rfloor + 1 $$ (9) 式中,
$ ({x}_{1},{y}_{1}) $ 和$ ({x}_{2},{y}_{2}) $ 分别为旋转框在水平位置处的左上角和右下角点,$ {l}_{x} $ 和$ {l}_{y} $ 分别为水平方向和垂直方向的采样距离, 如图7所示.根据候选框的大小决定采样点的数量. 然而采样距离太小会导致计算量大幅增加, 因此为平衡检测效率与精度, 本文将采样距离
$ {l}_{x} $ 和$ {l}_{y} $ 设置为0.4.旋转框在水平位置处采样点的坐标为
$ ({x}_{h},{y}_{h}) $ , 旋转框$ w $ 所对应的边与横轴正方向的夹角为$ \theta $ , 旋转框的中心点为$ ({c}_{x},{c}_{y}) $ , 由式(10)转化为旋转框中的坐标$ (x,y) $ , 再由面积插值法得到该位置的像素值.$$ \left[ \begin{array}{c}x\\ y\end{array} \right] = \left[ \begin{array}{ccc}{\rm cos}\theta & - {\rm sin}\theta & \left(1 - {\rm cos}\theta \right) \cdot {c}_{x} + {\rm sin}\theta \cdot {c}_{y}\\ {\rm sin}\theta & {\rm cos}\theta & - {\rm sin}\theta \cdot {c}_{x} + \left(1 - {\rm cos}\theta \right) \cdot {c}_{y} \end{array} \right]\left[ \begin{array}{c}{x}_{h}\\ {y}_{h}\\ 1\end{array} \right] $$ (10) 本文方法与R3Det类似, 都使用了精细调整旋转框的定位. 然而R3Det每一次调整的预测分支直接采用卷积层操作, 但是卷积操作为水平滑动, 用于旋转框回归将会包含一些背景像素干扰预测结果, 而本文方法采用旋转框感兴趣区域提取框内的特征信息用于预测, 更加有利于检测性能的提升.
1.4 预测分支结构
目标检测算法分为定位和分类两个任务. 一般而言, 两级检测器的预测分支采用全连接层, 而单级检测器的预测分支采用卷积层. Wu等[31]发现这两个任务适合于不同的预测分支结构, 全连接层更适合用于分类任务, 卷积层更适合用于回归任务. 因此, 本文采用图8所示的预测分支结构.
在本文采用的预测分支中, 分类结构保持不变, 仍然采用全连接层. 而回归分支采用一系列ResNet网络中的ResBlock结构(本文使用2个).
1.5 网络训练损失函数
本文提出网络的损失函数包含RPN阶段
${L}_{{\rm{RPN}}}$ 、粗略调整阶段$ {L}_{ro} $ 和精细调整阶段$ {L}_{re} $ , 即:$$ L={L}_{{\rm{RPN}}}+{L}_{ro}+{L}_{re} $$ (11) 每一阶段的损失函数都包含分类损失和回归损失. 分类损失采用交叉熵损失函数[4]. 回归损失采用SmoothLn损失函数[32], 如式(12)所示, 相比于SmoothL1损失函数[4], 该损失函数的一阶导数是连续存在的, 具有良好的光滑性.
$$ S{L}_{n}\left(x\right)=\left(\left|x\right|+1\right){\rm ln}\left(\left|x\right|+1\right)-\left|x\right| $$ (12) $$ \dfrac{\partial S{L}_{n}\left(x\right)}{\partial x}={\rm sign}\left(x\right)\cdot{\rm ln}\left({\rm sign}\left( {x} \right)\cdot{ x+1}\right) $$ (13) 此外, 式(11)中RPN阶段为水平框的回归, 因此使用
$x、y、w、h$ 4个值代表水平框. 粗调阶段和细调阶段为旋转框的回归, 使用$x、y、 w、 h、\theta$ 5个值代表旋转框, 因此旋转框的回归转换值定义为:$$ \left[\begin{array}{c}{t}_{x}\\ {t}_{y}\end{array}\right] = \left[\begin{array}{cc}{\rm cos}\theta & {\rm sin}\theta \\ -{\rm sin}\theta & {\rm cos}\theta \end{array}\right]\left[\begin{array}{c}{x}_{t}-{x}_{a}\\ {y}_{t}-{y}_{a}\end{array}\right]\left[\begin{array}{cc}\dfrac{1}{{w}_{a}}& 0\\ 0& \dfrac{1}{{h}_{a}}\end{array}\right] $$ (14) $$ {t}_{w}=\log_2\left(\frac{{w}_{t}}{{w}_{a}}\right),\;\;\;{t}_{h}=\log_2\left(\frac{{h}_{t}}{{h}_{a}}\right) $$ (15) $$ {t}_{\theta }=\left({\theta }_{t}-{\theta }_{a}\right){\rm{mod}}\;2\pi $$ (16) 式中,
$x、y、w、h、\theta$ 分别为旋转框中心点的横、纵坐标, 框的宽度、高度和旋转角度.${x}_{t}、{x}_{a}$ 分别表示真实框和候选框的值.2. 实验结果与分析
本文实验设备使用英特尔E5-2683 CPU, 英伟达GTX 1080Ti显卡, 64 GB内存的服务器, 实验环境为Ubuntu 16.04.4操作系统、Cuda9.0、Cudnn7.4.2、Pytorch1.1.0、Python3.7.
本文实验中采用3个GPU进行训练, 批处理大小为3 (GPU显存限制), 输入图像统一为1024
$\times$ 1024分辨率. 训练的迭代次数为15轮, 同时使用衰减系数为0.0001、动量为0.9的随机梯度下降作为优化器, 初始的学习率设置为0.01, 分别在第8、第11轮和第14轮将学习率降低10倍. 图9是在DOTA 数据集上训练过程的损失下降曲线图(一轮训练有4500次迭代), 在第8轮(36000次迭代)出现明显的损失下降.2.1 实验数据集
本文使用DOTA[21]用于算法的评估. DOTA是由旋转框标注的大型公开数据集, 主要用于遥感图像目标检测任务. 该数据集包含由各个不同传感器和平台采集的2806张图像, 图像的大小范围从800 × 800像素到4000 × 4000像素, 含有各种尺度、方向和形状. 专家选择15种常见类别对这些图像进行标注, 总共标注188282个目标对象, 包括飞机、棒球场、桥梁、田径场、小型车辆、大型车辆、船舶、网球场、篮球场、储油罐、足球场、环形车道、港口、游泳池和直升机. 另外该数据集选取一半的图像作为训练集, 1/6作为验证集, 1/3作为测试集, 其中测试集的标注不公开. 为降低高分辨率图像由于压缩对于小目标的影响, 本文将所有图像统一裁剪为1024 × 1024的子图像, 重叠为200像素.
2.2 检测结果对比
本文方法采用ResNet50与可变形卷积[33]相结合作为基础网络进行本节实验. 为了评估本文方法的性能, 实验数据均采用官方提供的训练集和测试集. 实验结果通过提交到DOTA评估服务器上获得, 本文方法的评估结果平均准确率为0.7602, 超过目前官方提供的基准方法[21].
除了与官方基准方法进行对比, 本节实验还与R2CNN[10]、RoI-Transformer[12]、CADNet[13]、SCRDet[15]、R3Det[16]和GV R-CNN[17]进行对比分析, 各方法的检测结果如表1所示.
由表1中的检测结果可以看出, 本文方法的检测结果优于其他方法, 达到76.02%的平均准确率. 其中桥梁、小型车辆、大型车辆、船舶和港口这些类别取得最高检测精度. 由图10可以看出, 这些类别的目标在遥感数据集中尺寸较小, 并且往往呈现出密集排列, 因此说明本文方法对于在这类场景的检测更具有优势. 此外, 飞机、网球场、篮球场、储水池、游泳池等类别在遥感数据集中尺寸较大, 对于这些目标本文方法仍取得与其他方法中最高检测精度相差不大的结果. 这些检测结果说明本文方法能够有效地用于检测遥感图像中的目标.
2.3 分离实验
1)各模块对于检测精度的影响
为验证本文方法各模块的有效性, 本节进行了一系列对比实验. 表2展示了网络在DOTA 数据集上不同模块设置的检测结果. 其中“√”表示采用该项设置, ConvFc表示采用第1.4节设计的预测分支结构. 对比实验分析如下:
表 2 R2-FRCNN模块分离检测结果Table 2 R2-FRCNN module separates detection results模块 R2-FRCNN 基准设置 √ √ √ √ √ √ √ 精细调整 √ √ √ √ √ √ IRoIPool √ √ √ √ √ RRoIPool √ √ √ √ PFPN √ √ √ SmoothLn √ √ ConvFc √ 平均准确率 (%) 69.52 73.62 73.99 74.31 74.97 75.13 75.96 a)基准设置. 本节实验将扩展后的Faster R-CNN OBB[21]用于旋转框检测任务. 其中, 基础网络采用ResNet50[22], 并且采用特征金字塔[23], RoI特征提取采用RoI Align[29], 回归分支采用Smoo-thL1损失函数[4]. 为了保证实验的公平性和准确性, 后续实验参数设置都是严格一致.
b)精细调整. 在实验的精细调整阶段, 初始候选区域特征提取选择Rotated RoI Align (RRoI Align)方法, 该方法为RoI Align[29]在旋转框中的应用. 由表2的结果显示, 精细调整阶段的添加, 使得检测效果得到大幅提升, 评估指标平均准确率增加4.10%. 说明提取旋转候选框内像素进一步调整是有必要的, 这个阶段避免了水平框特征提取包含过多背景像素的问题, 从而提升对较大横纵比目标的检测效果. 然而在实验中发现, 在精细调整结构中多次调整提升效果并不明显, 从一次调整增加为两次调整, 平均准确率为73.68%, 仅仅增加0.06%, 因此为了减少参数量, 本文后续实验的精细调整阶段采用一次调整过程.
c) RoI特征提取. 实验中, 将第1.3节提出的IRoIPool和RRoIPool用于替换初始两阶段调整模块的RoI Align和RRoI Align. 由表2的实验结果显示, 相比于初始RoI特征提取方法, IRoIPool方法使得检测精度平均准确率提升0.37%, RRoIPool方法使得检测精度平均准确率进一步提升0.32%, 说明本文设计的RoI特征提取更为有效. 本文后续将对这两个特征提取方法的结构做进一步研究.
d) PFPN结构. 为了更好地验证PFPN的作用, 本文对此设计了两组实验. 第1组, 金字塔结构的深浅层不进行尺寸转化和非局部注意力模块, 仅仅采用
$ 1\times 1 $ 的卷积将特征层的通道数转化为256, 网络的其他结构和训练超参数保持一致, 平均准确率仅为64.55%, 由于DOTA数据集中小目标较多, 因此说明PFPN金字塔结构对于小目标的检测效果显著. 第2组实验的结果见表2, 相比于FPN, PFPN使得平均准确率提升0.66%, 说明本文提出的PFPN结构对于遥感目标的检测更为有效.e)网络预测分支. 本节针对预测分支进行两部分的实验, 即回归损失函数和预测分支结构. 由表2可以看出, 相比于SmoothL1, 回归损失函数采用SmoothLn, 使得检测精度平均准确率提升0.16%. 此外, 采用第1.4节所设计的预测分支结构, 分类过程采用全连接层, 回归过程采用卷积层, 仅增加2个ResBlock模块, 使得平均准确率提升0.83%. 由此说明回归过程采用SmoothLn函数和卷积层更加适合旋转框目标检测.
2)感兴趣区域特征提取模块研究
本节研究不同RoI特征提取结构对于检测精度的影响, 实验分为水平候选框特征提取方法和旋转候选框特征提取方法两部分. 实验结果分别见表3和表4所示.
表 3 不同水平框特征提取方法的实验结果Table 3 Experimental results of feature extraction methods of different horizontal boxes模块 平均准确率 + 精细调整 方法 RoIPooling RoI Align IRoIPool 平均准确率 (%) 71.21 73.62 73.99 表 4 不同旋转框特征提取方法的实验结果Table 4 Experimental results of different featureextraction methods of rotated boxes模块 平均准确率 + 精细调整 + IRoIPool 方法 RRoI A-Pooling RRoI Align RRoIPool 平均准确率 (%) 73.38 73.99 74.31 表3的实验结果显示, 采用RoIPooling方式的检测精度相对较低, 其量化操作降低了对于小目标的检测效果. 而RoI Align方式取消量化操作, 采用插值方式使得平均准确率提升2.41%, 说明提取连续的特征有利于目标检测. 本文方法在面积插值法的基础上引入积分操作, 平均准确率提升0.37%. 相比于前一种方式选取固定数量的像素点, 本文采用的积分操作类似于选取较多点, 可以提取更多特征, 有利于检测效果的提升.
表4为采用不同旋转框特征提取方法的检测结果. 第1种方法旋转感兴趣区域平均池化方法(Rotated region of interest average pooling, RRoI A-Pooling)选取旋转框内的像素点, 像素均值作为提取的特征. 第2种方法采用类似RoI Align的方式在旋转框内选择浮点数坐标, 运用双线性插值获得对应的像素值, 平均准确率提升0.61%. 本文采用方法RRoIPool可以根据旋转框大小选择不同数量的像素点表示特征. 相比于第2种方式提升0.32%, 说明本文采用的旋转框特征提取方式更适合于精细调整模块.
3. 结束语
基于深度学习的目标检测算法在自然场景图像中取得了很大进展. 然而遥感图像存在背景复杂、小目标较多、排列方向任意等难点, 常见的目标检测算法并不满足这类场景的应用需求. 因此本文提出一种粗调与细调两阶段结合的旋转框检测网络R2-FRCNN用于遥感图像检测任务. 并且设计像素重组金字塔结构, 提高复杂背景下小目标的检测性能. 同时在粗调阶段设计一种水平框特征提取方法IRoIPool, 细调阶段设计旋转框特征提取方法RRoIPool. 此外, 本文还采用SmoothLn回归损失函数, 以及全连接层和卷积层结合的预测分支, 进一步提升检测精度. 实验结果表明本文方法在大型公共数据集DOTA上获得了较好的检测效果. 然而本文方法存在检测速度较慢、GPU资源消耗较大等缺点, 因此在后续的工作中也将针对网络的轻量化展开进一步研究.
-
表 1 不同特征空间学习方法汇总信息
Table 1 Information of different feature space learning methods
方法 描述 特点 经典MAP方法[29] $ {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{D}}{\boldsymbol{z}}_{s,h} $ MAP 自适应方法 $ {\boldsymbol{D}} $为对角矩阵, $ {\boldsymbol{z}}_{s,h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $ 无法进行信道补偿 本征音模型[36-37] $ {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{V}}{\boldsymbol{y}}_{s,h} $ 能够获得低维句级特征表示 $ {\boldsymbol{V}} $为低秩矩阵, $ {\boldsymbol{y}}_{s,h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $ 无法进行信道补偿 本征信道模型[37] $ {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{D}}{\boldsymbol{z}}_{s}+{\boldsymbol{U}}{\boldsymbol{x}}_{h} $ 能够进行信道补偿 $ {\boldsymbol{D}} $为对角矩阵, $ {\boldsymbol{z}}_{s} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $ 需要提供同一说话人的多信道语音数据 $ {\boldsymbol{U}} $为低秩矩阵, $ {\boldsymbol{y}}_{s,h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $ 说话人子空间中包含残差信息 联合因子分析模型[38] ${\boldsymbol{M} }_{s,h}={\boldsymbol{m} }+V{\boldsymbol{y} }_{s}+{\boldsymbol{U} }{\boldsymbol{x} }_{h}+{\boldsymbol{D} }{\boldsymbol{z} }_{s,h}$ 独立学习说话人信息与信道信息
需要提供同一说话人的多信道语音数据, 计算复杂度高$ {\boldsymbol{V}} $为低秩矩阵, $ {\boldsymbol{y}}_{s} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $ $ {\boldsymbol{U}} $为低秩矩阵, $ {\boldsymbol{x}}_{h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $ $ {\boldsymbol{D}} $为对角矩阵, $ {\boldsymbol{z}}_{s} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $ 总变化空间模型[39-40] $ {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{T}}{\boldsymbol{w}}_{s,h}+{\boldsymbol{\varepsilon}}_{s,h} $ 学习均值超矢量中的全部变化信息 $ {\boldsymbol{T}} $为低秩矩阵, $ {\boldsymbol{w}}_{s,h} \sim {\rm{N}}\left({\bf{0}},{\boldsymbol{I}}\right) $ 获取 I-vector 特征后再进行会话补偿 $ {\boldsymbol{\varepsilon}}_{s,h} $为残差矢量 $ {\boldsymbol{\varepsilon}}_{s,h} $在不同方法中的形式不同 表 2 基于不同残差假设的无监督总变化空间模型
Table 2 Unsupervised TVS model based on different residual assumptions
方法 描述 E 步 M 步 计算复杂度 FEFA[40] $ {{\boldsymbol{M} }_{s,h}={\boldsymbol{m} }+{\boldsymbol{T} }{\boldsymbol{w} }_{s,h}}$
输入为统计量无残差假设${\begin{align}&{\boldsymbol{L} }={\left({\boldsymbol{I} }+\displaystyle\sum\limits_{c=1}^{C}{N}_{s,h}^{c}{ {\boldsymbol{T} } }_{c}^{\rm{T} }{\boldsymbol{\Sigma }}_{c}^{-1}{ {\boldsymbol{T} } }_{c}\right)}^{-1}\\ &{\boldsymbol{E} }={\boldsymbol{L} }\displaystyle\sum\limits_{c=1}^{C}{ {\boldsymbol{T} } }_{c}^{\rm{T} }{\boldsymbol{\Sigma } }_{c}^{-1}\left({\boldsymbol{F} }_{s,h}^{c}-{N}_{s,h}^{c}{\boldsymbol{\mu } }_{c}\right)\\ &\Upsilon ={\boldsymbol{L}}+{\boldsymbol{E}}{{\boldsymbol{E}}}^{\rm{T}}\end{align}} $ $ {{ {\boldsymbol{T} } }_{c}=\left[\displaystyle\sum\limits_{s,h}\left({\boldsymbol{F} }_{s,h}^{c}-{N}_{s,h}^{c}{\boldsymbol{\mu } }_{c}\right){\boldsymbol{E} }\right]{\left(\displaystyle\sum\limits_{s,h}{N}_{s,h}^{c}\Upsilon \right)}^{-1}}$ $ { {\rm{O}}\left(CFR+C{R}^{2}+{R}^{3}\right)} $ PPCA[43-44] $ { {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{T}}{\boldsymbol{w}}_{s,h}+{\boldsymbol{\varepsilon}}_{s,h}} $
残差协方差矩阵各向同性$ {\begin{align}&{\boldsymbol{L} }={\left({\boldsymbol{I} }+\dfrac{1}{ {\sigma }^{2} }{ {\boldsymbol{T} } }^{\rm{T} }{\boldsymbol{T} }\right)}^{-1}\\ &{\boldsymbol{E} }=\dfrac{1}{ {\sigma }^{2} }{\boldsymbol{L} }{ {\boldsymbol{T} } }^{\rm{T} }\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right)\\ &\Upsilon ={\boldsymbol{L}}+{\boldsymbol{E}}{{\boldsymbol{E}}}^{\rm{T}} \end{align}}$ $ {\begin{aligned}{\boldsymbol{T} }=&\left[\displaystyle\sum\limits_{s,h}\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right){\boldsymbol{E} }\right]{\left(\displaystyle\sum\limits_{s,h}\Upsilon \right)}^{-1}\\{\sigma }^{2}=&\;\dfrac{1}{CF\displaystyle\sum\limits _{s,h}1}\{ {\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right)}^{\rm{T} }\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right)-\\ &{\rm{t} }{\rm{r} }\left(\Upsilon { {\boldsymbol{T} } }^{\rm{T} }{\boldsymbol{T} })\right\} \end{aligned} }$ $ {{\rm{O}}\left(CFR\right) }$ FA[44-45] $ { {\boldsymbol{M}}_{s,h}={\boldsymbol{m}}+{\boldsymbol{T}}{\boldsymbol{w}}_{s,h}+{\boldsymbol{\varepsilon}}_{s,h} }$
残差协方差矩阵各向异性$ {\begin{align} &{\boldsymbol{L}}={\left({\boldsymbol{I}}+{{\boldsymbol{T}}}^{\rm{T}}{\boldsymbol{\varPhi }}^{-1}{\boldsymbol{T}}\right)}^{-1}\\ &{\boldsymbol{E}}={\boldsymbol{L}}{{\boldsymbol{T}}}^{\rm{T}}{\boldsymbol{\varPhi }}^{-1}\left({\boldsymbol{M}}_{s,h}-{\boldsymbol{m}}\right) \\ &\Upsilon ={\boldsymbol{L}}+{\boldsymbol{E}}{{\boldsymbol{E}}}^{\rm{T}}\end{align}} $ $ {\begin{aligned}{\boldsymbol{T} }=&\left[\displaystyle\sum\limits_{ {\boldsymbol{s} },{\boldsymbol{h} } }\left({\boldsymbol{M} }_{ {\boldsymbol{s} },{\boldsymbol{h} } }-{\boldsymbol{m} }\right){\boldsymbol{E} }\right]{\left(\displaystyle\sum\limits_{s,h}\Upsilon \right)}^{-1}\\ {\sigma }^{2}=\;&\dfrac{1}{CF\displaystyle\sum\limits _{s,h}1}\{\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right){\left({\boldsymbol{M} }_{s,h}-{\boldsymbol{m} }\right)}^{\rm{T} }-\\ &{ {\boldsymbol{T} } }^{\rm{T} }\Upsilon {\boldsymbol{T} }\}\odot {\boldsymbol{I} } \end{aligned} }$ $ { {\rm{O}}\left(CFR\right)} $ 表 3 基于不同映射关系假设的无监督总变化空间模型
Table 3 Unsupervised TVS model based on different mapping relations
表 4 不同有监督总变化空间模型汇总信息
Table 4 Information of different supervised TVS models
表 5 不同会话补偿方法汇总信息
Table 5 Information of different session compensation methods
表 6 不同目标函数汇总信息
Table 6 Information of different objective functions
目标 方法 目标函数 多分类 交叉熵 ${L_{{\rm{cro}}} } = - [y\log \hat y + (1 - y)\log (1 - \hat y)]$ Softmax ${L_s} = - \dfrac{1}{N}\displaystyle \sum\limits_{n = 1}^N {\log } \frac{ { { {\rm{e} } ^{ {\boldsymbol{\theta } }_{ {y_n} }^{\rm{T} }f({ {\boldsymbol{x} }_n})} } } }{ {\displaystyle \sum\limits_{k = 1}^K { { {\rm{e} } ^{ {\boldsymbol{\theta } }_k^{\rm{T} }f({ {\boldsymbol{x} }_n})} } } } }$ Center[98] ${L}_{c}=\dfrac{1}{2N}{\displaystyle \sum\limits_{n=1}^{N}\Vert f(}{\boldsymbol{x} }_{n})-{\boldsymbol{c} }_{ {y}_{n} }{\Vert }^{2}$ L-softmax[99] ${L}_{{\rm{l}}\text{-}{\rm{s}}}=-\dfrac{1}{N}{\displaystyle \sum\limits_{n=1}^{N}{\rm{log} } }\displaystyle\frac{ {\rm{e} }^{\Vert { {\boldsymbol{\theta } } }_{ {y}_{n} }\Vert \Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})} }{ {\rm{e} }^{\Vert { {\boldsymbol{\theta } } }_{ {y}_{n} }\Vert \Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})}+{\displaystyle \sum\limits_{k\ne {y}_{n} }{\rm{e} }^{\Vert { {\boldsymbol{\theta } } }_{k}\Vert \Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }({\alpha }_{k,n})} } }$ A-softmax[100] ${L}_{{\rm{a}}\text{-}{\rm{s}}}=-\dfrac{1}{N}{\displaystyle \sum\limits_{n=1}^{N}{\rm{log} } }\displaystyle\frac{ {\rm{e} }^{\Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})} }{ {\rm{e} }^{\Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }(m{\alpha }_{ {y}_{n},n})}+{\displaystyle \sum\limits_{k\ne {y}_{n} }{\rm{e} }^{\Vert { {\boldsymbol{\theta } } }_{k}\Vert \Vert f({\boldsymbol{x} }_{n})\Vert {\rm{cos} }({\alpha }_{k,n})} } }$ AM-softmax[101] ${L_{{\rm{am}}\text{-}{\rm{s}}} } = - \dfrac{1}{N}\displaystyle \sum\limits_{n = 1}^N {\log } \frac{ { { {\rm{e} } ^{s \cdot [\cos ({\alpha _{ {y_n},n} }) - m]} } } }{ { { {\rm{e} } ^{s \cdot [\cos ({\alpha _{ {y_n},n} }) - m]} } + \displaystyle \sum\limits_{k \ne {y_n} } { { {\rm{e} } ^{\cos ({\alpha _{k,n} })} } } } }$ 度量学习 Contrastive[102] ${L_{{\rm{con}}} } = yd\left[ {f({ {\boldsymbol{\boldsymbol{x} } }_1}),f({ {\boldsymbol{\boldsymbol{x} } }_2})} \right] + (1 - y)\max \{ 0,m - d\left[ {f({ {\boldsymbol{\boldsymbol{x} } }_1}),f({ {\boldsymbol{\boldsymbol{x} } }_2})} \right]\}$ Triplet[103] ${L_{{\rm{trip}}} } = \max \{ 0,d\left[ {f({ {\boldsymbol{\boldsymbol{x} } }_p}),f({ {\boldsymbol{\boldsymbol{x} } }_a})} \right] - d\left[ {f({ {\boldsymbol{\boldsymbol{x} } }_n}),f({ {\boldsymbol{\boldsymbol{x} } }_a})} \right] + m\}$ 表 7 联合优化方法汇总信息
Table 7 Information of different joint optimization methods
表 8 常用数据库信息
Table 8 Information of common databases
数据库 年份 声学环境 类别数 语音段数/总时长 开源 CN-CELEB[126] 2019 多媒体 1000 300 h √ VoxCeleb[89]: VoxCeleb1[73] 2017 多媒体 1251 153516 √ VoxCeleb2[75] 2018 多媒体 6112 1128246 √ SITW[127] 2016 多媒体 299 2800 √ Forensic Comparison[128] 2015 电话 552 1264 √ NIST SRE12[129] 2012 电话/麦克风 2000+ — — ELSDSR[130] 2005 纯净语音 22 198 √ SWITCHBOARD[131] 1992 电话 3114 33039 — TIMIT[132] 1990 纯净语音 630 6300 — -
[1] Reynolds D A. An overview of automatic speaker recognition technology. In: Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, USA: IEEE, 2002. IV-4072−IV-4075 [2] Aghajan H, Delgado R L C, Augusto J C. Human-Centric Interfaces for Ambient Intelligence. Burlington: Academic Press, 2010. [3] Poddar A, Sahidullah M, Saha G. Speaker verification with short utterances: A review of challenges, trends and opportunities. IET Biometrics, 2018, 7(2): 91-101 doi: 10.1049/iet-bmt.2017.0065 [4] 韩纪庆, 张磊, 郑铁然. 语音信号处理. 第3版. 北京: 清华大学出版社, 2019.Han Ji-Qing, Zhang Lei, Zheng Tie-Ran. Speech Signal Processing (3rd edition). Beijing: Tsinghua University Press, 2019. [5] Nematollahi M A, Al-Haddad S A R. Distant speaker recognition: An overview. International Journal of Humanoid Robotics, 2016, 13(2): Article No. 1550032 doi: 10.1142/S0219843615500322 [6] Hansen J H L, Hasan T. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 2015, 32(6): 74-99 doi: 10.1109/MSP.2015.2462851 [7] Kinnunen T, Li H Z. An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 2010, 52(1): 12-40 doi: 10.1016/j.specom.2009.08.009 [8] Markel J, Oshika B, Gray A. Long-term feature averaging for speaker recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1977, 25(4): 330-337 doi: 10.1109/TASSP.1977.1162961 [9] Li K, Wrench E. An approach to text-independent speaker recognition with short utterances. In: Proceedings of the 1983 IEEE International Conference on Acoustics, Speech, and Signal Processing. Boston, USA: IEEE, 1983. 555−558 [10] Chen S H, Wu H T, Chang Y, Truong T K. Robust voice activity detection using perceptual wavelet-packet transform and Teager energy operator. Pattern Recognition Letters, 2007, 28(11): 1327-1332 doi: 10.1016/j.patrec.2006.11.023 [11] Fujimoto M, Ishizuka K, Nakatani T. A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme. In: Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. Las Vegas, USA: IEEE, 2008. 4441−4444 [12] Li K, Swamy M N S, Ahmad M O. An improved voice activity detection using higher order statistics. IEEE Transactions on Speech and Audio Processing, 2005, 13(5): 965-974 doi: 10.1109/TSA.2005.851955 [13] Soleimani S A, Ahadi S M. Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses. In: Proceedings of the 3rd International Conference on Information and Communication Technologies: From Theory to Applications. Damascus, Syria: IEEE, 2008. 1−5 [14] Sohn J, Kim N S, Sung W. A statistical model-based voice activity detection. IEEE Signal Processing Letters, 1999, 6(1): 1-3 doi: 10.1109/97.736233 [15] Chang J H, Kim N S. Voice activity detection based on complex Laplacian model. Electronics Letters, 2003, 39(7): 632-634 doi: 10.1049/el:20030392 [16] Ramirez J, Segura J C, Benitez C, Garcia L, Rubio A. Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Processing Letters, 2005, 12(10): 689-692 doi: 10.1109/LSP.2005.855551 [17] Tong S B, Gu H, Yu K. A comparative study of robustness of deep learning approaches for VAD. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai, China: IEEE, 2016. 5695−5699 [18] Atal B S. Automatic recognition of speakers from their voices. Proceedings of the IEEE, 1976, 64(4): 460-475 doi: 10.1109/PROC.1976.10155 [19] Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1980, 28(4): 357-366 doi: 10.1109/TASSP.1980.1163420 [20] Hermansky H. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 1990, 87(4): 1738-1752 doi: 10.1121/1.399423 [21] Koenig W, Dunn H K, Lacy L Y. The sound spectrograph. The Journal of the Acoustical Society of America, 1946, 18(1): 19-49 doi: 10.1121/1.1916342 [22] LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4): 541−551 [23] 林景栋, 吴欣怡, 柴毅, 尹宏鹏. 卷积神经网络结构优化综述. 自动化学报, 2020, 46(1): 24-37Lin Jing-Dong, Wu Xin-Yi, Chai Yi, Yin Hong-Peng. Structure optimization of convolutional neural networks: A survey. Acta Automatica Sinica, 2020, 46(1): 24-37 [24] Furui S. Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1981, 29(2): 254-272 doi: 10.1109/TASSP.1981.1163530 [25] Pelecanos J W, Sridharan S. Feature warping for robust speaker verification. In: Proceedings of the 2001 A Speaker Odyssey: The Speaker Recognition Workshop. Crete, Greece: ISCA, 2001. 1−5 [26] Sadjadi S O, Slaney M, Heck A L. MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research, Microsoft Research Technical Report MSR-TR-2013-133, 2013. [27] Campbell W M, Sturim D E, Reynolds D A. Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 2006, 13(5): 308-311 doi: 10.1109/LSP.2006.870086 [28] Reynolds D A. Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 1995, 17(1−2): 91-108 doi: 10.1016/0167-6393(95)00009-D [29] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 2000, 10(1−3): 19-41 doi: 10.1006/dspr.1999.0361 [30] Wang W, Han J, Zheng T, Zheng G, Liu H. A robust sparse auditory feature for speaker verification. Journal of Computational Information Systems, 2013, 9(22): 8987-8993 [31] Wang W, Han J Q, Zheng T R, Zheng G B. Robust speaker verification based on max pooling of sparse representation. Journal of Computers, 2014, 24(4): 56-65 [32] He Y J, Chen C, Han J Q. Noise-robust speaker recognition based on morphological component analysis. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA, 2015. 3001−3005 [33] Wang W, Han J Q, Zheng T R, Zheng G B, Zhou X Y. Speaker verification via modeling kurtosis using sparse coding. International Journal of Pattern Recognition and Artificial Intelligence, 2016, 30(3): Article No. 1659008 doi: 10.1142/S0218001416590084 [34] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1977, 39(1): 1-22 [35] Gauvain J L, Lee C H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 1994, 2(2): 291-298 doi: 10.1109/89.279278 [36] Kuhn R, Junqua J C, Nguyen P, Niedzielski N. Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing, 2000, 8(6): 695-707 doi: 10.1109/89.876308 [37] Kenny P, Mihoubi M, Dumouchel P. New MAP estimators for speaker recognition. In: Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH). Geneva, Switzerland: ISCA, 2003. 2961−2964 [38] Kenny P, Boulianne G, Ouellet P, Dumouchel P. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(4): 1435-1447 doi: 10.1109/TASL.2006.881693 [39] Dehak N, Dehak R, Kenny P, Brümmer N, Dumouchel P. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In: Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH). Brighton, UK: ISCA, 2009. 1559−1562 [40] Dehak N, Kenny P J, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(4): 788-798 doi: 10.1109/TASL.2010.2064307 [41] Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 1987, 2(1−3): 37-52 doi: 10.1016/0169-7439(87)80084-9 [42] Lei Z C, Yang Y C. Maximum likelihood I-vector space using PCA for speaker verification. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH). Florence, Italy: ISCA, 2011. 2725−2728 [43] Tipping M E, Bishop C M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B Statistical Methodology), 1999, 61(3): 611-622 doi: 10.1111/1467-9868.00196 [44] Vestman V, Kinnunen T. Supervector compression strategies to speed up I-vector system development. In: Proceedings of the 2018 Odyssey: The Speaker and Language Recognition Workshop. Les Sables d' Olonne, France: ISCA, 2018. 357−364 [45] Gorsuch R L. Factor Analysis (2nd edition). Hillsdale: Lawrence Erlbaum Associates, 1983. [46] Roweis S T. EM algorithms for PCA and SPCA. In: Proceedings of the 10th International Conference on Neural Information Processing Systems. Denver, USA: MIT Press, 1997. 626−632 [47] Chen L P, Lee K A, Ma B, Guo W, Li H Z, Dai L R. Local variability vector for text-independent speaker verification. In: Proceedings of the 9th International Symposium on Chinese Spoken Language Processing. Singapore, Singapore: IEEE, 2014. 54−58 [48] Xu L T, Lee K A, Li H Z, Yang Z. Sparse coding of total variability matrix. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany: ISCA, 2015. 1022−1026 [49] Ma J B, Sethu V, Ambikairajah E, Lee K A. Generalized variability model for speaker verification. IEEE Signal Processing Letters, 2018, 25(12): 1775-1779 doi: 10.1109/LSP.2018.2874814 [50] Shepstone S E, Lee K A, Li H Z, Tan Z H, Jensen S H. Total variability modeling using source-specific priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(3): 504-517 doi: 10.1109/TASLP.2016.2515506 [51] Ribas D, Vincent E. An improved uncertainty propagation method for robust I-vector based speaker recognition. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 6331−6335 [52] Xu L T, Lee K A, Li H Z, Yang Z. Generalizing I-vector estimation for rapid speaker recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(4): 749-759 doi: 10.1109/TASLP.2018.2793670 [53] Travadi R, Narayanan S. Efficient estimation and model generalization for the totalvariability model. Computer Speech and Language, 2019, 53: 43-64 [54] Chen C, Han J Q. Partial least squares based total variability space modeling for I-vector speaker verification. Chinese Journal of Electronics. 2018, 27(6): 1229-1233 doi: 10.1049/cje.2018.06.001 [55] Chen C, Han J Q, Pan Y L. Speaker verification via estimating total variability space using probabilistic partial least squares. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Swedish: ISCA, 2017. 1537−1541 [56] Lei Y, Hansen J H L. Speaker recognition using supervised probabilistic principal component analysis. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH). Makuhari, Japan: ISCA, 2010. 382−385 [57] Huber J. A robust version of the probability ratio test. Annals of Mathematical Statistics, 1965, 36(6): 1753-1758 doi: 10.1214/aoms/1177699803 [58] Hautamäki V, Cheng Y C, Rajan P, Lee C H. Minimax i-vector extractor for short duration speaker verification. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH). Lyon, France: ISCA, 2013. 3708−3712 [59] Vogt R, Baker B, Sridharan S. Modelling session variability in text-independent speaker verification. In: Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH). Lisbon, Portugal: ISCA, 2005. 3117−3120 [60] Fisher R A. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 1936, 7(2): 179-188 doi: 10.1111/j.1469-1809.1936.tb02137.x [61] Hatch A O, Kajarekar S S, Stolcke A. Within-class covariance normalization for SVM-based speaker recognition. In: Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH). Pittsburgh, USA: ISCA, 2006. 1471−1474 [62] Campbell W M, Sturim D E, Reynolds D A, Solomonoff A. SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing. Toulouse, France: IEEE, 2006. [63] Sadjadi S O, Pelecanos J W, Zhu W Z. Nearest neighbor discriminant analysis for robust speaker recognition. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH). Singapore, Singapore: ISCA, 2014. 1860−1864 [64] Misra A, Ranjan S, Hansen J H L. Locally weighted linear discriminant analysis for robust speaker verification. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 2864−2868 [65] Misra A, Hansen J H L. Modelling and compensation for language mismatch in speaker verification. Speech Communication, 2018, 96: 58-66 doi: 10.1016/j.specom.2017.09.004 [66] Li M, Zhang X, Yan Y H, Narayanan S S. Speaker verification using sparse representations on total variability I-vectors. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH). Florence, Italy: ISCA, 2011. 2729−2732 [67] Wang W, Han J Q, Zheng T R, Zheng G B, Shao M G. Speaker recognition via block sparse Bayesian learning. International Journal of Multimedia and Ubiquitous Engineering, 2015, 10(7): 247-254 doi: 10.14257/ijmue.2015.10.7.26 [68] 王伟, 韩纪庆, 郑铁然, 郑贵滨, 陶耀. 基于Fisher判别字典学习的说话人识别. 电子与信息学报, 2016, 38(2): 367-372Wang Wei, Han Ji-Qing, Zheng Tie-Ran, Zheng Gui-Bin, Tao Yao. Speaker recognition based on Fisher discrimination dictionary learning. Journal of Electronics and Information Technology, 2016, 38(2): 367-372 [69] Variani E, Lei X, McDermott E, Moreno I L, Gonzalez-Dominguez J. Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy: IEEE, 2014. 4052−4056 [70] Snyder D, Garcia-Romero D, Povey D, Khudanpur S. Deep neural network embeddings for text-independent speaker verification. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 999−1003 [71] Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S. X-Vectors: Robust DNN embeddings for speaker recognition. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE, 2018. 5329−5333 [72] Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of the 2014 British Machine Vision Conference (BMVC). Nottingham, UK: BMVA Press, 2014: 1−5 [73] Nagrani A, Chung J S, Zisserman A. VoxCeleb: A large-scale speaker identification dataset. In: Proceedings of the 18the Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 2616−2620 [74] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 770−778 [75] Chung J S, Nagrani A, Zisserman A. VoxCeleb2: Deep speaker recognition. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 1086−1090 [76] Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2014. 2672−2680 [77] Zhang Z F, Wang L B, Kai A, Yamada T, Li W F, Iwahashi M. Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. Eurasip Journal on Audio, Speech, and Music Processing, 2015, 2015(1): Article No. 12 doi: 10.1186/s13636-015-0056-7 [78] Richardson F, Reynolds D, Dehak N. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 2015, 22(10): 1671-1675 doi: 10.1109/LSP.2015.2420092 [79] Chen Y H, Lopez-Moreno I, Sainath T N, Visontai M, Alvarez R, Parada C. Locally-connected and convolutional neural networks for small footprint speaker recognition. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany: ISCA, 2015. 1136−1140 [80] Li L T, Chen Y X, Shi Y, Tang Z Y, Wang D. Deep speaker feature learning for text-independent speaker verification. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 1542−1546 [81] Prince S J D, Elder J H. Probabilistic linear discriminant analysis for inferences about identity. In: Proceedings of the 11th IEEE International Conference on Computer Vision. Rio de Janeiro, Brazil: IEEE, 2007. 1−8 [82] Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany: ISCA, 2015. 3214−3218 [83] Villalba J, Chen N X, Snyder D, Garcia-Romero D, McCree A, Sell G, et al. State-of-the-art speaker recognition for telephone and video speech: The JHU-MIT submission for NIST SRE18. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019. 1488−1492 [84] Povey D, Cheng G F, Wang Y M, Li K, Xu H N, Yarmohammadi M, et al. Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 3743−3747 [85] Snyder D, Garcia-Romero D, Sell G, McCree A, Povey D, Khudanpur S. Speaker recognition for multi-speaker conversations using X-vectors. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 5796−5800 [86] Kanagasundaram A, Sridharan S, Ganapathy S, Singh P, Fookes C. A study of X-vector based speaker recognition on short utterances. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019. 2943−2947 [87] Garcia-Romero D, Snyder D, Sell G, McCree A, Povey D, Khudanpur S. X-vector DNN refinement with full-length recordings for speaker recognition. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019. 1493−1496 [88] Hong Q B, Wu C H, Wang H M, Huang C L. Statistics pooling time delay neural network based on X-vector for speaker verification. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 2020. 6849−6853 [89] Nagrani A, Chung J S, Xie W D, Zisserman A. Voxceleb: Large-scale speaker verification in the wild. Computer Science and Language, 2020, 60: Article No. 101027 [90] Hajibabaei M, Dai D X. Unified hypersphere embedding for speaker recognition. arXiv preprint arXiv: 1807.08312, 2018. [91] Xie W D, Nagrani A, Chung J S, Zisserman A. Utterance-level aggregation for speaker recognition in the wild. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 5791−5795 [92] Zhang C L, Koishida K. End-to-end text-independent speaker verification with triplet loss on short utterances. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017. 1487−1491 [93] Cai W C, Chen J K, Li M. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In: Proceedings of the 2018 Odyssey: The Speaker and Language Recognition Workshop. Les Sables d'Olonne, France: ISCA, 2018. 74−81 [94] Li C, Ma X K, Jiang B, Li X G, Zhang X W, Liu X, Cao Y, Kannan A, Zhu Z Y. Deep speaker: An end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304, 2017. [95] Ding W H, He L. MTGAN: Speaker verification through multitasking triplet generative adversarial networks. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 3633−3637 [96] Zhou J F, Jiang T, Li L, Hong Q Y, Wang Z, Xia B Y. Training multi-task adversarial network for extracting noise-robust speaker embeddings. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 6196−6200 [97] Yang Y X, Wang S, Sun M, Qian Y M, Yu K. Generative adversarial networks based X-vector augmentation for robust probabilistic linear discriminant analysis in speaker verification. In: Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP). Taipei, China: IEEE, 2018. 205−209 [98] Li N, Tuo D Y, Su D, Li Z F, Yu D. Deep discriminative embeddings for duration robust speaker verification. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 2262−2266 [99] Liu Y, He L, Liu J. Large margin softmax loss for speaker verification. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019. 2873−2877 [100] Huang Z L, Wang S, Yu K. Angular softmax for short-duration text-independent speaker verification. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 3623−3627 [101] Yu Y Q, Fan L, Li W J. Ensemble additive margin softmax for speaker verification. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019. 6046−6050 [102] Bhattacharya G, Alam J, Gupta V, Kenny P. Deeply fused speaker embeddings for text-independent speaker verification. In: Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018. 3588−3592 [103] Zhang C L, Koishida K, Hansen J H L. Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(9): 1633-1644 doi: 10.1109/TASLP.2018.2831456 [104] Zheng T R, Han J Q, Zheng G B. Deep neural network based discriminative training for I-vector/PLDA speaker verification. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE, 2018. 5354−5358 [105] Chen C, Wang W, He Y J, Han J Q. A bilevel framework for joint optimization of session compensation and classification for speaker identification. Digital Signal Processing, 2019, 89: 104-115 doi: 10.1016/j.dsp.2019.03.008 [106] Chen C, Han J Q. Task-driven variability model for speaker verification. Circuits, Systems, and Signal Processing, 2020, 39(6): 3125-3144 doi: 10.1007/s00034-019-01315-7 [107] Rohdin J, Silnova A, Diez M, Plchot O, Matějka P, Burget L. End-to-end DNN based speaker recognition inspired by I-vector and PLDA. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE, 2018. 4874−4878 [108] Chen C, Han J Q. TDMF: Task-driven multilevel framework for end-to-end speaker verification. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 2020. 6809−6813 [109] Migdalas A, Pardalos P M, Varbränd P. Multilevel Optimization: Algorithms and Applications. Boston: Springer Science and Business Media, 2013. [110] Kenny P. Bayesian speaker verification with heavy-tailed priors. In: Proceedings of the 2010 Odyssey: The Speaker and Language Recognition Workshop. Brno, Czech Republic: ISCA, 2010. 1−4 [111] Garcia-Romero D, Espy-Wilson C Y. Analysis of I-vector length normalization in speaker recognition systems. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH). Florence, Italy: ISCA, 2011. 249−252 [112] Pan Y L, Zheng T R, Chen C. I-vector Kullback-Leibler divisive normalization for PLDA speaker verification. In: Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). Montreal, Canada: IEEE, 2017. 56−60 [113] Burget L, Plchot O, Cumani S, Glembek O, Matějka P, Brümmer N. Discriminatively trained probabilistic linear discriminant analysis for speaker verification. In: Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prague, Czech Republic: IEEE, 2011. 4832−4835 [114] Cumani S, Laface P. Joint estimation of PLDA and nonlinear transformations of speaker vectors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(10): 1890-1900 doi: 10.1109/TASLP.2017.2724198 [115] Cumani S, Laface P. Scoring heterogeneous speaker vectors using nonlinear transformations and tied PLDA models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(5): 995-1009 doi: 10.1109/TASLP.2018.2806305 [116] Kenny P, Stafylakis T, Ouellet P, Alam J, Dumouchel P. PLDA for speaker verification with utterances of arbitrary duration. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 7649−7653 [117] Ma J B, Sethu V, Ambikairajah E, Lee K A. Twin model G-PLDA for duration mismatch compensation in text-independent speaker verification. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016. 1853−1857 [118] Ma J B, Sethu V, Ambikairajah E, Lee K A. Duration compensation of I-vectors for short duration speaker verification. Electronics Letters, 2017, 53(6): 405-407 doi: 10.1049/el.2016.4629 [119] Villalba J, Lleida E. Handling I-vectors from different recording conditions using multi-channel simplified PLDA in speaker recognition. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 6763−6767 [120] Garcia-Romero D, McCree A. Supervised domain adaptation for I-vector based speaker recognition. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy: IEEE, 2014. 4047−4051 [121] Richardson F, Nemsick B, Reynolds D. Channel compensation for speaker recognition using MAP adapted PLDA and denoising DNNs. In: Proceedings of the 2016 Odyssey: The Speaker and Language Recognition Workshop. Bilbao, Spain: ISCA, 2016. 225−230 [122] Hong Q Y, Li L, Zhang J, Wan L H, Guo H Y. Transfer learning for PLDA-based speaker verification. Speech Communication, 2017, 92: 90-99 doi: 10.1016/j.specom.2017.05.004 [123] Li N, Mak M W. SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(10): 1648-1659 doi: 10.1109/TASLP.2015.2442757 [124] Mak M W, Pang X M, Chien J T. Mixture of PLDA for noise robust I-vector speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(1): 130-142 doi: 10.1109/TASLP.2015.2499038 [125] Villalba J, Miguel A, Ortega A, Lleida E. Bayesian networks to model the variability of speaker verification scores in adverse environments. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(12): 2327-2340 doi: 10.1109/TASLP.2016.2607343 [126] Fan Y, Kang J W, Li L T, Li K C, Chen H L, Cheng S T, et al. CN-Celeb: A challenging Chinese speaker recognition dataset. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, 2020. 7604−7608 [127] McLaren M, Ferrer L, Castán D, Lawson A. The speakers in the wild (SITW) speaker recognition database. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016. 818−822 [128] Morrison G S, Zhang C, Enzinger E, Ochoa F, Bleach D, Johnson M, et al. Forensic database of voice recordings of 500+ Australian English speakers [Online], available: http://databases.forensic-voice-comparison.net/, November 10, 2020 [129] Greenberg C S. The NIST Year 2012 Speaker Recognition Evaluation plan, Technical Report NIST_SRE12_evalplan.v17, 2012. [130] Feng L, Hansen L K. A New Database for Speaker Recognition, IMM-Technical Report, 2005. [131] Godfrey J J, Holliman E C, McDaniel J. SWITCHBOARD: Telephone speech corpus for research and development. In: Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing. San Francisco, USA: IEEE, 1992. 517−520 [132] Jankowski C, Kalyanswamy A, Basson S, Spitz J. TIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In: Proceedings of the 1990 IEEE International Conference on Acoustics, Speech, and Signal Processing. Albuquerque, USA: IEEE, 1990. 109−122 [133] 王金甲, 纪绍男, 崔琳, 夏静, 杨倩. 基于注意力胶囊网络的家庭活动识别. 自动化学报, 2019, 45(11): 2199-2204Wang Jin-Jia, Ji Shao-Nan, Cui Lin, Xia Jing, Yang Qian. Domestic activity recognition based on attention capsule network. Acta Automatica Sinica, 2019, 45(11): 2199-2204 [134] Wang H J, Dinkel H, Wang S, Qian Y M, Yu K. Dual-adversarial domain adaptation for generalized replay attack detection. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020. 1086−1090 [135] 黄雅婷, 石晶, 许家铭, 徐波. 鸡尾酒会问题与相关听觉模型的研究现状与展望. 自动化学报, 2019, 45(2): 234-251Huang Ya-Ting, Shi Jing, Xu Jia-Ming, Xu Bo. Research advances and perspectives on the cocktail party problem and related auditory models. Acta Automatica Sinica, 2019, 45(2): 234-251 [136] Lin Q J, Hou Y, Li M. Self-attentive similarity measurement strategies in speaker diarization. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020. 284−288 期刊类型引用(22)
1. 李耀龙,陈晓林,林浩,王宇,王春林. DySnake-YOLO:改进的YOLOv9c电路板表面缺陷检测方法. 计算机工程与应用. 2025(03): 242-252 . 百度学术
2. 雷帮军,朱涵. 基于上下文空间感知的遥感图像旋转目标检测. 电光与控制. 2025(03): 69-75 . 百度学术
3. 李璇,冯昭明,徐宇航,马雷,程莉. 基于空间匹配校准的预成端盒端口信息自动化识别. 控制与决策. 2025(04): 1367-1376 . 百度学术
4. 张华卫,张文飞,蒋占军,廉敬,吴佰靖. 引入上下文信息和Attention Gate的GUS-YOLO遥感目标检测算法. 计算机科学与探索. 2024(02): 453-464 . 百度学术
5. 管文青,周世斌,张国鹏. 混合注意力特征增强的航空图像目标检测. 计算机工程与应用. 2024(04): 249-257 . 百度学术
6. 禹鑫燚,林密,卢江平,欧林林. 基于向量叉乘标签分配的遥感图像目标检测算法. 高技术通讯. 2024(02): 132-142 . 百度学术
7. 王志林,于瓅. 基于改进YOLOv5的遥感图像检测. 重庆科技学院学报(自然科学版). 2024(02): 62-67 . 百度学术
8. 张云佐,郭威,李文博. 遥感图像密集小目标全方位精准检测算法. 吉林大学学报(工学版). 2024(04): 1105-1113 . 百度学术
9. 陈天鹏,胡建文. 基于改进FCOS的遥感图像舰船目标检测. 计算机科学. 2024(S1): 479-485 . 百度学术
10. 魏瑶坤,康运江,王丹伟,赵鹏,徐斌. 改进YOLOv5s的旋转框工业零件检测算法. 激光与光电子学进展. 2024(14): 155-164 . 百度学术
11. 程凯伦,胡晓兵,陈海军,李虎. 基于改进YOLOv5s的遥感图像目标检测方法. 激光与光电子学进展. 2024(18): 285-291 . 百度学术
12. 焦仕昂,罗亮,杨萌,翟宏睿,刘维勤. 基于改进YOLOv7的光学遥感图像船舶旋转目标检测. 武汉理工大学学报(交通科学与工程版). 2024(05): 903-908 . 百度学术
13. 董燕,魏铭宏,高广帅,刘洲峰,李春雷. 基于双重标签分配的遥感有向目标检测方法. 计算机科学. 2024(S2): 496-504 . 百度学术
14. 温桂炜,杨志钢. 面向遥感图像目标检测的特征增强和融合方法. 应用科技. 2024(05): 305-310 . 百度学术
15. 张娜,包梓群,罗源,吴彪,涂小妹. 改进的Cascade R-CNN算法在目标检测上的应用. 电子学报. 2023(04): 896-906 . 百度学术
16. 庄文华,唐晓刚,张斌权,原光明. 基于改进YOLOv5的遥感图像旋转框目标检测. 电子设计工程. 2023(14): 137-141+146 . 百度学术
17. 顾东泽,王敬东,姜宜君,廖元晖. 一种基于CenterNet的多朝向建筑物检测方法. 电子测量技术. 2023(10): 150-154 . 百度学术
18. 沈中华,陈万委,甘增康. 基于改进YOLOv5的旋转目标检测算法及其应用研究. 包装工程. 2023(19): 229-237 . 百度学术
19. 刘恩海,许佳音,李妍,樊世燕. 自适应特征细化的遥感图像有向目标检测. 计算机工程与应用. 2023(24): 155-164 . 百度学术
20. 何林远,白俊强,贺旭,王晨,刘旭伦. 基于稀疏Transformer的遥感旋转目标检测. 激光与光电子学进展. 2022(18): 55-63 . 百度学术
21. 王宏乐,王兴林,李文波,邹阿配,叶全洲,刘大存. 一种基于解耦旋转锚框匹配策略的谷粒检测方法. 广东农业科学. 2022(12): 143-150 . 百度学术
22. 安胜彪,娄慧儒,陈书旺,白宇. 基于深度学习的旋转目标检测方法研究进展. 电子测量技术. 2021(21): 168-178 . 百度学术
其他类型引用(19)
-