2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

解耦表征学习综述

文载道 王佳蕊 王小旭 潘泉

牟永强, 范宝杰, 孙超, 严蕤, 郭怡适. 面向精准价格牌识别的多任务循环神经网络. 自动化学报, 2022, 48(2): 608−614 doi: 10.16383/j.aas.c190633
引用本文: 文载道, 王佳蕊, 王小旭, 潘泉. 解耦表征学习综述. 自动化学报, 2022, 48(2): 351−374 doi: 10.16383/j.aas.c210096
Mou Yong-Qiang, Fan Bao-Jie, Sun Chao, Yan Rui, Guo Yi-Shi. Towards accurate price tag recognition algorithm with multi-task RNN. Acta Automatica Sinica, 2022, 48(2): 608−614 doi: 10.16383/j.aas.c190633
Citation: Wen Zai-Dao, Wang Jia-Rui, Wang Xiao-Xu, Pan Quan. A review of disentangled representation learning. Acta Automatica Sinica, 2022, 48(2): 351−374 doi: 10.16383/j.aas.c210096

解耦表征学习综述

doi: 10.16383/j.aas.c210096
基金项目: 国家自然科学基金(61806165, 61790552, 61801020), 陕西省基础研究计划 (2020JQ-196)资助
详细信息
    作者简介:

    文载道:西北工业大学自动化学院副教授. 主要研究方向为压缩感知与稀疏模型, 认知机器学习, 合成孔径雷达图像解译, 多源自主目标识别. E-mail: wenzaidao@nwpu.edu.cn

    王佳蕊:西北工业大学自动化学院博士研究生. 主要研究方向为解耦表征学习, SAR图像处理, 因果推理. E-mail: wangjiarui_wyy163@163.com

    王小旭:西北工业大学自动化学院教授. 主要研究方向为惯性器件与惯性导航, 合成孔径雷达图像解译, 协同感知. 本文通信作者. E-mail: woyaofly1982@163.com

    潘泉:西北工业大学自动化学院教授. 主要研究方向为信息融合理论及应用, 目标跟踪与识别技术, 光谱成像及图像处理. E-mail: quanpan@nwpu.edu.cn

A Review of Disentangled Representation Learning

Funds: Supported by National Natural Science Foundation of China (61806165, 61790552, 61801020), the Natural Science Basic Research Plan in ShaanXi Province of China (2020JQ-196)
More Information
    Author Bio:

    WEN Zai-Dao Associate professor at the School of Automation, Northwestern Polytechnical University. His research interest covers compressed sensing and sparse model, cognitive machine learning, synthetic aperture radar image interpretation, and multisource automatic target recognition

    WANG Jia-Rui Ph. D. candidate at the School of Automation, Northwestern Polytechnical University. Her research interest covers disentangled representation learning, SAR image processing and causal reasoning

    WANG Xiao-Xu Professor at the School of Automation, Northwestern Polytechnical University. His research interest covers inertial devices and inertial navigation, synthetic aperture radar image interpretation, cooperative sensing. Corresponding author of this paper

    PAN Quan Professor at the School of Automation, Northwestern Polytechnical University. His research interest covers information fusion theory and application, target tracking and recognition technology, spectral imaging and image processing

  • 摘要: 在大数据时代下, 以高效自主隐式特征提取能力闻名的深度学习引发了新一代人工智能的热潮, 然而其背后黑箱不可解释的“捷径学习”现象成为制约其进一步发展的关键性瓶颈问题. 解耦表征学习通过探索大数据内部蕴含的物理机制和逻辑关系复杂性, 从数据生成的角度解耦数据内部多层次、多尺度的潜在生成因子, 促使深度网络模型学会像人类一样对数据进行自主智能感知, 逐渐成为新一代基于复杂性的可解释深度学习领域内重要研究方向, 具有重大的理论意义和应用价值. 本文系统地综述了解耦表征学习的研究进展, 对当前解耦表征学习中的关键技术及典型方法进行了分类阐述, 分析并汇总了现有各类算法的适用场景并对此进行了可视化实验性能展示, 最后指明了解耦表征学习今后的发展趋势以及未来值得研究的方向.
  • 传统零售业抑或是近年来兴起的快消新零售, 渠道核查是其中的必要环节. 传统的作业方式主要分为业务代表现场考察以及第三方外包核查, 但都存在人工误差大、核查周期长、核查成本高以及误差数据无法溯源等缺点. 随着深度学习的迅速发展, AI (人工智能)已经成为高端科技的代名词, 各行各业的AI应用层出不穷. 基于深度学习的图像识别技术凭借着高精度、高泛化性, 非常适合应用于渠道核查的业务场景, 是核查工作强大的助力. 渠道核查主要包含两大识别内容, SKU (Stock keeping unit), (库存量单位)识别和价格牌识别, 本文工作主要针对价格牌识别的需求. 价格作为销售数据的基石, 对识别精度非常敏感, 目前基于深度学习的价格牌识别技术容易受到其外观样式、拍摄质量等因素的影响, 如模糊、倾斜、光照不均匀等. 因此, 如何克服实际应用中可能遇到的复杂场景, 准确识别价格牌中的信息是 OCR (Optical character recognition)领域的一个重要研究目标.

    目前, 应用性较广的价格牌识别算法大多以文本识别算法为基础. 基于卷积循环神经网络(Convolution recurrent neural network, CRNN)[1]的识别方法, 为序列识别任务带来了突破性的进展, 也为文本识别领域打开了一扇大门. 随后基于CRNN变体和各种注意力机制的文本识别算法[2-3]层出不穷, 相较于前者, 增加的注意力机制主要用于关联输入信息的相关性, 这种方式显著提高了通用文本的识别精度.

    目前国内外的文本识别研究, 普遍关注没有符号的文字序列. 对于价格牌这类带有符号的序列识别, 一些在通用文本数据集上表现优异的算法[4-5], 性能并不能令人满意. 为此本文提出了一种多任务的卷积神经网络, 有效地提高了价格牌的识别精度.

    在价格牌的识别任务中, 精准地识别所占像素比例很小或直接被省略的小数点, 是非常困难的一项任务, 也是其区别其他文本图像识别任务的重点. 现有的绝大部分算法是将价格牌的整体进行无差别的识别, 但是由于价格牌的种类繁多, 以及一些客观因素的影响, 导致其在图像中的特征并不明显, 即使采用基于上下文关系的序列识别算法也很难准确定位小数点的位置. 为此本文提出了一种将整数部分与小数部分分开, 协同识别整体的方法, 实现对小数点的准确定位. 使用端对端的多任务训练策略进行学习, 降低训练的难度. 经过实验证明, 本文提出的方法不仅在识别精度上有着优越的指标, 对于小数点的识别更是超越了以往深度学习算法的成绩.

    由于已开源的数据集中暂无价格牌这一特定场景, 我们将实验中使用的价格牌数据集开源出来以供研究使用. 我们的数据集采集自真实货架场景图像, 涵盖不同样式, 不同拍摄角度, 不同光照变化等, 其中包含训练集10 000张, 测试集1 000张, 困难测试集1 000张(包含了手写价格、模糊价格以及其他影响因素的价格数据), 训练集及测试集的数字区域比较清晰, 辨识度较高, 而困难测试集的数字区域大都存在干扰项(如反光、拍摄重影、双价格标签等), 辨识度较低. 此外, 为了进一步验证本文所提出方法的泛化能力, 我们在类似的车牌数据集中也进行了相关实验, 实验结果表明了本文所提出方案的有效性.

    OCR (Optical character recognition, 光学字符识别, 现泛指所有图像文字检测和识别技术)的研究, 一直是图像识别领域的重要研究方向之一. 随着深度学习研究的飞跃, 关于自然场景的图像文本识别算法不胜枚举, 掀起了一轮又一轮的竞赛狂潮.

    CRNN主要用于图像的序列识别问题, 包含卷积层、循环层和转录层, 结构如图1所示, 是OCR技术的常用模型. CRNN主要可以分为以下几个部分: 首先输入图像预处理后通过深层卷积神经网络, 得到输出的高级特征图(Feature map); 随后将feature map的每一列或每几列作为一个时间序列输入由双向LSTM (Bi-directional Long short-term memory)网络构成的循环层; 最后输出一个序列标签(预测特征序列中的每一个特征向量的标签分布——真实结果的概率列表). 转录层采用CTC (Connectionist temporal classification), (时序连接分类)[6]或者其他高效的序列分类方法[7]进行转录, 处理循环层所输出的序列标签, 将所有可能的“字符定位”结果进行整合, 转换为最终的识别结果.

    图 1  卷积循环网络结构
    Fig. 1  The structure of convolutional recurrent neural network

    虽然CRNN的结构理论上可以预测任意的序列对应关系, 但实际中编码和解码的准确度很大程度上依赖于语义向量. 语义向量在编码压缩过程中存在信息丢失, 而语义向量的信息偏差会严重影响解码端的准确率. 其次, 解码过程在每个时间步使用的内容向量是相同的, 这也会对解码准确率造成一定程度的影响. 为了解决以上问题, CRNN模型加入了注意力机制[8].

    不同的注意力机制对序列的处理方法不同, 应用较广泛的注意力机制[8]是由编码器将输入数据编码成一个向量的序列后, 在解码阶段的每一个时间步, 注意力模型都会选择性地从向量序列中挑选出一个子集进行输出预测(这种选择基于解码阶段隐层状态与输入序列的相关性). 这种机制可以保证在产生每一个输出的时候, 都能找到当前输入序列应该重点关注的信息, 也表明每一个输出所参考的语义向量都是不同的.

    深度学习中单任务学习模型关注点通常是对某一个特定度量进行优化, 比如分类精度、识别精度或者回归指标等. 在训练的基准模型上, 我们不断地微调模型, 直到模型的结果不能继续优化. 虽然这种方法可以得到高于基准模型的结果, 但我们选择性地忽略了可能提升特征度量指标的其他信息.

    区别于单任务模型将注意力聚焦于某个度量, 多任务学习可以共享相关任务之间的表征, 使模型可以更好地学习原始任务. 某种程度上, 多任务学习可以认为是人类学习的思维延伸, 通过人类学习的先验知识, 关联多任务之间的表征信息. 从信息学的角度, 可以将多任务学习视为信息归纳转移的一种方式.

    分析价格牌数据, 识别过程最大的困难便是小数点的定位. 如图2所示, 小数点的位置总是模糊不清或被省略, 单任务的端到端网络包括针对复杂文本的[9]也很难做到定位小数点. 因此, 提出拆分价格牌的整数部分和小数部分, 通过多任务学习的策略联合学习小数点的特征信息, 价格牌拆分示意图如图3所示. 这种策略需要价格牌数据结构的先验知识, 将分支结果与小数点后期拼接, 得到完整的价格牌数据.

    图 2  价格牌图像
    Fig. 2  Images of some price tag samples

    在计算机视觉领域, 最常见的多任务学习方法便是共享卷积层[10]参数, 同时独立学习特定任务的其他层参数.

    CRNN及其变体的结构在Coco[11]、ICDAR2015[12]等通用文本数据集上取得了优异成绩, 证明了其方法的有效性. 文献[13]对近年来具有代表性的文本识别算法结构进行了总结, 通过实验分析, 确定了在自然文本数据集上表现最优的CRNN结构.

    沿着CRNN的方法, 我们使用卷积网络提取文本的特征, 沿宽度方向切片作为输入特征送入循环层, 得到特征序列的标签分布, 之后用基于LSTM的编码器和解码器将特征序列转换为最终的识别结果, 网络结构如图4所示.

    图 4  基础单任务识别网络结构
    Fig. 4  The structure of our basic single recognition network

    本文设计的多任务学习模型不同于一般联合学习[14], 而是基于价格牌可拆分的数据结构知识. 整体结构如图5所示, 其中IB (Integer branch)表示整数分支, DB (Decimal branch)表示小数分支, NDPB (No decimal point branch)表示去小数点的数字分支, 如图3所示. 模型分支结构完全相同, 在特征提取阶段后, 学习序列不同感受野的信息. 无小数点字符串分支作为辅助损失抑制整数分支与小数分支的过拟合, 共同优化共享的卷积块参数. 三分支网络结构与损失函数完全相同, 通过对应不同的标签优化网络参数, 极大地简化训练流程. 这里我们之所以选择三分支的模型, 也是由于应用场景的特殊性, 在实验阶段我们也会输出不同分支组合结果进行分析.

    图 5  多任务循环卷积网络结构
    Fig. 5  The structure of multi-task RNN
    图 3  基准识别与多分支识别结果的生成方式
    Fig. 3  Baseline method compared with multi-branch method

    相比于单一任务的方法, 我们所提出的多任务模型机制也拥有更好的可分析性: 对于价格牌识别问题, 我们选取多任务的结构可以数据化模型对整数部分以及小数部分的识别准确度, 从而分析误判问题. 对于不同分支的识别难度有初步估计, 从而制定相应的训练策略, 如去小数点分支融入的可训练超参数权重, 这种策略对模型精度有可观的改善.

    解码阶段以单向LSTM作为解码网络, 增加了循环层注意力机制, 结构如图6所示. 所提出的多任务模型采用相同方式解码, 分支损失函数为式(1)所示的交叉熵函数, 其中M为每批次序列数, N为解码端单向LSTM时间步长. 网络损失函数设置为整数损失与小数损失之和, 去小数点分支损失乘以超参数η作为损失函数正则化项, 整体损失函数如式(2)所示. 该设计的出发点是考虑到实际场景应用中小数部分会存在很大一部分全为零的情况, 网络存在过拟合风险. 训练相对复杂一点的去小数点分支可以起到正则化的作用, 且加入的超参数可训练, 根据验证集的反馈自适应学习, 实验阶段中我们建议的超参数值为0.5.

    图 6  注意力机制网络解码流程图
    Fig. 6  Flowchart of decoder network based on attention

    损失函数的改进[15]以及其他改进策略也可以一定程度上提高模型精度, 后续会考虑融入到我们的工作当中.

    $$ L = \frac{{ - 1}}{{MN}}\mathop \sum_{i = 1}^M \mathop \sum_{j = 1}^N {y_{i,j}}{\rm{ln}}\left( {{s_{i,j}}} \right) $$ (1)
    $$ L = {L_{integer}} + {L_{decimal}} + \eta {L_{NDPB}} $$ (2)

    为提高模型性能, 在训练网络之前, 需要对训练数据进行数据预处理操作. 本文采集的数据集来源于真实的货架图像, 数据丰富多样, 涵盖不同设计样式以及角度、光照的变化. 将数据归一化处理为相同的规格(本文规格为96×200), 并处理数据标签. 例如, 价格数据原标签为79.99, 处理得到整数标签79、小数标签99以及去小数点标签7 999.

    预处理后的图像送入卷积块, 得到规格为12×25×512的高层特征. 沿宽度方向切片reshape成25×6 144的序列格式输入循环层. 循环层如前文所述, 由双向LSTM堆叠组成. 解码求得每个时间步的输出, 通过与标签计算交叉熵, 反馈训练网络. 对于我们的双分支网络, 网络的输出结果取决于两个分支的结果合并. 以去小数点分支与整数分支为例, 将去小数点分支结果沿着整数分支结果截断即得到小数部分, 通过小数点拼接输出完整价格.

    关于模型的训练, 我们提供了一些训练策略来提升精度. 考虑到实际场景的条件影响, 增加饱和度随机调整和随机旋转的数据增强策略, 可以很好地增强模型的泛化能力. 由于整体网络较深, 需要较大的学习率初始值加速网络收敛. 通过实验测试, 学习率初始值为0.3时, 伴随随机梯度下降策略效果最优.

    3.2.1   多任务结构分析

    本文实验目的在于介绍多任务机制对于特殊结构文本的贡献, 因此对于Baseline的选取, 我们只对前沿场景文本识别算法[13]的主干结构进行实验分析, 而暂不考虑相关训练策略. 实验结果如表1所示, ResNet作为卷积块, BiLSTM作为循环层, 通过注意力机制解码的结构能够达到最高的精度.

    表 1  模块的研究(%)
    Table 1  Study of modules (%)
    ModelGeneral-dataHard-data
    VGG-BiLSTM-CTC50.2020.20
    VGG-BiLSTM-Attn61.2038.60
    ResNet-BiLSTM-CTC55.6028.80
    ResNet-BiLSTM-Attn68.1041.40
    下载: 导出CSV 
    | 显示表格

    我们采用文献[13]中表现最优的模型作为Baseline, 实验测试了基准方法并与我们的多任务分支进行比较分析. 根据价格牌结构的切分方式, 价格牌识别任务可以划分为: 去小数点的数据分支识别(NDPB)、整数分支识别(IB)以及小数分支识别(DB), 实验测试了多种分支组合方案, 精度结果如表2所示. 相较于在文本识别上的突出成绩, 基准模型很难在价格牌数据集中取得满意的成绩, 而本文提出的多任务模型则非常适用于价格牌这一特定场景, 为了体现出多分支结果的优点, 我们将基准方案与每个分支的输出进行了可视化分析, 图7给出了本方法是如何通过三分支识别的方式规避了困难的小数点识别并通过各分支的结果推断出最终识别结果的机制. 实验结果表明, 不同双分支组合的结构相较于基准模型均取得较优的成绩, 这便验证了我们最初的信息拆分识别思路, 通过多任务的方式独立地识别各分支是行之有效的且对最终的结果有促进作用. 整数分支与小数分支, 以及去小数点分支与小数点分支的多任务模型分别在普通测试集与困难测试集上取得了最优成绩, 这也是由于数据结构的最优切分与相应多任务模型的组合. 进而我们在整数分支与小数分支的基础上以正则化的方式融入去小数点分支, 也让我们的多任务模型更进一步有所提升, 在普通测试集取得了93.20 %的最好成绩, 困难测试集上取得了75.20 %的最好成绩.

    表 2  多任务模型结果(%)
    Table 2  Results of multitask model (%)
    ModelGeneral-dataHard-data
    Baseline[13]68.1041.40
    NDPB&IB90.1072.90
    NDPB&DB91.7074.30
    IB&DB92.2073.20
    NDPB&IB&DB93.2075.20
    下载: 导出CSV 
    | 显示表格
    图 7  与直接识别方法的比较
    Fig. 7  Compared with the single-branch method

    实验结果表明, 多任务机制可以充分有效地解决价格牌的识别问题. 在没有其他策略的优化下, 仅以多任务机制便可取得优异的成绩. 现阶段端对端的模型已成为深度学习主流, 而一些特殊的任务如价格牌中的小数点, 却很难以用单任务的端对端模型解决. 因此, 我们提议从数据结构上分析, 以多任务结构联合进行分离式的识别是一个可行的解决方案.

    3.2.2   模型分析

    本文提出的价格牌识别网络有效地提高了价格牌的识别精度, 该方法也可以应用到其他OCR场景中, 为验证方法的迁移能力, 本文通过选取类似可拆分数据结构的车牌场景[16], 验证所提方法的泛化性. 我们将论文所提出的方法在目前最大的车牌数据集CCPD中与效果优异的TE2E[17]以及CCPD[16]网络进行对比研究. 在测试中, 我们将车牌拆分成省、市和车牌号三个部分, 使用三分支结构进行识别. CCPD测试集合中包括各种复杂场景, 例如光线不均匀、角度倾斜以及雨雪天气等, 实验结果如表3所示. 本文所提出的方法均高于所对比的方法, 尤其在复杂场景的测试集中, 识别精度明显提升.

    表 3  车牌数据集实验结果(%)
    Table 3  Experimental results on license plate dataset (%)
    DBFNRotateTiltWeatherChallenge
    TE2E[17]96.9094.3090.8092.5087.9085.10
    CCPD[16]96.9094.3090.8092.5087.9085.10
    Our method98.2498.8198.1298.7998.1991.92
    下载: 导出CSV 
    | 显示表格

    在车牌识别的应用中, 可以将汉字为切分点, 多分支结构分为汉字分支、数字字母分支以及完整车牌分支. 同样地, 完整车牌分支以正则化项的方式融入网络中, 防止其他分支训练过拟合. 表3中的实验结果表明, 对于车牌识别这一特定场景的任务, 本文的多任务模型性能明显高于原论文, 在所有测试集上, 精度均有所提升. 相比于在商业应用中的TE2E以及学术研究中的CCPD, 个别测试集甚至取得了高达10 %的提升, 这为多任务机制联合学习感受野的策略提供了强力依据. 本文提出的方法主要针对那些信息可拆分的且拆分部分具有独立性的图像文本, 比如价格牌的整数与小数部分, 车牌的汉字与字母数字部分. 实验结果表明本文提出的根据数据结构进行设计的多任务学习方法具有良好的性能, 且对于复杂场景的泛化性更强, 在价格牌数据集中的困难测试集以及车牌数据集中的各种复杂场景车牌集上都取得了很好的效果.

    本文针对新零售领域价格牌识别应用提出了基于多任务的价格牌识别网络, 针对特定场景图像文本的数据结构, 将整体数据分开处理, 通过先识别整数分支与小数分支替代识别完整价格, 最后添加小数点来解决小数点难以识别问题. 我们的网络采用卷积循环网络的结构, 以循环层注意力机制解码序列, 结合多任务学习机制, 用特定的领域知识联合学习难以定位的特征信息. 本文所提出的方法在我们开源的价格牌数据集上, 相比目前主流的文本识别算法有着明显的精度提升, 并且在类似数据结构的车牌数据集中也有非常好的效果. 我们的工作目前只针对具有特定文本结构的图像, 对于通用文本的泛化性较差, 接下来的工作将会考虑多任务机制在通用文本上的可行性研究.

  • 图  1  人类对于交通场景量测数据的层次化智能感知示意图

    Fig.  1  Humans' hierarchical intelligent perception of a traffic scene

    图  2  深度网络的捷径学习(Shortcut learning)现象示例图[21]

    Fig.  2  Examples of “Shortcut Learning” in DNNs[21]

    图  3  决策空间示意图[21]

    Fig.  3  Taxonomy of decision rules[21]

    图  4  人类视网膜瞥视过程图[60]

    Fig.  4  Illustration of the retinal transformation[60]

    图  5  模型架构设计图[64]

    Fig.  5  AIR framework[64]

    图  6  深度梯形网络模型图

    Fig.  6  Deep ladder network models

    图  7  简易树形变分自编码模型示意图[73]

    Fig.  7  Structure of a simple latent tree variational auto-encoders[73]

    图  8  RCN模型示意图[74]

    Fig.  8  Structure of the RCN[74]

    图  9  遥感舰船图像组数据示例图

    Fig.  9  Samples from remote sensing ship group images

    图  10  GSL模型[78]用在遥感舰船图像组数据集中对应的网络架构示意图

    Fig.  10  The structure of GSL model[78] when it is used in the remote sensing ship image group data set

    图  11  人类想象泛化能力示意图[87]

    Fig.  11  An example of human imagination generalization ability[87]

    图  12  堆栈胶囊自编码网络(SCAE)模型架构图[92]

    Fig.  12  Architecture of stacked capsule autoencoders (SCAE)[92]

    图  13  多目标场景去遮掩实现过程示意图[87]

    Fig.  13  The framework of the de-occlusion completion for multi-objective scene[87]

    图  14  Factor-VAE[51]算法在3D chairs[103]以及3D faces[104]数据集上的解耦性能展示图. 每一行代表仅有左侧标注的潜在表征取值发生改变时所对应的重构图像变化

    Fig.  14  The disentangled performance of Factor-VAE[51] for 3D chairs[103] and 3D faces[104] data sets. Each row represents the change in the image reconstruction when only the specific latent marked on the left change

    图  15  AAE[48]算法对于MNIST[99]和SVHN[100]数字数据集中类别与风格属性的解耦表征结果展示图. 图中每一行代表风格类潜在表征保持不变的情况下, 改变类别类潜在表征取值所对应的重构图像变化; 每一列代表类别类潜在表征保持不变的情况下, 改变风格类潜在表征取值所对应的重构图像变化

    Fig.  15  The disentangled performance of AAE[48] in the MNIST[99] and SVHN[100] data set. Each row represents the change of the reconstructed images corresponding to the category latent while the style latent remains unchanged; when each column represents the change of the reconstructed images corresponding to the style latent while the category latent is unchanged

    图  16  SQAIR[66]用于视频目标检测、跟踪实验结果图. 其中不同颜色的标注框代表网络递归过程中所检测、跟踪到的不同目标

    Fig.  16  The video target detection and tracking results of SQAIR[66], where the bounding boxes with different colors represent different objects

    图  17  RCN[74]用于字符分割识别的实验结果展示图. 其中左侧图像中黄色轮廓线为字符分割结果, 右侧第一列为输入遮掩数字, 第二列为网络预测的去遮掩掩码图

    Fig.  17  Scene-text parsing results with RCN[74]. The yellow outline in the left image shows segmentations, the first column on the right is the occlusion input, and the second column shows the predicted occlusion mask

    图  18  文献[73]所提算法的聚类实验结果图

    Fig.  18  The clustering results of the algorithm proposed in the reference [73]

    图  19  GSL[78]算法所实现的图像属性迁移实验结果图

    Fig.  19  The image synthesis qualitative performance by GSL[78]

    图  20  文献[83]所提算法在人类关节动作识别以及部分关节风格转换后生成图像的实验结果图

    Fig.  20  The human action recognition and swapping part appearance results of the algorithm proposed in the reference [83]

    图  21  文献[87]所提算法在自然场景下按照人类偏好重组目标位置以及遮盖顺序后的实验结果图

    Fig.  21  The generation results of the algorithm proposed in the reference [87] after reorganizing the target position and the masking order in a natural scene

    图  22  文献[98]所提方法应用在CLEVR[128]数据集上的智能知识问答实验结果图

    Fig.  22  The VQA results on the CLEVR[128] data set using the method proposed in the reference [98]

    表  1  非结构化表征先验归纳偏好方法对比

    Table  1  Comparison of unstructured representation priori induction preference methods

    工作 正则项 优点 缺点
    $\beta$-VAE[46] $-\beta {D_{\mathrm{KL}}}\left( {{q_\phi }(\boldsymbol{z}|\boldsymbol{x})\;{\rm{||}}\;p(\boldsymbol{z})} \right)$ 高$\beta$值促使网络所学到的后验分布与先验分布尽可能服从相似的独立统计特性, 提升解耦性能. 高$\beta$值在提升解耦性能的同时会限制网络的数据表征能力, 直观反映为重构性能降低, 无法很好权衡二者.
    Understanding
    disentangling in
    $\beta$-VAE[47]
    $ -\gamma \left| {\mathrm{KL}\left( {q(\boldsymbol{z}|\boldsymbol{x})\;{\rm{||}}\;p(\boldsymbol{z})} \right) - C} \right| $ 从信息瓶颈角度分析$\beta$-VAE, 在训练过程中渐进增大潜在变量的信息容量$ C $, 能够在一定程度上改善了网络对于数据表征能力与解耦能力间的权衡. 该设计下的潜在变量依旧缺乏明确的物理语义, 且网络增加了信息容量$ C $这一超参数, 需要人为设计其渐进增长趋势.
    Joint-VAE[53] $- \gamma \left| {\mathrm{KL}\left( { {q_\phi }(\boldsymbol{z}|\boldsymbol{x})\;{\rm{||} }\;p(\boldsymbol{z})} \right) - {C_{ {z} } } } \right|\\- \gamma \left| {\mathrm{KL}\left( { {q_\phi }(\boldsymbol{c}|\boldsymbol{x})\;{\rm{||} }\;p(\boldsymbol{c})} \right) - {C_{ {c} } } } \right|\;$ 运用 Concrete 分布[54] 解决离散型潜在变量的解耦问题. 潜在变量缺乏明确物理语义.
    AAE[48] ${D_\mathrm{JS} }\left[ { {{\rm{E}}_\phi }\left( \boldsymbol{z} \right)||p\left( \boldsymbol{z} \right)} \right]$ 利用对抗网络完成累积后验分布与先验分布间的相似性度量, 使得潜在变量的表达空间更大, 表达能力更强. 面临对抗网络所存在的鞍点等训练问题[50].
    DIP-VAE[49] $- {\lambda _{od} }\sum\nolimits_{i \ne j} {\left[ {Co{v_{ {q_\phi }\left( \boldsymbol{z} \right)} }\left[ \boldsymbol{z} \right]} \right]} _{ij}^2\\- {\lambda _d}\sum\nolimits_i { { {\left( { { {\left[ {Co{v_{ {q_\phi }\left( \boldsymbol{z} \right)} }\left[ \boldsymbol{z} \right]} \right]}_{ii} } - {1 } } \right)}^2} }$ 设计更简便的矩估计项替代 AAE[48] 中对抗网络的设计, 计算更为简洁有效. 该设计仅适用于潜在变量服从高斯分布的情况且并未限制均值矩或更高阶矩, 适用范围有限.
    Factor-VAE[51] ${D_\mathrm{JS}}(q(\boldsymbol{z})||\prod\nolimits_{i = 1}^d {q({z_i})})$ 设计对抗网络直接鼓励累积后验分布$q({\boldsymbol{z}})$服从因子分布, 进一步改善了网络在强表征能力与强解耦能力间的权衡. 面临对抗网络所存在的鞍点等训练问题[50].
    RF-VAE[56] ${D_\mathrm{JS}}(q(\boldsymbol{r} \circ \boldsymbol{z})||\prod\nolimits_{i = 1}^d {q({r_i \circ z_i})})$ 引入相关性指标${\boldsymbol{r}}$使得网络对于无关隐变量间的解耦程度不作约束. 相关性指标${\boldsymbol{r}}$也需要由网络学习得到, 加深了网络训练的复杂性.
    $\beta $-TCVAE[52] $- \alpha {I_q}(\boldsymbol{x};\boldsymbol{z}) -\\ \beta \mathrm{KL}\left( {q\left( \boldsymbol{z} \right)||\prod\nolimits_{i = 1}^d {q\left( { {z_i} } \right)} } \right)\\- \gamma \sum\nolimits_j {{\rm{KL}}(q({z_j})||p({z_i}))}$ 证明了TC总相关项$\mathrm{KL}(q(\boldsymbol{z})||\prod\nolimits_{i = 1}^d q({z_i}) )$
    的重要性并赋予各个正则项不同的权重值构成新的优化函数使其具有更强的表示能力.
    引入更多的超参需要人为调试.
    下载: 导出CSV

    表  2  不同归纳偏好方法对比

    Table  2  Comparisons of methods based on different inductive bias

    归纳偏好分类 模型 简要描述 适用范围 数据集
    非结构化表征先验 $ \beta $-VAE[46]
    InfoGAN[55]
    文献 [47]
    Joint-VAE[53]
    AAE[48]
    DIP-VAE[49]
    Factor-VAE[51]
    RF-VAE[56]
    $ \beta $-TCVAE[52]
    在网络优化过程中施加表1中不同的先验正则项, 能够促使网络学习到的潜在表征具备一定的解耦性能. 但该类方法并未涉及足够的显式物理语义约束, 网络不一定按照人类理解的方式进行解耦, 因此该类方法一般用于规律性较强的简易数据集中. 适用于解耦表征存在显著可分离属性的简易数据集, 如人脸数据集、数字数据集等. MNIST[99]; SVHN[100]; CelebA[101]; 2D Shapes[102]; 3D Chairs[103]; dSprites[102]; 3D Faces[104]
    结构化模型
    先验
    顺序深度递归网络 DRAW[62]
    AIR[64]
    SQAIR[66]
    通过构建顺序深度递归网络架构, 可以在执行决策时反复结合历史状态特征, 实现如简易场景下的检测、跟踪等. 适用于需要关联记忆的多次决策任务场景. 3D scenes[64]; Multi-MNIST[64]; dSprites[102]; Moving-MNIST[66]; Omniglot[105]; Pedestrian CCTV data[106]
    层次深度梯形网络 VLAE[70]
    文献 [71]
    HFVAE[72]
    使用层次梯形网络模拟人类由浅入深的层次化认知过程, 促使每层潜在变量代表着不同的涵义, 可用作聚类等任务. 适用于简易数据集下由浅入深的属性挖掘. MNIST[99]; CelebA[101]; SVHN[100]; dSprites[102]
    树形网络 RCN[74]
    LTVAE[73]
    使用树形网络模拟人类高级神经元间的横向交互过程, 完成底层特征解耦的同时高层特征语义交互, 可用作聚类、自然场景文本识别等任务. 适用于底层特征解耦共享, 高级特征耦合交互的场景任务. CAPTCHA[107]; ICDAR-13 Robust Reading[107]; MNIST[99]; HHAR[73]; Reuters[108]; STL-10[73]
    物理知识
    先验
    分组数据的相关性 MLVAE[75]
    文献 [77]
    GSL[78]
    文献 [81]
    文献 [82]
    文献 [83]
    文献 [85]
    文献 [86]
    通过交换、共享潜在表征、限制互信息相关性、循环回归等方式, 实现分组数据相关因子的解耦表征. 后续可单独利用有效因子表征实现分类、分割、属性迁移数据集生成等任务. 适用于分组数据的相关有效属性挖掘. MNIST[99]; RaFD[109]; Fonts[78]; CelebA[101]; Colored-MNIST[81]; dSprites[102]; MS-Celeb-1M[110]; CUB birds[111]; ShapeNet[112]; iLab-20M[113]; 3D Shapes[81]; IAM[114]; PKU vehicle id[115]; Sentinel-2[116]; Norb[117]; BBC Pose dataset[118]; NTU[119]; KTH[120]; Deep fashion[121]; Cat head[122]; Human3.6M[123]; Penn action[124]; 3D cars[125]
    基于对象的物理空间组合关系 MixNMatch[89] 结合数据组件化、层次化生成过程实现单目标场景的背景、姿态、纹理、形状解耦表征. 适用于单目标场景属性迁移的数据集生成. CUB birds[111]; Stanford dogs[126]; Stanford cars[125]
    文献 [83] 考虑单目标多部件间的组合关系. 适用于人类特定部位、面部表情转换等数据生成. Cat head[122]; Human 3.6M[123]; Penn action[124]
    SCAE[92] 提出了胶囊网络的新思想, 考虑多目标、多部件间的组合关联关系. 适用于简易数据集的目标、部件挖掘. MNIST[99]; SVHN[100]; CIFAR10
    TAGGER[88]
    IODINE[95]
    MONET[96]
    考虑多目标场景的逐次单目标解耦表征方式. 适用于简易多目标场景的目标自主解译任务. Shapes[127]; Textured MNIST[88]; CLEVR[128]; dSprites[102]; Tetris[95]; Objects room[96]
    文献 [87] 引入目标空间逻辑树状图, 解耦多目标复杂场景的遮掩关系, 可用于去遮挡等任务. 适用于自然复杂场景下少量目标的去遮挡任务. KINS[129]; COCOA[112]
    文献 [98] 将目标三维本体特征视为目标内禀不变属性进行挖掘, 解决视角、尺度大差异问题, 有望实现检测、识别、智能问答等高级场景理解任务. 适用于简易数据集的高级场景理解. CLEVR[128]
    下载: 导出CSV
  • [1] 段艳杰, 吕宜生, 张杰, 赵学亮, 王飞跃. 深度学习在控制领域的研究现状与展望. 自动化学报, 2016, 42(5): 643−654

    Duan Yan-Jie, Lv Yi-Sheng, Zhang Jie, Zhao Xue-Liang, Wang Fei-Yue. Deep learning for control: The state of the art and prospects. Acta Automatica Sinica, 2016, 42(5): 634−654
    [2] 王晓峰, 杨亚东. 基于生态演化的通用智能系统结构模型研究. 自动化学报, 2020, 46(5): 1017−1030

    Wang Xiao-Feng, Yang Ya-Dong. Research on structure model of general intelligent system based on ecological evolution. Acta Automatica Sinica, 2020, 46(5): 1017−1030
    [3] Amizadeh S, Palangi H, Polozov O, Huang Y C, Koishida K. Neuro-Symbolic visual reasoning: Disentangling “visual” from “reasoning”. In: Proceedings of the 37th International Conference on Machine Learning. Vienna, Austria: PMLR, 2020. 279−290
    [4] Adel T, Zhao H, Turner R E. Continual learning with adaptive weights (CLAW). In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: ICLR, 2020.
    [5] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504−507 doi: 10.1126/science.1127647
    [6] Lee G, Li H Z. Modeling code-switch languages using bilingual parallel corpus. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: ACL, 2020. 860−870
    [7] Chen X H. Simulation of English speech emotion recognition based on transfer learning and CNN neural network. Journal of Intelligent & Fuzzy Systems, 2021, 40(2): 2349−2360
    [8] Lü Y, Lin H, Wu P P, Chen Y T. Feature compensation based on independent noise estimation for robust speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2021, 2021(1): Article No. 22 doi: 10.1186/s13636-021-00213-8
    [9] Torfi A, Shirvani R A, Keneshloo Y, Tavaf N, Fox E A. Natural language processing advancements by deep learning: A survey. [Online], available: https://arxiv.org/abs/2003.01200, February 27, 2020
    [10] Stoll S, Camgoz N C, Hadfield S, Bowden R. Text2Sign: Towards sign language production using neural machine translation and generative adversarial networks. International Journal of Computer Vision, 2020, 128(4): 891−908 doi: 10.1007/s11263-019-01281-2
    [11] He P C, Liu X D, Gao J F, Chen W Z. DeBERTa: Decoding-enhanced Bert with disentangled attention. In: Proceedings of the 9th International Conference on Learning Representations. Austria: ICLR, 2021.
    [12] Shi Y C, Yu X, Sohn K, Chandraker M, Jain A K. Towards universal representation learning for deep face recognition. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 6816−6825
    [13] Ni T G, Gu X Q, Zhang C, Wang W B, Fan Y Q. Multi-Task deep metric learning with boundary discriminative information for cross-age face verification. Journal of Grid Computing, 2020, 18(2): 197−210 doi: 10.1007/s10723-019-09495-x
    [14] Shi X, Yang C X, Xia X, Chai X J. Deep cross-species feature learning for animal face recognition via residual interspecies equivariant network. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 667−682
    [15] Chen J T, Lei B W, Song Q Y, Ying H C, Chen D Z, Wu J. A hierarchical graph network for 3D object detection on point clouds. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 389−398
    [16] 蒋弘毅, 王永娟, 康锦煜. 目标检测模型及其优化方法综述. 自动化学报, 2021, 47(6): 1232−1255

    Jiang Hong-Yi, Wang Yong-Juan, Kang Jin-Yu. A survey of object detection models and its optimization methods. Acta Automatica Sinica, 2021, 47(6): 1232−1255
    [17] Xu Z J, Hrustic E, Vivet D. CenterNet heatmap propagation for real-time video object detection. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 220−234
    [18] Zhang D W, Tian H B, Han J G. Few-cost salient object detection with adversarial-paced learning. [Online], available: https://arxiv.org/abs/2104.01928, April 5, 2021
    [19] 张慧, 王坤峰, 王飞跃. 深度学习在目标视觉检测中的应用进展与展望. 自动化学报, 2017, 43(8): 1289−1305

    Zhang Hui, Wang Kun-Feng, Wang Fei-Yue. Advances and perspectives on applications of deep learning in visual object detection. Acta Automatica Sinica, 2017, 43(8): 1289−1305
    [20] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436−444 doi: 10.1038/nature14539
    [21] Geirhos R, Jacobsen J H, Michaelis C, Zemel R, Brendel W, Bethge M, et al. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020, 2(11): 665−673 doi: 10.1038/s42256-020-00257-z
    [22] Minderer M, Bachem O, Houlsby N, Tschannen M. Automatic shortcut removal for self-supervised representation learning. In: Proceedings of the 37th International Conference on Machine Learning. San Diego, USA: JMLR, 2020. 6927−6937
    [23] Ran X M, Xu M K, Mei L R, Xu Q, Liu Q Y. Detecting out-of-distribution samples via variational auto-encoder with reliable uncertainty estimation. [Online], available: https://arxiv.org/abs/2007.08128v3, November 1, 2020
    [24] Charakorn R, Thawornwattana Y, Itthipuripat S, Pawlowski N, Manoonpong P, Dilokthanakul N. An explicit local and global representation disentanglement framework with applications in deep clustering and unsupervised object detection. [Online], available: https://arxiv.org/abs/2001.08957, February 24, 2020
    [25] 张钹, 朱军, 苏航. 迈向第三代人工智能. 中国科学: 信息科学, 2020, 50(9): 1281−1302 doi: 10.1360/SSI-2020-0204

    Zhang Bo, Zhu Jun, Su Hang. Toward the third generation of artificial intelligence. Scientia Sinica Informationis, 2020, 50(9): 1281−1302 doi: 10.1360/SSI-2020-0204
    [26] Lake B M, Ullman T D, Tenenbaum J B, Gershman S J. Building machines that learn and think like people. Behavioral and Brain Sciences, 2017, 40: Article No. e253 doi: 10.1017/S0140525X16001837
    [27] Geirhos R, Meding K, Wichmann F A. Beyond accuracy: Quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency. [Online], available: https://arxiv.org/abs/2006.16736v3, December 18, 2020
    [28] Regazzoni C S, Marcenaro L, Campo D, Rinner B. Multisensorial generative and descriptive self-awareness models for autonomous systems. Proceedings of the IEEE, 2020, 108(7): 987−1010 doi: 10.1109/JPROC.2020.2986602
    [29] Wang T, Huang J Q, Zhang H W, Sun Q R. Visual commonsense R-CNN. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 10757−10767
    [30] Wang T, Huang J Q, Zhang H W, Sun Q R. Visual commonsense representation learning via causal inference. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, USA: IEEE, 2020. 1547−1550
    [31] Schölkopf B, Locatello F, Bauer S, Ke N R, Kalchbrenner N, Goyal A, et al. Toward causal representation learning. Proceedings of the IEEE, 2021, 109(5): 612−634 doi: 10.1109/JPROC.2021.3058954
    [32] Locatello F, Tschannen M, Bauer S, Rätsch G, Schölkopf B, Bachem O. Disentangling factors of variations using few labels. In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: ICLR, 2020.
    [33] Dittadi A, Träuble F, Locatello F, Wüthrich M, Agrawal V, Winther O, et al. On the transfer of disentangled representations in realistic settings. In: Proceedings of the 9th International Conference on Learning Representations. Austria: ICLR, 2021.
    [34] Tschannen M, Bachem O, Lucic M. Recent advances in autoencoder-based representation learning. [Online], available: https://arxiv.org/abs/1812.05069, December 12, 2018
    [35] Shu R, Chen Y N, Kumar A, Ermon S, Poole B. Weakly supervised disentanglement with guarantees. In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: ICLR, 2020.
    [36] Kim H, Shin S, Jang J, Song K, Joo W, Kang W, et al. Counterfactual fairness with disentangled causal effect variational autoencoder. In: Proceedings of the 35th Conference on Artificial Intelligence. Palo Alto, USA, 2021. 8128−8136
    [37] Locatello F, Bauer S, Lucic M, Rätsch G, Gelly S, Schölkopf B, et al. Challenging common assumptions in the unsupervised learning of disentangled representations. In: Proceedings of the 36th International Conference on Machine Learning. JMLR, 2019. 4114−4124
    [38] Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798−1828 doi: 10.1109/TPAMI.2013.50
    [39] Sikka H. A Deeper Look at the unsupervised learning of disentangled representations in Beta-VAE from the perspective of core object recognition. [Online], available: https://arxiv.org/abs/2005.07114, April 25, 2020.
    [40] Locatello F, Poole B, Rätsch G, Schölkopf B, Bachem O, Tschannen M. Weakly-supervised disentanglement without compromises. In: Proceedings of the 37th International Conference on Machine Learning. San Diego, USA: JMLR, 2020. 6348−6359
    [41] 翟正利, 梁振明, 周炜, 孙霞. 变分自编码器模型综述. 计算机工程与应用, 2019, 55(3): 1−9 doi: 10.3778/j.issn.1002-8331.1810-0284

    Zhai Zheng-Li, Liang Zhen-Ming, Zhou Wei, Sun Xia. Research overview of variational auto-encoders models. Computer Engineering and Applications, 2019, 55(3): 1−9 doi: 10.3778/j.issn.1002-8331.1810-0284
    [42] Schmidhuber J. Learning factorial codes by predictability minimization. Neural Computation, 1992, 4(6): 863−879 doi: 10.1162/neco.1992.4.6.863
    [43] Kingma D P, Welling M. Auto-encoding variational Bayes. [Online], available: https://arxiv.org/abs/1312.6114, May 1, 2014
    [44] Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS, 2014. 2672−2680
    [45] 林懿伦, 戴星原, 李力, 王晓, 王飞跃. 人工智能研究的新前线: 生成式对抗网络. 自动化学报, 2018, 44(5): 775−792

    Lin Yi-Lun, Dai Xing-Yuan, Li Li, Wang Xiao, Wang Fei-Yue. The new frontier of AI research: Generative adversarial networks. Acta Automatica Sinica, 2018, 44(5): 775−792
    [46] Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, et al. Beta-vae: Learning basic visual concepts with a constrained variational framework. In: Proceedings of the 5th International Conference on Learning Representations. Toulon, France: ICLR, 2017.
    [47] Burgess C P, Higgins I, Pal A, Matthey L, Watters N, Desjardins G, et al. Understanding disentangling in Beta-VAE. [Online], available: https://arxiv.org/abs/1804.03599, April 10, 2018
    [48] Makhzani A, Shlens J, Jaitly N, Goodfellow I, Frey B. Adversarial autoencoders. [Online], available: https://arxiv.org/abs/1511.05644, May 25, 2016.
    [49] Kumar A, Sattigeri P, Balakrishnan A. Variational inference of disentangled latent concepts from unlabeled observations. In: Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: ICLR, 2018.
    [50] Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks. [Online], available: https://arxiv.org/abs/1701.04862, January 17, 2017
    [51] Kim H, Mnih A. Disentangling by factorising. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: JMLR, 2018. 2649−2658
    [52] Chen T Q, Li X C, Grosse R B, Duvenaud D. Isolating sources of disentanglement in variational autoencoders. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: NIPS, 2018. 2615−2625
    [53] Dupont E. Learning disentangled joint continuous and discrete representations. [Online], available: https://arxiv.org/abs/1804.00104v3, October 22, 2018.
    [54] Maddison C J, Mnih A, Teh Y W. The concrete distribution: A continuous relaxation of discrete random variables. [Online], available: https://arxiv.org/abs/1611.00712, March 5, 2017.
    [55] Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS, 2016. 2180−2188
    [56] Kim M, Wang Y T, Sahu P, Pavlovic V. Relevance factor VAE: Learning and identifying disentangled factors. [Online], available: https://arxiv.org/abs/1902.01568, February 5, 2019.
    [57] Grathwohl W, Wilson A. Disentangling space and time in video with hierarchical variational auto-encoders. [Online], available: https://arxiv.org/abs/1612.04440, December 19, 2016.
    [58] Kim M, Wang Y T, Sahu P, Pavlovic V. Bayes-factor-VAE: Hierarchical Bayesian deep auto-encoder models for factor disentanglement. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019. 2979−2987
    [59] Montero M L, Ludwig C J H, Costa R P, Malhotra G, Bowers J S. The role of disentanglement in generalisation. In: Proceedings of the 9th International Conference on Learning Representations. Austria: ICLR, 2021.
    [60] Larochelle H, Hinton G E. Learning to combine foveal glimpses with a third-order boltzmann machine. Advances in Neural Information Processing Systems, 2010, 23: 1243−1251
    [61] Mnih V, Heess N, Graves A, Kavukcuoglu K. Recurrent models of visual attention. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS, 2014. 2204−2212
    [62] Gregor K, Danihelka I, Graves A, Rezende D J, Wierstra D. DRAW: A recurrent neural network for image generation. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR, 2015. 1462−1471
    [63] Henderson J M, Hollingworth A. High-level scene perception. Annual Review of Psychology, 1999, 50(1): 243−271 doi: 10.1146/annurev.psych.50.1.243
    [64] Eslami S M A, Heess N, Weber T, Tassa Y, Szepesvari D, Kavukcuoglu K, et al. Attend, infer, repeat: Fast scene understanding with generative models. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: NIPS, 2016. 3233−3241
    [65] Crawford E, Pineau J. Spatially invariant unsupervised object detection with convolutional neural networks. In: Proceedings of the 33rd Conference on Artificial Intelligence. California, USA: AAAI, 2019. 3412−3420
    [66] Kosiorek A R, Kim H, Posner I, Teh Y W. Sequential attend, infer, repeat: Generative modelling of moving objects. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: NIPS, 2018. 8615−8625
    [67] Santoro A, Raposo D, Barrett D G T, Malinowski M, Pascanu R, Battaglia P W, et al. A simple neural network module for relational reasoning. In: Proceedings of the 31th International Conference on Neural Information Processing Systems. Long Beach, USA: NIPS, 2017. 4967−4976
    [68] Massague A C, Zhang C, Feric Z, Camps O I, Yu R. Learning disentangled representations of video with missing data. In: Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: California, USA, 2020. 3625−3635
    [69] Sønderby C K, Raiko T, Maaløe L, Sønderby S K, Winther O. Ladder variational autoencoders. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: NIPS, 2016. 3745−3753
    [70] Zhao S J, Song J M, Ermon S. Learning hierarchical features from deep generative models. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR, 2017. 4091−4099
    [71] Willetts M, Roberts S, Holmes C. Disentangling to cluster: Gaussian mixture variational Ladder autoencoders. [Online], available: https://arxiv.org/abs/1909.11501, December 4, 2019.
    [72] Esmaeili B, Wu H, Jain S, Bozkurt A, Siddharth N, Paige B, et al. Structured disentangled representations. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. Okinawa, Japan: AISTATS, 2019. 2525−2534
    [73] Li X P, Chen Z R, Poon L K M, Zhang N L. Learning latent superstructures in variational autoencoders for deep multidimensional clustering. In: Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: ICLR, 2019.
    [74] George D, Lehrach W, Kansky K, Lázaro-Gredilla M, Laan C, Marthi B, et al. A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs. Science, 2017, 358(6368): eaag2612 doi: 10.1126/science.aag2612
    [75] Bouchacourt D, Tomioka R, Nowozin S. Multi-level variational autoencoder: Learning disentangled representations from grouped observations. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI, 2018. 2095−2102
    [76] Hwang H J, Kim G H, Hong S, Kim K E. Variational interaction information maximization for cross-domain disentanglement. In: Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: California, USA, 2020. 22479−22491
    [77] Szabó A, Hu Q Y, Portenier T, Zwicker M, Favaro P. Understanding degeneracies and ambiguities in attribute transfer. In: Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer, 2018. 721−736
    [78] Ge Y H, Abu-El-Haija S, Xin G, Itti L. Zero-shot synthesis with group-supervised learning. In: Proceedings of the 9th International Conference on Learning Representations. Austria: ICLR, 2021.
    [79] Lee S, Cho S, Im S. DRANet: Disentangling representation and adaptation networks for unsupervised cross-domain adaptation. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 15247−15256
    [80] Zhu J Y, Park T, Isola P, Efros A A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017: 2242−2251
    [81] Sanchez E H, Serrurier M, Ortner M. Learning disentangled representations via mutual information estimation. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 205−221
    [82] Esser P, Haux J, Ommer B. Unsupervised robust disentangling of latent characteristics for image synthesis. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019. 2699−2709
    [83] Lorenz D, Bereska L, Milbich T, Ommer B. Unsupervised part-based disentangling of object shape and appearance. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE, 2019. 10947−10956
    [84] Liu S L, Zhang L, Yang X, Su H, Zhu J. Unsupervised part segmentation through disentangling appearance and shape. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 8351−8360
    [85] Dundar A, Shih K, Garg A, Pottorff R, Tao A, Catanzaro B. Unsupervised disentanglement of pose, appearance and background from images and videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, DOI: 10.1109/TPAMI.2021.3055560
    [86] Vowels M J, Camgoz N C, Bowden R. Gated variational autoencoders: Incorporating weak supervision to encourage disentanglement. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). Buenos Aires, Argentina: IEEE, 2020. 125−132
    [87] Zhan X H, Pan X G, Dai B, Liu Z W, Lin D H, Loy C C. Self-supervised scene de-occlusion. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 3783−3791
    [88] Greff K, Rasmus A, Berglund M, Hao T H, Schmidhuber J, Valpola H. Tagger: Deep unsupervised perceptual grouping. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: NIPS, 2016. 4491−4499
    [89] Li Y H, Singh K K, Ojha U, Lee Y J. MixNMatch: Multifactor disentanglement and encoding for conditional image generation. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 8036−8045
    [90] Singh K K, Ojha U, Lee Y J. FineGAN: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE, 2019. 6483−6492
    [91] Ojha U, Singh K K, Lee Y J. Generating furry cars: Disentangling object shape & Appearance across Multiple Domains. In: Proceedings of the 9th International Conference on Learning Representations. Austria: ICLR, 2021.
    [92] Kosiorek A R, Sabour S, Teh Y W, Hinton G E. Stacked capsule autoencoders. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: NIPS, 2019. 15512−15522
    [93] Lee J, Lee Y, Kim J, Kosiorek A R, Choi S, Teh Y W. Set transformer: A framework for attention-based permutation-invariant neural networks. In: Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: JMLR, 2019. 3744−3753
    [94] Yang M Y, Liu F R, Chen Z T, Shen X W, Hao J Y, Wang J. CausalVAE: Disentangled representation learning via neural structural causal models. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 9588−9597
    [95] Greff K, Kaufman R L, Kabra R, Watters N, Burgess C, Zoran D, et al. Multi-object representation learning with iterative variational inference. In: Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: JMLR, 2019. 2424−2433
    [96] Burgess C P, Matthey L, Watters N, Kabra R, Higgins I, Botvinick M, et al. MONet: Unsupervised scene decomposition and representation. [Online], available: https://arxiv.org/abs/1901.11390, January 22, 2019
    [97] Marino J, Yue Y, Mandt S. Iterative amortized inference. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: JMLR, 2018. 3400−3409
    [98] Prabhudesai M, Lal S, Patil D, Tung H Y, Harley A W, Fragkiadaki K. Disentangling 3D prototypical networks for few-shot concept learning. [Online], available: https://arxiv.org/abs/2011.03367, July 20, 2021
    [99] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278−2324 doi: 10.1109/5.726791
    [100] Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng A Y. Reading digits in natural images with unsupervised feature learning. In: Proceedings of Advances in Neural Information Processing Systems. Workshop on Deep Learning and Unsupervised Feature Learning. Granada, Spain: NIPS, 2011. 1−9
    [101] Liu Z W, Luo P, Wang X G, Tang X O. Deep learning face attributes in the wild. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 3730−3738
    [102] Matthey L, Higgins I, Hassabis D, Lerchner A. dSprites: Disentanglement testing sprites dataset [Online], available: https://github.com/deepmind/dsprites-dataset, Jun 2, 2017
    [103] Aubry M, Maturana D, Efros A A, Russell B C, Sivic J. Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE, 2014. 3762−3769
    [104] Paysan P, Knothe R, Amberg B, Romdhani S, Vetter T. A 3D face model for pose and illumination invariant face recognition. In: Proceedings of the 6th IEEE International Conference on Advanced Video and Signal Based Surveillance. Genova, Italy: IEEE, 2009. 296−301
    [105] Lake B M, Salakhutdinov R, Tenenbaum J B. Human-level concept learning through probabilistic program induction. Science, 2015, 350(6266): 1332−1338 doi: 10.1126/science.aab3050
    [106] Ristani E, Solera F, Zou R S, Cucchiara R, Tomasi C. Performance measures and a data set for multi-target, multi-camera tracking. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016. 17−35
    [107] Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda L G I, Mestre S R, et al. ICDAR 2013 robust reading competition. In: Proceedings of the 12th International Conference on Document Analysis and Recognition. Washington, USA: IEEE, 2013. 1484−1493
    [108] Xie J Y, Girshick R B, Farhadi A. Unsupervised deep embedding for clustering analysis. In: Proceedings of the 33rd International Conference on Machine Learning. New York, USA: JMLR, 2016. 478−487
    [109] Langner O, Dotsch R, Bijlstra G, Wigboldus D H J, Hawk S T, Van Knippenberg A. Presentation and validation of the Radboud Faces Database. Cognition and Emotion, 2010, 24(8): 1377−1388 doi: 10.1080/02699930903485076
    [110] Guo Y D, Zhang L, Hu Y X, He X D, Gao J F. MS-Celeb-1M: A dataset and benchmark for large-scale face recognition. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016. 87−102
    [111] Wah C, Branson S, Welinder P, Perona P, Belongie S. The Caltech-UCSD birds-200-2011 dataset [Online], available: http://www.vision.caltech.edu/visipedia/CUB-200-2011.html, November 6, 2011
    [112] Zhu Y, Tian Y D, Metaxas D, Dollár P. Semantic amodal segmentation. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017. 3001−3009
    [113] Borji A, Izadi S, Itti L. iLab-20M: A large-scale controlled object dataset to investigate deep learning. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 2221−2230
    [114] Marti U V, Bunke H. The IAM-database: An English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 2002, 5(1): 39−46 doi: 10.1007/s100320200071
    [115] Liu H Y, Tian Y H, Wang Y W, Pang L, Huang T J. Deep relative distance learning: Tell the difference between similar vehicles. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 2167−2175
    [116] Drusch M, Del Bello U, Carlier S, Colin O, Fernandez V, Gascon F, et al. Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment, 2012, 120: 25−36 doi: 10.1016/j.rse.2011.11.026
    [117] LeCun Y, Huang F J, Bottou L. Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. Washington, USA: IEEE, 2004. II−104
    [118] Charles J, Pfister T, Everingham M, Zisserman A. Automatic and efficient human pose estimation for sign language videos. International Journal of Computer Vision, 2014, 110(1): 70−90 doi: 10.1007/s11263-013-0672-6
    [119] Shahroudy A, Liu J, Ng T T, Wang G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 1010−1019
    [120] Schuldt C, Laptev I, Caputo B. Recognizing human actions: A local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. Cambridge, UK: IEEE, 2004. 32−36
    [121] Liu Z W, Luo P, Qiu S, Wang X G, Tang X O. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 1096−1104
    [122] Zhang W W, Sun J, Tang X O. Cat head detection - how to effectively exploit shape and texture features. In: Proceedings of the 10th European Conference on Computer Vision. Marseille, France: Springer, 2008. 802−816
    [123] Ionescu C, Papava D, Olaru V, Sminchisescu C. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36(7): 1325−1339
    [124] Zhang W Y, Zhu M L, Derpanis K G. From actemes to action: A strongly-supervised representation for detailed action understanding. In: Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013. 2248−2255
    [125] Krause J, Stark M, Deng J, Li F F. 3D object representations for fine-grained categorization. In: Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops. Sydney, Australia: IEEE, 2013. 554−561
    [126] Khosla A, Jayadevaprakash N, Yao B, Li F F. Novel dataset for fine-grained image categorization: Stanford dogs. In: Proceedings of the 1st Workshop on Fine-Grained Visual Categorization. Colorado Springs, USA: IEEE, 2011. 1−2
    [127] Reichert D P, Seriès P, Storkey A J. A hierarchical generative model of recurrent object-based attention in the visual cortex. In: Proceedings of the 21st International Conference on Artificial Neural Networks. Espoo, Finland: ICANN, 2011. 18−25
    [128] Johnson J, Hariharan B, Van Der Maaten L, Li F F, Zitnick C L, Girshick R. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017. 1988−1997
    [129] Qi L, Jiang L, Liu S, Shen X Y, Jia J Y. Amodal instance segmentation with KINS dataset. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE, 2019. 3009−3018
    [130] Eastwood C, Williams C K I. A framework for the quantitative evaluation of disentangled representations. In: Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: ICLR, 2018.
    [131] Wu Z Z, Lischinski D, Shechtman E. StyleSpace analysis: Disentangled controls for StyleGAN image generation. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 12858−12867
  • 期刊类型引用(9)

    1. 马路遥,罗晓清,张战成. 基于信息瓶颈孪生自编码网络的红外与可见光图像融合. 红外技术. 2024(03): 314-324 . 百度学术
    2. 王培龙,苗壮,王家宝,李阳,李允臣. 基于GAN网络的目标图像生成方法综述. 软件导刊. 2024(09): 10-19 . 百度学术
    3. 王帅炜,雷杰,冯尊磊,梁荣华. 视觉表征学习综述. 计算机科学. 2024(11): 112-132 . 百度学术
    4. 赵春晖,陈旭. 从分解视角出发:基于多元统计方法的工业时序建模与过程监测综述. 控制与决策. 2024(11): 3521-3546 . 百度学术
    5. 王雪松,王荣荣,程玉虎. 安全强化学习综述. 自动化学报. 2023(09): 1813-1835 . 本站查看
    6. 曾俊杰,秦龙,徐浩添,张琪,胡越,尹全军. 基于内在动机的深度强化学习探索方法综述. 计算机研究与发展. 2023(10): 2359-2382 . 百度学术
    7. 陈亚瑞,胡世凯,徐肖阳,张奇. 全相关约束下的变分层次自编码模型. 天津科技大学学报. 2023(05): 64-73 . 百度学术
    8. 汤红忠,王蔚,王涛,陆旺达,黄向红,章兢. 一种基于课程学习的胚胎图像语义分割方法. 电子学报. 2023(11): 3365-3376 . 百度学术
    9. 郁钱,路金晓,柏基权,范洪辉. 基于深度学习的三维物体重建方法研究综述. 江苏理工学院学报. 2022(04): 31-41 . 百度学术

    其他类型引用(38)

  • 加载中
图(22) / 表(2)
计量
  • 文章访问数:  9203
  • HTML全文浏览量:  5592
  • PDF下载量:  2900
  • 被引次数: 47
出版历程
  • 收稿日期:  2021-01-28
  • 录用日期:  2021-06-18
  • 网络出版日期:  2021-07-26
  • 刊出日期:  2022-02-18

目录

/

返回文章
返回