董胤蓬 苏航 朱军

doi: 10.16383/j.aas.c200317
基金项目: 国家自然科学基金 (61620106010, U19B2034, U1811461), 清华国强研究院项目资助

    董胤蓬:清华大学计算机科学与技术系博士研究生. 主要研究方向为机器学习, 深度学习的可解释性与鲁棒性. E-mail: dyp17@mails.tsinghua.edu.cn

    苏航:清华大学计算机系副研究员. 主要研究方向为鲁棒、可解释人工智能基础理论及其视觉应用. E-mail: suhangss@mail.tsinghua.edu.cn

    朱军:清华大学计算机系教授. 主要研究方向为机器学习. 本文通信作者. E-mail: dcszj@mail.tsinghua.edu.cn

Interpretability Analysis of Deep Neural Networks With Adversarial Examples

Funds: Supported by National Natural Science Foundation of China (61620106010, U19B2034, U1811461) and the Tsinghua Institute for Guo Qiang
    Author Bio:

    DONG Yin-Peng Ph.D. candidate in the Department of Computer Science and Technology, Tsinghua University. His research interest covers interpretability and robustness of machine learning and deep learning

    SU Hang Associated researcher in the Department of Computer Science and Technology, Tsinghua University. His research interest covers theory and vision applications of the robust and interpretable artificial intelligence

    ZHU Jun  Professor in the Department of Computer Science and Technology, Tsinghua University. His main research interest is machine learning. Corresponding author of this paper

  • 摘要: 虽然深度神经网络 (Deep neural networks, DNNs) 在许多任务上取得了显著的效果, 但是由于其可解释性 (Interpretability) 较差, 通常被当做“黑盒”模型. 本文针对图像分类任务, 利用对抗样本 (Adversarial examples) 从模型失败的角度检验深度神经网络内部的特征表示. 通过分析, 发现深度神经网络学习到的特征表示与人类所理解的语义概念之间存在着不一致性. 这使得理解和解释深度神经网络内部的特征变得十分困难. 为了实现可解释的深度神经网络, 使其中的神经元具有更加明确的语义内涵, 本文提出了加入特征表示一致性损失的对抗训练方式. 实验结果表明该训练方式可以使深度神经网络内部的特征表示与人类所理解的语义概念更加一致.
  • 图  1  语义概念与神经元学习到的特征存在不一致性的示意图

    Fig.  1  Demonstration of the inconsistency betweena semantic concept and the learned features of a neuron

    图  2  VGG-16网络中神经元(来自conv5_3层)特征可视化

    Fig.  2  The visualization results of the neuron (from the conv5_3 layer) features in VGG-16

    图  3  基于WordNet[32]衡量特征的层次与一致性示意

    Fig.  3  Illustration for quantifying the level and consistency of features based on WordNet[32]

    图  4  AlexNet网络中神经元(来自conv5层)特征可视化

    Fig.  4  The visualization results of the neuron (from the conv5 layer) features in AlexNet

    图  5  ResNet-18网络中神经元(来自conv5b层)特征可视化

    Fig.  5  The visualization results of the neuron (from the conv5b layer) features in ResNet-18

    图  6  AlexNet-Adv网络中神经元(来自conv5层)特征可视化

    Fig.  6  The visualization results of the neuron (from the conv5 layer) features in AlexNet-Adv

    图  7  VGG-16-Adv网络中神经元(来自conv5_3层)特征可视化

    Fig.  7  The visualization results of the neuron (from the conv5_3 layer) features in VGG-16-Adv

    图  8  ResNet-18-Adv网络中神经元(来自conv5b层)特征可视化

    Fig.  8  The visualization results of the neuron (from the conv5b layer) features in ResNet-18-Adv

    图  9  Adv-Inc-v3网络中神经元(来自最后一层)特征可视化

    Fig.  9  The visualization results of the neuron (from the last layer) features in Adv-Inc-v3

    图  10  CS1CS2LC的变化曲线

    Fig.  10  The curves of CS1 and CS2 along with LC

    表  1  各个模型面对真实图片和对抗图片时其中与语义概念关联的神经元的比例(%)

    Table  1  The ratio (%) of neurons that align with semantic concepts for each model when showing real and adversarial images respectively

    表  2  各个模型在ImageNet验证集及对于FGSM攻击的准确率(%) (扰动规模为$ {\rm{\epsilon}} =4 $)

    Table  2  Accuracy (%) on the ImageNet validation set and adversarial examples generated by FGSM with $ {\rm{\epsilon}}=4 $

    模型真实图片 对抗图片
    Top-1Top-5 Top-1Top-5
    AlexNet54.5378.17 9.0432.77
  • 收稿日期:  2020-05-15
  • 录用日期:  2020-08-27
  • 网络出版日期:  2021-12-01
  • 刊出日期:  2022-01-25


