Topology-guided Adversarial Deep Mutual Learning for Knowledge Distillation
-
摘要: 针对基于互学习的知识蒸馏方法中存在的不足——模型只关注教师网络和学生网络的分布差异而没有考虑其他的约束条件; 只关注了结果导向的监督, 而缺少过程导向的监督——本文提出了一种拓扑一致性指导的对抗互学习知识蒸馏方法(Topology-guided aadversarial deep mutual learning, TADML)该方法将教师网络和学生网络同时训练, 网络之间相互指导学习, 不仅采用网络输出的类分布之间的差异, 还设计了网络中间特征的拓扑性差异度量. 训练过程采用对抗训练, 进一步提高教师网络和学生网络的判别性. 在分类数据集CIFAR10、CIFAR100和Tiny-ImageNet及行人重识别数据集Market1501上的实验结果表明本文所提方法TADML的有效性, TADML取得了同类模型压缩方法中最好的效果.Abstract: The existing mutual-deep-learning based knowledge distillation methods have the limitations: the discrepancy between the teacher network and the student network is only used to supervise the knowledge transfer neglecting other constraints, and the result-driven supervision is only used neglecting process-driven supervision. This paper proposes a Topology-guided Adversarial Deep Mutual Learning network (TADML). This method trains multiple classification sub-networks of the same task simultaneously and each sub-network learns from others. Moreover, our method uses an adversarial network to adaptively measure the differences between pairwise sub-networks and optimizes the features without changing the model structure. The experimental results on three classification datasets: CIFAR10, CIFAR100 and Tiny-ImageNet and a person re-identification dataset Market1501 show that our method has achieved the best results among similar model compression methods.
-
表 1 损失函数对分类精度的影响比较(%)
Table 1 Comparison of classification performance with different loss function(%)
损失构成 CIFAR10 CIFAR100 LS 92.90 70.47 LS+JS 93.18 71.70 LS+JS+Ladv 93.52 72.75 LS+L1+Ladv 93.04 71.97 LS+L2+Ladv 93.26 72.02 LS+L1+JS+Ladv 92.87 71.63 LS+L2+JS+Ladv 92.38 70.90 LS+JS+Ladv+LT 93.05 71.81 表 2 判别器结构对分类精度的影响比较(%)
Table 2 Comparison of classification performance with different discriminator structure(%)
结构 CIFAR100 256fc-256fc 71.57 500fc-500fc 72.09 100fc-100fc-100fc 72.33 128fc-256fc-128fc 72.51 64fc-128fc-256fc-128fc 72.28 128fc-256fc-256fc-128fc 72.23 表 3 判别器输入对分类精度的影响比较(%)
Table 3 Comparison of classification performance with different discriminator input(%)
输入约束 CIFAR100 conv4 72.33 fc 72.51 conv4+fc 72.07 fc+DAE 71.97 fc+label 72.35 fc+avgfc 71.20 表 4 采样数量对分类精度的影响比较(%)
Table 4 Comparison of classification performance with different sampling strategy(%)
网络结构 Vanila Random K=2 K=4 K=8 K=16 K=32 K=64 Resnet32 71.14 72.12 31.07 60.69 72.43 72.84 72.50 71.99 Resnet110 74.31 74.59 22.64 52.33 74.59 75.18 75.01 74.59 表 5 网络结构对分类精度的影响比较(%)
Table 5 Comparison of classification performance with different network structure(%)
网络结构 原始网络 DML[13] ADML TADML 网络1 网络2 网络1 网络2 网络1 网络2 网络1 网络2 网络1 网络2 ResNet32 ResNet320 70.47 70.47 71.86 71.89 72.85 72.89 73.07 73.13 ResNet32 ResNet110 70.47 73.12 71.62 74.08 72.66 74.18 73.14 74.86 ResNet110 ResNet110 73.12 73.12 74.59 74.55 75.08 75.10 75.52 75.71 WRN-10-4 -WRN-10-4 72.65 72.65 73.06 73.01 73.77 73.75 73.97 74.08 WRN-10-4 -WRN-28-10 72.65 80.77 73.58 81.11 74.61 81.43 75.11 82.13 表 6 网络结构对行人重识别mAP的影响比较(%)
Table 6 Comparison of person re-identification mAP with different network structure(%)
网络结构 原始网络 DML[13] ADML TADML 网络1 网络2 网络1 网络2 网络1 网络2 网络1 网络2 网络1 网络2 InceptionV1 MobileNetV1 65.26 46.07 65.34 52.87 65.60 53.22 66.03 53.91 MobileNetV1 MobileNetV1 46.07 46.07 52.95 51.26 53.42 53.27 53.84 53.65 表 7 所提算法与其他压缩算法的实验结果(%)
Table 7 Experimental results of the proposed algorithm and other compression algorithms(%)
对比算法 参数量 CIFAR10 CIFAR100 Tiny-ImageNet ResNet20 0.27M 91.42 66.63 54.45 ResNet164 2.6M 93.43 72.24 61.55 Yim[10] - 0.27M 88.70 63.33 --- L2-Ba[23] == 0.27M 90.93 67.21 --- KD[8] 0.27M 91.12 66.66 57.65 FitNet[9] -- 0.27M 91.41 64.96 55.59 Quantization[21] 0.27M 91.13 --- --- Binary Connect[22] 15.20M 91.73 --- --- ANC[24] 0.27M 91.92 67.55 58.17 TSANC[25] 0.27M 92.17 67.43 58.20 KSANC[25] 0.27M 92.68 68.58 59.77 DML[13] 0.27M 91.82 69.47 57.91 ADML - 0.27M 92.23 69.60 59.00 TADML --- 0.27M 93.05 70.81 60.11 -
[1] He K M, Zhang X Y, Ren S Q and Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA: IEEE, 2016.770−778. [2] Zhang X Y, Zhou X Y, Lin M X and Sun J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, USA: IEEE, 2018.6848−6856. [3] Guo Y W, Yao A B, Zhao H and Chen Y R. Network Sketching: Exploiting Binary Structure in Deep CNNs. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA: IEEE, 2017.4040−4048. [4] Tai C, Xiao T, Wang X G and E W N. Convolutional neural networks with low-rank regularization. In: Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016. [5] Chen W, Wilson J T, Tyree S, Weinberger K Q and Chen Y X. Compressing Neural Networks with the Hashing Trick. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France: ACM, 2015. 37: 2285−2294. [6] Denton E L, Zaremba W, Bruna J, LeCun Y and Fergus R. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada: MIT Press, 2014.1269−1277. [7] Li Z, Hoiem D. Learning without Forgetting. In: Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands: Springer Verlag, 2016.614−629. [8] Hinton G E, Vinyals O and Dean J. Distilling the knowledge in a neural network. arXiv preprint, arXiv: 1503.02531, 2015. [9] Romero A, Ballas N, Kahou S E, Chassang A, Gatta C and Bengio Y. Fitnets: Hints for thin deep nets. In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, CA, USA, 2015. [10] Yim J, Joo D, Bae J H and Kim J. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA: IEEE, 2017.7130−7138. [11] Peng B Y, Jin X, Li D S, Zhou S F, Wu Y C, Liu J H, Zhang Z N and Liu Y. Correlation Congruence for Knowledge Distillation. In: Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Korea (South): IEEE, 2017.5006−5015. [12] Park W, Kim D, Lu Y and Cho M. Relational Knowledge Distillation. In: Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA: IEEE, 2019.3967−3976. [13] Zhang Y, Xiang T, Hospedales T M, Lu H C. Deep Mutual Learning. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA: IEEE, 2018.4320−4328. [14] Batra T and Parikh D. Cooperative Learning with Visual Attributes. arXiv preprint, arXiv: 1705.05512, 2017. [15] Zhang H, Goodfellow I J, Metaxas D N and Odena A. Self-Attention Generative Adversarial Networks. In: Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, USA: ACM, 2019. 97: 7354−7363. [16] He K M, Zhang X Y, Ren S Q and Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA: IEEE, 2016.770−778. [17] Zagoruyko S and Komodakis N. Wide residual networks. In: Proceedings of British Machine Vision Conference, York, UK: Springer, 2016. [18] Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases, 2009, 1(4) [19] Mirza M and Osindero S. Conditional Generative Adversarial Nets. arXiv preprint. arXiv: 1411.1784, 2014. [20] Shu C Y, Li P, Xie Y, Qu Y Y and Kong H. Knowledge Squeezed Adversarial Network Compression. In: Proceedings of the 34st AAAI Conference on Artificial Intelligence, New York, NY, USA: AAAI, 2020.11370−11377. [21] Zhu C Z, Han S, Mao H Z and Dally W J. Trained ternary quantization. In: Proceedings of the 5th International Conference on Learning Representations. Toulon, France, 2017. [22] Courbariaux M, Bengio Y and David J P. Binaryconnect: Training deep neural networks with binary weights during propagations. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. Montreal, Quebec, Canada: MIT Press, 2015.3123−3131. [23] Ba J and Caruana R. Do deep nets really need to be deep?. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada: MIT Press, 2014.2654−2662. [24] Belagiannis V, Farshad A and Galasso F. Adversarial network compression. In: Proceedings of the 2018 European Conference on Computer Vision, Munich, Germany: Springer, 2018. 11132: 431−449. [25] Xu Z, Hsu Y C and H J W. Training student networks for acceleration with conditional adversarial networks. In: Proceedings of British Machine Vision Conference, Newcastle, UK: Springer, 2018.61. -

计量
- 文章访问数: 36
- HTML全文浏览量: 18
- 被引次数: 0