2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于重组性高斯自注意力的视觉Transformer

赵亮 周继开

赵亮, 周继开. 基于重组性高斯自注意力的视觉Transformer. 自动化学报, 2023, 49(9): 1976−1988 doi: 10.16383/j.aas.c220715
引用本文: 赵亮, 周继开. 基于重组性高斯自注意力的视觉Transformer. 自动化学报, 2023, 49(9): 1976−1988 doi: 10.16383/j.aas.c220715
Zhao Liang, Zhou Ji-Kai. Vision Transformer based on reconfigurable Gaussian self-attention. Acta Automatica Sinica, 2023, 49(9): 1976−1988 doi: 10.16383/j.aas.c220715
Citation: Zhao Liang, Zhou Ji-Kai. Vision Transformer based on reconfigurable Gaussian self-attention. Acta Automatica Sinica, 2023, 49(9): 1976−1988 doi: 10.16383/j.aas.c220715

基于重组性高斯自注意力的视觉Transformer

doi: 10.16383/j.aas.c220715
基金项目: 国家自然科学基金(51209167, 12002251), 陕西省自然科学基金(2019JM-474), 陕西省岩土与地下空间工程重点实验室开放基金(YT202004), 陕西省教育厅服务地方专项计划(22JC043)资助
详细信息
    作者简介:

    赵亮:西安建筑科技大学信息与控制工程学院教授. 主要研究方向为智能建筑检测, 计算机视觉和模式识别. 本文通信作者. E-mail: zhaoliang@xauat.edu.cn

    周继开:西安建筑科技大学信息与控制工程学院硕士研究生. 主要研究方向为图像处理和目标检测. E-mail: m18706793699@163.com

Vision Transformer Based on Reconfigurable Gaussian Self-attention

Funds: Supported by National Natural Science Foundation of China (51209167, 12002251), Natural Science Foundation of Shaanxi Province (2019JM-474), Open Fund Project of Key Laboratory of Geotechnical and Underground Space Engineering in Shaanxi Province (YT202004), and Shaanxi Provincial Department of Education Service Local Special Plan Project (22JC043)
More Information
    Author Bio:

    ZHAO Liang Professor at College of Information and Control Engineering, Xi'an University of Architecture and Technology. His research interest covers intelligent building detection, computer vision and pattern recognition. Corresponding author of this paper

    ZHOU Ji-Kai Master student at College of Information and Control Engineering, Xi'an University of Architecture and Technology. His research interest covers image processing and object detection

  • 摘要: 在目前视觉Transformer的局部自注意力中, 现有的策略无法建立所有窗口之间的信息流动, 导致上下文语境建模能力不足. 针对这个问题, 基于混合高斯权重重组(Gaussian weight recombination, GWR)的策略, 提出一种新的局部自注意力机制SGW-MSA (Shuffled and Gaussian window-multi-head self-attention), 它融合了3种不同的局部自注意力, 并通过GWR策略对特征图进行重建, 在重建的特征图上提取图像特征, 建立了所有窗口的交互以捕获更加丰富的上下文信息. 基于SGW-MSA设计了SGWin Transformer整体架构. 实验结果表明, 该算法在mini-imagenet图像分类数据集上的准确率比Swin Transformer提升了5.1%, 在CIFAR10图像分类实验中的准确率比Swin Transformer提升了5.2%, 在MS COCO数据集上分别使用Mask R-CNN和Cascade R-CNN目标检测框架的mAP比Swin Transformer分别提升了5.5%和5.1%, 相比于其他基于局部自注意力的模型在参数量相似的情况下具有较强的竞争力.
  • 图  1  现有局部自注意力方法

    Fig.  1  Existing local self-attention methods

    图  2  局部自注意力组合

    Fig.  2  Local self-attention combination

    图  3  SGWin Transformer整体架构

    Fig.  3  Overall architecture of SGWin Transformer

    图  4  SGW-MSA局部自注意力示意图

    Fig.  4  SGW-MSA local self-attention diagram

    图  5  GW-MSA局部自注意力示意图

    Fig.  5  GW-MSA local self-attention diagram

    图  6  纵横向基础元素块示意图

    Fig.  6  Schematic diagram of vertical and horizontal basic element block

    图  7  SGWin Transformer block结构示意图

    Fig.  7  Structure diagram of SGWin Transformer block

    图  8  本文算法与Swin Transformer的热力图对比

    Fig.  8  Comparison between the algorithm in this paper and the thermal diagram of Swin Transformer

    图  9  融合效果示意图

    Fig.  9  Schematic diagram of fusion effect

    图  10  MS COCO检测结果或可视化

    Fig.  10  MS COCO test results or visualization

    表  1  SGWin Transformer的超参数配置表

    Table  1  Super parameter configuration table of SGWin Transformer

    StageStrideLayerParameter
    14Patch embed$\begin{aligned} P_1 = 4\;\;\\ C_1 = 96\end{aligned}$
    Transformer block${\left[\begin{aligned} S_1 = 7\\H_1 = 3\\R_1 = 4\end{aligned}\right ]}\times{2}$
    28Patch merging$\begin{aligned} P_2 = 2\;\;\;\\C_2 = 192\end{aligned}$
    Transformer block${\left[\begin{aligned} S_2 = 7\\H_2 = 6\\R_2 = 4\end{aligned}\right ]}\times{2}$
    316Patch merging$\begin{aligned} P_3 = 2\;\;\;\\C_3 = 384\end{aligned}$
    Transformer block${\left[\begin{aligned} S_3 = 7\;\;\\H_3 = 12\\R_3 = 4\;\;\end{aligned}\right ]}\times{2}$
    432Patch merging$\begin{aligned} p_4 = 2\;\;\;\\C_4 = 768\end{aligned}$
    Transformer block${\left[\begin{aligned} S_4 = 7\;\;\\H_4 = 24\\R_4 = 4\;\;\end{aligned}\right ]}\times{2}$
    下载: 导出CSV

    表  2  基础元素块宽度消融实验对比

    Table  2  Comparison of ablation experiments of basic element block width

    $W_b$$AP^b\;(\%)$$AP^m\;(\%)$
    134.231.9
    234.932.5
    335.833.2
    436.333.7
    535.532.4
    634.732.0
    下载: 导出CSV

    表  3  SGW-MSA消融实验结果

    Table  3  SGW-MSA ablation experimental results

    序号方法$AP^b\;(\%)$$AP^m\;(\%)$
    ASW-MSA (baseline)30.829.5
    BShuffled W-MSA33.6 (+2.8)31.6 (+2.1)
    CB+VGW-MSA34.9 (+1.3)32.7 (+1.1)
    DC+HGW-MSA36.3 (+1.4)33.7 (+1.0)
    下载: 导出CSV

    表  4  CIFAR10数据集上的Top1精度对比

    Table  4  Top1 accuracy comparison on CIFAR10 dataset

    算法Top1准确率 (%)Parameter (MB)
    Swin Transformer85.447.1
    CSWin Transformer90.207.0
    CrossFormer88.647.0
    GG Transformer87.757.1
    Shuffle Transformer89.327.1
    Pale Transformer90.237.0
    SGWin Transformer90.647.1
    下载: 导出CSV

    表  5  mini-imagenet数据集上的Top1精度对比

    Table  5  Top1 accuracy comparison on mini-imagenet dataset

    算法Top1准确率(%)Parameter (MB)
    Swin Transformer67.5128
    CSWin Transformer71.6823
    CrossFormer70.4328
    GG Transformer69.8528
    Shuffle Transformer71.2628
    Pale Transformer71.9623
    SGWin Transformer72.6328
    下载: 导出CSV

    表  6  以Mask R-CNN为目标检测框架在MS COCO数据集上的实验结果

    Table  6  Experimental results on MS COCO dataset based on Mask R-CNN

    BackboneParams (M)FLOPs (G)$AP^b\;(\%)$$AP^b_{50}\;(\%)$$AP^b_{75}\;(\%)$$AP^m\;(\%)$$AP^m_{50}\;(\%)$$AP^m_{75}\;(\%)$
    Swin4826439.661.343.236.658.239.3
    CSWin4227942.663.346.939.060.542.0
    Cross5030141.362.745.338.259.741.2
    GG4826540.061.443.936.758.239.0
    Shuffle4826842.763.647.139.160.942.2
    Focal4929140.762.444.837.859.640.8
    Pale4130643.364.147.939.561.242.8
    SGWin4826545.166.049.940.863.544.2
    下载: 导出CSV

    表  7  以Cascade R-CNN为目标检测框架在MS COCO数据集上的实验结果

    Table  7  Experimental results on MS COCO dataset based on Cascade R-CNN

    BackboneParams(M)FLOPs(G)$AP^b\;(\%)$$AP^b_{50}\;(\%)$$AP^b_{75}\;(\%)$$AP^m\;(\%)$$AP^m_{50}\;(\%)$$AP^m_{75}\;(\%)$
    Swin8675447.855.540.933.452.835.8
    CSWin8075740.757.144.535.555.038.3
    Cross8877039.556.943.034.753.737.2
    GG8675638.155.441.533.251.935.1
    Shuffle8675840.757.044.435.855.138.0
    Focal8777038.655.642.234.553.739.0
    Pale7977041.557.845.336.155.239.0
    SGWin8675642.960.946.337.857.240.5
    下载: 导出CSV

    表  8  KITTI和PASCAL VOC数据集上的实验结果

    Table  8  Experimental results on KITTI and PASCAL VOC dataset

    BackboneKITTI mAP@0.5:0.95VOC mAP@0.5Params (M)FPS
    Swin57.359.614.450
    CSWin58.764.114.248
    Cross58.162.813.820
    Shuffle58.764.614.453
    GG57.862.414.446
    Pale58.964.514.248
    SGWin59.265.114.456
    下载: 导出CSV
  • [1] 蒋弘毅, 王永娟, 康锦煜. 目标检测模型及其优化方法综述. 自动化学报, 2021, 47(6): 1232-1255 doi: 10.16383/j.aas.c190756

    Jiang Hong-Yi, Wang Yong-Juan, Kang Jin-Yu. A survey of object detection models and its optimiza-tion methods. Acta Automatica Sinica, 2021, 47(6): 1232-1255 doi: 10.16383/j.aas.c190756
    [2] 尹宏鹏, 陈波, 柴毅, 刘兆栋. 基于视觉的目标检测与跟踪综述. 自动化学报, 2016, 42(10): 1466-1489 doi: 10.16383/j.aas.2016.c150823

    Yin Hong-Peng, Chen Bo, Chai Yi, Liu Zhao-Dong. Vision-based object detection and tracking: a review.Acta Automatica Sinica, 2016, 42(10): 1466-1489 doi: 10.16383/j.aas.2016.c150823
    [3] 徐鹏斌, 翟安国, 王坤峰, 李大字. 全景分割研究综述. 自动化学报, 2021, 47(3): 549-568 doi: 10.16383/j.aas.c200657

    Xu Peng-Bin, Q An-Guo, Wang Kun-Feng, Li Da-Zi. A survey of panoptic segmentation methods. Acta Automatica Sinica, 2021, 47(3): 549-568 doi: 10.16383/j.aas.c200657
    [4] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6): 84-90 doi: 10.1145/3065386
    [5] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556, 2014.
    [6] Huang G, Liu Z, Laurens V D M. Densely connected convolutional networks. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017. 4700−4708
    [7] He K, Zhang X, Ren S. Deep residual tearning for image recognition. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 770−778
    [8] Xie S, Girshick R, Dollár P. Aggregated residual transformations for deep neural networks. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017. 1492−1500
    [9] Szegedy C, Liu W, Jia Y. Going deeper with convolutions. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE, 2015. 1−9
    [10] Tan M, Le Q V. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning. New York, USA: JMLR, 2019. 6105−6114
    [11] Tomar G S, Duque T, Tckstrm O. Neural paraphrase identification of questions with noisy pretraining. In: Proceedings of the First Workshop on Subword and Character Level Models in NLP. Copenhagen, Denmark: Association for Computational Linguistics, 2017. 142−147
    [12] Wang C, Bai X, Zhou L. Hyperspectral image classification based on non-local neural networks. In: Proceedings of the International Geoscience and Remote Sensing Symposium. Yokohama, Japan: IEEE, 2019. 584−587
    [13] Zhao H, Jia J, Koltun V. Exploring self-attention for image recognition. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 10073−10082
    [14] Ramachandran P, Parmar N, Vaswani A. Stand-alone self-attention in vision models. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, Canada: NeurIPS, 2019.
    [15] Carion N, Massa F, Synnaeve G. End-to-end object detection with transformers. In: Proceedings of the 16th European Conference. Glasgow, UK: ECCV, 2020. 213−229
    [16] Dosovitskiy A, Beyer L, Kolesnikov A. An image is worth 16×16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations. Virtual Event: ICLR, 2021.
    [17] Chu X, Tian Z, Zhang B. Conditional positional encodings for vision transformers. In: Proceedings of the International Conference on Learning Representations. Virtual Event: ICLR, 2021.
    [18] Han K, Xiao A, Wu E. Transformer in transformer. Advances in Neural Information Processing Systems. 2021, 34: 15908-15919
    [19] Touvron H, Cord M, Douze M. Training data-efficient image transformers distillation through attention. In: Proceedings of the International Conference on Machine Learning. Jeju Island, South Korea: PMLR, 2021. 10347−10357
    [20] Yuan L, Chen Y, Wang T. Tokens-to-Token ViT: Training vision transformers from scratch on ImageNet. In: Proceedings of the International Conference on Computer Vision. Montreal, Canada: IEEE, 2021. 558−567
    [21] Henaff O. Data-efficient image recognition with contrastive predictive coding. In: Proceedings of International Conference on Machine Learning. Berlin, Germany: PMLR, 2020. 4182−4192
    [22] Liu Z, Lin Y, Cao Y. Swin Transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the International Conference on Computer Vision. Montreal, Canada: IEEE, 2021. 10012−10022
    [23] Rao Y, Zhao W, Liu B. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems. 2021, 34: 13937-13949
    [24] Lin H, Cheng X, Wu X. CAT: Cross attention in visiontransformer. In: Proceedings of the International Conference on Multimedia and Expo. Taipei, China: IEEE, 2022. 1−6
    [25] Vaswani A, Ramachandran P, Srinivas A. Scaling local self-attention for parameter efficient visual backbones. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 12894−12904
    [26] Wang W, Chen W, Qiu Q. Crossformer++: A versatile vision transformer hinging on cross-scale attention. arXiv preprint arXiv: 2303.06908, 2023.
    [27] Huang Z, Ben Y, Luo G. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv: 2106.03650, 2021.
    [28] Yu Q, Xia Y, Bai Y. Glance-and-gaze Vision Transformer. Advances in Neural Information Processing Systems.2021, 34: 12992-13003
    [29] Wang H, Zhu Y, Green B. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In: Proceedings of the 16th European Conference. Glasgow, UK: ECCV, 2020. 108−126
    [30] Dong X, Bao J, Chen D. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the Conference on Computer Vision and Pattern Recognition. New York, USA: IEEE, 2022. 12124−12134
    [31] Wu S, Wu T, Tan H. Pale transformer: A general vision transformer backbone with pale-shaped attention. In: Proceedings of the AAAI Conference on Artificial Intelligence. Washington, USA: 2022. 2731−2739
    [32] Wang W, Xie E, Li X. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the International Conference on Computer Vision. Montreal, Canada: IEEE, 2021. 568−578
    [33] Ren S, He K, Girshick R. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems. 2015, 28
    [34] Efraimidis P S, Spirakis P G. Weighted random samplingwith a reservoir. Information Processing Letters, 2006, 97(5): 181-185 doi: 10.1016/j.ipl.2005.11.003
    [35] Krizhevsky A, Hinton G. Convolutional beep belief networks on Cifar-10[J]. Unpublished manuscript, 2010, 40(7): 1-9
    [36] Geiger A, Lenz P, Stiller C. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013
    [37] Everingham M, Eslami S M A, Van Gool L. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 2015, 111: 98-136 doi: 10.1007/s11263-014-0733-5
    [38] Veit A, Matera T, Neumann L. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv: 1601.07140, 2016.
    [39] Selvaraju R R, Cogswell M, Das A. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 618−626
    [40] Chen K, Wang J, Pang J. MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv: 1906.07155, 2019.
    [41] He K, Gkioxari G, Dollár P. Mask R-CNN. In: Proceedings of the International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 2961−2969
    [42] Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv: 1711.05101, 2017.
    [43] You Y, Li J, Reddi S. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv: 1904.00962, 2019.
    [44] Cai Z, Vasconcelos N. Cascade R-CNN: Delving into high quality object detection. In: Proceedings of the Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 6154−6162
    [45] Wu W, Liu H, Li L. Application of local fully Convolutional Neural Network combined with YOLO v5 algorithm in small target detection of remote sensing image. PloS one, 2021, 16(10): 1-10
    [46] Bottou, L. Stochastic Gradient descent tricks. Journal of Machine Learning Research. 2017, 18: 1−15
    [47] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv: 2004.10934, 2020.
  • 加载中
图(10) / 表(8)
计量
  • 文章访问数:  575
  • HTML全文浏览量:  143
  • PDF下载量:  212
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-09-10
  • 录用日期:  2023-01-13
  • 网络出版日期:  2023-08-24
  • 刊出日期:  2023-09-26

目录

    /

    返回文章
    返回