2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

视觉Transformer研究的关键问题: 现状及展望

田永林 王雨桐 王建功 王晓 王飞跃

田永林, 王雨桐, 王建功, 王晓, 王飞跃. 视觉Transformer研究的关键问题: 现状及展望. 自动化学报, 2022, 48(4): 957−979 doi: 10.16383/j.aas.c220027
引用本文: 田永林, 王雨桐, 王建功, 王晓, 王飞跃. 视觉Transformer研究的关键问题: 现状及展望. 自动化学报, 2022, 48(4): 957−979 doi: 10.16383/j.aas.c220027
Tian Yong-Lin, Wang Yu-Tong, Wang Jian-Gong, Wang Xiao, Wang Fei-Yue. Key problems and progress of vision Transformers: The state of the art and prospects. Acta Automatica Sinica, 2022, 48(4): 957−979 doi: 10.16383/j.aas.c220027
Citation: Tian Yong-Lin, Wang Yu-Tong, Wang Jian-Gong, Wang Xiao, Wang Fei-Yue. Key problems and progress of vision Transformers: The state of the art and prospects. Acta Automatica Sinica, 2022, 48(4): 957−979 doi: 10.16383/j.aas.c220027

视觉Transformer研究的关键问题: 现状及展望

doi: 10.16383/j.aas.c220027
基金项目: 广东省重点领域研发计划(2020B090921003), 广州市智能网联汽车重大科技专项(202007050002), 国家自然科学基金(U1811463), 英特尔智能网联汽车大学合作研究中心(ICRI-IACV)资助
详细信息
    作者简介:

    田永林:中国科学技术大学与中科院自动化研究所联合培养博士研究生. 2017年获得中国科学技术大学自动化系学士学位. 主要研究方向为计算机视觉, 智能交通. E-mail: tyldyx@mail.ustc.edu.cn

    王雨桐:中国科学院自动化研究所复杂系统管理与控制国家重点实验室助理研究员. 2021年获得中国科学院大学控制理论与控制工程专业博士学位. 主要研究方向为深度学习中的对抗攻击与防御. E-mail: yutong.wang@ia.ac.cn

    王建功:中国科学院自动化研究所博士研究生. 2018年获得同济大学学士学位. 主要研究方向为计算机视觉, 交通场景理解, 医学图像处理. E-mail: wangjiangong2018@ia.ac.cn

    王晓:中国科学院自动化研究所复杂系统管理与控制国家重点实验室副研究员, 青岛智能产业技术研究院院长. 2016年获得中国科学院大学社会计算博士学位. 主要研究方向为社会交通, 动态网群组织, 平行智能和社交网络分析. E-mail: x.wang@ia.ac.cn

    王飞跃:中国科学院自动化研究所复杂系统管理与控制国家重点实验室研究员.主要研究方向为智能系统和复杂系统的建模、分析与控制. 本文通信作者. E-mail: feiyue.wang@ia.ac.cn

Key Problems and Progress of Vision Transformers: The State of the Art and Prospects

Funds: Supported by Key-Area Research and Development Program of Guangdong Province (2020B090921003), Key Research and Development Program of Guangzhou (202007050002), National Natural Science Foundation of China (U1811463) and Intel Collaborative Research Institute for Intelligent and Automated Connected Vehicles (ICRI-IACV)
More Information
    Author Bio:

    TIAN Yong-Lin Ph. D. candidate in the Department of Automation, University of Science and Technology of China and Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from University of Science and Technology of China in 2017. His research interest covers computer vision and intelligent transportation system

    WANG Yu-Tong Assistant professor at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences. She received her Ph. D. degree in control theory and control engineering from University of Chinese Academy of Sciences in 2021. Her research interest covers computer vision and adversarial attack

    WANG Jian-Gong Ph. D. candidate at Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from Tongji University in 2018. His research interest covers computer vision, traffic scene understanding and medical image processing

    WANG Xiao Associate professor at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences and the president of the Qingdao Academy of Intelligent Industries. She received her Ph. D. degree in social computing from University of Chinese Academy of Sciences in 2016. Her research interest covers social transportation, cybermovement organizations, parallel intelligence and social network analysis

    WANG Fei-Yue Professor at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences. His research interest covers modeling, analysis, and control of intelligent systems and complex systems. Corresponding author of this paper

  • 摘要: Transformer所具备的长距离建模能力和并行计算能力使其在自然语言处理领域取得了巨大成功并逐步拓展至计算机视觉等领域. 本文以分类任务为切入, 介绍了典型视觉Transformer的基本原理和结构, 并分析了Transformer与卷积神经网络在连接范围、权重动态性和位置表示能力三方面的区别与联系; 同时围绕计算代价、性能提升、训练优化以及结构设计四个方面总结了视觉Transformer研究中的关键问题以及研究进展; 并提出了视觉Transformer的一般性框架; 然后针对检测和分割两个领域, 介绍了视觉Transformer在特征学习、结果产生和真值分配等方面给上层视觉模型设计带来的启发和改变; 并对视觉Transformer未来发展方向进行了展望.
  • 图  1  ViT模型结构[15]

    Fig.  1  The framework of ViT[15]

    图  2  自注意力[15]与多头自注意力[15]

    Fig.  2  Self-attention[15]and multi-head self-attention[15]

    图  3  Transformer与CNN有效感受野对比[25]

    Fig.  3  The comparison[25] of effective receptive field between Transformer and CNN

    图  4  单尺度与多尺度结构对比

    Fig.  4  The comparison of single-scale framework and multi-scale framework

    图  5  类Transformer方法的交互模块结构对比(Transformer[8], MLP-Mixer[85], ResMLP[86], gMLP[87], CycleMLP[88], ConvMixer[89], PoolFormer[90], MetaFormer[90], CMT[34], CVX[91])

    Fig.  5  The comparison of mixing blocks of Transformer-like methods (Transformer[8], MLP-Mixer[85], ResMLP[86], gMLP[87], CycleMLP[88], ConvMixer[89], PoolFormer[90], MetaFormer[90], CMT[34], CVX[91])

    图  6  视觉Transformer的一般性框架

    Fig.  6  Vision Transformers architecture

    图  7  DETR的结构图[16]

    Fig.  7  The framework of DETR[16]

    图  8  基于Transformer和CNN的目标检测器的收敛速度对比(DETR-DC5[16], TSP-FCOS[115], TSP-RCNN[115], SMCA[110], Deformable DETR[24], Conditional DETR-DC5-R50[111], RetinaNet[104], Faster R-CNN[95], Sparse R-CNN[108])

    Fig.  8  The comparison of converge speed among object detectors based on Transformer and CNN (DETR-DC5[16], TSP-FCOS[115], TSP-RCNN[115], SMCA[110], Deformable DETR[24], Conditional DETR-DC5-R50[111], RetinaNet[104], Faster R-CNN[95], Sparse R-CNN[108])

    图  9  DETR交叉注意力稀疏性变化

    Fig.  9  The change of sparsity of cross-attention map in DETR

    图  10  SETR的结构图[121]

    Fig.  10  The framework of SETR[121]

    表  1  不同Transformer自注意力机制以及卷积的时间和空间复杂度($ N $, $ d $, $ s $分别表示序列长度、特征维度和局部窗口尺寸, 其中$s<N$)

    Table  1  The time and space complexity of different Transformer frameworks ($N$, $ d $, $ s $denote the length, dimension and local window size respectively)

    名称 时间复杂度 空间复杂度
    Convolution $ {\rm{O}}(Nd^2s) $ $ {\rm{O}}(Ns^2d^2+Nd) $
    Transformer[8] $ {\rm{O}}(N^2d) $ $ {\rm{O}}(N^2+Nd) $
    Sparse Transformers[49] $ {\rm{O}}(N\sqrt{N}d) $
    Reformer[50] $ {\rm{O}}(N\log Nd) $ $ {\rm{O}}(N\log N+Ld) $
    Linear Transformer[33] $ {\rm{O}}(Nd^2) $ $ {\rm{O}}(Nd+d^2) $
    Performer[54] $ {\rm{O}}(Nd^2\log d) $ $ {\rm{O}}(Nd\log d+d^2\log d) $
    AFT-simple[56] $ {\rm{O}}(Nd) $ $ {\rm{O}}(Nd) $
    AFT-full[56] $ {\rm{O}}(N^2d) $ $ {\rm{O}}(Nd) $
    AFT-local (1D)[56] $ {\rm{O}}(Nsd) $ $ {\rm{O}}(Nd) $
    Swin Transformer (2D)[23] ${\rm{O}}(Ns^2d)$
    下载: 导出CSV

    表  2  视觉Transformer算法在ImageNet-1k上的Top-1准确率比较

    Table  2  The comparison of Top-1 accuracy of different vision Transformers on ImageNet-1k dataset

    方法名称 迭代轮次 批处理大小 参数量 (×106) 计算量 (GFLOPs) 图像尺寸 Top-1 准确率
    训练 测试
    ViT-B/16[15] 300 4 096 86 743 224 384 77.9
    ViT-L/16[15] 307 5172 224 384 76.5
    DeiT-Ti[58] 300 1 024 5 1.3 224 224 72.2
    DeiT-S[58] 22 4.6 224 224 79.8
    DeiT-B[58] 86 17.6 224 224 81.8
    DeiT-B$ \uparrow $[58] 86 52.8 224 384 83.1
    ConViT-Ti[60] 300 512 6 1 224 224 73.1
    ConViT-S[60] 27 5.4 224 224 81.3
    ConViT-B[60] 86 17 224 224 82.4
    LocalViT-T[61] 300 1 024 5.9 1.3 224 224 74.8
    LocalViT-S[61] 22.4 4.6 224 224 80.8
    CeiT-T[73] 300 1 024 6.4 1.2 224 224 76.4
    CeiT-S[73] 24.2 4.5 224 224 82.0
    CeiT-S$ \uparrow $[73] 24.2 12.9 224 384 83.3
    ResT-Small[53] 300 2 048 13.66 1.9 224 224 79.6
    ResT-Base[53] 30.28 4.3 224 224 81.6
    ResT-Large[53] 51.63 7.9 224 224 83.6
    Swin-T[23] 300 1 024 29 4.5 224 224 81.3
    Swin-S[23] 50 8.7 224 224 83.0
    Swin-B[23] 88 15.4 224 224 83.3
    Swin-B$ \uparrow $[23] 88 47.0 224 384 84.2
    VOLO-D1[68] 300 1 024 27 6.8 224 224 84.2
    VOLO-D2[68] 59 14.1 224 224 85.2
    VOLO-D3[68] 86 20.6 224 224 85.4
    VOLO-D4[68] 193 43.8 224 224 85.7
    VOLO-D5[68] 296 69.0 224 224 86.1
    VOLO-D5$ \uparrow $[68] 296 304 224 448 87.0
    PVT-Tiny[22] 300 128 13.2 1.9 224 224 75.1
    PVT-Small[22] 24.5 3.8 224 224 79.8
    PVT-Medium[22] 44.2 6.7 224 224 81.2
    PVT-Large[22] 61.4 9.8 224 224 81.7
    DeepViT-S[66] 300 256 27 6.2 224 224 82.3
    DeepViT-L[66] 55 12.5 224 224 83.1
    Refined-ViT-S[59] 300 256 25 7.2 224 224 83.6
    Refined-ViT-M[59] 55 13.5 224 224 84.6
    Refined-ViT-L[59] 81 19.1 224 224 84.9
    Refined-ViT-L$ \uparrow $[59] 512 81 69.1 224 384 85.7
    CrossViT-9[63] 300 4 096 8.6 1.8 224 224 73.9
    CrossViT-15[63] 27.4 5.8 224 224 81.5
    CrossViT-18[63] 43.3 9.0 224 224 82.5
    下载: 导出CSV

    表  3  基于Transformer和基于CNN的目标检测算法在COCO 2017 val数据集上的检测精度比较. 其中C. 表示基于CNN的算法, T. 表示基于Transformer的算法

    Table  3  The comparison of detection performance of Transformer-based and CNN-based detectors on COCO 2017 val set. C. denotes CNN-based methods, T. denotes Transformer-based methods

    类型 方法名称 迭代轮次 计算量 (GFLOPs) 参数量 (×106) 帧数 (FPS) 多尺度输入 $ AP $ $ AP_{50} $ $ AP_{75} $ $ AP_{S} $ $ AP_{M} $ $ AP_{L} $
    C. FCOS[116] 36 177 17 41.0 59.8 44.1 26.2 44.6 52.2
    Faster R-CNN[95] 36 180 42 26 40.2 61.0 43.8 24.2 43.5 52.0
    Faster R-CNN+[95] 108 180 42 26 42.0 62.1 45.5 26.6 45.4 53.4
    Mask R-CNN[99] 36 260 44 41.0 61.7 44.9
    Cascade Mask R-CNN[105] 36 739 82 18 46.3 64.3 50.5
    T. ViT-B/16-FRCNN$ \ddagger $[117] 21 36.6 56.3 39.3 17.4 40.0 55.5
    ViT-B/16-FRCNN*[117] 21 37.8 57.4 40.1 17.8 41.4 57.3
    DETR-R50[16] 500 86 41 28 42.0 62.4 44.2 20.5 45.8 61.1
    DETR-DC5-R50[16] 500 187 41 12 43.3 63.1 45.9 22.5 47.3 61.1
    ACT-MTKD (L=16)[113] 156 14 40.6 18.5 44.3 59.7
    ACT-MTKD (L=32)[113] 169 16 43.1 22.2 47.1 61.4
    Deformable DETR[24] 50 78 34 27 39.7 60.1 42.4 21.2 44.3 56.0
    Deformable DETR-DC5[24] 50 128 34 22 41.5 61.8 44.9 24.1 45.3 56.0
    Deformable DETR[24] 50 173 40 19 43.8 62.6 47.7 26.4 47.1 58.0
    Two-Stage Deformable DETR[24] 50 173 40 19 46.2 65.2 50.0 28.8 49.2 61.7
    SMCA[110] 50 152 40 22 41.0 21.9 44.3 59.1
    SMCA+[110] 108 152 40 22 42.7 22.8 46.1 60.0
    SMCA[110] 50 152 40 10 43.7 63.6 47.2 24.2 47.0 60.4
    SMCA+[110] 108 152 40 10 45.6 65.5 49.1 25.9 49.3 62.6
    Efficient DETR[109] 36 159 32 44.2 62.2 48.0 28.4 47.5 56.6
    Efficient DETR*[109] 36 210 35 45.1 63.1 49.1 28.3 48.4 59.0
    Conditional DETR[111] 108 90 44 43.0 64.0 45.7 22.7 46.7 61.5
    Conditional DETR-DC5[111] 108 195 44 45.1 65.4 48.5 25.3 49.0 62.2
    UP-DETR[112] 150 86 41 28 40.5 60.8 42.6 19.0 44.4 60.0
    UP-DETR+[112] 300 86 41 28 42.8 63.0 45.3 20.8 47.1 61.7
    TSP-FCOS[115] 36 189 51.5 15 43.1 62.3 47.0 26.6 46.8 55.9
    TSP-RCNN[115] 36 188 64 11 43.8 63.3 48.3 28.6 46.9 55.7
    TSP-RCNN+[115] 96 188 64 11 45.0 64.5 49.6 29.7 47.7 58.0
    YOLOS-S[114] 150 200 30.7 7 36.1 56.4 37.1 15.3 38.5 56.1
    YOLOS-S[114] 150 179 27.9 5 37.6 57.6 39.2 15.9 40.2 57.3
    YOLOS-B[114] 150 537 127 42.0 62.2 44.5 19.5 45.3 62.1
    下载: 导出CSV

    表  4  基于Transformer的语义分割算法在ADE20K val数据集上的语义分割精度比较. 其中, 1k表示ImageNet-1k, 22k表示ImageNet-1k和ImageNet-21k的结合

    Table  4  The comparison of semantic segmentation performance of Transformer-based methods on ADE20K val set. 1k denotes ImageNet-1k dataset, 22k denotes the combination of ImageNet-1k and ImageNet-21k

    方法名称 骨干网络 预训练数据集 图像尺寸 参数量 (×106) 计算量 (GFLOPs) 帧数 (FPS) 多尺度输入 mIoU
    UperNet[122] R-50 1k 512 23.4 42.8
    R-101 1k 512 86 1 029 20.3 44.9
    Swin-T 1k 512 60 236 18.5 46.1
    Swin-S 1k 512 81 259 15.2 49.3
    Swin-B 22k 640 121 471 8.7 51.6
    Swin-L 22k 640 234 647 6.2 53.5
    Segformer[25] MiT-B3 1k 512 47.3 79 50.0
    MiT-B4 1k 512 64.1 95.7 15.4 51.1
    MiT-B5 1k 512 84.7 183.3 9.8 51.8
    Segmenter[124] ViT-S/16 1k 512 37.0 34.8 46.9
    ViT-B/16 1k 512 106 24.1 50.0
    ViT-L/16 22k 640 334 53.6
    MaskFormer[125] R-50 1k 512 41 53 24.5 46.7
    R-101 1k 512 60 73 19.5 47.2
    Swin-T 1k 512 42 55 22.1 48.8
    Swin-S 1k 512 63 79 19.6 51.0
    Swin-B 22k 640 102 195 12.6 53.9
    Swin-L 22k 640 212 375 7.9 55.6
    Mask2Former[26] R-50 1k 512 49.2
    R-101 1k 512 50.1
    Swin-S 1k 512 52.4
    Swin-B 22k 640 55.1
    Swin-L 22k 640 57.3
    下载: 导出CSV

    表  5  基于Transformer的实例分割方法和基于CNN算法在COCO test-dev数据集上的实例分割精度比较

    Table  5  The comparison of instance segmentation performance of Transformer-based and typical CNN-based methods on COCO test-dev dataset

    方法名称 骨干网络 迭代轮次 帧数 (FPS) $ Ap^m $ $ Ap_S^m $ $ Ap_M^m $ $ Ap_L^m $ $ Ap^b $
    Mask R-CNN[99] R-50-FPN 36 15.3 37.5 21.1 39.6 48.3 41.3
    R-101-FPN 36 11.8 38.8 21.8 41.4 50.5 43.1
    Blend Mask[96] R-50-FPN 36 15.0 37.8 18.8 40.9 53.6 43.0
    R-101-FPN 36 11.5 39.6 22.4 42.2 51.4 44.7
    SOLO v2[97] R-50-FPN 36 10.5 38.2 16.0 41.2 55.4 40.7
    R-101-FPN 36 9.0 39.7 17.3 42.9 57.4 42.6
    ISTR[127] R-50-FPN 36 13.8 38.6 22.1 40.4 50.6 46.8
    R-101-FPN 36 11.0 39.9 22.8 41.9 52.3 48.1
    SOLQ[98] R-50 50 39.7 21.5 42.5 53.1 47.8
    R-101 50 40.9 22.5 43.8 54.6 48.7
    Swin-L 50 45.9 27.8 49.3 60.5 55.4
    QueryInst[126] R-50-FPN 36 7.0 40.6 23.4 42.5 52.8 45.6
    R-101-FPN 36 6.1 41.7 24.2 43.9 53.9 47.0
    Swin-L 50 3.3 49.1 31.5 51.8 63.2 56.1
    Mask2Former[26] R-50 50 43.7 30.6
    R-101 50 44.2 31.1
    Swin-T 50 45.0 31.8
    Swin-L 50 50.1 36.2
    下载: 导出CSV

    表  6  基于Transformer的全景分割算法在COCO panoptic minval数据集上的全景分割精度比较

    Table  6  The comparison of panoptic segmentation performance of Transformer-based methods on COCO panoptic minival dataset

    方法名称 骨干网络 迭代轮次 参数量 (×106) 计算量 (GFLOPs) $ PQ $ $ PQ^{Th} $ $ PQ^{St} $
    DETR[16] R-50 150+25 42.8 137 43.4 48.2 36.3
    R-101 61.8 157 45.1 50.5 37
    MaxDeepLab[123] Max-S 54 61.9 162 48.4 53.0 41.5
    Max-L 451 1 846 51.1 57.0 42.2
    MaskFormer[125] R-50 300 45 181 46.5 51.0 39.8
    R-101 64 248 47.6 52.5 40.3
    Swin-T 42 179 47.7 51.7 41.7
    Swin-S 63 259 49.7 54.4 42.6
    Swin-B 102 411 51.1 56.3 43.2
    Swin-L 212 792 52.7 58.5 44.0
    Panoptic SegFormer[128] R-50 12 51.0 214 48.0 52.3 41.5
    R-50 24 51.0 214 49.6 54.4 42.4
    R-101 69.9 286 50.6 55.5 43.2
    Swin-L 221.4 816 55.8 61.7 46.9
    下载: 导出CSV
  • [1] 张慧, 王坤峰, 王飞跃. 深度学习在目标视觉检测中的应用进展与展望. 自动化学报, 2017, 43(8): 1289-1305

    Zhang Hui, Wang Kun-Feng, Wang Fei-Yue. Advances and perspectives on applications of deep learning in visual object detection. Acta Automatica Sinica, 2017, 43(8): 1289-1305
    [2] 陈伟宏, 安吉尧, 李仁发, 李万里. 深度学习认知计算综述. 自动化学报, 2017, 43(11): 1886-1897

    Chen Wei-Hong, An Ji-Yao, Li Ren-Fa, Li Wan-Li. Review on deep-learning-based cognitive computing. Acta Automatica Sinica, 2017, 43(11): 1886-1897
    [3] LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4): 541-551 doi: 10.1162/neco.1989.1.4.541
    [4] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc., 2012. 1097−1105
    [5] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 770−778
    [6] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780 doi: 10.1162/neco.1997.9.8.1735
    [7] Chung J, Gulcehre C, Cho K H, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv: 1412.3555, 2014.
    [8] Vaswani A, Shazeer N, Parmar N, Uszkoreit U, Jones L, Gomez A N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc., 2017. 6000−6010
    [9] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA, 2015.
    [10] Gehring J, Auli M, Grangier D, Yarats D, Dauphin Y N. Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR.org, 2017. 1243−1252
    [11] Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y H. Exploring the limits of language modeling. arXiv preprint arXiv: 1602.02410, 2016.
    [12] Luong T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: The Association for Computational Linguistics, 2015. 1412−1421
    [13] Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota, USA: Association for Computational Linguistics, 2018. 4171−4186
    [14] Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. In: Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems. 2020.
    [15] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event, Austria: OpenReview.net, 2020.
    [16] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 213−229
    [17] Han K, Wang Y H, Chen H T, Chen X H, Guo J Y, Liu Z H, et al. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2022.3152247
    [18] Liu Y, Zhang Y, Wang Y X, Hou F, Yuan J, Tian J, et al. A survey of visual transformers. arXiv preprint arXiv: 2111.06091, 2021.
    [19] Khan S, Naseer M, Hayat M, Zamir S W, Khan, F S, Shah M. Transformers in vision: A survey. arXiv preprint arXiv: 2101.01169, 2021.
    [20] Selva J, Johansen A S, Escalera S, Nasrollahi K, Moeslund T B, Clapés A. Video transformers: A survey. arXiv preprint arXiv: 2201.05991, 2022.
    [21] Shamshad F, Khan S, Zamir S W, Khan M H, Hayat M, Khan F S, et al. Transformers in medical imaging: A survey. arXiv preprint arXiv: 2201.09873, 2022.
    [22] Wang W H, Xie E Z, Li X, Fan D P, Song K T, Liang D, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 548−558
    [23] Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 9992−10002
    [24] Zhu X Z, Su W J, Lu L W, Li B, Wang X G, Dai J F. Deformable DETR: Deformable transformers for end-to-end object detection. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event, Austria: OpenReview.net, 2021.
    [25] Xie E Z, Wang W H, Yu Z D, Anandkumar A, Alvarez J M, Luo P. SegFormer: Simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv: 2105.15203, 2021.
    [26] Cheng B W, Misra I, Schwing A G, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. arXiv preprint arXiv: 2112.01527, 2021.
    [27] Zhou L W, Zhou Y B, Corso J J, Socher R, Xiong C M. End-to-end dense video captioning with masked transformer. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 8739−8748
    [28] Zeng Y H, Fu J L, Chao H Y. Learning joint spatial-temporal transformations for video inpainting. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 528−543
    [29] Jiang Y F, Chang S Y, Wang Z Y. TransGAN: Two transformers can make one strong gan. arXiv preprint arXiv: 2102.07074, 2021.
    [30] Zhao H, Jiang L, Jia J, Torr P H, Koltun V. Point transformer. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021. 16259−16268
    [31] Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R, Hu S M. PCT: Point cloud transformer. Computational Visual Media, 2021, 7(2): 187-199 doi: 10.1007/s41095-021-0229-5
    [32] Shen Z R, Zhang M Y, Zhao H Y, Yi S, Li H S. Efficient attention: Attention with linear complexities. In: Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, USA: IEEE, 2021. 3530−3538
    [33] Katharopoulos A, Vyas A, Pappas N, François F. Transformers are rNNS: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020. 5156−5165
    [34] Guo J Y, Han K, Wu H, Xu C, Tang Y H, Xu C J, et al. CMT: Convolutional neural networks meet vision transformers. arXiv preprint arXiv: 2107.06263, 2021.
    [35] Xiao T, Singh M, Mintun E, Darrell T, Dollár P, Girshick R. Early convolutions help transformers see better. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
    [36] Kolesnikov A, Beyer L, Zhai X H, Puigcerver J, Yung J, Gelly S, et al. Big transfer (BiT): General visual representation learning. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 491−507
    [37] Mahajan D, Girshick R, Ramanathan V, He K M, Paluri M, Li Y X, et al. Exploring the limits of weakly supervised pretraining. In: Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018. 185−201
    [38] Touvron H, Vedaldi A, Douze M, Jégou H. Fixing the train-test resolution discrepancy. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Vancouver, Canada, 2019. 8250−8260
    [39] Xie Q Z, Luong M T, Hovy E, Le Q V. Self-training with noisy student improves ImageNet classification. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 10684−10695
    [40] Ba J L, Kiros J R, Hinton G E. Layer normalization. arXiv preprint arXiv: 1607.06450, 2016.
    [41] Kim Y, Denton C, Hoang L, Rush A M. Structured attention networks. In: Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview.net, 2017.
    [42] Buades A, Coll B, Morel J M. A non-local algorithm for image denoising. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR′′05). San Diego, USA: IEEE, 2005. 60−65
    [43] Wang X L, Girshick R, Gupta A, He K M. Non-local neural networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 7794−7803
    [44] Han Q, Fan Z J, Dai Q, Sun L Cheng M M, Liu J Y, et al. Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight. arXiv preprint arXiv: 2106.04263, 2021.
    [45] Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848 doi: 10.1109/TPAMI.2017.2699184
    [46] De Brabandere B, Jia X, Tuytelaars T, Van Gool L. Dynamic filter networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc., 2016. 667−675
    [47] Islam A, Jia S, Bruce N D B. How much position information do convolutional neural networks encode? In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: OpenReview.net, 2020.
    [48] Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. arXiv preprint arXiv: 2009.06732, 2020.
    [49] Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse transformers. arXiv preprint arXiv: 1904.10509, 2019.
    [50] Kitaev N, Kaiser L, Levskaya A. Reformer: The efficient transformer. In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: OpenReview.net, 2020.
    [51] Rao Y M, Zhao W L, Liu B L, Lu J W, Zhou J, Hsieh C J. DynamicViT: Efficient vision transformers with dynamic token sparsification. arXiv preprint arXiv: 2106.02034, 2021.
    [52] Wang W X, Yao L, Chen L, Lin B B, Cai D, He X F, et al. CrossFormer: A versatile vision transformer hinging on cross-scale attention. arXiv preprint arXiv: 2108.00154, 2021.
    [53] Zhang Q L, Yang B B. ResT: An efficient transformer for visual recognition. arXiv preprint arXiv: 2105.13677, 2021.
    [54] Choromanski K M, Likhosherstov V, Dohan D, Song X Y, Gane A, Sarlás T, et al. Rethinking attention with performers. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event, Austria: OpenReview.net, 2021.
    [55] Tsai Y H H, Bai S J, Yamada M, Morency L P, Salakhutdinov R. Transformer dissection: An unified understanding for transformer$'$s attention via the lens of kernel. arXiv preprint arXiv: 1908.11775, 2019.
    [56] Zhai S F, Talbott W, Srivastava N, Huang C, Goh H, Zhang R X, et al. An attention free transformer. arXiv preprint arXiv: 2015.14103, 2021.
    [57] Lu J C, Yao J H, Zhang J G, Zhu X T, Xu H, Gao W G, et al. SOFT: Softmax-free transformer with linear complexity. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
    [58] Touvron H, Cord M, Douze M, Francisco M, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 10347−10357
    [59] Zhou D Q, Shi Y J, Kang B Y, Yu W H, Jiang Z H, Li Y, et al. Refiner: Refining self-attention for vision transformers. arXiv preprint arXiv: 2106.03714, 2021.
    [60] d′′Ascoli S, Touvron H, Leavitt M L, Morcos A S, Biroli G, Sagun L. ConViT: Improving vision transformers with soft convolutional inductive biases. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 2286−2296
    [61] Li Y W, Zhang K, Cao J Z, Timofte R, Van Gool L. LocalViT: Bringing locality to vision transformers. arXiv preprint arXiv: 2104.05707, 2021.
    [62] Lin T Y, Dollár P, Girshick R, He K M, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017. 936-−944
    [63] Chen C F, Fan Q F, Panda R. CrossViT: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021. 347−356
    [64] Gong C Y, Wang D L, Li M, Chandra V, Liu Q. Improve vision transformers training by suppressing over-smoothing. arXiv preprint arXiv: 2104.12753, 2021.
    [65] Yun S, Han D, Chun S, Oh S J, Yoo Y, Choe J. CutMix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019. 6022−6031
    [66] Zhou D Q, Kang B Y, Jin X J, Yang L J, Lian X C, Hou Q B, et al. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv: 2103.11886, 2021.
    [67] Tay Y, Bahri D, Metzler D, Juan D C, Zhao Z, Zheng C. Synthesizer: Rethinking self-attention for transformer models. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 10183−10192
    [68] Yuan L, Hou Q B, Jiang Z H, Feng J S, Yan S C. VOLO: Vision outlooker for visual recognition. arXiv preprint arXiv: 2106.13112, 2021.
    [69] Mihcak M K, Kozintsev I, Ramchandran K, Moulin P. Low-complexity image denoising based on statistical modeling of wavelet coefficients. IEEE Signal Processing Letters, 1999, 6(12): 300-303 doi: 10.1109/97.803428
    [70] He K M, Sun J, Tang X O. Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(6): 1397-1409 doi: 10.1109/TPAMI.2012.213
    [71] Criminisi A, Pérez P, Toyama K. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing, 2004, 13(9): 1200-1212 doi: 10.1109/TIP.2004.833105
    [72] Raghu M, Unterthiner T, Kornblith S, Zhang C Y, Dosovitskiy A. Do vision transformers see like convolutional neural networks? In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
    [73] Yuan K, Guo S P, Liu Z W, Zhou A J, Yu F W, Wu W. Incorporating convolution designs into visual transformers. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 559−568
    [74] Chen Y P, Dai X Y, Chen D D, Liu M C, Dong X Y, Yuan L, et al. Mobile-former: Bridging MobileNet and transformer. arXiv preprint arXiv: 2108.05895, 2021.
    [75] Mehta S, Rastegari M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv: 2110.02178, 2021.
    [76] Peng Z L, Huang W, Gu S Z, Xie L X, Wang Y W, Jiao J B, et al. Conformer: Local features coupling global representations for visual recognition. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 357−366
    [77] Yan H, Deng B C, Li X N, Qiu X P. TENER: Adapting transformer encoder for named entity recognition. arXiv preprint arXiv: 1911.04474, 2019.
    [78] Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, USA: Association for Computational Linguistics, 2018. 464−468
    [79] Dai Z H, Yang Z L, Yang Y M, Carbonell J G, Le Q V, Salakhutdinov R. Transformer-XL: Attentive language models beyond a fixed-length context. In: Proceedings of the 57th Conference of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019. 2978−2988
    [80] Huang Z H, Liang D, Xu P, Xiang B. Improve transformer models with better relative position embeddings. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics, 2020. 3327−3335
    [81] Parmar N, Ramachandran P, Vaswani A, Bello I, Levskaya A, Shlens J. Stand-alone self-attention in vision models. In: Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems. Vancouver, Canada, 2019. 68−80
    [82] Wu K, Peng H W, Chen M H, Fu J L, Chao H Y. Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 10013−10021
    [83] Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNet: A large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE, 2009. 248−255
    [84] Zhao Y C, Wang G T, Tang C X, Luo C, Zeng W J, Zha Z J. A battle of network structures: An empirical study of CNN, transformer, and MLP. arXiv preprint arXiv: 2108.13002, 2021.
    [85] Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X H, Unterthiner T, et al. MLP-Mixer: An all-MLP architecture for vision. arXiv preprint arXiv: 2105.01601, 2021.
    [86] Touvron H, Bojanowski P, Caron M, Cord M, El-Nouby A, Grave E, et al. ResMLP: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv: 2105.03404, 2021.
    [87] Liu H X, Dai Z H, So D R, Le Q V. Pay attention to MLPs. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
    [88] Chen S F, Xie E Z, Ge C J, Chen R J, Liang D, Luo P. CycleMLP: A MLP-like architecture for dense prediction. arXiv preprint arXiv: 2107.10224, 2021.
    [89] Ng D, Chen Y Q, Tian B, Fu Q, Chng E S. ConvMixer: Feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting. arXiv preprint arXiv: 2201.05863, 2022.
    [90] Yu W H, Luo M, Zhou P, Si C Y, Zhou Y C, Wang X C, et al. MetaFormer is actually what you need for vision. arXiv preprint arXiv: 2111.11418, 2021.
    [91] Jeevan P, Sethi A. Convolutional xformers for vision. arXiv preprint arXiv: 2201.10271, 2022.
    [92] Liu Z, Mao H Z, Wu C Y, Feichtenhofer C, Darrell T, Xie S N. A ConvNet for the 2020s. arXiv preprint arXiv: 2201.03545, 2022.
    [93] Ding X H, Zhang X Y, Zhou Y Z, Han J G, Ding G G, Sun J. Scaling up your kernels to 31x31: Revisiting large kernel design in CNNs. arXiv preprint arXiv: 2203.06717, 2022.
    [94] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA, 2014.
    [95] Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. Montreal, Canada, 2015. 91−99
    [96] Chen H, Sun K Y, Tian Z, Shen C H, Huang Y M, Yan Y L. BlendMask: Top-down meets bottom-up for instance segmentation. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 8570−8578
    [97] Wang X L, Zhang R F, Kong T, Li L, Shen C H. SOLOv2: Dynamic and fast instance segmentation. In: Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020). Vancouver, Canada, 2020.
    [98] Dong B, Zeng F G, Wang T C, Zhang X Y, Wei Y C. SOLQ: Segmenting objects by learning queries. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
    [99] He K M, Gkioxari G, Dollár P, Girshick R B. Mask R-CNN. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017. 2980−2988
    [100] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y, et al. SSD: Single shot MultiBox detector. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016. 21−37
    [101] Law H, Deng J. CornerNet: Detecting objects as paired keypoints. In: Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018. 765−781
    [102] Zhou X Y, Wang D Q, Krähenbühl P. Objects as points. arXiv preprint arXiv: 1904.07850, 2019.
    [103] Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: Common objects in context. In: Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 740−755
    [104] Lin T Y, Goyal P, Girshick R, He K M, Dollár P. Focal loss for dense object detection. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017. 2999−3007
    [105] Cai Z W, Vasconcelos N. Cascade R-CNN: Delving into high quality object detection. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 6154−6162
    [106] Zhang S F, Chi C, Yao Y Q, Lei Z, Li S Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 9756−9765
    [107] Chen Y H, Zhang Z, Cao Y, Wang L W, Lin S, Hu H. RepPoints v2: Verification meets regression for object detection. In: Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020. 2020.
    [108] Sun P Z, Zhang R F, Jiang Y, Kong T, Xu C F, Zhan W, et al. Sparse R-CNN: End-to-end object detection with learnable proposals. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 14449−14458
    [109] Yao Z Y, Ai J B, Li B X, Zhang C. Efficient DETR: Improving end-to-end object detector with dense prior. arXiv preprint arXiv: 2104.01318, 2021.
    [110] Gao P, Zheng M H, Wang X G, Dai J F, Li H S. Fast convergence of DETR with spatially modulated co-attention. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 3601−3610
    [111] Meng D P, Chen X K, Fan Z J, Zeng G, Li H Q, Yuan Y H, et al. Conditional DETR for fast training convergence. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 3631−3640
    [112] Dai Z G, Cai B L, Lin Y G, Chen J Y. UP-DETR: Unsupervised pre-training for object detection with transformers. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 1601−1610
    [113] Zheng M H, Gao P, Zhang R R, Li K C, Wang X G, Li H S, et al. End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv: 2011.09315, 2020.
    [114] Fang Y X, Liao B C, Wang X G, Fang J M, Qi J Y, Wu R, et al. You only look at one sequence: Rethinking transformer in vision through object detection. arXiv preprint arXiv: 2106.00666, 2021.
    [115] Sun Z Q, Cao S C, Yang Y M, Kitani K. Rethinking transformer-based set prediction for object detection. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 3591−3600
    [116] Tian Z, Shen C H, Chen H, He T. FCOS: Fully convolutional one-stage object detection. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019. 9626−9635
    [117] Beal J, Kim E, Tzeng E, Park D H, Zhai A, Kislyuk D. Toward transformer-based object detection. arXiv preprint arXiv: 2012.09958, 2020.
    [118] Girshick R. Fast R-CNN. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 1440−1448
    [119] Minaee S, Boykov Y Y, Porikli F, Plaza A J, Kehtarnavaz N, Terzopoulos D. Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2021.3059968
    [120] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE, 2015. 3431−3440
    [121] Zheng S X, Lu J C, Zhao H S, Zhu X T, Luo Z K, Wang Y B, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 6877−6886
    [122] Xiao T T, Liu Y C, Zhou B L, Jiang Y N, Sun J. Unified perceptual parsing for scene understanding. In: Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018. 432−448
    [123] Wang H Y, Zhu Y K, Adam H, Yuille A, Chen L C. MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 5459−5470
    [124] Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: Transformer for semantic segmentation. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 7242−7252
    [125] Cheng B W, Schwing A, Kirillov A. Per-pixel classification is not all you need for semantic segmentation. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
    [126] Fang Y X, Yang S S, Wang X G, Li Y, Fang C, Shan Y, et al. QueryInst: Parallelly supervised mask query for instance segmentation. arXiv preprint arXiv: 2105.01928, 2021.
    [127] Hu J, Cao L J, Yan L, Zhang S C, Wang Y, Li K, et al. ISTR: End-to-end instance segmentation with transformers. arXiv preprint arXiv: 2105.00637, 2021.
    [128] Li Z Q, Wang W H, Xie E Z, Yu Z D, Anandkumar A, Alvarez J M, et al. Panoptic SegFormer: Delving deeper into panoptic segmentation with transformers. arXiv preprint arXiv: 2109.03814, 2022.
  • 加载中
图(10) / 表(6)
计量
  • 文章访问数:  6706
  • HTML全文浏览量:  4442
  • PDF下载量:  6104
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-01-11
  • 网络出版日期:  2022-03-28
  • 刊出日期:  2022-04-13

目录

    /

    返回文章
    返回