视觉<b>Transformer</b>研究的关键问题: 现状及展望

田永林; 王雨桐; 王建功; 王晓; 王飞跃

doi:10.16383/j.aas.c220027

视觉Transformer研究的关键问题: 现状及展望

doi: 10.16383/j.aas.c220027

田永林^{1, 2,},
王雨桐^2,,
王建功^2,,
王晓^{2, 3,},
王飞跃^{2, 3,}

1.
中国科学技术大学自动化系合肥 230027
2.
中国科学院自动化研究所复杂系统管理与控制国家重点实验室北京 100190
3.
青岛智能产业技术研究院青岛 266000

基金项目: 广东省重点领域研发计划(2020B090921003), 广州市智能网联汽车重大科技专项(202007050002), 国家自然科学基金(U1811463), 英特尔智能网联汽车大学合作研究中心(ICRI-IACV)资助

详细信息

作者简介:
田永林：中国科学技术大学与中科院自动化研究所联合培养博士研究生. 2017年获得中国科学技术大学自动化系学士学位. 主要研究方向为计算机视觉, 智能交通. E-mail: tyldyx@mail.ustc.edu.cn

王雨桐：中国科学院自动化研究所复杂系统管理与控制国家重点实验室助理研究员. 2021年获得中国科学院大学控制理论与控制工程专业博士学位. 主要研究方向为深度学习中的对抗攻击与防御. E-mail: yutong.wang@ia.ac.cn

王建功：中国科学院自动化研究所博士研究生. 2018年获得同济大学学士学位. 主要研究方向为计算机视觉, 交通场景理解, 医学图像处理. E-mail: wangjiangong2018@ia.ac.cn

王晓：中国科学院自动化研究所复杂系统管理与控制国家重点实验室副研究员, 青岛智能产业技术研究院院长. 2016年获得中国科学院大学社会计算博士学位. 主要研究方向为社会交通, 动态网群组织, 平行智能和社交网络分析. E-mail: x.wang@ia.ac.cn

王飞跃：中国科学院自动化研究所复杂系统管理与控制国家重点实验室研究员.主要研究方向为智能系统和复杂系统的建模、分析与控制. 本文通信作者. E-mail: feiyue.wang@ia.ac.cn

计量
- 文章访问数: 8363
- HTML全文浏览量: 6229
- PDF下载量: 6543
- 被引次数: 0
出版历程
- 收稿日期: 2022-01-11
- 网络出版日期: 2022-03-28
- 刊出日期: 2022-04-13

Key Problems and Progress of Vision Transformers: The State of the Art and Prospects

TIAN Yong-Lin^{1, 2
,},
WANG Yu-Tong^2
,,
WANG Jian-Gong^2
,,
WANG Xiao^{2, 3
,},
WANG Fei-Yue^{2, 3
,}

1.
Department of Automation, University of Science and Technology of China, Hefei 230027
2.
The State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190
3.
Qingdao Academy of Intelligent Industries, Qingdao 266000

Funds: Supported by Key-Area Research and Development Program of Guangdong Province (2020B090921003), Key Research and Development Program of Guangzhou (202007050002), National Natural Science Foundation of China (U1811463) and Intel Collaborative Research Institute for Intelligent and Automated Connected Vehicles (ICRI-IACV)

More Information

Author Bio:
TIAN Yong-Lin　Ph. D. candidate in the Department of Automation, University of Science and Technology of China and Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from University of Science and Technology of China in 2017. His research interest covers computer vision and intelligent transportation system

WANG Yu-Tong　Assistant professor at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences. She received her Ph. D. degree in control theory and control engineering from University of Chinese Academy of Sciences in 2021. Her research interest covers computer vision and adversarial attack

WANG Jian-Gong　Ph. D. candidate at Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree from Tongji University in 2018. His research interest covers computer vision, traffic scene understanding and medical image processing

WANG Xiao　Associate professor at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences and the president of the Qingdao Academy of Intelligent Industries. She received her Ph. D. degree in social computing from University of Chinese Academy of Sciences in 2016. Her research interest covers social transportation, cybermovement organizations, parallel intelligence and social network analysis

WANG Fei-Yue　Professor at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences. His research interest covers modeling, analysis, and control of intelligent systems and complex systems. Corresponding author of this paper

摘要

摘要: Transformer所具备的长距离建模能力和并行计算能力使其在自然语言处理领域取得了巨大成功并逐步拓展至计算机视觉等领域. 本文以分类任务为切入, 介绍了典型视觉Transformer的基本原理和结构, 并分析了Transformer与卷积神经网络在连接范围、权重动态性和位置表示能力三方面的区别与联系; 同时围绕计算代价、性能提升、训练优化以及结构设计四个方面总结了视觉Transformer研究中的关键问题以及研究进展; 并提出了视觉Transformer的一般性框架; 然后针对检测和分割两个领域, 介绍了视觉Transformer在特征学习、结果产生和真值分配等方面给上层视觉模型设计带来的启发和改变; 并对视觉Transformer未来发展方向进行了展望.
- 视觉Transformer /
- 图像分类 /
- 目标检测 /
- 图像分割 /
- 计算机视觉
Abstract: Due to its long-range sequence modeling and parallel computing capability, Transformers have achieved significant success in natural language processing and are gradually expanding to computer vision area. Starting from image classification, we introduce the architecture of classic vision Transformer and compare it with convolutional neural networks in connection range, dynamic weights and position representation ability. Then, we summarize existing problems and corresponding solutions in vision Transformers including computational efficiency, performance improvement, optimization and architecture design. Besides, we propose a general architecture of Vision Transformers. For object detection and image segmentation, we discuss Transformer-based models and their roles on feature extraction, result generation and ground-truth assignment. Finally, we point out the development trends of vision Transformers.
- Vision Transformers /
- image classification /
- object detection /
- image segmentation /
- computer vision

HTML全文

图 1 ViT模型结构^[15]

Fig. 1 The framework of ViT^[15]

下载: 全尺寸图片幻灯片

图 2 自注意力^[15]与多头自注意力^[15]

Fig. 2 Self-attention^[15]and multi-head self-attention^[15]

下载: 全尺寸图片幻灯片

图 3 Transformer与CNN有效感受野对比^[25]

Fig. 3 The comparison^[25] of effective receptive field between Transformer and CNN

下载: 全尺寸图片幻灯片

图 4 单尺度与多尺度结构对比

Fig. 4 The comparison of single-scale framework and multi-scale framework

下载: 全尺寸图片幻灯片

图 5 类Transformer方法的交互模块结构对比(Transformer^[8], MLP-Mixer^[85], ResMLP^[86], gMLP^[87], CycleMLP^[88], ConvMixer^[89], PoolFormer^[90], MetaFormer^[90], CMT^[34], CVX^[91])

Fig. 5 The comparison of mixing blocks of Transformer-like methods (Transformer^[8], MLP-Mixer^[85], ResMLP^[86], gMLP^[87], CycleMLP^[88], ConvMixer^[89], PoolFormer^[90], MetaFormer^[90], CMT^[34], CVX^[91])

下载: 全尺寸图片幻灯片

图 6 视觉Transformer的一般性框架

Fig. 6 Vision Transformers architecture

下载: 全尺寸图片幻灯片

图 7 DETR的结构图^[16]

Fig. 7 The framework of DETR^[16]

下载: 全尺寸图片幻灯片

图 8 基于Transformer和CNN的目标检测器的收敛速度对比(DETR-DC5^[16], TSP-FCOS^[115], TSP-RCNN^[115], SMCA^[110], Deformable DETR^[24], Conditional DETR-DC5-R50^[111], RetinaNet^[104], Faster R-CNN^[95], Sparse R-CNN^[108])

Fig. 8 The comparison of converge speed among object detectors based on Transformer and CNN (DETR-DC5^[16], TSP-FCOS^[115], TSP-RCNN^[115], SMCA^[110], Deformable DETR^[24], Conditional DETR-DC5-R50^[111], RetinaNet^[104], Faster R-CNN^[95], Sparse R-CNN^[108])

下载: 全尺寸图片幻灯片

图 9 DETR交叉注意力稀疏性变化

Fig. 9 The change of sparsity of cross-attention map in DETR

下载: 全尺寸图片幻灯片

图 10 SETR的结构图^[121]

Fig. 10 The framework of SETR^[121]

下载: 全尺寸图片幻灯片

表 1 不同Transformer自注意力机制以及卷积的时间和空间复杂度($ N $, $ d $, $ s $分别表示序列长度、特征维度和局部窗口尺寸, 其中$s<N$)

Table 1 The time and space complexity of different Transformer frameworks ($N$, $ d $, $ s $denote the length, dimension and local window size respectively)

名称	时间复杂度	空间复杂度
Convolution	$ {\rm{O}}(Nd^2s) $	$ {\rm{O}}(Ns^2d^2+Nd) $
Transformer^[8]	$ {\rm{O}}(N^2d) $	$ {\rm{O}}(N^2+Nd) $
Sparse Transformers^[49]	$ {\rm{O}}(N\sqrt{N}d) $	—
Reformer^[50]	$ {\rm{O}}(N\log Nd) $	$ {\rm{O}}(N\log N+Ld) $
Linear Transformer^[33]	$ {\rm{O}}(Nd^2) $	$ {\rm{O}}(Nd+d^2) $
Performer^[54]	$ {\rm{O}}(Nd^2\log d) $	$ {\rm{O}}(Nd\log d+d^2\log d) $
AFT-simple^[56]	$ {\rm{O}}(Nd) $	$ {\rm{O}}(Nd) $
AFT-full^[56]	$ {\rm{O}}(N^2d) $	$ {\rm{O}}(Nd) $
AFT-local (1D)^[56]	$ {\rm{O}}(Nsd) $	$ {\rm{O}}(Nd) $
Swin Transformer (2D)^[23]	${\rm{O}}(Ns^2d)$	—

下载: 导出CSV

表 2 视觉Transformer算法在ImageNet-1k上的Top-1准确率比较

Table 2 The comparison of Top-1 accuracy of different vision Transformers on ImageNet-1k dataset

方法名称	迭代轮次	批处理大小	参数量 (×10⁶)	计算量 (GFLOPs)	图像尺寸		Top-1 准确率
方法名称	迭代轮次	批处理大小	参数量 (×10⁶)	计算量 (GFLOPs)	训练	测试	Top-1 准确率
ViT-B/16^[15]	300	4 096	86	743	224	384	77.9
ViT-L/16^[15]	300	4 096	307	5172	224	384	76.5
DeiT-Ti^[58]	300	1 024	5	1.3	224	224	72.2
DeiT-S^[58]			22	4.6	224	224	79.8
DeiT-B^[58]			86	17.6	224	224	81.8
DeiT-B$ \uparrow $^[58]			86	52.8	224	384	83.1
ConViT-Ti^[60]	300	512	6	1	224	224	73.1
ConViT-S^[60]			27	5.4	224	224	81.3
ConViT-B^[60]			86	17	224	224	82.4
LocalViT-T^[61]	300	1 024	5.9	1.3	224	224	74.8
LocalViT-S^[61]	300	1 024	22.4	4.6	224	224	80.8
CeiT-T^[73]	300	1 024	6.4	1.2	224	224	76.4
CeiT-S^[73]			24.2	4.5	224	224	82.0
CeiT-S$ \uparrow $^[73]			24.2	12.9	224	384	83.3
ResT-Small^[53]	300	2 048	13.66	1.9	224	224	79.6
ResT-Base^[53]			30.28	4.3	224	224	81.6
ResT-Large^[53]			51.63	7.9	224	224	83.6
Swin-T^[23]	300	1 024	29	4.5	224	224	81.3
Swin-S^[23]			50	8.7	224	224	83.0
Swin-B^[23]			88	15.4	224	224	83.3
Swin-B$ \uparrow $^[23]			88	47.0	224	384	84.2
VOLO-D1^[68]	300	1 024	27	6.8	224	224	84.2
VOLO-D2^[68]			59	14.1	224	224	85.2
VOLO-D3^[68]			86	20.6	224	224	85.4
VOLO-D4^[68]			193	43.8	224	224	85.7
VOLO-D5^[68]			296	69.0	224	224	86.1
VOLO-D5$ \uparrow $^[68]			296	304	224	448	87.0
PVT-Tiny^[22]	300	128	13.2	1.9	224	224	75.1
PVT-Small^[22]			24.5	3.8	224	224	79.8
PVT-Medium^[22]			44.2	6.7	224	224	81.2
PVT-Large^[22]			61.4	9.8	224	224	81.7
DeepViT-S^[66]	300	256	27	6.2	224	224	82.3
DeepViT-L^[66]	300	256	55	12.5	224	224	83.1
Refined-ViT-S^[59]	300	256	25	7.2	224	224	83.6
Refined-ViT-M^[59]			55	13.5	224	224	84.6
Refined-ViT-L^[59]			81	19.1	224	224	84.9
Refined-ViT-L$ \uparrow $^[59]		512	81	69.1	224	384	85.7
CrossViT-9^[63]	300	4 096	8.6	1.8	224	224	73.9
CrossViT-15^[63]			27.4	5.8	224	224	81.5
CrossViT-18^[63]			43.3	9.0	224	224	82.5

下载: 导出CSV

表 3 基于Transformer和基于CNN的目标检测算法在COCO 2017 val数据集上的检测精度比较. 其中C. 表示基于CNN的算法, T. 表示基于Transformer的算法

Table 3 The comparison of detection performance of Transformer-based and CNN-based detectors on COCO 2017 val set. C. denotes CNN-based methods, T. denotes Transformer-based methods

类型	方法名称	迭代轮次	计算量 (GFLOPs)	参数量 (×10⁶)	帧数 (FPS)	多尺度输入	$ AP $	$ AP_{50} $	$ AP_{75} $	$ AP_{S} $	$ AP_{M} $	$ AP_{L} $
C.	FCOS^[116]	36	177	—	17	√	41.0	59.8	44.1	26.2	44.6	52.2
	Faster R-CNN^[95]	36	180	42	26	√	40.2	61.0	43.8	24.2	43.5	52.0
	Faster R-CNN+^[95]	108	180	42	26	√	42.0	62.1	45.5	26.6	45.4	53.4
	Mask R-CNN^[99]	36	260	44	—	√	41.0	61.7	44.9	—	—	—
	Cascade Mask R-CNN^[105]	36	739	82	18	√	46.3	64.3	50.5	—	—	—
T.	ViT-B/16-FRCNN$ \ddagger $^[117]	21	—	—	—	—	36.6	56.3	39.3	17.4	40.0	55.5
	ViT-B/16-FRCNN*^[117]	21	—	—	—	—	37.8	57.4	40.1	17.8	41.4	57.3
	DETR-R50^[16]	500	86	41	28	—	42.0	62.4	44.2	20.5	45.8	61.1
	DETR-DC5-R50^[16]	500	187	41	12	—	43.3	63.1	45.9	22.5	47.3	61.1
	ACT-MTKD (L=16)^[113]	—	156	—	14	—	40.6	—	—	18.5	44.3	59.7
	ACT-MTKD (L=32)^[113]	—	169	—	16	—	43.1	—	—	22.2	47.1	61.4
	Deformable DETR^[24]	50	78	34	27	—	39.7	60.1	42.4	21.2	44.3	56.0
	Deformable DETR-DC5^[24]	50	128	34	22	—	41.5	61.8	44.9	24.1	45.3	56.0
	Deformable DETR^[24]	50	173	40	19	√	43.8	62.6	47.7	26.4	47.1	58.0
	Two-Stage Deformable DETR^[24]	50	173	40	19	√	46.2	65.2	50.0	28.8	49.2	61.7
	SMCA^[110]	50	152	40	22	—	41.0	—	—	21.9	44.3	59.1
	SMCA+^[110]	108	152	40	22	—	42.7	—	—	22.8	46.1	60.0
	SMCA^[110]	50	152	40	10	√	43.7	63.6	47.2	24.2	47.0	60.4
	SMCA+^[110]	108	152	40	10	√	45.6	65.5	49.1	25.9	49.3	62.6
	Efficient DETR^[109]	36	159	32	—	√	44.2	62.2	48.0	28.4	47.5	56.6
	Efficient DETR*^[109]	36	210	35	—	√	45.1	63.1	49.1	28.3	48.4	59.0
	Conditional DETR^[111]	108	90	44	—	—	43.0	64.0	45.7	22.7	46.7	61.5
	Conditional DETR-DC5^[111]	108	195	44	—	—	45.1	65.4	48.5	25.3	49.0	62.2
	UP-DETR^[112]	150	86	41	28	—	40.5	60.8	42.6	19.0	44.4	60.0
	UP-DETR+^[112]	300	86	41	28	—	42.8	63.0	45.3	20.8	47.1	61.7
	TSP-FCOS^[115]	36	189	51.5	15	√	43.1	62.3	47.0	26.6	46.8	55.9
	TSP-RCNN^[115]	36	188	64	11	√	43.8	63.3	48.3	28.6	46.9	55.7
	TSP-RCNN+^[115]	96	188	64	11	√	45.0	64.5	49.6	29.7	47.7	58.0
	YOLOS-S^[114]	150	200	30.7	7	—	36.1	56.4	37.1	15.3	38.5	56.1
	YOLOS-S^[114]	150	179	27.9	5	√	37.6	57.6	39.2	15.9	40.2	57.3
	YOLOS-B^[114]	150	537	127	—	—	42.0	62.2	44.5	19.5	45.3	62.1

下载: 导出CSV

表 4 基于Transformer的语义分割算法在ADE20K val数据集上的语义分割精度比较. 其中, 1k表示ImageNet-1k, 22k表示ImageNet-1k和ImageNet-21k的结合

Table 4 The comparison of semantic segmentation performance of Transformer-based methods on ADE20K val set. 1k denotes ImageNet-1k dataset, 22k denotes the combination of ImageNet-1k and ImageNet-21k

方法名称	骨干网络	预训练数据集	图像尺寸	参数量 (×10⁶)	计算量 (GFLOPs)	帧数 (FPS)	多尺度输入	mIoU
UperNet^[122]	R-50	1k	512	—	—	23.4	√	42.8
	R-101	1k	512	86	1 029	20.3	√	44.9
	Swin-T	1k	512	60	236	18.5	√	46.1
	Swin-S	1k	512	81	259	15.2	√	49.3
	Swin-B	22k	640	121	471	8.7	√	51.6
	Swin-L	22k	640	234	647	6.2	√	53.5
Segformer^[25]	MiT-B3	1k	512	47.3	79	—	√	50.0
	MiT-B4	1k	512	64.1	95.7	15.4	√	51.1
	MiT-B5	1k	512	84.7	183.3	9.8	√	51.8
Segmenter^[124]	ViT-S/16	1k	512	37.0	—	34.8	√	46.9
	ViT-B/16	1k	512	106	—	24.1	√	50.0
	ViT-L/16	22k	640	334	—	—	√	53.6
MaskFormer^[125]	R-50	1k	512	41	53	24.5	√	46.7
	R-101	1k	512	60	73	19.5	√	47.2
	Swin-T	1k	512	42	55	22.1	√	48.8
	Swin-S	1k	512	63	79	19.6	√	51.0
	Swin-B	22k	640	102	195	12.6	√	53.9
	Swin-L	22k	640	212	375	7.9	√	55.6
Mask2Former^[26]	R-50	1k	512	—	—	—	√	49.2
	R-101	1k	512	—	—	—	√	50.1
	Swin-S	1k	512	—	—	—	√	52.4
	Swin-B	22k	640			—	√	55.1
	Swin-L	22k	640	—	—	—	√	57.3

下载: 导出CSV

表 5 基于Transformer的实例分割方法和基于CNN算法在COCO test-dev数据集上的实例分割精度比较

Table 5 The comparison of instance segmentation performance of Transformer-based and typical CNN-based methods on COCO test-dev dataset

方法名称	骨干网络	迭代轮次	帧数 (FPS)	$ Ap^m $	$ Ap_S^m $	$ Ap_M^m $	$ Ap_L^m $	$ Ap^b $
Mask R-CNN^[99]	R-50-FPN	36	15.3	37.5	21.1	39.6	48.3	41.3
Mask R-CNN^[99]	R-101-FPN	36	11.8	38.8	21.8	41.4	50.5	43.1
Blend Mask^[96]	R-50-FPN	36	15.0	37.8	18.8	40.9	53.6	43.0
Blend Mask^[96]	R-101-FPN	36	11.5	39.6	22.4	42.2	51.4	44.7
SOLO v2^[97]	R-50-FPN	36	10.5	38.2	16.0	41.2	55.4	40.7
SOLO v2^[97]	R-101-FPN	36	9.0	39.7	17.3	42.9	57.4	42.6
ISTR^[127]	R-50-FPN	36	13.8	38.6	22.1	40.4	50.6	46.8
ISTR^[127]	R-101-FPN	36	11.0	39.9	22.8	41.9	52.3	48.1
SOLQ^[98]	R-50	50	—	39.7	21.5	42.5	53.1	47.8
	R-101	50	—	40.9	22.5	43.8	54.6	48.7
	Swin-L	50	—	45.9	27.8	49.3	60.5	55.4
QueryInst^[126]	R-50-FPN	36	7.0	40.6	23.4	42.5	52.8	45.6
	R-101-FPN	36	6.1	41.7	24.2	43.9	53.9	47.0
	Swin-L	50	3.3	49.1	31.5	51.8	63.2	56.1
Mask2Former^[26]	R-50	50	—	43.7	—	—	—	30.6
	R-101	50	—	44.2	—	—	—	31.1
	Swin-T	50	—	45.0	—	—	—	31.8
	Swin-L	50	—	50.1	—	—	—	36.2

下载: 导出CSV

表 6 基于Transformer的全景分割算法在COCO panoptic minval数据集上的全景分割精度比较

Table 6 The comparison of panoptic segmentation performance of Transformer-based methods on COCO panoptic minival dataset

方法名称	骨干网络	迭代轮次	参数量 (×10⁶)	计算量 (GFLOPs)	$ PQ $	$ PQ^{Th} $	$ PQ^{St} $
DETR^[16]	R-50	150+25	42.8	137	43.4	48.2	36.3
DETR^[16]	R-101	150+25	61.8	157	45.1	50.5	37
MaxDeepLab^[123]	Max-S	54	61.9	162	48.4	53.0	41.5
MaxDeepLab^[123]	Max-L	54	451	1 846	51.1	57.0	42.2
MaskFormer^[125]	R-50	300	45	181	46.5	51.0	39.8
	R-101		64	248	47.6	52.5	40.3
	Swin-T		42	179	47.7	51.7	41.7
	Swin-S		63	259	49.7	54.4	42.6
	Swin-B		102	411	51.1	56.3	43.2
	Swin-L		212	792	52.7	58.5	44.0
Panoptic SegFormer^[128]	R-50	12	51.0	214	48.0	52.3	41.5
	R-50	24	51.0	214	49.6	54.4	42.4
	R-101		69.9	286	50.6	55.5	43.2
	Swin-L		221.4	816	55.8	61.7	46.9

下载: 导出CSV

参考文献(128)

[1]	张慧, 王坤峰, 王飞跃. 深度学习在目标视觉检测中的应用进展与展望. 自动化学报, 2017, 43(8): 1289-1305 Zhang Hui, Wang Kun-Feng, Wang Fei-Yue. Advances and perspectives on applications of deep learning in visual object detection. Acta Automatica Sinica, 2017, 43(8): 1289-1305
[2]	陈伟宏, 安吉尧, 李仁发, 李万里. 深度学习认知计算综述. 自动化学报, 2017, 43(11): 1886-1897 Chen Wei-Hong, An Ji-Yao, Li Ren-Fa, Li Wan-Li. Review on deep-learning-based cognitive computing. Acta Automatica Sinica, 2017, 43(11): 1886-1897
[3]	LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4): 541-551 doi: 10.1162/neco.1989.1.4.541
[4]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: Curran Associates Inc., 2012. 1097−1105
[5]	He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016. 770−778
[6]	Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780 doi: 10.1162/neco.1997.9.8.1735
[7]	Chung J, Gulcehre C, Cho K H, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv: 1412.3555, 2014.
[8]	Vaswani A, Shazeer N, Parmar N, Uszkoreit U, Jones L, Gomez A N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc., 2017. 6000−6010
[9]	Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA, 2015.
[10]	Gehring J, Auli M, Grangier D, Yarats D, Dauphin Y N. Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR.org, 2017. 1243−1252
[11]	Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y H. Exploring the limits of language modeling. arXiv preprint arXiv: 1602.02410, 2016.
[12]	Luong T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: The Association for Computational Linguistics, 2015. 1412−1421
[13]	Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota, USA: Association for Computational Linguistics, 2018. 4171−4186
[14]	Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. In: Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems. 2020.
[15]	Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event, Austria: OpenReview.net, 2020.
[16]	Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 213−229
[17]	Han K, Wang Y H, Chen H T, Chen X H, Guo J Y, Liu Z H, et al. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2022.3152247
[18]	Liu Y, Zhang Y, Wang Y X, Hou F, Yuan J, Tian J, et al. A survey of visual transformers. arXiv preprint arXiv: 2111.06091, 2021.
[19]	Khan S, Naseer M, Hayat M, Zamir S W, Khan, F S, Shah M. Transformers in vision: A survey. arXiv preprint arXiv: 2101.01169, 2021.
[20]	Selva J, Johansen A S, Escalera S, Nasrollahi K, Moeslund T B, Clapés A. Video transformers: A survey. arXiv preprint arXiv: 2201.05991, 2022.
[21]	Shamshad F, Khan S, Zamir S W, Khan M H, Hayat M, Khan F S, et al. Transformers in medical imaging: A survey. arXiv preprint arXiv: 2201.09873, 2022.
[22]	Wang W H, Xie E Z, Li X, Fan D P, Song K T, Liang D, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 548−558
[23]	Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 9992−10002
[24]	Zhu X Z, Su W J, Lu L W, Li B, Wang X G, Dai J F. Deformable DETR: Deformable transformers for end-to-end object detection. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event, Austria: OpenReview.net, 2021.
[25]	Xie E Z, Wang W H, Yu Z D, Anandkumar A, Alvarez J M, Luo P. SegFormer: Simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv: 2105.15203, 2021.
[26]	Cheng B W, Misra I, Schwing A G, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. arXiv preprint arXiv: 2112.01527, 2021.
[27]	Zhou L W, Zhou Y B, Corso J J, Socher R, Xiong C M. End-to-end dense video captioning with masked transformer. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 8739−8748
[28]	Zeng Y H, Fu J L, Chao H Y. Learning joint spatial-temporal transformations for video inpainting. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 528−543
[29]	Jiang Y F, Chang S Y, Wang Z Y. TransGAN: Two transformers can make one strong gan. arXiv preprint arXiv: 2102.07074, 2021.
[30]	Zhao H, Jiang L, Jia J, Torr P H, Koltun V. Point transformer. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021. 16259−16268
[31]	Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R, Hu S M. PCT: Point cloud transformer. Computational Visual Media, 2021, 7(2): 187-199 doi: 10.1007/s41095-021-0229-5
[32]	Shen Z R, Zhang M Y, Zhao H Y, Yi S, Li H S. Efficient attention: Attention with linear complexities. In: Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, USA: IEEE, 2021. 3530−3538
[33]	Katharopoulos A, Vyas A, Pappas N, François F. Transformers are rNNS: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020. 5156−5165
[34]	Guo J Y, Han K, Wu H, Xu C, Tang Y H, Xu C J, et al. CMT: Convolutional neural networks meet vision transformers. arXiv preprint arXiv: 2107.06263, 2021.
[35]	Xiao T, Singh M, Mintun E, Darrell T, Dollár P, Girshick R. Early convolutions help transformers see better. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
[36]	Kolesnikov A, Beyer L, Zhai X H, Puigcerver J, Yung J, Gelly S, et al. Big transfer (BiT): General visual representation learning. In: Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020. 491−507
[37]	Mahajan D, Girshick R, Ramanathan V, He K M, Paluri M, Li Y X, et al. Exploring the limits of weakly supervised pretraining. In: Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018. 185−201
[38]	Touvron H, Vedaldi A, Douze M, Jégou H. Fixing the train-test resolution discrepancy. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Vancouver, Canada, 2019. 8250−8260
[39]	Xie Q Z, Luong M T, Hovy E, Le Q V. Self-training with noisy student improves ImageNet classification. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 10684−10695
[40]	Ba J L, Kiros J R, Hinton G E. Layer normalization. arXiv preprint arXiv: 1607.06450, 2016.
[41]	Kim Y, Denton C, Hoang L, Rush A M. Structured attention networks. In: Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview.net, 2017.
[42]	Buades A, Coll B, Morel J M. A non-local algorithm for image denoising. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR′′05). San Diego, USA: IEEE, 2005. 60−65
[43]	Wang X L, Girshick R, Gupta A, He K M. Non-local neural networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 7794−7803
[44]	Han Q, Fan Z J, Dai Q, Sun L Cheng M M, Liu J Y, et al. Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight. arXiv preprint arXiv: 2106.04263, 2021.
[45]	Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848 doi: 10.1109/TPAMI.2017.2699184
[46]	De Brabandere B, Jia X, Tuytelaars T, Van Gool L. Dynamic filter networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc., 2016. 667−675
[47]	Islam A, Jia S, Bruce N D B. How much position information do convolutional neural networks encode? In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: OpenReview.net, 2020.
[48]	Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. arXiv preprint arXiv: 2009.06732, 2020.
[49]	Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse transformers. arXiv preprint arXiv: 1904.10509, 2019.
[50]	Kitaev N, Kaiser L, Levskaya A. Reformer: The efficient transformer. In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: OpenReview.net, 2020.
[51]	Rao Y M, Zhao W L, Liu B L, Lu J W, Zhou J, Hsieh C J. DynamicViT: Efficient vision transformers with dynamic token sparsification. arXiv preprint arXiv: 2106.02034, 2021.
[52]	Wang W X, Yao L, Chen L, Lin B B, Cai D, He X F, et al. CrossFormer: A versatile vision transformer hinging on cross-scale attention. arXiv preprint arXiv: 2108.00154, 2021.
[53]	Zhang Q L, Yang B B. ResT: An efficient transformer for visual recognition. arXiv preprint arXiv: 2105.13677, 2021.
[54]	Choromanski K M, Likhosherstov V, Dohan D, Song X Y, Gane A, Sarlás T, et al. Rethinking attention with performers. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event, Austria: OpenReview.net, 2021.
[55]	Tsai Y H H, Bai S J, Yamada M, Morency L P, Salakhutdinov R. Transformer dissection: An unified understanding for transformer$'$s attention via the lens of kernel. arXiv preprint arXiv: 1908.11775, 2019.
[56]	Zhai S F, Talbott W, Srivastava N, Huang C, Goh H, Zhang R X, et al. An attention free transformer. arXiv preprint arXiv: 2015.14103, 2021.
[57]	Lu J C, Yao J H, Zhang J G, Zhu X T, Xu H, Gao W G, et al. SOFT: Softmax-free transformer with linear complexity. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
[58]	Touvron H, Cord M, Douze M, Francisco M, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 10347−10357
[59]	Zhou D Q, Shi Y J, Kang B Y, Yu W H, Jiang Z H, Li Y, et al. Refiner: Refining self-attention for vision transformers. arXiv preprint arXiv: 2106.03714, 2021.
[60]	d′′Ascoli S, Touvron H, Leavitt M L, Morcos A S, Biroli G, Sagun L. ConViT: Improving vision transformers with soft convolutional inductive biases. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 2286−2296
[61]	Li Y W, Zhang K, Cao J Z, Timofte R, Van Gool L. LocalViT: Bringing locality to vision transformers. arXiv preprint arXiv: 2104.05707, 2021.
[62]	Lin T Y, Dollár P, Girshick R, He K M, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017. 936-−944
[63]	Chen C F, Fan Q F, Panda R. CrossViT: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021. 347−356
[64]	Gong C Y, Wang D L, Li M, Chandra V, Liu Q. Improve vision transformers training by suppressing over-smoothing. arXiv preprint arXiv: 2104.12753, 2021.
[65]	Yun S, Han D, Chun S, Oh S J, Yoo Y, Choe J. CutMix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019. 6022−6031
[66]	Zhou D Q, Kang B Y, Jin X J, Yang L J, Lian X C, Hou Q B, et al. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv: 2103.11886, 2021.
[67]	Tay Y, Bahri D, Metzler D, Juan D C, Zhao Z, Zheng C. Synthesizer: Rethinking self-attention for transformer models. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 10183−10192
[68]	Yuan L, Hou Q B, Jiang Z H, Feng J S, Yan S C. VOLO: Vision outlooker for visual recognition. arXiv preprint arXiv: 2106.13112, 2021.
[69]	Mihcak M K, Kozintsev I, Ramchandran K, Moulin P. Low-complexity image denoising based on statistical modeling of wavelet coefficients. IEEE Signal Processing Letters, 1999, 6(12): 300-303 doi: 10.1109/97.803428
[70]	He K M, Sun J, Tang X O. Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(6): 1397-1409 doi: 10.1109/TPAMI.2012.213
[71]	Criminisi A, Pérez P, Toyama K. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing, 2004, 13(9): 1200-1212 doi: 10.1109/TIP.2004.833105
[72]	Raghu M, Unterthiner T, Kornblith S, Zhang C Y, Dosovitskiy A. Do vision transformers see like convolutional neural networks? In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
[73]	Yuan K, Guo S P, Liu Z W, Zhou A J, Yu F W, Wu W. Incorporating convolution designs into visual transformers. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 559−568
[74]	Chen Y P, Dai X Y, Chen D D, Liu M C, Dong X Y, Yuan L, et al. Mobile-former: Bridging MobileNet and transformer. arXiv preprint arXiv: 2108.05895, 2021.
[75]	Mehta S, Rastegari M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv: 2110.02178, 2021.
[76]	Peng Z L, Huang W, Gu S Z, Xie L X, Wang Y W, Jiao J B, et al. Conformer: Local features coupling global representations for visual recognition. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 357−366
[77]	Yan H, Deng B C, Li X N, Qiu X P. TENER: Adapting transformer encoder for named entity recognition. arXiv preprint arXiv: 1911.04474, 2019.
[78]	Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, USA: Association for Computational Linguistics, 2018. 464−468
[79]	Dai Z H, Yang Z L, Yang Y M, Carbonell J G, Le Q V, Salakhutdinov R. Transformer-XL: Attentive language models beyond a fixed-length context. In: Proceedings of the 57th Conference of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019. 2978−2988
[80]	Huang Z H, Liang D, Xu P, Xiang B. Improve transformer models with better relative position embeddings. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics, 2020. 3327−3335
[81]	Parmar N, Ramachandran P, Vaswani A, Bello I, Levskaya A, Shlens J. Stand-alone self-attention in vision models. In: Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems. Vancouver, Canada, 2019. 68−80
[82]	Wu K, Peng H W, Chen M H, Fu J L, Chao H Y. Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 10013−10021
[83]	Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNet: A large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE, 2009. 248−255
[84]	Zhao Y C, Wang G T, Tang C X, Luo C, Zeng W J, Zha Z J. A battle of network structures: An empirical study of CNN, transformer, and MLP. arXiv preprint arXiv: 2108.13002, 2021.
[85]	Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X H, Unterthiner T, et al. MLP-Mixer: An all-MLP architecture for vision. arXiv preprint arXiv: 2105.01601, 2021.
[86]	Touvron H, Bojanowski P, Caron M, Cord M, El-Nouby A, Grave E, et al. ResMLP: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv: 2105.03404, 2021.
[87]	Liu H X, Dai Z H, So D R, Le Q V. Pay attention to MLPs. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
[88]	Chen S F, Xie E Z, Ge C J, Chen R J, Liang D, Luo P. CycleMLP: A MLP-like architecture for dense prediction. arXiv preprint arXiv: 2107.10224, 2021.
[89]	Ng D, Chen Y Q, Tian B, Fu Q, Chng E S. ConvMixer: Feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting. arXiv preprint arXiv: 2201.05863, 2022.
[90]	Yu W H, Luo M, Zhou P, Si C Y, Zhou Y C, Wang X C, et al. MetaFormer is actually what you need for vision. arXiv preprint arXiv: 2111.11418, 2021.
[91]	Jeevan P, Sethi A. Convolutional xformers for vision. arXiv preprint arXiv: 2201.10271, 2022.
[92]	Liu Z, Mao H Z, Wu C Y, Feichtenhofer C, Darrell T, Xie S N. A ConvNet for the 2020s. arXiv preprint arXiv: 2201.03545, 2022.
[93]	Ding X H, Zhang X Y, Zhou Y Z, Han J G, Ding G G, Sun J. Scaling up your kernels to 31x31: Revisiting large kernel design in CNNs. arXiv preprint arXiv: 2203.06717, 2022.
[94]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA, 2014.
[95]	Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. Montreal, Canada, 2015. 91−99
[96]	Chen H, Sun K Y, Tian Z, Shen C H, Huang Y M, Yan Y L. BlendMask: Top-down meets bottom-up for instance segmentation. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 8570−8578
[97]	Wang X L, Zhang R F, Kong T, Li L, Shen C H. SOLOv2: Dynamic and fast instance segmentation. In: Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020). Vancouver, Canada, 2020.
[98]	Dong B, Zeng F G, Wang T C, Zhang X Y, Wei Y C. SOLQ: Segmenting objects by learning queries. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
[99]	He K M, Gkioxari G, Dollár P, Girshick R B. Mask R-CNN. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017. 2980−2988
[100]	Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y, et al. SSD: Single shot MultiBox detector. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016. 21−37
[101]	Law H, Deng J. CornerNet: Detecting objects as paired keypoints. In: Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018. 765−781
[102]	Zhou X Y, Wang D Q, Krähenbühl P. Objects as points. arXiv preprint arXiv: 1904.07850, 2019.
[103]	Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: Common objects in context. In: Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 740−755
[104]	Lin T Y, Goyal P, Girshick R, He K M, Dollár P. Focal loss for dense object detection. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017. 2999−3007
[105]	Cai Z W, Vasconcelos N. Cascade R-CNN: Delving into high quality object detection. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 6154−6162
[106]	Zhang S F, Chi C, Yao Y Q, Lei Z, Li S Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA: IEEE, 2020. 9756−9765
[107]	Chen Y H, Zhang Z, Cao Y, Wang L W, Lin S, Hu H. RepPoints v2: Verification meets regression for object detection. In: Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020. 2020.
[108]	Sun P Z, Zhang R F, Jiang Y, Kong T, Xu C F, Zhan W, et al. Sparse R-CNN: End-to-end object detection with learnable proposals. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 14449−14458
[109]	Yao Z Y, Ai J B, Li B X, Zhang C. Efficient DETR: Improving end-to-end object detector with dense prior. arXiv preprint arXiv: 2104.01318, 2021.
[110]	Gao P, Zheng M H, Wang X G, Dai J F, Li H S. Fast convergence of DETR with spatially modulated co-attention. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 3601−3610
[111]	Meng D P, Chen X K, Fan Z J, Zeng G, Li H Q, Yuan Y H, et al. Conditional DETR for fast training convergence. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 3631−3640
[112]	Dai Z G, Cai B L, Lin Y G, Chen J Y. UP-DETR: Unsupervised pre-training for object detection with transformers. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 1601−1610
[113]	Zheng M H, Gao P, Zhang R R, Li K C, Wang X G, Li H S, et al. End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv: 2011.09315, 2020.
[114]	Fang Y X, Liao B C, Wang X G, Fang J M, Qi J Y, Wu R, et al. You only look at one sequence: Rethinking transformer in vision through object detection. arXiv preprint arXiv: 2106.00666, 2021.
[115]	Sun Z Q, Cao S C, Yang Y M, Kitani K. Rethinking transformer-based set prediction for object detection. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 3591−3600
[116]	Tian Z, Shen C H, Chen H, He T. FCOS: Fully convolutional one-stage object detection. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019. 9626−9635
[117]	Beal J, Kim E, Tzeng E, Park D H, Zhai A, Kislyuk D. Toward transformer-based object detection. arXiv preprint arXiv: 2012.09958, 2020.
[118]	Girshick R. Fast R-CNN. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 1440−1448
[119]	Minaee S, Boykov Y Y, Porikli F, Plaza A J, Kehtarnavaz N, Terzopoulos D. Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2021.3059968
[120]	Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE, 2015. 3431−3440
[121]	Zheng S X, Lu J C, Zhao H S, Zhu X T, Luo Z K, Wang Y B, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 6877−6886
[122]	Xiao T T, Liu Y C, Zhou B L, Jiang Y N, Sun J. Unified perceptual parsing for scene understanding. In: Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018. 432−448
[123]	Wang H Y, Zhu Y K, Adam H, Yuille A, Chen L C. MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE, 2021. 5459−5470
[124]	Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: Transformer for semantic segmentation. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE, 2021. 7242−7252
[125]	Cheng B W, Schwing A, Kirillov A. Per-pixel classification is not all you need for semantic segmentation. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). 2021.
[126]	Fang Y X, Yang S S, Wang X G, Li Y, Fang C, Shan Y, et al. QueryInst: Parallelly supervised mask query for instance segmentation. arXiv preprint arXiv: 2105.01928, 2021.
[127]	Hu J, Cao L J, Yan L, Zhang S C, Wang Y, Li K, et al. ISTR: End-to-end instance segmentation with transformers. arXiv preprint arXiv: 2105.00637, 2021.
[128]	Li Z Q, Wang W H, Xie E Z, Yu Z D, Anandkumar A, Alvarez J M, et al. Panoptic SegFormer: Delving deeper into panoptic segmentation with transformers. arXiv preprint arXiv: 2109.03814, 2022.