2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

提示学习在计算机视觉中的分类、应用及展望

刘袁缘 刘树阳 刘云娇 袁雨晨 唐厂 罗威

刘袁缘, 刘树阳, 刘云娇, 袁雨晨, 唐厂, 罗威. 提示学习在计算机视觉中的分类、应用及展望. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240177
引用本文: 刘袁缘, 刘树阳, 刘云娇, 袁雨晨, 唐厂, 罗威. 提示学习在计算机视觉中的分类、应用及展望. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240177
Liu Yuan-Yuan, Liu Shu-Yang, Liu Yun-Jiao, Yuan Yu-Chen, Tang Chang, Luo Wei. The classification, applications, and prospects of prompt learning in computer vision. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240177
Citation: Liu Yuan-Yuan, Liu Shu-Yang, Liu Yun-Jiao, Yuan Yu-Chen, Tang Chang, Luo Wei. The classification, applications, and prospects of prompt learning in computer vision. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240177

提示学习在计算机视觉中的分类、应用及展望

doi: 10.16383/j.aas.c240177 cstr: 32138.14.j.aas.c240177
基金项目: 国家自然科学基金(62076227), 湖北省自然科学基金(2023AFB572), 湖北省智能地理信息处理重点实验室(KLIGIP-2022-B10), 国家自然科学基金(U2341228)资助
详细信息
    作者简介:

    刘袁缘:中国地质大学(武汉)计算机学院副教授. 主要研究方向为计算机视觉. E-mail: liuyy@cug.edu.cn

    刘树阳:中国地质大学(武汉)计算机学院硕士研究生. 主要研究方向为人脸情感识别. E-mail: 20171003670@cug.edu.cn

    刘云娇:中国地质大学(武汉)计算机学院硕士研究生. 主要研究方向为遥感图像分割. E-mail: luyunjiao@cug.edu.cn

    袁雨晨:中国地质大学(武汉)计算机学院硕士研究生. 主要研究方向为聚类分析. E-mail: 1202321648@cug.edu.cn

    唐厂:中国地质大学(武汉)计算机学院教授. 主要研究方向为多视图学习. E-mail: tangchang@cug.edu.cn

    罗威:中国舰船研究设计中心高级工程师. 主要研究方向为舰船人工智能. 本文通信作者. E-mail: csddc_weiluo@163.com

The Classification, Applications, and Prospects of Prompt Learning in Computer Vision

Funds: Supported by National Natural Science Foundation of China (62076227), Natural Science Foundation of Hubei Province (2023AFB572), Hubei Key Laboratory of Intelligent Geo-information Processing (KLIGIP-2022-B10), and National Natural Science Foundation of China (U2341228)
More Information
    Author Bio:

    LIU Yuan-Yuan Associate professor at the School of Computer Science, China University of Geosciences (Wuhan). Her mian research interest is computer vision

    LIU Shu-Yang Master student at the School of Computer Science, China University of Geosciences (Wuhan). His main research interest is facial emotion recognition

    LIU Yun-Jiao Master student at the School of Computer Science, China University of Geosciences (Wuhan). Her main research interest is remote sensing image segmentation

    YUAN Yu-Chen Master student at the School of Computer Science, China University of Geosciences (Wuhan). His main research interest is cluster analysis

    TANG Chang Professor at the School of Computer Science, China University of Geosciences (Wuhan). His main research interest is multi-view learning

    LUO Wei Senior engineer at China Ship Development and Design Center. His main research interest is ship artificial intelligence. Corresponding author of this paper

  • 摘要: 随着计算机视觉(Computer vision, CV)的快速发展, 人们对于提高视觉任务的性能和泛化能力的需求不断增长, 导致模型的复杂度与对各种资源的需求进一步提高. 提示学习(Prompt learning, PL)作为一种能有效地提升模型性能和泛化能力、重用预训练模型和降低计算量的方法, 在一系列下游视觉任务中受到了广泛的关注与研究. 然而, 现有的PL综述缺乏对PL方法全面的分类和讨论, 也缺乏对现有实验结果进行深入的研究以评估现有方法的优缺点. 因此, 本文对PL在CV领域的分类、应用和性能进行全面的概述. 首先, 介绍PL的研究背景和定义, 并简要回顾CV领域中PL研究的最新进展. 其次, 对目前CV领域中的PL方法进行分类, 包括文本提示、视觉提示和视觉—语言联合提示, 对每类PL方法进行详细阐述并探讨其优缺点. 接着, 综述PL在十个常见下游视觉任务中的最新进展. 此外, 提供三个CV应用的实验结果并进行总结和分析, 全面讨论不同PL方法在CV领域的表现. 最后, 基于上述讨论对PL在CV领域面临的挑战和机遇进行分析, 为进一步推动PL在CV领域的发展提供前瞻性的思考.
  • 图  1  基于PL的CV应用概述

    Fig.  1  Overview of CV applications based on PL

    图  2  NLP中的提示流程

    Fig.  2  The prompting process in NLP

    图  3  文本提示((a)基于手工设计的文本提示; (b)连续提示; (c)基于梯度引导的文本提示; (d)基于视觉映射到语言空间的提示; (e)基于图像引导的文本提示; (f)基于伪标签的文本提示; (g)基于多任务的文本提示)

    Fig.  3  Text prompts((a)Text prompt based on hand-crafted; (b)Continuous prompt; (c)Text prompt based on gradient guidance; (d)Prompt based on the mapping from vision to the language space; (e)Text prompt based on image guidance; (f)Text prompt based on pseudo-labels; (g)Text prompt based on multi-tasking)

    图  4  视觉提示((a)基于像素扰动的视觉提示; (b)基于提示tokens的视觉提示; (c)基于提示模块的视觉提示; (d)基于上下文样例模板的视觉提示; (e)基于网络结构搜索的视觉提示)

    Fig.  4  Visual prompts ((a) Pixel perturbation-based visual prompt; (b) Prompt tokens-based visual prompt; (c) Prompt module-based visual prompt; (d) Contextual example template-based visual prompt; (e) Network architecture search-based visual prompt)

    图  5  在视觉—语言模型上引入视觉—语言联合提示的四种方法对比((a)独立训练两种模态的提示; (b)共享地训练两种模态的提示; (c)使用两个MLP层来生成提示; (d)使用一个轻量级的自注意力网络来生成提示)

    Fig.  5  Comparison of four methods for introducing vision-language joint prompts in vision-language models ((a) Independently train the prompts of the two modalities; (b) Train the prompts of two modalities in a shared manner; (c) Utilizing two MLP layers to generate prompts; (d) Employing a lightweight self-attention network to generate prompts)

    图  6  图像识别中的视觉提示方法((a)基于像素扰动提示的DAM-VP; (b)基于提示tokens的VQT)

    Fig.  6  Visual prompt methods in image recognition ((a) DAM-VP based on pixel perturbation prompts; (b) VQT based on prompt tokens)

    图  7  基于视觉—语言联合提示的MaPLe图像分类框架

    Fig.  7  Vision-language joint prompts-based MaPLe image classification framework

    图  8  SAM方法流程图

    Fig.  8  Flowchart of the SAM method

    图  9  基于CLIP的OVD框架((a)在CLIP的文本编码器端引入文本提示; (b)在CLIP的图像编码器端引入提示tokens)

    Fig.  9  CLIP-based OVD framework ((a) Introducing text prompts at the text encoder side of CLIP; (b) Introducing prompt tokens at the image encoder side of CLIP)

    图  10  CLIPCap图像描述任务框架

    Fig.  10  Image caption task framework of CLIPCap

    图  11  ViPT方法流程图

    Fig.  11  Flowchart of the ViPT method

    图  12  基于手工设计的文本提示的FEWVLM模型结构

    Fig.  12  FEWVLM model structure based on hand-crafted text prompts

    表  1  CV领域视觉与多模态基础大模型及其参数量

    Table  1  Vision and multimodal foundational large models in CV with their parameter size

    模型 视觉 多模态
    DERT Vision Transformer DINOv2 LVM CLIP SAM MiniGPT-4 LLaVA Yi-VL
    年份 2020 2021 2023 2023 2021 2023 2023 2023 2024
    参数量 40M 86M$ \sim $632M 1.1B 300M$ \sim $3B 400M$ \sim $1.6B 1B 13B 7B$ \sim $13B 6B$ \sim $34B
    下载: 导出CSV

    表  2  图像分类任务中提示方法和非提示方法的性能对比(加粗表示性能最优, 下划线表示性能次优)

    Table  2  In the task of image classification, a comparison of the performance between prompted and unprompted methods is presented (Bold indicates the best performance and underline indicates the second-best performance)

    预训练模型 ViT-B-22K Swin-B-22K
    方法 非PL方法 PL方法 非PL方法 PL方法
    全面微调 (%) 线性探测 (%) VP (%) VPT (%) DAM-VP (%) 全面微调 (%) 线性探测 (%) VP (%) VPT (%) DAM-VP (%)
    CIFAR10 97.4 96.3 94.2 96.83 97.3 98.3 96.3 94.8 96.9 97.3
    CIFAR100 68.9 63.4 78.7 78.8 88.1 73.3 61.6 80.6 80.5 88.1
    Food-101 84.9 84.4 80.5 83.3 86.9 91.7 88.2 83.4 90.1 90.5
    DTD 64.3 63.2 59.5 65.8 73.1 72.4 73.6 75.1 78.5 80.0
    SVHN 87.4 36.6 87.6 78.1 87.9 91.2 43.5 80.3 87.8 81.7
    CUB-200 87.3 85.3 84.6 88.5 87.5 89.7 88.6 86.5 90.0 90.4
    Stanford Dogs 89.4 86.2 84.5 90.2 92.3 86.2 85.9 81.3 84.8 88.5
    Flowers102 98.8 97.9 97.7 99.0 99.2 98.3 99.4 98.6 99.3 99.6
    下载: 导出CSV

    表  3  从基类到新类的泛化设置下CLIP、CoOp、CoCoOp和MaPLe的对比(HM代表对基类和新类的准确率取调和平均值, 加粗表示性能最优)

    Table  3  Comparison of CLIP, CoOp, CoCoOp and MaPLe under the generalization setting from base class to new class (HM denotes the harmonic mean of the accuracies on both base and new classes, bold indicates the best performance)

    数据集 CLIP CoOp) CoCoOp MaPLe
    Base (%) New (%) HM (%) Base (%) New (%) HM (%) Base (%) New (%) HM (%) Base (%) New (%) HM (%)
    ImageNet 72.43 68.14 70.22 76.47 67.88 71.92 75.98 70.43 73.10 76.66 70.54 73.47
    Caltech101 96.84 94.00 95.40 98.00 89.81 93.73 97.96 93.81 95.84 97.74 94.36 96.02
    OxfordPets 91.17 97.26 94.12 93.67 95.29 94.47 95.20 97.69 96.43 95.43 97.76 96.58
    StanfordCars 63.37 74.89 68.65 78.12 60.40 68.13 70.49 73.59 72.01 72.94 74.00 73.47
    Flowers102 72.08 77.80 74.83 97.60 59.67 74.06 94.87 71.75 81.71 95.92 72.46 82.56
    Food-101 90.10 91.22 90.66 88.33 82.26 85.19 90.70 91.29 90.99 90.71 92.05 91.38
    FGVCAircraft 27.19 36.29 31.09 40.44 22.30 28.75 33.41 23.71 27.74 37.44 35.61 36.50
    SUN397 69.36 75.35 72.23 80.60 65.89 72.51 79.74 76.86 78.27 80.82 78.70 79.75
    DTD 53.24 59.90 56.37 79.44 41.18 54.24 77.01 56.00 64.85 80.36 59.18 68.16
    EuroSAT 56.48 64.05 60.03 92.19 54.74 68.69 87.49 60.04 71.21 94.07 73.23 82.35
    UCF101 70.53 77.50 73.85 84.69 56.05 67.46 82.33 73.45 77.64 83.00 78.66 80.77
    平均值 69.34 74.22 71.10 82.69 63.22 71.66 80.47 71.69 75.83 82.28 75.14 78.55
    下载: 导出CSV

    表  4  ADE20K数据集上提示方法和非提示方法的语义分割性能对比(加粗表示性能最优, 下划线表示性能次优

    Table  4  Comparison of semantic segmentation performance on the ADE20K dataset between prompted and unprompted methods (Bold indicates the best performance and underline indicates the second-best performance)

    评价指标 参数量(M) mIoU(%)
    PL方法SPM14.945.05
    VPT13.3942.11
    AdaptFormer16.3144.00
    SAM53.0
    EfficientSAM51.8
    非PL方法fully tuning317.2947.53
    head tuning13.1437.77
    下载: 导出CSV

    表  5  COCO数据集上提示方法和非提示方法的实例分割性能对比(加粗表示性能最优, 下划线表示性能次优)

    Table  5  Comparison of instance segmentation performance on the COCO dataset between prompted and unprompted methods (Bold indicates the best performance and underline indicates the second-best performance)

    评价指标 mAP(%)
    PL方法SAM46.8
    EfficientSAM44.4
    HQ-SAM 49.5
    PA-SAM49.9
    非PL方法Mask2Former43.7
    OneFormer45.6
    下载: 导出CSV

    表  6  多模态跟踪任务中提示方法和非提示方法的性能对比(加粗表示性能最优, 下划线表示性能次优)

    Table  6  Performance comparison between prompted and unprompted methods in multimodal tracking tasks (Bold indicates the best performance and underline indicates the second-best performance)

    数据集 RGBT234 LasHeR
    评价指标 precision (%) success (%) precision (%) success (%)
    PL
    方法
    TaTrack 87.2 64.4 85.3 61.8
    MPLT 88.4 65.7 72.0 57.1
    ViPT 83.5 61.7 65.1 52.5
    ProTrack 79.5 59.9 53.8 42.0
    非PL
    方法
    OsTrack 72.9 54.9 51.5 41.2
    FANet 78.7 55.3 44.1 30.9
    SGT 72.0 47.2 36.5 25.1
    下载: 导出CSV
  • [1] Xu M, Yin W, Cai D, Yi R, Xu D, Wang Q, et al. A survey of resource-efficient llm and multimodal foundation models. arXiv preprint arXiv: 2401.08092, 2024.
    [2] Zhou J, Chen Y, Hong Z, Chen W, Yu Y, Zhang T, et al. Training and Serving System of Foundation Models: A Comprehensive Survey. IEEE Open Journal of the Computer Society, DOI: 10.1109/OJCS.2024.3380828
    [3] Liu Z, Yu X, Fang Y, Zhang X. Graphprompt: Unifying pre-training and downstream tasks for graph neural networks. In: Proceedings of the ACM Web Conference. Austin, USA: 2023. 417-428
    [4] Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023, 55(9): 1−35
    [5] Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, Fernandez P, Haziza D, Massa F, El-Nouby A. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv: 230407193, 2023.
    [6] Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning. Virtual Event: PMLR, 2021. 8748-8763
    [7] Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment Anything. In: Proceeding of 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE, 2023. 3992-4003
    [8] 廖宁, 曹敏, 严骏驰. 视觉提示学习综述. 计算机学报, 2024, 47(04): 790−820

    Liao Ning, Cao Min, Yan Jun-Chi. Visual prompt learning: a survey. Chinese Journal of Computers, 2024, 47(04): 790−820
    [9] Zang Y, Li W, Zhou K, Huang C, Loy C C. Unified vision and language prompt learning. arXiv: 2210.07225, 2022
    [10] Khattak M U, Rasheed H, Maaz M, Khan S, Khan F S. Maple: Multi-modal prompt learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 19113-19122
    [11] Chen S, Ge C, Tong Z, Wang J, Song Y, Wang J, et al. Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv: 2205.13535, 2022
    [12] Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009. 248-255
    [13] Zhou K, Yang J, Loy C C, Liu Z. Learning to prompt for vision-language models. International Journal of Computer Vision, 2022, 130(9): 2337−2348 doi: 10.1007/s11263-022-01653-1
    [14] Zhou K, Yang J, Loy C C, Liu Z. Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA: IEEE, 2022. 16816-16825
    [15] Derakhshani M M, Sanchez E, Bulat A, da Costa V G, Snoek C G, Tzimiropoulos G, et al. Bayesian prompt learning for image-language model generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Vancouver, BC, Canada: IEEE, 2023. 15237-15246
    [16] Yao H, Zhang R, Xu C. Visual-language prompt tuning with knowledge-guided context optimization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Vancouver, BC, Canada: IEEE, 2023. 6757-6767
    [17] Bulat A, Tzimiropoulos G. Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 23232-23241
    [18] Zhu B, Niu Y, Han Y, Wu Y, Zhang H. Prompt-aligned gradient for prompt tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE, 2023. 15659-15669
    [19] Huang T, Chu J, Wei F. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv: 2204.03649, 2022.
    [20] Shen S, Yang S, Zhang T, Zhai B, Gonzalez J E, Keutzer K, Darrell T. Multitask vision-language prompt tuning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, HI, USA: IEEE, 2024. 5656-5667
    [21] Bahng H, Jahanian A, Sankaranarayanan S, Isola P. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv: 2203.17274, 2022.
    [22] Chen A, Yao Y, Chen P Y, Zhang Y, Liu S. Understanding and improving visual prompting: A label-mapping perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 19133-19143
    [23] Oh C, Hwang H, Lee H Y, Lim Y, Jung G, Jung J, Choi H, Song K. Blackvip: Black-box visual prompting for robust transfer learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 24224-24235
    [24] Huang Q, Dong X, Chen D, Zhang W, Wang F, Hua G, Yu N. Diversity-aware meta visual prompting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 10878-10887
    [25] Jia M, Tang L, Chen B C, Cardie C, Belongie S, Hariharan B, et al. Visual prompt tuning. In: Proceedings of the European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022. 709-727
    [26] Tu C H, Mai Z, Chao W L. Visual query tuning: Towards effective usage of intermediate representations for parameter and memory efficient transfer learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 7725-7735
    [27] Das R, Dukler Y, Ravichandran A, Swaminathan A. Learning expressive prompting with residuals for vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 3366-3377
    [28] Dong B, Zhou P, Yan S, Zuo W. LPT: long-tailed prompt tuning for image classification. In: Proceedings of The Eleventh International Conference on Learning Representations. Kigali, Rwanda: ICLR, 2023. 1-20
    [29] Zhang Y, Zhou K, Liu Z. Neural prompt search. IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.48550/arXiv.2206.04673
    [30] Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv: 2106.09685, 2021.
    [31] Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the International Conference on Machine Learning. Long Beach, CA, USA: PMLR, 2019. 2790-2799
    [32] Nilsback M E, Zisserman A. Automated flower classification over a large number of classes. In: Proceedings of the Sixth Indian Conference on Computer Vision, Graphics & Image Processing. Bhubaneswar, India: IEEE, 2008. 722-729
    [33] Helber P, Bischke B, Dengel A, Borth D. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(7): 2217−26 doi: 10.1109/JSTARS.2019.2918242
    [34] Fahes M, Vu T H, Bursuc A, Pérez P, De Charette R. Poda: Prompt-driven zero-shot domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Vancouver, BC, Canada: IEEE, 2023. 18623-18633
    [35] Liu L, Chang J, Yu BX, Lin L, Tian Q, Chen C W. Prompt-matched semantic segmentation. arXiv preprint arXiv: 2208.10159, 2022.
    [36] Liu W, Shen X, Pun C M, Cun X. Explicit visual prompting for low-level structure segmentations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 19434-19445
    [37] Bar A, Gandelsman Y, Darrell T, Globerson A, Efros A. Visual prompting via image inpainting. arXiv: 2209. 00647, 2022.
    [38] Ma X, Wang Y, Liu H, Guo T, Wang Y. When visual prompt tuning meets source-free domain adaptive semantic segmentation. Advances in Neural Information Processing Systems, 2023, 36: 6690−6702
    [39] Zhao X, Ding W, An Y, Du Y, Yu T, Li M, et al. Fast segment anything. arXiv preprint arXiv: 2306.12156, 2023.
    [40] Zhang C, Han D, Qiao Y, Kim J U, Bae S H, Lee S, et al. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv: 2306.14289, 2023.
    [41] Xiong Y, Varadarajan B, Wu L, Xiang X, Xiao F, Zhu C, et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA: IEEE, 2024. 16111-16121
    [42] Ke L, Ye M, Danelljan M, Tai Y W, Tang C K, Yu F. Segment anything in high quality. Advances in Neural Information Processing Systems. arXiv: 2306. 01567, 2024.
    [43] Xie Z, Guan B, Jiang W, Yi M, Ding Y, Lu H, et al. PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation. arXiv preprint arXiv: 2401.13051, 2024.
    [44] Wang X, Zhang X, Cao Y, Wang W, Shen C, Huang T. Seggpt: Segmenting everything in context. arXiv preprint arXiv: 2304.03284, 2023.
    [45] Ren T, Liu S, Zeng A, Lin J, Li K, Cao H, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv: 2401.14159, 2024.
    [46] Zou X, Yang J, Zhang H, Li F, Li L, Wang J, et al. Segment everything everywhere all at once. In: Proceedings of the 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA: NeurIPS. 2023. 19769-19782.
    [47] Gu X, Lin T Y, Kuo W, Cui Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv: 2104.13921, 2021.
    [48] Du Y, Wei F, Zhang Z, Shi M, Gao Y, Li G. Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA: IEEE, 2022. 14084-14093
    [49] Wu X, Zhu F, Zhao R, Li H. Cora: Adapting CLIP for open-vocabulary detection with region prompting and anchor pre-matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 7031-7040
    [50] Ju C, Han T, Zheng K, Zhang Y, Xie W. Prompting visual-language models for efficient video understanding. European Conference on Computer Vision. Tel Aviv, Israel: Springer Nature Switzerland, 2022. 105-124
    [51] Wang M, Xing J, Liu Y. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv: 2109.08472, 2021.
    [52] Mokady R, Hertz A, Bermano A H. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv: 2111.09734, 2021.
    [53] Tewel Y, Shalev Y, Schwartz I, Wolf L. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022. 17918-17928
    [54] Su Y, Lan T, Liu Y, Liu F, Yogatama D, Wang Y, et al. Language models can see: Plugging visual controls in text g eneration. arXiv preprint arXiv: 2205.02655, 2022.
    [55] Wang N, Xie J, Wu J, Jia M, Li L. Controllable image captioning via prompting. In: Proceedings of the AAAI Conference on Artificial Intelligence. Washington, DC, USA: AAAI Press, 2023. 2617-2625
    [56] Yang J, Li Z, Zheng F, Leonardis A, Song J. Prompting for multi-modal tracking. In: Proceedings of the 30th ACM International Conference on Multimedia. Lisbon, Portugal: Association for Computing Machinery, 2022. 3492-3500
    [57] Zhu J, Lai S, Chen X, Wang D, Lu H. Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 9516-9526
    [58] He K, Zhang C, Xie S, Li Z, Wang Z. Target-aware tracking with long-term context attention. In: Proceedings of the AAAI Conference on Artificial Intelligence. Washington, DC, USA: AAAI Press, 2023. 773-780
    [59] Luo Y, Guo X, Feng H, Ao L. RGB-T Tracking via Multi-Modal Mutual Prompt Learning. arXiv preprint arXiv: 2308.16386, 2023.
    [60] Tsimpoukelli M, Menick J L, Cabi S, Eslami S M, Vinyals O, Hill F. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 2021, 34: 200−12
    [61] Yang Z, Gan Z, Wang J, Hu X, Lu Y, Liu Z, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence. Virtual Event: AAAI Press, 2022. 3081-3089
    [62] Jin W, Cheng Y, Shen Y, Chen W, Ren X. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. arXiv preprint arXiv: 2110.08484, 2021.
    [63] Wang A J, Zhou P, Shou M Z, Yan S. Enhancing visual grounding in vision-language pre-training with position-guided text prompts. IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2023.3343736
    [64] Wu W, Liu T, Wang Y, Xu K, Yin Q, Hu Y. Dynamic multi-modal prompting for efficient visual grounding. In: Proceedings of the 6th Chinese Conference on Pattern Recognition and Computer Vision. Xiamen, China: Springer-Verlag, 2023. 359-371
    [65] Hegde D, Valanarasu J M, Patel V. CLIP goes 3D: leveraging prompt tuning for language grounded 3D recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE, 2023. 2028-2038
    [66] Zhu X, Zhang R, He B, Guo Z, Zeng Z, Qin Z, et al. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE, 2023. 2639-2650
    [67] Bar-Tal O, Ofri-Amar D, Fridman R, Kasten Y, Dekel T. Text2live: Text-driven layered image and video editing. In: Proceedings of the European Conference on Computer Vision. Cham, Switzerland: Springer Nature, 2022. 707-723
    [68] Krizhevsky A. Learning Multiple Layers of Features from Tiny Images [Master's thesis], University of Toronto, Canada, 2009
    [69] Bossard L, Guillaumin M, Van G L. Food-101: Mining discriminative components with random forests. In: Proceedings of the European Conference on Computer Vision. Zurich, Switzerland: Springer International Publishing, 2014. 446-461
    [70] Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A. Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio, USA: IEEE, 2014. 3606-3613
    [71] Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng A Y. Reading digits in natural images with unsupervised feature learning. In: Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning. Granada, Spain: NIPS, 2011. 4
    [72] Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-ucsd birds-200-2011 dataset, Technical Report CNS-TR-2011-001, California Institute of Technology, USA, 2011.
    [73] Khosla A, Jayadevaprakash N, Yao B, Fei-Fei L. Novel dataset for fine-grained image categorization. In: Proceedings of the First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO, USA: IEEE, 2011
    [74] Fei-Fei L, Fergus R, Perona P. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In: Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop. Washington, DC, USA: IEEE, 2004. 178-178
    [75] Parkhi O M, Vedaldi A, Zisserman A, Jawahar C V. Cats and dogs. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2012. 3498-3505
    [76] Krause J, Stark M, Deng J, Fei-Fei L. 3d object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. Sydney, Australia: IEEE, 2013. 554-561
    [77] Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A. Fine-grained visual classification of aircraft. arXiv preprint arXiv: 1306.5151, 2013.
    [78] Xiao J, Hays J, Ehinger K A, Oliva A, Torralba A. SUN database: Large-scale scene recognition from abbey to zoo. In: Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010. 3485-3492
    [79] Soomro K, Zamir A, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv: 1212.0402, 2012.
    [80] Cheng B, Misra I, Schwing A G, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA: IEEE, 2022. 1290-1299
    [81] Jain J, Li J, Chiu MT, Hassani A, Orlov N, Shi H. OneFormer: One Transformer to Rule Universal Image Segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023. 2989-2998
    [82] Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A. Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017. 633-641
    [83] Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: Common objects in context. In: Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer International Publishing, 2014. 740-755
    [84] Xiao Y, Yang M, Li C, Liu L, Tang J. Attribute-based progressive fusion network for RGBT tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. Virtual Event: AAAI Press, 2022. 2831-2838
    [85] Li C, Xue W, Jia Y, Qu Z, Luo B, Tang J, et al. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. arXiv: 2104.13202, 2021.
  • 加载中
计量
  • 文章访问数:  13
  • HTML全文浏览量:  23
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-04-04
  • 录用日期:  2024-08-27
  • 网络出版日期:  2025-03-20

目录

    /

    返回文章
    返回