2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于语义概念关联的参考多目标跟踪方法

林家丞 陈嘉俊 李智勇 王耀南

林家丞, 陈嘉俊, 李智勇, 王耀南. 基于语义概念关联的参考多目标跟踪方法. 自动化学报, 2025, 51(10): 1000−1014 doi: 10.16383/j.aas.c250118
引用本文: 林家丞, 陈嘉俊, 李智勇, 王耀南. 基于语义概念关联的参考多目标跟踪方法. 自动化学报, 2025, 51(10): 1000−1014 doi: 10.16383/j.aas.c250118
Lin Jia-Cheng, Chen Jia-Jun, Li Zhi-Yong, Wang Yao-Nan. Semantic conceptual association-based method for referring multi-object tracking. Acta Automatica Sinica, 2025, 51(10): 1000−1014 doi: 10.16383/j.aas.c250118
Citation: Lin Jia-Cheng, Chen Jia-Jun, Li Zhi-Yong, Wang Yao-Nan. Semantic conceptual association-based method for referring multi-object tracking. Acta Automatica Sinica, 2025, 51(10): 1000−1014 doi: 10.16383/j.aas.c250118

基于语义概念关联的参考多目标跟踪方法

doi: 10.16383/j.aas.c250118 cstr: 32138.14.j.aas.c250118
基金项目: 国家自然科学基金(U21A20518, U23A20341)资助
详细信息
    作者简介:

    林家丞:湖南大学博士研究生. 2025年获得湖南大学博士学位.主要研究方向为计算机视觉, 场景理解与隐私保护. E-mail: jcheng_lin@hnu.edu.cn

    陈嘉俊:湖南大学硕士研究生.2022年获得广东工业大学学士学位. 主要研究方向为计算机视觉, 多目标跟踪. E-mail: chenjiajun@hnu.edu.cn

    李智勇:湖南大学教授. 主要研究方向为智能感知与自主无人系统, 技能学习与人机融合系统, 机器学习与智能决策系统. 本文通信作者. E-mail: zhiyong.li@hnu.edu.cn

    王耀南:中国工程院院士, 湖南大学人工智能与机器人学院教授. 1995年获得湖南大学博士学位. 主要研究方向为机器人学, 智能控制和图像处理. E-mail: yaonan@hnu.edu.cn

Semantic Conceptual Association-Based Method for Referring Multi-Object Tracking

Funds: Supported by National Natural Science Foundation of China (U21A20518, U23A20341)
More Information
    Author Bio:

    Lin Jia-Cheng Ph.D. student at the Hunan University. He received his Ph. D. degree from Hunan University in 2025. His research interest covers computer vision, scene understanding and privacy protection

    Chen JIa-Jun Master student at the Hunan University. He received his bachelor degree in Guangdong University of Technology in 2022. His research interest covers computer vision and multi-object tracking

    Li Zhi-Yong Professor at Hunan University. His research interest covers intelligent perception and autonomous unmanned systems, skill learning and human-machine fusion systems, machine learning and intelligent decision-making systems. Corresponding author of this paper

    Wang Yao-Nan Academician at Chinese Academy of Engineering, professor at the School of Artificial Intelligence and Robotics, Hunan University. He received his Ph.D. degree from Hunan University in 1995. His research interest covers robotics, intelligent control, and image processing

  • 摘要: 参考目标跟踪(Referring multi-object tracking, RMOT)是一项利用语言与视觉模态数据进行目标定位与跟踪的任务, 旨在根据语言提示在视频帧中精准识别并持续跟踪指定目标. 尽管现有RMOT方法在该领域取得了一定进展, 但针对语言表述概念粒度的建模仍较为有限, 导致模型在处理复杂语言描述时存在语义解析不足的问题. 为此, 提出基于语义概念关联的参考多目标跟踪方法SCATrack), 通过引入共享语义概念(Sharing semantic concept, SSC)和语义概念辅助生成(Semantic concept generation, SCG)模块, 以提升模型对语言表述的深层理解能力, 从而增强跟踪任务的持续性与鲁棒性. 具体而言, SSC模块对语言表述进行语义概念划分, 使模型能够有效区分相同语义的不同表达方式, 以及不同语义间的相似表达方式, 从而提升多粒度输入条件下的目标辨别能力. SCG模块则采用特征遮蔽与生成机制, 引导模型学习多粒度语言概念的表征信息, 增强其对复杂语言描述的鲁棒性和辨别能力. 在两个广泛使用的基准数据集上的实验结果表明, 所提出的SCATrack显著提升RMOT任务的跟踪性能, 验证了方法的有效性与优越性.
  • 图  1  现有RMOT方法与所提SCATrack框架的示意图

    Fig.  1  Illustration of the existing RMOT and the proposed SCATrack methods

    图  2  语义概念关联的参考多目标跟踪算法SCATrack框架结构

    Fig.  2  Semantic concept association for referring multi-object tracking SCATrack framework

    图  3  SCATrack与现有RMOT方法在Refer-KITTI上的定性比较

    Fig.  3  Qualitative comparison of the proposed SCATrack with existing RMOT methods on Refer-KITTI

    图  4  SCATrack与现有RMOT方法在Refer-BDD上的定性比较

    Fig.  4  Qualitative comparison of the proposed SCATrack with existing RMOT methods on Refer-BDD

    图  5  SCATrack在Refer-KITTI上的更多定性结果

    Fig.  5  More qualitative results of the proposed SCATrack on Refer-KITTI

    图  6  SCATrack在Refer-BDD上的更多定性结果

    Fig.  6  More qualitative results of the proposed SCATrack on Refer-BDD

    图  7  SCATrack与现有RMOT方法的编码器最后一层热力图在Refer-KITTI上的比较

    Fig.  7  Comparison of the SCATrack with the existing RMOT method's encoder last layer heat map on Refer-KITTI

    表  1  SCATrack与现有RMOT方法在Refer-KITTI上的定量结果

    Table  1  Quantification of the proposed SCATrack with existing RMOT methods on Refer-KITTI [2]

    方法 特征提取网络 检测器 HOTA$\uparrow$ DetA$\uparrow$ AssA$\uparrow$ MOTA$\uparrow$ IDF1$\uparrow$ IDS$\downarrow$
    DeepSORT[26] ${}_{\rm{ICIP17}}$ - FairMOT 25.59 19.76 34.31 - - -
    FairMOT[48] ${}_{\rm{IJCV21}}$ DLA-34 CenterNet 23.46 14.84 40.15 0.80 26.18 3376
    ByteTrack[49] ${}_{\rm{ECCV22}}$ - FairMOT 24.95 15.50 43.11 - - -
    CSTrack[50] ${}_{\rm{TIP22}}$ DarkNet-53 YOLOv5 27.91 20.65 39.00 - - -
    TransTrack[27] ${}_{\rm{arXiv20}}$ ResNet-50 Deformable-DETR 32.77 23.31 45.71 - - -
    TrackFormer[14]$\,\; $ ${}_{\rm{CVPR22}}$ ResNet-50 Deformable-DETR 33.26 25.44 45.87 - - -
    DeepRMOT[3] ${}_{\rm{ICASSP24}}$ ResNet-50 Deformable-DETR 39.55 30.12 53.23 - - -
    EchoTrack[6] ${}_{\rm{TITS24}}$ ResNet-50 Deformable-DETR 39.47 31.19 51.56 - - -
    TransRMOT[2] ${}_{\rm{CVPR23}}$ ResNet-50 Deformable-DETR 46.56 37.97 57.33 24.68 53.85 3144
    iKUN[5] ${}_{\rm{CVPR24}}$ ResNet-50 Deformable-DETR 48.84 35.74 66.80 12.26 54.05 -
    MLS-Track[29] ${}_{\rm{arXiv24}}$ ResNet-50 Deformable-DETR 49.05 40.03 60.25 - - -
    MGLT${}_{\rm{MOTRv2}}$[31] ${}_{\rm{TIM25}}$ ResNet-50 YOLOX+DAB-D-DETR 47.75 35.11 65.08 8.36 53.39 2948
    MGLT${}_{{\rm{CO}}-{\rm{MOT}}}$[31] ${}_{\rm{TIM25}}$ ResNet-50 Deformable-DETR 49.25 37.09 65.50 21.13 55.91 2442
    SCATrack${}_{\rm{MOTRv2}}$ (ours) ResNet-50 YOLOX+DAB-D-DETR $\underline{49.98}_{ {+ 2.23}}$ $37.57_{ {+ 2.46}}$ $\underline{66.68}_{ {+ 1.60}}$ $13.08_{ {+ 4.72}}$ $\underline{56.66}_{ {+ 3.27}}$ $2985_{+ 37}$
    SCATrack$_{\; {\rm{CO}}-{\rm{MOT}}}$ (ours) ResNet-50 Deformable-DETR ${\bf{50.33}}_{ {+ 1.08}}$ $\underline{38.53}_{ {+ 1.46}}$ $65.84_{ {+ 0.34}}$ $\underline{23.86}_{ {+ 2.73}}$ ${\bf{57.10}}_{ {+ 1.19}}$ $\underline{2700}_{+ 258}$
    下载: 导出CSV

    表  2  SCATrack与现有RMOT方法在Refer-BDD上的定量结果

    Table  2  Quantification of the proposed SCATrack with existing RMOT methods on Refer-BDD [6]

    方法 特征提取网络 检测器 HOTA$\uparrow$ DetA $\uparrow$ AssA $\uparrow$ MOTA $\uparrow$ IDF1 $\uparrow$ IDS $\downarrow$
    TransRMOT [2] ${}_{\rm{CVPR23}}$ ResNet-50 Deformable-DETR 34.79 26.22 47.56 - - -
    EchoTrack [6] ${}_{\rm{TITS24}}$ ResNet-50 Deformable-DETR 38.00 28.57 51.24 - - -
    MOTRv2 [10] ${}_{\rm{CVPR23}}$ ResNet-50 YOLOX+DAB-D-DETR 36.49 23.64 56.88 $-1.05$ 37.38 17670
    CO-MOT [17] ${}_{\rm{arXiv23}}$ ResNet-50 Deformable-DETR 37.32 25.53 55.09 10.57 40.56 14432
    MGLT$_{\rm{MOTR}}$ [31]${}_{\rm{TIM25}}$ ResNet-50 Deformable-DETR 38.69 27.06 55.76 $\underline{13.97}$ 41.85 13846
    MGLT${}_{\rm{MOTRv2}}$ [31]${}_{\rm{TIM25}}$ ResNet-50 YOLOX+DAB-D-DETR 38.40 26.48 56.23 0.69 41.01 14804
    MGLT${}_{{\rm{CO}}-{\rm{MOT}}}$ [31]${}_{\rm{TIM25}}$ ResNet-50 Deformable-DETR 40.26 28.44 57.58 11.68 44.41 $\underline{12935}$
    SCATrack ${}_{\rm{MOTRv2}}$ (ours) ResNet-50 YOLOX+DAB-D-DETR $\underline{40.49}_{ {+ 1.89}}$ $\underline{28.68}_{ {+ 2.20}}$ $\underline{57.73}_{ {+ 1.50}}$ $4.15_{ {+3.46}}$ $\underline{44.65}_{ {+ 3.64}}$ $13613_{ {- 1191}}$
    SCATrack ${}_{{\rm{CO}}-{\rm{MOT}}}$ (ours) ResNet-50 Deformable-DETR ${\bf{41.27}}_{ {+ 1.01}}$ ${\bf{29.11}}_{ {+ 0.67}}$ ${\bf{59.21}}_{ {+ 1.63}}$ ${\bf{14.24}}_{ {+2.56}}$ ${\bf{45.46}}_{ {+ 1.05}}$ ${\bf{12458}}_{ {- 477}}$
    下载: 导出CSV

    表  3  不同组件组合模型性能对比

    Table  3  Performance comparison of different component combination models

    设置 HOTA$\uparrow$ DetA$\uparrow$ AssA$\uparrow$ MOTA$\uparrow$ IDF1$\uparrow$ IDS$\downarrow$ Acc$\uparrow$
    基线 49.25 37.09 65.50 21.13 55.91 2442 -
    w. SSC 49.42 37.43 65.35 20.79 56.19 2574 -
    w. SCG 49.81 38.01 65.37 18.81 56.42 2862 53.78
    SCATrack (ours) ${\bf{50.33}}$ ${\bf{38.53}}$ ${\bf{65.84}}$ ${\bf{23.86}}$ ${\bf{57.10}}$ 2700 54.60
    下载: 导出CSV

    表  4  模型效率分析

    Table  4  Model efficiency analysis

    方法 阶段 Params./M FLOPs/G FPS 训练/推理时间
    基线 训练 82.84 338.24 - $33$小时$29$分钟
    推理 82.84 338.17 10.56 $2$小时$24$分钟
    SCATrack (ours) 训练 116.98 340.05 - $38$小时$17$分钟
    推理 82.84 338.17 10.56 $2$小时$24$分钟
    下载: 导出CSV

    表  5  SCG中不同屏蔽方式模型性能对比

    Table  5  Comparison of the model performance of the proposed SCG with different shielding methods

    方法 HOTA$\uparrow$ DetA$\uparrow$ AssA$\uparrow$ MOTA$\uparrow$ IDF1$\uparrow$ IDS$\downarrow$ Acc$\uparrow$
    固定"#" ${\bf{50.33}}$ ${\bf{38.53}}$ 65.84 23.86 ${\bf{57.10}}$ ${\bf{2700}}$ 54.60
    随机字符 49.47 36.76 ${\bf{66.74}}$ 19.46 55.94 2956 53.78
    $0$值填充 49.39 38.38 63.66 ${\bf{25.04}}$ 55.78 2992 54.05
    下载: 导出CSV

    表  6  SCG中不同可学习词嵌入设置模型性能对比

    Table  6  Performance comparison of different learnable word embedding setup models for the proposed SCG

    设置 HOTA$\uparrow$ DetA$\uparrow$ AssA$\uparrow$ MOTA$\uparrow$ IDF1$\uparrow$ IDS$\downarrow$ Acc$\uparrow$
    None 49.35 37.85 64.42 23.31 55.83 2706 53.65
    1 ${\bf{50.33}}$ ${\bf{38.53}}$ 65.84 ${\bf{23.86}}$ ${\bf{57.10}}$ ${\bf{2700}}$ 54.60
    2 49.79 37.64 ${\bf{65.91}}$ 18.67 55.68 2786 54.28
    下载: 导出CSV

    表  7  不同$\gamma_{gen}$设置模型性能对比

    Table  7  Performance comparison of models with different $\gamma_{gen}$ value

    $\gamma_{gen}$ HOTA$\uparrow$ DetA$\uparrow$ AssA$\uparrow$ MOTA$\uparrow$ IDF1$\uparrow$ IDS$\downarrow$
    0.02 49.70 37.15 ${\bf{66.66}}$ 20.18 56.28 2959
    0.1 ${\bf{50.33}}$ ${\bf{38.53}}$ 65.84 23.86 ${\bf{57.10}}$ 2700
    0.5 47.49 35.79 63.11 24.54 55.13 2190
    1 46.96 35.12 62.86 ${\bf{24.55}}$ 54.66 ${\bf{2160}}$
    下载: 导出CSV

    表  8  不同${\cal{J}}$值下模型的性能对比

    Table  8  Comparison of model performance for different ${\cal{J}}$ values

    ${\cal{N}}$ HOTA$\uparrow$ DetA$\uparrow$ AssA$\uparrow$ MOTA$\uparrow$ IDF1$\uparrow$ IDS$\downarrow$
    1 49.67 37.13 66.60 25.49 56.26 3018
    2 ${\bf{50.33}}$ 38.53 65.84 23.86 ${\bf{57.10}}$ ${\bf{2700}}$
    3 49.81 ${\bf{39.11}}$ 63.54 ${\bf{27.28}}$ 56.52 2922
    4 49.86 37.5 ${\bf{66.92}}$ 19.86 56.65 3022
    下载: 导出CSV

    表  9  不同${\cal{N}}$设置模型性能对比

    Table  9  Comparison of model performance with different ${\cal{N}}$ settings

    ${\cal{N}}$ HOTA$\uparrow$ DetA$\uparrow$ AssA$\uparrow$ MOTA$\uparrow$ IDF1$\uparrow$ IDS$\downarrow$
    1 47.53 35.55 63.63 ${\bf{24.90}}$ 55.11 ${\bf{2406}}$
    2 48.26 37.51 62.21 23.24 54.25 2840
    3 48.95 37.21 64.50 19.91 55.10 2681
    4 49.16 37.03 65.34 19.86 55.49 2633
    5 ${\bf{50.33}}$ ${\bf{38.53}}$ ${\bf{65.84}}$ 23.86 ${\bf{57.10}}$ 2700
    6 49.96 38.11 65.58 22.32 55.91 2530
    下载: 导出CSV

    表  10  不同$\delta$设置模型性能对比

    Table  10  Comparison of model performance with different $\delta$ settings

    $\delta$ HOTA$\uparrow$ DetA$\uparrow$ AssA$\uparrow$ MOTA$\uparrow$ IDF1$\uparrow$ IDS$\downarrow$
    0.2 48.54 35.23 ${\bf{66.95}}$ 1.12 53.57 3916
    0.3 49.66 37.03 66.71 12.36 55.54 3412
    0.4 50.27 38.07 66.48 19.29 56.65 3043
    0.5 ${\bf{50.33}}$ ${\bf{38.53}}$ 65.84 23.86 ${\bf{57.10}}$ 2700
    0.6 49.64 38.09 64.78 26.34 56.57 2361
    0.7 47.81 36.26 63.10 ${\bf{26.60}}$ 54.54 ${\bf{2035}}$
    下载: 导出CSV
  • [1] Zhenyang Li, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. Tracking by natural language specification. In the IEEE Conference on Computer Vision and Pattern Recognition, pages 6495–6503, 2017.
    [2] Dongming Wu, Wencheng Han, Tiancai Wang, Xingping Dong, Xiangyu Zhang, and Jianbing Shen. Referring multi-object tracking. In the IEEE Conference on Computer Vision and Pattern Recognition, pages 14633–14642, 2023.
    [3] Wenyan He, Yajun Jian, Yang Lu, and Hanzi Wang. Visual-linguistic representation learning with deep cross-modality fusion for referring multi-object tracking. In the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 6310–6314. IEEE, 2024.
    [4] Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. In the IEEE Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022.
    [5] Yunhao Du, Cheng Lei, Zhicheng Zhao, and Fei Su. ikun: Speak to trackers without retraining. In the IEEE Conference on Computer Vision and Pattern Recognition, 2024.
    [6] Jiacheng Lin, Jiajun Chen, Kunyu Peng, Xuan He, Zhiyong Li, Rainer Stiefelhagen, and Kailun Yang. Echotrack: Auditory referring multi-object tracking for autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 2024, 25(11): 18964−18977 doi: 10.1109/TITS.2024.3437645
    [7] Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multiple-object tracking with transformer. In the European Conference on Computer Vision, pages 659–675, 2022.
    [8] 张虹芸, 陈辉, 张文旭. 扩展目标跟踪中基于深度强化学习的传感器管理方法. 自动化学报, 2024, 50(07): 1417−1431

    Zhang Hongyun, Chen Hui, and Zhang Wenxu. Sensor management method based on deep reinforcement learning in extended target tracking. Acta Automatica Sinica, 2024, 50(07): 1417−1431
    [9] Ruopeng Gao and Limin Wang. Memotr: Long-term memory-augmented transformer for multi-object tracking. In the IEEE International Conference on Computer Vision, pages 9901–9910, 2023.
    [10] Yuang Zhang, Tiancai Wang, and Xiangyu Zhang. Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In the IEEE Conference on Computer Vision and Pattern Recognition, pages 22056–22065, 2023.
    [11] 安志勇, 梁顺楷, 李博, 赵峰, 窦全胜, 相忠良. 一种新的分段式细粒度正则化的鲁棒跟踪算法. 自动化学报, 2023, 49(5): 1116−1130

    An Zhiyong, Liang Shunklai, Li Bo, Zhao Feng, Dou Quansheng, and Xiang Zhongliang. Robust visual tracking with a novel segmented fine-grained regularization. Acta Automatica Sinica, 2023, 49(5): 1116−1130
    [12] 张鹏, 雷为民, 赵新蕾, 董力嘉, 林兆楠, 景庆阳. 跨摄像头多目标跟踪方法综述. 计算机学报, 2024, 47(02): 287−309

    Zhang Peng, Lei Weiming, Zhao Xinglei, Dong Lijia, Lin Zhaonan, and Jing Qingyang. A survey on multi-target multi-camera tracking methods. Chinese Journal of Computers, 2024, 47(02): 287−309
    [13] Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv: 2012.15460, 2020.
    [14] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. In the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
    [15] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv: 2107.08430, 2021.
    [16] Ruopeng Gao, Yijun Zhang, and Limin Wang. Multiple object tracking as id prediction. In the IEEE Conference on Computer Vision and Pattern Recognition, 2025.
    [17] Feng Yan, Weixin Luo, Yujie Zhong, Yiyang Gan, and Lin Ma. Bridging the gap between end-to-end and non-end-to-end multi-object tracking. arXiv preprint arXiv: 2305.12724, 2023.
    [18] Y Zhang, X Wang, X Ye, W Zhang, J Lu, X Tan, E Ding, P Sun, and J Wang. Bytetrackv2: 2d and 3d multi-object tracking by associating every detection box. arXiv preprint arXiv: 2303.15334, 1.
    [19] Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9686–9696, 2023.
    [20] Mingzhan Yang, Guangxin Han, Bin Yan, Wenhua Zhang, Jinqing Qi, Huchuan Lu, and Dong Wang. Hybrid-sort: Weak cues matter for online multi-object tracking. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 6504–6512, 2024.
    [21] 卢锦, 马令坤, 吕春玲, 章为川, Sun Chang-Ming. 基于代价参考粒子滤波器组的多目标检测前跟踪算法. 自动化学报, 2024, 50(4): 851−861

    Lu Jin, Ma Ling-Kun, Lv Chun-Ling, Zhang Wei-Chuan, and Sun Chang-Ming. A multi-target track-before-detect algorithm based on cost-reference particle filter bank. Acta Automatica Sinica, 2024, 50(4): 851−861
    [22] Qi Feng, Vitaly Ablavsky, Qinxun Bai, and Stan Sclaroff. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In the IEEE Conference on Computer Vision and Pattern Recognition, pages 5851–5860, 2021.
    [23] Yihao Li, Jun Yu, Zhongpeng Cai, and Yuwen Pan. Cross-modal target retrieval for tracking by natural language. In the IEEE Conference on Computer Vision and Pattern Recognition, pages 4931–4940, 2022.
    [24] Mingzhe Guo, Zhipeng Zhang, Heng Fan, and Liping Jing. Divert more attention to vision-language tracking. the Advances in Neural Information Processing Systems, 35: 4446–4460, 2022.
    [25] Ding Ma and Xiangqian Wu. Tracking by natural language specification with long short-term context decoupling. In the IEEE International Conference on Computer Vision, pages 14012–14021, 2023.
    [26] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In the IEEE International Conference on Image Processing, 2017.
    [27] Peize Sun et al. TransTrack: Multiple-object tracking with transformer. arXiv preprint arXiv: 2012.15460, 2020.
    [28] Pha Nguyen, Kha Gia Quach, Kris Kitani, and Khoa Luu. Type-to-track: Retrieve any object via prompt-based tracking. the Advances in Neural Information Processing Systems, 36: 3205–3219, 2023.
    [29] Zeliang Ma, Song Yang, Zhe Cui, Zhicheng Zhao, Fei Su, Delong Liu, and Jingyu Wang. Mls-track: Multilevel semantic interaction in rmot. arXiv preprint arXiv: 2404.12031, 2024.
    [30] Sijia Chen, En Yu, and Wenbing Tao. Cross-view referring multi-object tracking. In the AAAI Conference on Artificial Intelligence, 2025.
    [31] Jiajun Chen, Jiacheng Lin, Guojin Zhong, You Yao, and Zhiyong Li. Multigranularity localization transformer with collaborative understanding for referring multiobject tracking. IEEE Transactions on Instrumentation and Measurement, 2025, 74: 1−13
    [32] Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, and Xiaojuan Qi. Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection. Advances in neural information processing systems, 2023, 36: 71078-−71094
    [33] Matthew Honnibal and Mark Johnson. An improved non-monotonic transition system for dependency parsing. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 1373–1378. Association for Computational Linguistics (ACL), 2015.
    [34] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In the European Conference on Computer Vision, volume 12346, pages 213–229, 2020.
    [35] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In the Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies, volume 1, pages 4171–4186. Association for Computational Linguistics, 2019.
    [36] Chang Liu, Henghui Ding, Yulun Zhang, and Xudong Jiang. Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing, 2023, 32: 3054−3065 doi: 10.1109/TIP.2023.3277791
    [37] Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, and Gao Huang. Mask grounding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26573–26583, 2024.
    [38] Alec Radford, Jong Wook Kim, et al. Learning transferable visual models from natural language supervision. In the International Conference on Machine Learning, pages 8748–8763, 2021.
    [39] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In the IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
    [40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
    [41] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In the European Conference on Computer Vision, pages 740–755, 2014.
    [42] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transformers for end-to-end object detection. In the International Conference on Learning Representations, 2021.
    [43] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In the International Conference on Learning Representations, 2019.
    [44] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In the IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361. IEEE, 2012.
    [45] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taixé, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 2021.
    [46] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008: 1–10, 2008.
    [47] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In the European Conference on Computer Vision, pages 17–35, 2016.
    [48] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. FairMOT: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 2021.
    [49] Yifu Zhang et al. Bytetrack: Multi-object tracking by associating every detection box. In the European Conference on Computer Vision, 2022.
    [50] Chao Liang, Zhipeng Zhang, Xue Zhou, Bing Li, Shuyuan Zhu, and Weiming Hu. Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing, 2022, 31: 3182−3196 doi: 10.1109/TIP.2022.3165376
  • 加载中
计量
  • 文章访问数:  33
  • HTML全文浏览量:  44
  • 被引次数: 0
出版历程
  • 网络出版日期:  2025-09-22

目录

    /

    返回文章
    返回