2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

具身智能自主无人系统技术

孙长银 袁心 王远大 柳文章

孙长银, 袁心, 王远大, 柳文章. 具身智能自主无人系统技术. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240456
引用本文: 孙长银, 袁心, 王远大, 柳文章. 具身智能自主无人系统技术. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240456
Sun Chang-Yin, Yuan Xin, Wang Yuan-Da, Liu Wen-Zhang. Embodied intelligence autonomous unmanned systems technology. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240456
Citation: Sun Chang-Yin, Yuan Xin, Wang Yuan-Da, Liu Wen-Zhang. Embodied intelligence autonomous unmanned systems technology. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240456

具身智能自主无人系统技术

doi: 10.16383/j.aas.c240456 cstr: 32138.14.j.aas.c240456
基金项目: 国家自然科学基金创新研究群体(61921004), 国家自然科学基金重点项目(62236002), 国家自然科学基金(62203113)资助
详细信息
    作者简介:

    孙长银:安徽大学人工智能学院教授. 1996年获得四川大学应用数学专业学士学位. 分别于2001年, 2004年获得东南大学电子工程专业硕士和博士学位. 主要研究方向为智能控制, 飞行器控制, 模式识别和优化理论. 本文通信作者. E-mail: cysun@seu.edu.cn

    袁心:东南大学自动化学院博士后, 2021年获得东南大学控制科学与工程专业博士学位. 主要研究方向是深度强化学习和最优控制. E-mail: xinyuan@seu.edu.cn

    王远大:东南大学自动化学院博士后, 2020年获得东南大学控制科学与工程专业博士学位. 主要研究方向是深度强化学习和机器人系统控制. E-mail: wangyd@seu.edu.cn

    柳文章:安徽大学人工智能学院讲师. 2022年获得东南大学控制科学与工程博士学位. 主要研究方向包括深度强化学习, 多智能体强化学习, 迁移强化学习, 机器人等. E-mail: wzliu@ahu.edu.cn

Embodied Intelligence Autonomous Unmanned Systems Technology

Funds: Supported by Foundation for Innovative Research Groups of National Natural Science Foundation of China (61921004), Key Projects of National Natural Science Foundation of China (62236002), and National Natural Science Foundation of China (62236002, 62203113)
More Information
    Author Bio:

    SUN Chang-Yin Professor at the School of Artificial Intelligence, Anhui University. He received his bachelor degree in applied mathematics from Sichuan University in 1996, and his master and Ph.D. degrees in electrical engineering from Southeast University in 2001 and 2004, respectively. His research interest covers intelligent control, flight control, pattern recognition, and optimal theory. Corresponding author of this paper

    Yuan Xin, a postdoctoral researcher at the School of Automation, Southeast University, received his Ph.D. in Control Science and Engineering from Southeast University in 2020. His main research areas are deep reinforcement learning and optimal control

    Wang Yuan-Da a postdoctoral researcher at the School of Automation, Southeast University, received his Ph.D. in Control Science and Engineering from Southeast University in 2020. His main research areas are deep reinforcement learning and robotic system control

    Liu Wen-Zhang Lecturer at the School of Artificial Intelligence, Anhui University. He received his Ph.D. degree in engineering from the School of Automation, Southeast University, Nanjing, China, in 2022. His current research interests include deep reinforcement learning, multi-agent reinforcement learning, transfer reinforcement learning, and robotics, etc

  • 摘要: 自主无人系统是一类具有自主感知和决策能力的智能系统, 在国防安全、航空航天、高性能机器人等方面有着广泛的应用. 近年来, 基于Transformer架构的各类大模型快速革新, 极大地推动了自主无人系统的发展. 目前, 自主无人系统正迎来一场以“具身智能”为核心的新一代技术革命. 大模型需要借助无人系统的物理实体来实现“具身化”, 无人系统可以利用大模型技术来实现“智能化”. 本文阐述了具身智能自主无人系统的发展现状, 详细探讨了包含大模型驱动的多模态感知、面向具身任务的推理与决策、基于动态交互的机器人学习与控制、三维场景具身模拟器等具身智能领域的关键技术. 最后, 指出了目前具身智能无人系统所面临的挑战, 并展望了未来的研究方向.
  • 图  1  自主无人系统体系架构发展趋势

    Fig.  1  Architecture development trend of autonomous unmanned systems

    图  2  PaLM-E完成长程任务

    Fig.  2  The PaLM-E completes long range tasks

    图  4  各类人形机器人

    Fig.  4  Various humanoid robots

    图  3  具身智能无人系统关键技术结构示意图

    Fig.  3  Schematic diagram of key technical structure of autonomous intelligent unmanned system

    图  5  具身智能自主无人系统框架示意图及典型应用

    Fig.  5  Framework diagram and typical application of embodied intelligent autonomous unmanned system

    图  6  具身智能未来研究方向

    Fig.  6  Future research direction of embodied intelligence

    表  1  具身智能模型架构

    Table  1  Embodied intelligence model architecture

    名称 模型参数 响应频率 模型架构说明
    SayCan[75] SayCan利用价值函数表示各个技能的可行性, 并由语言模型进行技能评分, 能够兼顾任务需求和机器人技能的可行性
    RT-1[31] 350万 3 Hz RT-1采用13万条机器人演示数据的数据集完成模仿学习训练, 能以97%的成功率执行超过700个语音指令任务
    RoboCat[108] 12亿 10 ~ 20 Hz RoboCat构建了基于目标图像的可迁移机器人操纵框架, 能够实现多个操纵任务的零样本迁移
    PaLM-E[32] 5620亿 5 ~ 6 Hz PaLM-E构建了当时最大的具身多模态大模型, 将机器人传感器模态融入语言模型, 建立了端到端的训练框架
    RT-2[33] 550亿 1 ~ 3 Hz RT-2首次构建了视觉-语言-动作的模型, 在多个具身任务上实现了多阶段的语义推理
    VoxPoser[52] VoxPoser利用语言模型生成关于当前环境的价值地图, 并基于价值地图进行动作轨迹规划, 实现了高自由度的环境交互
    RT-2-X[105] 550亿 1 ~ 3 Hz RT-2-X构建了提供了标准化数据格式、交互环境和模型的数据集, 包含展示了527种技能和16万个任务
    下载: 导出CSV
  • [1] Gupta A, Savarese S, Ganguli S, Li F F. Embodied intelligence via learning and evolution. Nature communications, 2021, 12(1): 5721 doi: 10.1038/s41467-021-25874-z
    [2] 孙长银, 穆朝絮, 柳文章, 等. 自主无人系统的具身认知智能框架. 科技导报, 2024, 42(12): 157−166
    [3] Wiener N. Cybernetics or Control and Communication in the Animal and the Machine. Cambridge: MIT Press, 1961
    [4] Turing A M. Computing machinery and intelligence. Springer Netherlands, 1950
    [5] 王耀南, 安果维, 王传成, 莫洋, 缪志强, 曾凯. 智能无人系统技术应用与发展趋势. 中国舰船研究, 2022, 17(5): 9−26
    [6] Kaufmann E, Bauersfeld L, Loquercio A, Müller M, Koltun V, Scaramuzza D. Champion-level drone racing using deep reinforcement learning. Nature, 2023, 620: 982−987 doi: 10.1038/s41586-023-06419-4
    [7] Feng S, Sun H, Yan X, Zhu H, Zou Z, Shen S, Liu H X. Dense reinforcement learning for safety validation of autonomous vehicles. Nature, 2023, 615: 620−627 doi: 10.1038/s41586-023-05732-2
    [8] 张鹏飞, 程文铮, 米江勇, 和烨龙, 李亚文, 王力金. 反无人机蜂群关键技术研究现状及展望. 火炮发射与控制学报, 20241−7
    [9] 张琳. 美军反无人机系统技术新解. 坦克装甲车辆, 2024(11): 22−29
    [10] 董昭荣, 赵民, 姜利, 王智. 异构无人系统集群自主协同关键技术综述. 遥测遥控, 2024, 45(04): 1−11
    [11] 江碧涛, 温广辉, 周佳玲, 郑德智. 智能无人集群系统跨域协同技术研究现状与展望. 中国工程科学, 2024, 26(01): 117−126
    [12] Firoozi R, Tucker J, Tian S, Majumdar A, Sun J K, Liu W Y, et al. Foundation models in robotics: Applications, challenges, and the future. arXiv: 2312.07843, 2023
    [13] 兰沣卜, 赵文博, 朱凯, 张涛. 基于具身智能的移动操作机器人系统发展研究. 中国工程科学, 2024, 26(01): 139−148
    [14] 刘华平, 郭迪, 孙富春, 张新钰. 基于形态的具身智能研究: 历史回顾与前沿进展. 自动化学报, 2023, 49(6): 1131−1154
    [15] 张钹, 朱军, 苏航. 迈向第三代人工智能. 中国科学: 信息科学, 2020, 50(09): 1281−1302
    [16] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018
    [17] Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional Transformers for language understanding. arXiv: 1810.04805, 2018
    [18] Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog, 2019, 1(8): 9
    [19] Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, et al. Language models are few-shot learners. arXiv: 2005.14165, 2020
    [20] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, et al. Attention is all you need. Advances in neural information processing systems, 201730
    [21] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: 2010.11929 2020
    [22] H e, Kaiming, Chen X, Xie S, Li Y, Dollar P, Girshick R, et al. Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022
    [23] L iu, Z e, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z. Swin Transformer: Hierarchical vision Transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision, 2021
    [24] Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, et al. Llama: Open and efficient foundation language models. arXiv: 2302.13971 2023
    [25] Kim W, Bokyung S, Ildoo K. Vilt: Vision-and-language Transformer without convolution or region supervision. International conference on machine learning, 2021
    [26] L i, Junnan, Li D, Xiong C, Hoi S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. International conference on machine learning, 2022
    [27] Yu, Jiahui, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y. Coca: Contrastive captioners are image-text foundation models. arXiv: 2205.01917, 2022
    [28] Bao, Hangbo, Wang W, Dong L, Wei F. Vl-beit: Generative vision-language pretraining. arXiv: 2206.01127, 2022
    [29] Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. International conference on machine learning, 20218748−8763
    [30] Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 2022, 35: 27730−27744
    [31] Brohan A, Brown N, Carbajal J, Chebotar Y, Dabis J, Finn C, et al. Rt-1: Robotics Transformer for real-world control at scale. arXiv: 2212.06817, 2022
    [32] Driess D, Xia F, Sajjadi M S, Lynch C, Chowdhery A, Ichter B, et al. Palm-e: An embodied multimodal language model. arXiv: 2303.03378, 2023
    [33] Brohan A, Brown N, Carbajal J, Chebotar Y, Chen X, Choromanski K, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv: 2307.15818, 2023
    [34] Zeng F, Gan W, Wang Y, Liu N, Yu P S. Large language models for robotics: A survey. arXiv: 2311.07226, 2023
    [35] Bommasani R, Hudson D A, Adeli E, Altman E, Arora S, Arx S, et al. On the opportunities and risks of foundation models. arXiv: 2108.07258, 2021
    [36] Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv: 2208.10442, 2022
    [37] Bao H, Wang W, Dong L, Liu Q, Mohammed O K, Aggarwal K, et al. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 2022, 35: 32897−32912
    [38] Chen F L, Zhang D Z, Han M L, Chen X Y, Shi J, Xu S, et al. Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 2023, 20(1): 38−56 doi: 10.1007/s11633-022-1369-5
    [39] Peng F, Yang X, Xiao L, Wang Y, Xu C. Sgva-clip: Semantic-guided visual adapting of vision-language models for few-shot image classification. IEEE Transactions on Multimedia, 2024, 26: 3469−3480 doi: 10.1109/TMM.2023.3311646
    [40] Li L H, Zhang P, Zhang H, Yang J, Li C, Zhong Y, et al. Grounded language-image pre-training. Conference on Computer Vision and Pattern Recognition, 202210965−10975
    [41] Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv: 2303.05499, 2023
    [42] Minderer, M., Gritsenko, A., Stone, et al. (2022, October). Simple open-vocabulary object detection. In European Conference on Computer Vision (pp. 728-755). Cham: Springer Nature Switzerland
    [43] Xu J, De S, Liu S, Byeon W, Breuel T, Kautz J, Wang X. Groupvit: Semantic segmentation emerges from text supervision. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 202218134−18144
    [44] Li B, Weinberger K Q, Belongie S, Koltun V, Ranftl R. Language-driven semantic segmentation. arXiv: 2201.03546, 2022
    [45] Ghiasi G, Gu X, Cui Y, Lin T Y. Scaling open-vocabulary image segmentation with image-level labels. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022
    [46] Zhou C, Loy C C, Dai B. Extract free dense labels from clip. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022
    [47] Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment anything. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
    [48] Shah D, Osinski B, Levine S. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. Conference on robot learning, 2023
    [49] Gadre S Y, Wortsman M, Ilharco G, Schmidt L, Song S. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
    [50] Majumdar A, Aggarwal G, Devnani B, Hoffman J, Batra D. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. Advances in Neural Information Processing Systems, 2022, 35: 32340−32352
    [51] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with Transformers. European conference on computer vision. Cham: Springer International Publishing, 2020
    [52] Huang W L, Wang C, Zhang R, Li Y, Wu K, Li F F. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv: 2307.05973, 2023
    [53] Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 2015
    [54] Jiang P, Ergu D, Liu F, Cai Y, Ma B. A Review of Yolo algorithm developments. Procedia computer science, 2022, 199: 1066−1073 doi: 10.1016/j.procs.2022.01.135
    [55] Cheng H K, Alexander G S. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022
    [56] Li J, Selvaraju R, Gotmare A, Joty S, Xiong C, Hoi S C H. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 2021, 34: 9694−9705
    [57] Zhu X, Zhang R, He B, Guo Z, Zeng Z, Qin Z, et al. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
    [58] Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, Xiao J. 3d shapenets: A deep representation for volumetric shapes. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015
    [59] Muzahid A A M, Wan W, Sohel F, Wu L, Hou L. CurveNet: Curvature-based multitask learning deep networks for 3D object recognition. IEEE/CAA Journal of Automatica Sinica, 2020, 8(6): 1177−1187
    [60] Xue L, Gao M, Xing C, Martín-Martín R, Wu J, Xiong C, et al. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
    [61] Qi C R, Yi L, Su H, Guibas L J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 2017
    [62] Ma X, Qin C, You H, Ran H, Fu Y. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. arXiv: 2202.07123, 2022
    [63] Mildenhall B, Srinivasan P P, Tancik M, Barron J T, Ramamoorthi R, Ng R. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 2021, 65(1): 99−106
    [64] Kerr J, Kim C M, Goldberg K, Kanazawa A, Tancik M. Lerf: Language embedded radiance fields. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
    [65] Shen W, Yang G, Yu A, Wong J, Kaelbling L P, Isola P. Distilled feature fields enable few-shot language-guided manipulation. arXiv: 2308.07931, 2023
    [66] Gadre S Y, Ehsani K, Song S, Mottaghi R. Continuous scene representations for embodied ai. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
    [67] Shafiullah N N M, Paxton C, Pinto L, Chintala S, Szlam A. CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory. arXiv e-prints arXiv: 2210.05663, 2022
    [68] Huang C, Mees O, Zeng A, Burgard W. Visual language maps for robot navigation. 2023 IEEE International Conference on Robotics and Automation. IEEE, 2023: 10608-10615
    [69] Ha H, Song S. Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. arXiv: 2207.11514, 2022
    [70] Gan Z, Li L, Li C, Wang L, Liu Z, Gao J. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends in Computer Graphics and Vision, 2022, 14(3): 163−352
    [71] Zeng A, Attarian M, Ichter B, Choromanski K, Wong A, Welker S, et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv: 2204.00598, 2022
    [72] Li B Z, Nye M, Andreas J. Implicit representations of meaning in neural language models. arXiv: 2106.00737, 2021
    [73] Huang W L, Abbeel P, Pathak D, Mordatch I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. International Conference on Machine Learning. PMLR, 2022
    [74] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. Roberta: A robustly optimized bert pretraining approach. arXiv: 1907.11692, 2019
    [75] Brohan A, Chebotar Y, Finn C, Hausman K, Herzog A, Ho D, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv: 2204.01691, 2022
    [76] Vemprala S H, Bonatti R, Bucker A, Kapoor A. Chatgpt for robotics: Design principles and model abilities. arXiv: 2306.17582, 2023
    [77] Liang J, Huang W, Xia F, Xu P, Hausman K, Ichter B, et al. Code as policies: Language model programs for embodied control. 2023 IEEE International Conference on Robotics and Automation. IEEE, 2023
    [78] Huang W, Xia F, Xiao T, Chan H, Liang J, Florence P, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv: 2207.05608, 2022
    [79] Du Y, Yang M, Florence P, Xia F, Wahid A, Ichter B, et al. Video language planning. arXiv: 2310.10625, 2023
    [80] Hao S, Gu Y, Ma H, Hong J J, Wang Z, Wang D Z, Hu Z. Reasoning with language model is planning with world model. arXiv: 2305.14992, 2023
    [81] Sun Y, Zhang K, un C. Model-based transfer reinforcement learning based on graphical model representations. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(2): 1035−1048 doi: 10.1109/TNNLS.2021.3107375
    [82] Zha L, Cui Y, Lin L H, Kwon M, Arenas M G, Zeng A, et al. Distilling and retrieving generalizable knowledge for robot manipulation via language corrections. arXiv: 2311.10678, 2023
    [83] Liang J, Xia F, Yu W, Zeng A, Arenas M G, Attarian M, et al. Learning to Learn Faster from Human Feedback with Language Model Predictive Control. arXiv: 2402.11450, 2024
    [84] Lynch C, Sermanet P. Language conditioned imitation learning over unstructured data. arXiv: 2005.07648, 2020
    [85] Hassanin M, Khan S, Tahtali M. Visual affordance and function understanding: A survey. ACM Computing Surveys, 2021, 54(3): 1−35
    [86] Luo H, Zhai W, Zhang J, Cao J, Tao D. Learning visual affordance grounding from demonstration videos. IEEE Transactions on Neural Networks and Learning Systems, 2023
    [87] Mo K, Guibas L J, Mukadam M, Gupta A, Tulsiani S. Where2act: From pixels to actions for articulated 3d objects. Proceedings of the IEEE/CVF International Conference on Computer Vision, 20216813−6823
    [88] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI, 2015234−241
    [89] Mo K, Qin Y, Xiang F, Su H, Guibas L. O2O-Afford: Annotation-free large-scale object-object affordance learning. Conference on robot learning, 20221666−1677
    [90] Geng Y, An B, Geng H, Chen Y, Yang Y, Dong H. Rlafford: End-to-end affordance learning for robotic manipulation. 2023 IEEE International Conference on Robotics and Automation. IEEE, 2023
    [91] Makoviychuk V, Wawrzyniak L, Guo Y, Lu M, Storey K, Macklin M, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv: 2108.10470, 2021
    [92] Savva M, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, et al. Habitat: A platform for embodied ai research. Proceedings of the IEEE/CVF international conference on computer vision, 20199339−9347
    [93] Kolve E, Mottaghi R, Han W, VanderBilt E, Weihs L, Herrasti A, et al. Ai2-thor: An interactive 3d environment for visual ai. arXiv: 1712.05474, 2017
    [94] Gan C, Schwartz J, Alter S, Mrowca D, Schrimpf M, Traer J, et al. Threedworld: A platform for interactive multi-modal physical simulation. arXiv: 2007.04954, 2020
    [95] Xia F, Shen W B, Li C, Kasimbeg P, Tchapmi M E, Toshev A, et al. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 2020, 5(2): 713−720 doi: 10.1109/LRA.2020.2965078
    [96] Deitke M, VanderBilt E, Herrasti A, Weihs L, Ehsani K, Salvador J, et al. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. Advances in Neural Information Processing Systems, 2022, 35: 5982−5994
    [97] Anderson P, Wu Q, Teney D, Bruce J, Johnson M, Sünderhauf N, et al. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. Proceedings of the IEEE conference on computer vision and pattern recognition, 20183674−3683
    [98] Wu Q, Wu C J, Zhu Y, Joo J. Communicative learning with natural gestures for embodied navigation agents with human-in-the-scene. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 20214095−4102
    [99] Duan J, Yu S, Tan H L, Zhu H, Tan C. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2022, 6(2): 230−244 doi: 10.1109/TETCI.2022.3141105
    [100] Anderson P, Chang A, Chaplot D S, Dosovitskiy A, Gupta S, Koltun V, et al. On evaluation of embodied navigation agents. arXiv: 1807.06757, 2018
    [101] Paul S, Amit Roy A, Cherian. A. Avlen: Audio-visual-language embodied navigation in 3d environments. Advances in Neural Information Processing Systems, 20226236−6249
    [102] Tan S, Xiang W, Liu H, Guo D, Sun F. Multi-agent embodied question answering in interactive environments. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020
    [103] Huang C G, Mees O, Zeng A, Burgard W. Visual language maps for robot navigation. 2023 IEEE International Conference on Robotics and Automation. IEEE, 2023
    [104] Zhou G, Hong Y C, Wu Q. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(7): 7641−7649 doi: 10.1609/aaai.v38i7.28597
    [105] Padalkar A, Pooley A, Jain A, Bewley A, Herzog A, Irpan A, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv: 2310.08864, 2023
    [106] Shah D, Eysenbach B, Kahn G, Rhinehart N, Levine S. Ving: Learning open-world navigation with visual goals. 2021 IEEE International Conference on Robotics and Automation. IEEE, 2021
    [107] Wen G, Zheng W X, Wan Y. Distributed robust optimization for networked agent systems with unknown nonlinearities. IEEE Transactions on Automatic Control, 2022, 68(9): 5230−5244
    [108] Bousmalis K, Vezzani G, Rao D, Devin C, Lee A, Bauza M, et al. Robocat: A self-improving foundation agent for robotic manipulation. arXiv: 2306.11706, 2023
  • 加载中
计量
  • 文章访问数:  414
  • HTML全文浏览量:  165
  • 被引次数: 0
出版历程
  • 录用日期:  2024-09-27
  • 网络出版日期:  2024-10-23

目录

    /

    返回文章
    返回