-
摘要: 自主无人系统是一类具有自主感知和决策能力的智能系统, 在国防安全、航空航天、高性能机器人等方面有着广泛的应用. 近年来, 基于Transformer架构的各类大模型快速革新, 极大地推动了自主无人系统的发展. 目前, 自主无人系统正迎来一场以“具身智能”为核心的新一代技术革命. 大模型需要借助无人系统的物理实体来实现“具身化”, 无人系统可以利用大模型技术来实现“智能化”. 本文阐述了具身智能自主无人系统的发展现状, 详细探讨了包含大模型驱动的多模态感知、面向具身任务的推理与决策、基于动态交互的机器人学习与控制、三维场景具身模拟器等具身智能领域的关键技术. 最后, 指出了目前具身智能无人系统所面临的挑战, 并展望了未来的研究方向.Abstract: Autonomous unmanned systems are intelligent systems with autonomous perception and decision-making capabilities, widely applied in areas such as defense security, aerospace, and high-performance robotics. In recent years, the rapid advancements of various large models based on the Transformer architecture have significantly accelerated the development of autonomous unmanned systems. Currently, these systems are undergoing a new technological revolution centered on “embodied intelligence.” Large models require the physical embodiment of unmanned systems to achieve “embodiment,” while unmanned systems can leverage large model technologies to achieve “intelligence.” This paper outlines the current state of development in embodied intelligent autonomous unmanned systems and provides a detailed discussion of key technologies in the field of embodied intelligence, including large-model-driven multimodal perception, reasoning and decision-making for embodied tasks, robot learning and control based on human-computer interaction, and 3D embodied simulators. Finally, the paper identifies existing challenges in autonomous unmanned systems and explores future research directions.
-
表 1 具身智能模型架构
Table 1 Embodied intelligence model architecture
名称 模型参数 响应频率 模型架构说明 SayCan[75] — — SayCan利用价值函数表示各个技能的可行性, 并由语言模型进行技能评分, 能够兼顾任务需求和机器人技能的可行性 RT-1[31] 350万 3 Hz RT-1采用13万条机器人演示数据的数据集完成模仿学习训练, 能以97%的成功率执行超过700个语音指令任务 RoboCat[108] 12亿 10 ~ 20 Hz RoboCat构建了基于目标图像的可迁移机器人操纵框架, 能够实现多个操纵任务的零样本迁移 PaLM-E[32] 5620 亿5 ~ 6 Hz PaLM-E构建了当时最大的具身多模态大模型, 将机器人传感器模态融入语言模型, 建立了端到端的训练框架 RT-2[33] 550亿 1 ~ 3 Hz RT-2首次构建了视觉-语言-动作的模型, 在多个具身任务上实现了多阶段的语义推理 VoxPoser[52] — — VoxPoser利用语言模型生成关于当前环境的价值地图, 并基于价值地图进行动作轨迹规划, 实现了高自由度的环境交互 RT-2-X[105] 550亿 1 ~ 3 Hz RT-2-X构建了提供了标准化数据格式、交互环境和模型的数据集, 包含展示了527种技能和16万个任务 -
[1] Gupta A, Savarese S, Ganguli S, Li F F. Embodied intelligence via learning and evolution. Nature communications, 2021, 12(1): 5721 doi: 10.1038/s41467-021-25874-z [2] 孙长银, 穆朝絮, 柳文章, 等. 自主无人系统的具身认知智能框架. 科技导报, 2024, 42(12): 157−166 [3] Wiener N. Cybernetics or Control and Communication in the Animal and the Machine. Cambridge: MIT Press, 1961 [4] Turing A M. Computing machinery and intelligence. Springer Netherlands, 1950 [5] 王耀南, 安果维, 王传成, 莫洋, 缪志强, 曾凯. 智能无人系统技术应用与发展趋势. 中国舰船研究, 2022, 17(5): 9−26 [6] Kaufmann E, Bauersfeld L, Loquercio A, Müller M, Koltun V, Scaramuzza D. Champion-level drone racing using deep reinforcement learning. Nature, 2023, 620: 982−987 doi: 10.1038/s41586-023-06419-4 [7] Feng S, Sun H, Yan X, Zhu H, Zou Z, Shen S, Liu H X. Dense reinforcement learning for safety validation of autonomous vehicles. Nature, 2023, 615: 620−627 doi: 10.1038/s41586-023-05732-2 [8] 张鹏飞, 程文铮, 米江勇, 和烨龙, 李亚文, 王力金. 反无人机蜂群关键技术研究现状及展望. 火炮发射与控制学报, 20241−7 [9] 张琳. 美军反无人机系统技术新解. 坦克装甲车辆, 2024(11): 22−29 [10] 董昭荣, 赵民, 姜利, 王智. 异构无人系统集群自主协同关键技术综述. 遥测遥控, 2024, 45(04): 1−11 [11] 江碧涛, 温广辉, 周佳玲, 郑德智. 智能无人集群系统跨域协同技术研究现状与展望. 中国工程科学, 2024, 26(01): 117−126 [12] Firoozi R, Tucker J, Tian S, Majumdar A, Sun J K, Liu W Y, et al. Foundation models in robotics: Applications, challenges, and the future. arXiv: 2312.07843, 2023 [13] 兰沣卜, 赵文博, 朱凯, 张涛. 基于具身智能的移动操作机器人系统发展研究. 中国工程科学, 2024, 26(01): 139−148 [14] 刘华平, 郭迪, 孙富春, 张新钰. 基于形态的具身智能研究: 历史回顾与前沿进展. 自动化学报, 2023, 49(6): 1131−1154 [15] 张钹, 朱军, 苏航. 迈向第三代人工智能. 中国科学: 信息科学, 2020, 50(09): 1281−1302 [16] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018 [17] Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional Transformers for language understanding. arXiv: 1810.04805, 2018 [18] Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog, 2019, 1(8): 9 [19] Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, et al. Language models are few-shot learners. arXiv: 2005.14165, 2020 [20] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, et al. Attention is all you need. Advances in neural information processing systems, 201730 [21] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: 2010.11929 2020 [22] H e, Kaiming, Chen X, Xie S, Li Y, Dollar P, Girshick R, et al. Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022 [23] L iu, Z e, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z. Swin Transformer: Hierarchical vision Transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision, 2021 [24] Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, et al. Llama: Open and efficient foundation language models. arXiv: 2302.13971 2023 [25] Kim W, Bokyung S, Ildoo K. Vilt: Vision-and-language Transformer without convolution or region supervision. International conference on machine learning, 2021 [26] L i, Junnan, Li D, Xiong C, Hoi S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. International conference on machine learning, 2022 [27] Yu, Jiahui, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y. Coca: Contrastive captioners are image-text foundation models. arXiv: 2205.01917, 2022 [28] Bao, Hangbo, Wang W, Dong L, Wei F. Vl-beit: Generative vision-language pretraining. arXiv: 2206.01127, 2022 [29] Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. International conference on machine learning, 20218748−8763 [30] Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 2022, 35: 27730−27744 [31] Brohan A, Brown N, Carbajal J, Chebotar Y, Dabis J, Finn C, et al. Rt-1: Robotics Transformer for real-world control at scale. arXiv: 2212.06817, 2022 [32] Driess D, Xia F, Sajjadi M S, Lynch C, Chowdhery A, Ichter B, et al. Palm-e: An embodied multimodal language model. arXiv: 2303.03378, 2023 [33] Brohan A, Brown N, Carbajal J, Chebotar Y, Chen X, Choromanski K, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv: 2307.15818, 2023 [34] Zeng F, Gan W, Wang Y, Liu N, Yu P S. Large language models for robotics: A survey. arXiv: 2311.07226, 2023 [35] Bommasani R, Hudson D A, Adeli E, Altman E, Arora S, Arx S, et al. On the opportunities and risks of foundation models. arXiv: 2108.07258, 2021 [36] Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv: 2208.10442, 2022 [37] Bao H, Wang W, Dong L, Liu Q, Mohammed O K, Aggarwal K, et al. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 2022, 35: 32897−32912 [38] Chen F L, Zhang D Z, Han M L, Chen X Y, Shi J, Xu S, et al. Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 2023, 20(1): 38−56 doi: 10.1007/s11633-022-1369-5 [39] Peng F, Yang X, Xiao L, Wang Y, Xu C. Sgva-clip: Semantic-guided visual adapting of vision-language models for few-shot image classification. IEEE Transactions on Multimedia, 2024, 26: 3469−3480 doi: 10.1109/TMM.2023.3311646 [40] Li L H, Zhang P, Zhang H, Yang J, Li C, Zhong Y, et al. Grounded language-image pre-training. Conference on Computer Vision and Pattern Recognition, 202210965−10975 [41] Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv: 2303.05499, 2023 [42] Minderer, M., Gritsenko, A., Stone, et al. (2022, October). Simple open-vocabulary object detection. In European Conference on Computer Vision (pp. 728-755). Cham: Springer Nature Switzerland [43] Xu J, De S, Liu S, Byeon W, Breuel T, Kautz J, Wang X. Groupvit: Semantic segmentation emerges from text supervision. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 202218134−18144 [44] Li B, Weinberger K Q, Belongie S, Koltun V, Ranftl R. Language-driven semantic segmentation. arXiv: 2201.03546, 2022 [45] Ghiasi G, Gu X, Cui Y, Lin T Y. Scaling open-vocabulary image segmentation with image-level labels. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022 [46] Zhou C, Loy C C, Dai B. Extract free dense labels from clip. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022 [47] Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment anything. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023 [48] Shah D, Osinski B, Levine S. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. Conference on robot learning, 2023 [49] Gadre S Y, Wortsman M, Ilharco G, Schmidt L, Song S. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023 [50] Majumdar A, Aggarwal G, Devnani B, Hoffman J, Batra D. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. Advances in Neural Information Processing Systems, 2022, 35: 32340−32352 [51] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with Transformers. European conference on computer vision. Cham: Springer International Publishing, 2020 [52] Huang W L, Wang C, Zhang R, Li Y, Wu K, Li F F. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv: 2307.05973, 2023 [53] Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 2015 [54] Jiang P, Ergu D, Liu F, Cai Y, Ma B. A Review of Yolo algorithm developments. Procedia computer science, 2022, 199: 1066−1073 doi: 10.1016/j.procs.2022.01.135 [55] Cheng H K, Alexander G S. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022 [56] Li J, Selvaraju R, Gotmare A, Joty S, Xiong C, Hoi S C H. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 2021, 34: 9694−9705 [57] Zhu X, Zhang R, He B, Guo Z, Zeng Z, Qin Z, et al. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023 [58] Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, Xiao J. 3d shapenets: A deep representation for volumetric shapes. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015 [59] Muzahid A A M, Wan W, Sohel F, Wu L, Hou L. CurveNet: Curvature-based multitask learning deep networks for 3D object recognition. IEEE/CAA Journal of Automatica Sinica, 2020, 8(6): 1177−1187 [60] Xue L, Gao M, Xing C, Martín-Martín R, Wu J, Xiong C, et al. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023 [61] Qi C R, Yi L, Su H, Guibas L J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 2017 [62] Ma X, Qin C, You H, Ran H, Fu Y. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. arXiv: 2202.07123, 2022 [63] Mildenhall B, Srinivasan P P, Tancik M, Barron J T, Ramamoorthi R, Ng R. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 2021, 65(1): 99−106 [64] Kerr J, Kim C M, Goldberg K, Kanazawa A, Tancik M. Lerf: Language embedded radiance fields. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023 [65] Shen W, Yang G, Yu A, Wong J, Kaelbling L P, Isola P. Distilled feature fields enable few-shot language-guided manipulation. arXiv: 2308.07931, 2023 [66] Gadre S Y, Ehsani K, Song S, Mottaghi R. Continuous scene representations for embodied ai. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022 [67] Shafiullah N N M, Paxton C, Pinto L, Chintala S, Szlam A. CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory. arXiv e-prints arXiv: 2210.05663, 2022 [68] Huang C, Mees O, Zeng A, Burgard W. Visual language maps for robot navigation. 2023 IEEE International Conference on Robotics and Automation. IEEE, 2023: 10608-10615 [69] Ha H, Song S. Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. arXiv: 2207.11514, 2022 [70] Gan Z, Li L, Li C, Wang L, Liu Z, Gao J. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends in Computer Graphics and Vision, 2022, 14(3): 163−352 [71] Zeng A, Attarian M, Ichter B, Choromanski K, Wong A, Welker S, et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv: 2204.00598, 2022 [72] Li B Z, Nye M, Andreas J. Implicit representations of meaning in neural language models. arXiv: 2106.00737, 2021 [73] Huang W L, Abbeel P, Pathak D, Mordatch I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. International Conference on Machine Learning. PMLR, 2022 [74] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. Roberta: A robustly optimized bert pretraining approach. arXiv: 1907.11692, 2019 [75] Brohan A, Chebotar Y, Finn C, Hausman K, Herzog A, Ho D, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv: 2204.01691, 2022 [76] Vemprala S H, Bonatti R, Bucker A, Kapoor A. Chatgpt for robotics: Design principles and model abilities. arXiv: 2306.17582, 2023 [77] Liang J, Huang W, Xia F, Xu P, Hausman K, Ichter B, et al. Code as policies: Language model programs for embodied control. 2023 IEEE International Conference on Robotics and Automation. IEEE, 2023 [78] Huang W, Xia F, Xiao T, Chan H, Liang J, Florence P, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv: 2207.05608, 2022 [79] Du Y, Yang M, Florence P, Xia F, Wahid A, Ichter B, et al. Video language planning. arXiv: 2310.10625, 2023 [80] Hao S, Gu Y, Ma H, Hong J J, Wang Z, Wang D Z, Hu Z. Reasoning with language model is planning with world model. arXiv: 2305.14992, 2023 [81] Sun Y, Zhang K, un C. Model-based transfer reinforcement learning based on graphical model representations. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(2): 1035−1048 doi: 10.1109/TNNLS.2021.3107375 [82] Zha L, Cui Y, Lin L H, Kwon M, Arenas M G, Zeng A, et al. Distilling and retrieving generalizable knowledge for robot manipulation via language corrections. arXiv: 2311.10678, 2023 [83] Liang J, Xia F, Yu W, Zeng A, Arenas M G, Attarian M, et al. Learning to Learn Faster from Human Feedback with Language Model Predictive Control. arXiv: 2402.11450, 2024 [84] Lynch C, Sermanet P. Language conditioned imitation learning over unstructured data. arXiv: 2005.07648, 2020 [85] Hassanin M, Khan S, Tahtali M. Visual affordance and function understanding: A survey. ACM Computing Surveys, 2021, 54(3): 1−35 [86] Luo H, Zhai W, Zhang J, Cao J, Tao D. Learning visual affordance grounding from demonstration videos. IEEE Transactions on Neural Networks and Learning Systems, 2023 [87] Mo K, Guibas L J, Mukadam M, Gupta A, Tulsiani S. Where2act: From pixels to actions for articulated 3d objects. Proceedings of the IEEE/CVF International Conference on Computer Vision, 20216813−6823 [88] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI, 2015234−241 [89] Mo K, Qin Y, Xiang F, Su H, Guibas L. O2O-Afford: Annotation-free large-scale object-object affordance learning. Conference on robot learning, 20221666−1677 [90] Geng Y, An B, Geng H, Chen Y, Yang Y, Dong H. Rlafford: End-to-end affordance learning for robotic manipulation. 2023 IEEE International Conference on Robotics and Automation. IEEE, 2023 [91] Makoviychuk V, Wawrzyniak L, Guo Y, Lu M, Storey K, Macklin M, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv: 2108.10470, 2021 [92] Savva M, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, et al. Habitat: A platform for embodied ai research. Proceedings of the IEEE/CVF international conference on computer vision, 20199339−9347 [93] Kolve E, Mottaghi R, Han W, VanderBilt E, Weihs L, Herrasti A, et al. Ai2-thor: An interactive 3d environment for visual ai. arXiv: 1712.05474, 2017 [94] Gan C, Schwartz J, Alter S, Mrowca D, Schrimpf M, Traer J, et al. Threedworld: A platform for interactive multi-modal physical simulation. arXiv: 2007.04954, 2020 [95] Xia F, Shen W B, Li C, Kasimbeg P, Tchapmi M E, Toshev A, et al. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 2020, 5(2): 713−720 doi: 10.1109/LRA.2020.2965078 [96] Deitke M, VanderBilt E, Herrasti A, Weihs L, Ehsani K, Salvador J, et al. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. Advances in Neural Information Processing Systems, 2022, 35: 5982−5994 [97] Anderson P, Wu Q, Teney D, Bruce J, Johnson M, Sünderhauf N, et al. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. Proceedings of the IEEE conference on computer vision and pattern recognition, 20183674−3683 [98] Wu Q, Wu C J, Zhu Y, Joo J. Communicative learning with natural gestures for embodied navigation agents with human-in-the-scene. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 20214095−4102 [99] Duan J, Yu S, Tan H L, Zhu H, Tan C. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2022, 6(2): 230−244 doi: 10.1109/TETCI.2022.3141105 [100] Anderson P, Chang A, Chaplot D S, Dosovitskiy A, Gupta S, Koltun V, et al. On evaluation of embodied navigation agents. arXiv: 1807.06757, 2018 [101] Paul S, Amit Roy A, Cherian. A. Avlen: Audio-visual-language embodied navigation in 3d environments. Advances in Neural Information Processing Systems, 20226236−6249 [102] Tan S, Xiang W, Liu H, Guo D, Sun F. Multi-agent embodied question answering in interactive environments. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020 [103] Huang C G, Mees O, Zeng A, Burgard W. Visual language maps for robot navigation. 2023 IEEE International Conference on Robotics and Automation. IEEE, 2023 [104] Zhou G, Hong Y C, Wu Q. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(7): 7641−7649 doi: 10.1609/aaai.v38i7.28597 [105] Padalkar A, Pooley A, Jain A, Bewley A, Herzog A, Irpan A, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv: 2310.08864, 2023 [106] Shah D, Eysenbach B, Kahn G, Rhinehart N, Levine S. Ving: Learning open-world navigation with visual goals. 2021 IEEE International Conference on Robotics and Automation. IEEE, 2021 [107] Wen G, Zheng W X, Wan Y. Distributed robust optimization for networked agent systems with unknown nonlinearities. IEEE Transactions on Automatic Control, 2022, 68(9): 5230−5244 [108] Bousmalis K, Vezzani G, Rao D, Devin C, Lee A, Bauza M, et al. Robocat: A self-improving foundation agent for robotic manipulation. arXiv: 2306.11706, 2023
计量
- 文章访问数: 722
- HTML全文浏览量: 272
- 被引次数: 0