2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于大模型的具身智能系统综述

王文晟 谭宁 黄凯 张雨浓 郑伟诗 孙富春

王文晟, 谭宁, 黄凯, 张雨浓, 郑伟诗, 孙富春. 基于大模型的具身智能系统综述. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240542
引用本文: 王文晟, 谭宁, 黄凯, 张雨浓, 郑伟诗, 孙富春. 基于大模型的具身智能系统综述. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240542
Wang Wen-Sheng, Tan Ning, Huang Kai, Zhang Yu-Nong, Zheng Wei-Shi, Sun Fu-Chun. Embodied intelligence systems based on large models: a survey. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240542
Citation: Wang Wen-Sheng, Tan Ning, Huang Kai, Zhang Yu-Nong, Zheng Wei-Shi, Sun Fu-Chun. Embodied intelligence systems based on large models: a survey. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240542

基于大模型的具身智能系统综述

doi: 10.16383/j.aas.c240542
基金项目: 国家自然科学基金面上项目(62173352), 广东省基础与应用基础研究基金杰出青年基金(2024B1515020104)资助
详细信息
    作者简介:

    王文晟:中山大学计算机技术专业硕士研究生. 2023年获得北京科技大学自动化学院测控技术与仪器学士学位. 主要研究方向为基于大模型的具身智能. E-mail: wangwsh23@mail2.sysu.edu.cn

    谭宁:中山大学计算机学院副教授. 2013年获法国弗朗什-孔泰大学博士学位. 主要研究方向为各类机器人系统的建模、设计、仿真、优化、规划与控制, 内容涵盖基础研究和应用开发. 本文通信作者. E-mail: tann5@mail.sysu.edu.cn

    黄凯:中山大学计算机学院教授. 2010年获瑞士苏黎世联邦理工学院计算机科学博士学位. 主要研究方向为汽车和机器人领域嵌入式系统的分析、设计和优化技术. E-mail: huangk36@mail.sysu.edu.cn

    张雨浓:中山大学计算机学院教授. 2002年获香港中文大学博士学位. 广东省珠江学者特聘教授, Elsevier中国高被引学者. 主要研究方向为冗余机器人, 递归神经网络, 高斯过程, 科学计算和软硬件开发. E-mail: zhynong@mail.sysu.edu.cn

    郑伟诗:教育部“长江学者奖励计划” 特聘教授, 英国皇家学会牛顿高级学者, IAPR Fellow. 主要研究方向为协同与交互分析理论与方法, 解决人体建模和机器人行为的视觉计算问题. E-mail: zhwshi@mail.sysu.edu.cn

    孙富春:清华大学计算机科学与技术系教授. 1997 年获得清华大学博士学位. 国家杰出青年科学基金获得者. 中国人工智能学会副理事长, IEEE Fellow. 主要研究方向为智能控制、智能机器人与具身智能. E-mail: fcsun@tsinghua.edu.cn

Embodied Intelligence Systems Based on Large Models: A Survey

Funds: Supported by National Natural Science Foundation of China (62173352), Guangdong Basic and Applied Basic Research Foundation (2024B1515020104)
More Information
    Author Bio:

    WANG Wen-Sheng Master's student in Computer Technology at Sun Yat-sen University. He received his bachelor degree in Measurement and Control Technology and Instruments from the School of Automation at University of Science and Technology Beijing in 2023. His main research focus is on embodied AI based on large models

    TAN Ning Associate professor at the School of Computer Science and Engineering, Sun Yat-sen University. He received his Ph.D. degree from University of Franche-Comté in France in 2013. His research interest covers the modeling, design, simulation, optimization, planning, and control of various robotic systems, covering both fundamental research and application development. Correspoinding author of this paper

    HUANG Kai Professor at Sun Yat-sen University. He received his Ph.D. degree in computer science from ETH Zürich, Zürich, Switzerland, in 2010. His research interests include techniques for the analysis, design, and optimization of embedded systems, particularly in the automotive and robotic domains

    ZHANG Yu-Nong Professor at Sun Yat-sen University. He received his Ph.D. degree from Chinese University of Hong Kong in 2002. Distinguished Scholar of the Pearl River Scholars Program in Guangdong Province, and an Elsevier Highly Cited Researcher in China.His research interests include redundant robots, recurrent neural networks, Gaussian processes, scientific computing, and software-hardware development

    ZHENG Wei-Shi Cheung Kong scholar distinguished professor, recipient of the Excellent Young Scientists Fund of the National Natural Science Foundation of China, and recipient of the Royal Society-Newton Advanced Fellowship of the United Kingdom. His research interest covers theories and methods of collaborative and interactive analysis, addressing visual computing issues in human behavior modeling and artificial intelligence (AI) robotic learning

    SUN Fu-Chun Professor in the Department of Computer Science and Technology,Tsinghua University. He received his Ph.D. degree from Tsinghua University in 1997. He was a recipient of the National Science Fund for Distinguished Young Scholars. He serves as the vice director of Chinese Association for Artificial Intelligence. He is an IEEE Fellow. His main research interest is intelligent control, intelligent robotics, and embodied intelligence

  • 摘要: 得益于近期拥有世界知识的大规模预训练模型的迅速发展, 基于大模型的具身智能在各类任务中取得了良好的效果, 展现出了强大的泛化能力与在各领域内广阔的应用前景. 文章对基于大模型的具身智能的工作进行了综述, 首先介绍了大模型在具身智能系统中起到的感知与理解作用, 其次对大模型在具身智能中参与的需求级、任务级、规划级、动作级四个级别的控制进行了较为全面的总结, 随后对不同具身智能系统架构进行介绍, 并总结了具目前具身智能模型的数据来源, 包括模拟器、模仿学习以及视频学习, 最后对基于大语言模型的具身智能系统的面临的挑战与发展方向进行讨论与总结.
  • 图  1  基于大模型的具身智能工作概览

    Fig.  1  Overview of embodied intelligence work based on large models

    图  2  基于Nerf的语义特征场景表示[108]

    Fig.  2  Semantic feature scene representation based on NeRF[108]

    图  3  具身智能系统的控制层级

    Fig.  3  Control hierarchy of embodied intelligence systems

    图  5  具身智能的不同架构举例

    Fig.  5  Examples of different architectures in embodied AI

    图  6  RT-X收集到的多样化数据[55]

    Fig.  6  Diverse data collected by RT-X[55]

    图  4  VoxPoser根据价值图规划轨迹[52]

    Fig.  4  VoxPoser plans a motion trajectory based on value maps[52]

  • [1] A. M. Turing. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, 1950, LIX(236): 433−460 doi: 10.1093/mind/LIX.236.433
    [2] N. Roy, I. Posner, T. Barfoot, P. Beaudoin, Y. Bengio, J. Bohg et al., “From Machine Learning to Robotics: Challenges and Opportunities for Embodied Intelligence,” Oct. 2021.
    [3] R. A. Brooks, “Intelligence without representation.”
    [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal et al., “Language Models are FewShot Learners,” Jul. 2020.
    [5] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya et al., “GPT-4 Technical Report,” Mar. 2024.
    [6] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix et al., “LLaMA: Open and Effcient Foundation Language Models,” Feb. 2023.
    [7] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” Jul. 2023.
    [8] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut et al., “Gemini: A Family of Highly Capable Multimodal Models,” Apr. 2024.
    [9] G. Team, M. Reid, N. Savinov, D. Teplyashin, Dmitry, Lepikhin et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” Apr. 2024.
    [10] J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping Language-Image Pre-training for Unified VisionLanguage Understanding and Generation,” Feb. 2022.
    [11] J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” Jun. 2023.
    [12] N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration,” Nov. 2023.
    [13] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven Semantic Segmentation,” Apr. 2022.
    [14] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Openvocabulary Object Detection via Vision and Language Knowledge Distillation,” May 2022.
    [15] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Batra et al., “VQA: Visual Question Answering,” Oct. 2016.
    [16] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski et al., “Emerging Properties in SelfSupervised Vision Transformers,” May 2021.
    [17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal et al., “Learning Transferable Visual Models From Natural Language Supervision,” Feb. 2021.
    [18] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson et al., “Segment Anything,” Apr. 2023.
    [19] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx et al., “On the opportunities and risks of foundation models,” 2022.
    [20] 杨雨彤. AI 大模型与具身智能终将相遇. 机器人产业, 2024(2): 71−74

    Y. Yang. AI large models and embodied intelligence will eventually meet. Robot Industry, 2024(2): 71−74
    [21] A. Gupta, S. Savarese, S. Ganguli, and L. Fei-Fei. Embodied intelligence via learning and evolution. Nature Communications, 2021, 12(1): 5721 doi: 10.1038/s41467-021-25874-z
    [22] 刘华平, 郭迪, 孙富春, 张新钰. 基于形态的具身智能研 究: 历史回顾与前沿进展. 自动化学报, 2023, 49(6): 1131−1154

    H. Liu, D. Guo, F. Sun, and X. Zhang. Morphologybased Embodied Intelligence: Historical Retrospect and Research Progress. Science China Information Sciences, 2023, 49(6): 1131−1154
    [23] 兰沣卜, 赵文博, 朱凯, 张涛. 基于具身智能的移动操 作机器人系统发展研究. 中国工程科学, 2024, 26(1): 139−148 doi: 10.15302/J-SSCAE-2024.01.010

    F. LAN, W. Zhao, k. Zhu, and T. Zhang. Development of Mobile Manipulator Robot System with Embodied Intelligence. Strategic Study of CAE, 2024, 26(1): 139−148 doi: 10.15302/J-SSCAE-2024.01.010
    [24] R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu et al., “Foundation Models in Robotics: Applications, Challenges, and the Future,” Dec. 2023.
    [25] J. Wang, Z. Wu, Y. Li, H. Jiang, P. Shu, E. Shi et al., “Large Language Models for Robotics: Opportunities, Challenges, and Perspectives,” Jan. 2024.
    [26] Y. Kim, D. Kim, J. Choi, J. Park, N. Oh, and D. Park, “A Survey on Integration of Large Language Models with Intelligent Robots,” Apr. 2024.
    [27] Y. Hu, Q. Xie, V. Jain, J. Francis, J. Patrikar, N. Keetha et al., “Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis,” Dec. 2023.
    [28] Y. Liu, W. Chen, Y. Bai, J. Luo, X. Song, K. Jiang et al., “Aligning cyber space with physical world: A comprehensive survey on embodied ai,” 2024.
    [29] Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu et al., “The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision),” Oct. 2023.
    [30] Y. Hu, F. Lin, T. Zhang, L. Yi, and Y. Gao, “Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning,” Dec. 2023.
    [31] L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang, H. Zhu et al., “MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge,” Nov. 2022.
    [32] S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from Human Videos as a Versatile Representation for Robotics,” Apr. 2023.
    [33] B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet et al., “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos,” Jun. 2022.
    [34] S. A. Sontakke, J. Zhang, S. M. R. Arnold, K. Pertsch, E. Bıyık, D. Sadigh et al., “RoboCLIP: One Demonstration is Enough to Learn Robot Policies,” Oct. 2023.
    [35] Y. Seo, K. Lee, S. James, and P. Abbeel, “Reinforcement Learning with Action-Free Pre-Training from Videos,” Jun. 2022.
    [36] L. Han, Q. Zhu, J. Sheng, C. Zhang, T. Li, Y. Zhang, et al. Lifelike agility and play in quadrupedal robots using reinforcement learning and generative pre-trained models. Nature Machine Intelligence, 2024
    [37] T. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “ALOHA: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,” Apr. 2023.
    [38] C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng et al., “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-TheWild Robots,” Mar. 2024.
    [39] Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “Humanplus: Humanoid shadowing and imitation from humans,” 2024.
    [40] P. Wu, Y. Shentu, Z. Yi, X. Lin, and P. Abbeel, “GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators,” Sep. 2023.
    [41] H. Kim, Y. Ohmura, and Y. Kuniyoshi. Goalconditioned dual-action imitation learning for dexterous dual-arm robot manipulation. IEEE Transactions on Robotics, 2024, 40: 2287−2305 doi: 10.1109/TRO.2024.3372778
    [42] Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Y. Wang, Z. Erickson et al., “RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation,” Nov. 2023.
    [43] A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y. Narang, L. Fan et al., “Mimicgen: A data generation system for scalable robot learning using human demonstrations,” 2023.
    [44] H. Ha, P. Florence, and S. Song, “Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition,” Sep. 2023.
    [45] Y. J. Ma, W. Liang, H. Wang, S. Wang, Y. Zhu, L. Fan et al., “Dreureka: Language model guided sim-to-real transfer,” 2024.
    [46] Z. Luo, J. Cao, S. Christen, A. Winkler, K. Kitani, and W. Xu, “Grasping diverse objects with simulated humanoids,” 2024.
    [47] T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively et al., “Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning,” Jun. 2021.
    [48] C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martín-Martín et al., “BEHAVIOR-1K: A HumanCentered, Embodied AI Benchmark with 1, 000 Everyday Activities and Realistic Simulation,” Mar. 2024.
    [49] J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song et al., “TidyBot: Personalized Robot Assistance with Large Language Models.”
    [50] Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen et al., “VIMA: General Robot Manipulation with Multimodal Prompts,” May 2023.
    [51] S. Huang, Z. Jiang, H. Dong, Y. Qiao, P. Gao, and H. Li, “Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model,” May 2023.
    [52] W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models,” Nov. 2023.
    [53] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn et al., “RT-1: Robotics Transformer for Real-World Control at Scale,” in Robotics: Science and Systems XIX. Robotics: Science and Systems Foundation, Jul. 2023.
    [54] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski et al., “RT-2: VisionLanguage-Action Models Transfer Web Knowledge to Robotic Control,” Jul. 2023.
    [55] A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan et al., “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.”
    [56] Z. Durante, B. Sarkar, R. Gong, R. Taori, Y. Noda, P. Tang et al., “An Interactive Agent Foundation Model.”
    [57] W. Wang, Y. Lei, S. Jin, G. D. Hager, and L. Zhang, “Vihe: Virtual in-hand eye transformer for 3d robotic manipulation,” 2024.
    [58] A. Team, J. Aldaco, T. Armstrong, R. Baruch, J. Bingham, S. Chan et al., “ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation,” Feb. 2024.
    [59] Y.-J. Wang, B. Zhang, J. Chen, and K. Sreenath, “Prompt a Robot to Walk with Large Language Models,” Nov. 2023.
    [60] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron et al., “A Generalist Agent,” Nov. 2022.
    [61] X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu et al., “Vision-Language Foundation Models as Effective Robot Imitators,” Feb. 2024.
    [62] X. Li, M. Zhang, Y. Geng, H. Geng, Y. Long, Y. Shen et al., “ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation,” Dec. 2023.
    [63] H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y. Du et al., “3D-VLA: A 3D Vision-Language-Action Generative World Model,” Mar. 2024.
    [64] J. Wu, S. Yin, N. Feng, X. He, D. Li, J. Hao et al., “ivideogpt: Interactive videogpts are scalable world models,” 2024.
    [65] J. Zhang, K. Wang, R. Xu, G. Zhou, Y. Hong, X. Fang et al., “Navid: Video-based vlm plans the next step for vision-and-language navigation,” 2024.
    [66] Z. Mandi, S. Jain, and S. Song, “RoCo: Dialectic Multi-Robot Collaboration with Large Language Models,” Jul. 2023.
    [67] A. Jiao, T. P. Patel, S. Khurana, A.-M. Korol, L. Brunke, V. K. Adajania et al., “Swarm-GPT: Combining Large Language Models with Safe Motion Planning for Robot Choreography Design,” Dec. 2023.
    [68] W. Huang, C. Wang, Y. Li, R. Zhang, and L. Fei-Fei, “ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation.”
    [69] P. Liu, Y. Orru, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “OK-Robot What Really Matters in Integrating Open-Knowledge Models for Robotics.”
    [70] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter et al., “Code as Policies: Language Model Programs for Embodied Control,” May 2023.
    [71] Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and Motion Planning with Large Language Models for Object Rearrangement,” Oct. 2023.
    [72] K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2Motion: From Natural Language Instructions to Feasible Plans. Autonomous Robots, 2023, 47(8): 1345−1365 doi: 10.1007/s10514-023-10131-7
    [73] D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter et al., “PaLM-E: An Embodied Multimodal Language Model,” Mar. 2023.
    [74] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David et al., “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.”
    [75] Y. Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin et al., “EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought,” Sep. 2023.
    [76] Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel et al., “Guiding Pretraining in Reinforcement Learning with Large Language Models,” Sep. 2023.
    [77] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu et al., “Voyager: An Open-Ended Embodied Agent with Large Language Models,” Oct. 2023.
    [78] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models,” Mar. 2023.
    [79] A. Z. Ren, A. Dixit, A. Bodrova, S. Singh, S. Tu, N. Brown et al., “Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners,” Sep. 2023.
    [80] H. Liu, A. Chen, Y. Zhu, A. Swaminathan, A. Kolobov, and C.-A. Cheng, “Interactive Robot Learning from Verbal Correction,” Oct. 2023.
    [81] L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo et al., “Yell At Your Robot: Improving On-theFly from Language Corrections,” Mar. 2024.
    [82] A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker et al., “Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language,” May 2022.
    [83] R. Shah, R. Martín-Martín, and Y. Zhu, “MUTEX: Learning Unified Policies from Multimodal Task Specifications,” Sep. 2023.
    [84] Y. Dai, R. Peng, S. Li, and J. Chai, “Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation,” May 2024.
    [85] F. Liu, K. Fang, P. Abbeel, and S. Levine, “MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting,” Mar. 2024.
    [86] S. James, K. Wada, T. Laidlow, and A. J. Davison, “Coarse-to-Fine Q-attention: Effcient Learning for Visual Robotic Manipulation via Discretisation,” Mar. 2022.
    [87] M. Shridhar, L. Manuelli, and D. Fox, “A Multi-Task Transformer for Robotic Manipulation.”
    [88] M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “LangSplat: 3D Language Gaussian Splatting,” Dec. 2023.
    [89] O. Shorinwa, J. Tucker, A. Smith, A. Swann, T. Chen, R. Firoozi et al., “Splat-MOVER: Multi-Stage, OpenVocabulary Robotic Manipulation via Editable Gaussian Splatting,” May 2024.
    [90] J. Yang, X. Chen, S. Qian, N. Madaan, M. Iyengar, D. F. Fouhey et al., “LLM-Grounder: OpenVocabulary 3D Visual Grounding with Large Language Model as an Agent,” Sep. 2023.
    [91] C. Huang, O. Mees, A. Zeng, and W. Burgard, “Audio Visual Language Maps for Robot Navigation,” Mar. 2023.
    [92] T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki, “Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation,” Oct. 2023.
    [93] K. Zhang, B. Li, K. Hauser, and Y. Li, “Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation,” 2024.
    [94] S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, and L. E. Li, “AffordanceLLM: Grounding Affordance from Vision Language Models,” Apr. 2024.
    [95] Y. Ye, X. Li, A. Gupta, S. De Mellon, S. Birchfield, J. Song et al., “Affordance Diffusion: Synthesizing Hand-Object Interactions,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 22 479–22 489.
    [96] H. Huang, F. Lin, Y. Hu, S. Wang, and Y. Gao, “CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models.”
    [97] Z. Qin, K. Fang, Y. Zhu, L. Fei-Fei, and S. Savarese, “KETO: Learning Keypoint Representations for Tool Manipulation,” Oct. 2019.
    [98] Y. Ju, K. Hu, G. Zhang, G. Zhang, M. Jiang, and H. Xu, “Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation,” 2024.
    [99] P. Sundaresan, S. Belkhale, D. Sadigh, and J. Bohg, “KITE: Keypoint-Conditioned Policies for Semantic Manipulation,” Oct. 2023.
    [100] Y. Hong, Z. Zheng, P. Chen, Y. Wang, J. Li, and C. Gan, “MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World,” Jan. 2024.
    [101] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” Dec. 2023.
    [102] S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet et al., “HomeRobot: Open-Vocabulary Mobile Manipulation,” Jan. 2024.
    [103] N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, “CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory,” May 2023.
    [104] M. Shridhar, L. Manuelli, and D. Fox, “A Multi-Task Transformer for Robotic Manipulation.”
    [105] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Dec. 2015.
    [106] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” Apr. 2017.
    [107] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” Aug. 2020.
    [108] W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Distilled Feature Fields Enable FewShot Language-Guided Manipulation,” Dec. 2023.
    [109] B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics, 2023, 42(4): 1−14
    [110] B. Fei, J. Xu, R. Zhang, Q. Zhou, W. Yang, and Y. He, “3D Gaussian as a New Vision Era: A Survey,” Feb. 2024.
    [111] J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “LERF: Language Embedded Radiance Fields,” Mar. 2023.
    [112] H. Matsuki, R. Murai, P. H. J. Kelly, and A. J. Davison, “Gaussian Splatting SLAM,” Dec. 2023.
    [113] S. Zhu, R. Qin, G. Wang, J. Liu, and H. Wang, “SemGauss-SLAM: Dense Semantic Gaussian Splatting SLAM,” Mar. 2024.
    [114] M. Hassanin, S. Khan, and M. Tahtali, “Visual Affordance and Function Understanding: A Survey,” Jul. 2018.
    [115] Y. Cui, S. Niekum, A. Gupta, V. Kumar, and A. Rajeswaran, “Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?” Apr. 2022.
    [116] Z. Mandi, H. Bharadhwaj, V. Moens, S. Song, A. Rajeswaran, and V. Kumar, “CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning,” Feb. 2023.
    [117] T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang et al., “Scaling Robot Learning with Semantically Imagined Experience,” Feb. 2023.
    [118] B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo, Robotics: Modelling, Planning and Control, ser. Advanced Textbooks in Control and Signal Processing, M. J. Grimble and M. A. Johnson, Eds. London: Springer, 2009.
    [119] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia et al., “Chain-of-thought prompting elicits reasoning in large language models,” 2023.
    [120] M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy et al., “Simple OpenVocabulary Object Detection with Vision Transformers,” Jul. 2022.
    [121] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, et al. Array programming with NumPy. Nature, 2020, 585(7825): 357−362 doi: 10.1038/s41586-020-2649-2
    [122] R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu et al., “LLaMA-Adapter: Effcient Fine-tuning of Language Models with Zero-init Attention,” Jun. 2023.
    [123] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017.
    [124] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John et al., “Universal Sentence Encoder,” Apr. 2018.
    [125] M. Tan and Q. V. Le, “EffcientNet: Rethinking Model Scaling for Convolutional Neural Networks,” Sep. 2020.
    [126] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, “FiLM: Visual Reasoning with a General Conditioning Layer,” Dec. 2017.
    [127] X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu et al., “PaLI-X: On Scaling up a Multilingual Vision and Language Model,” May 2023.
    [128] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” Dec. 2022.
    [129] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” May 2019.
    [130] Z. Fu, T. Z. Zhao, and C. Finn, “Mobile ALOHA: Learning Bimanual Mobile Manipulation with LowCost Whole-Body Teleoperation,” Jan. 2024.
    [131] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “RLBench: The Robot Learning Benchmark & Learning Environment,” Sep. 2019.
    [132] F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. Tchapmi, A. Toshev, et al. Interactive Gibson Benchmark (iGibson 0.5): A Benchmark for Interactive Navigation in Cluttered Environments. IEEE Robotics and Automation Letters, 2020, 5(2): 713−720 doi: 10.1109/LRA.2020.2965078
    [133] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi et al., “ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks,” Mar. 2020.
    [134] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler et al., “VirtualHome: Simulating Household Activities via Programs,” Jun. 2018.
    [135] C. Gan, S. Zhou, J. Schwartz, S. Alter, A. Bhandwaldar, D. Gutfreund et al., “The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark for Physically Realistic Embodied AI,” Mar. 2021.
    [136] L. Weihs, M. Deitke, A. Kembhavi, and R. Mottaghi, “Visual Room Rearrangement,” Mar. 2021.
    [137] V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin et al., “Isaac gym: High performance gpu-based physics simulation for robot learning,” 2021.
    [138] L. Wang, Y. Ling, Z. Yuan, M. Shridhar, C. Bao, Y. Qin et al., “GenSim: Generating Robotic Simulation Tasks via Large Language Models,” Jan. 2024.
    [139] C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion,” Mar. 2024.
    [140] Z. Zhou, X. Ning, K. Hong, T. Fu, J. Xu, S. Li et al., “A Survey on Effcient Inference for Large Language Models,” Apr. 2024.
    [141] M. Ahn, D. Dwibedi, C. Finn, M. G. Arenas, K. Gopalakrishnan, K. Hausman et al., “AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents,” Jan. 2024.
  • 加载中
计量
  • 文章访问数:  865
  • HTML全文浏览量:  613
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-08-01
  • 录用日期:  2024-09-09
  • 网络出版日期:  2024-10-13

目录

    /

    返回文章
    返回