[1]
|
A. M. Turing. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, 1950, LIX(236): 433−460 doi: 10.1093/mind/LIX.236.433
|
[2]
|
N. Roy, I. Posner, T. Barfoot, P. Beaudoin, Y. Bengio, J. Bohg et al., “From Machine Learning to Robotics: Challenges and Opportunities for Embodied Intelligence,” Oct. 2021.
|
[3]
|
R. A. Brooks, “Intelligence without representation.”
|
[4]
|
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal et al., “Language Models are FewShot Learners,” Jul. 2020.
|
[5]
|
OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya et al., “GPT-4 Technical Report,” Mar. 2024.
|
[6]
|
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix et al., “LLaMA: Open and Effcient Foundation Language Models,” Feb. 2023.
|
[7]
|
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” Jul. 2023.
|
[8]
|
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut et al., “Gemini: A Family of Highly Capable Multimodal Models,” Apr. 2024.
|
[9]
|
G. Team, M. Reid, N. Savinov, D. Teplyashin, Dmitry, Lepikhin et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” Apr. 2024.
|
[10]
|
J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping Language-Image Pre-training for Unified VisionLanguage Understanding and Generation,” Feb. 2022.
|
[11]
|
J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” Jun. 2023.
|
[12]
|
N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration,” Nov. 2023.
|
[13]
|
B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven Semantic Segmentation,” Apr. 2022.
|
[14]
|
X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Openvocabulary Object Detection via Vision and Language Knowledge Distillation,” May 2022.
|
[15]
|
A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Batra et al., “VQA: Visual Question Answering,” Oct. 2016.
|
[16]
|
M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski et al., “Emerging Properties in SelfSupervised Vision Transformers,” May 2021.
|
[17]
|
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal et al., “Learning Transferable Visual Models From Natural Language Supervision,” Feb. 2021.
|
[18]
|
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson et al., “Segment Anything,” Apr. 2023.
|
[19]
|
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx et al., “On the opportunities and risks of foundation models,” 2022.
|
[20]
|
杨雨彤. AI 大模型与具身智能终将相遇. 机器人产业, 2024(2): 71−74Y. Yang. AI large models and embodied intelligence will eventually meet. Robot Industry, 2024(2): 71−74
|
[21]
|
A. Gupta, S. Savarese, S. Ganguli, and L. Fei-Fei. Embodied intelligence via learning and evolution. Nature Communications, 2021, 12(1): 5721 doi: 10.1038/s41467-021-25874-z
|
[22]
|
刘华平, 郭迪, 孙富春, 张新钰. 基于形态的具身智能研 究: 历史回顾与前沿进展. 自动化学报, 2023, 49(6): 1131−1154H. Liu, D. Guo, F. Sun, and X. Zhang. Morphologybased Embodied Intelligence: Historical Retrospect and Research Progress. Science China Information Sciences, 2023, 49(6): 1131−1154
|
[23]
|
兰沣卜, 赵文博, 朱凯, 张涛. 基于具身智能的移动操 作机器人系统发展研究. 中国工程科学, 2024, 26(1): 139−148 doi: 10.15302/J-SSCAE-2024.01.010F. LAN, W. Zhao, k. Zhu, and T. Zhang. Development of Mobile Manipulator Robot System with Embodied Intelligence. Strategic Study of CAE, 2024, 26(1): 139−148 doi: 10.15302/J-SSCAE-2024.01.010
|
[24]
|
R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu et al., “Foundation Models in Robotics: Applications, Challenges, and the Future,” Dec. 2023.
|
[25]
|
J. Wang, Z. Wu, Y. Li, H. Jiang, P. Shu, E. Shi et al., “Large Language Models for Robotics: Opportunities, Challenges, and Perspectives,” Jan. 2024.
|
[26]
|
Y. Kim, D. Kim, J. Choi, J. Park, N. Oh, and D. Park, “A Survey on Integration of Large Language Models with Intelligent Robots,” Apr. 2024.
|
[27]
|
Y. Hu, Q. Xie, V. Jain, J. Francis, J. Patrikar, N. Keetha et al., “Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis,” Dec. 2023.
|
[28]
|
Y. Liu, W. Chen, Y. Bai, J. Luo, X. Song, K. Jiang et al., “Aligning cyber space with physical world: A comprehensive survey on embodied ai,” 2024.
|
[29]
|
Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu et al., “The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision),” Oct. 2023.
|
[30]
|
Y. Hu, F. Lin, T. Zhang, L. Yi, and Y. Gao, “Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning,” Dec. 2023.
|
[31]
|
L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang, H. Zhu et al., “MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge,” Nov. 2022.
|
[32]
|
S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from Human Videos as a Versatile Representation for Robotics,” Apr. 2023.
|
[33]
|
B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet et al., “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos,” Jun. 2022.
|
[34]
|
S. A. Sontakke, J. Zhang, S. M. R. Arnold, K. Pertsch, E. Bıyık, D. Sadigh et al., “RoboCLIP: One Demonstration is Enough to Learn Robot Policies,” Oct. 2023.
|
[35]
|
Y. Seo, K. Lee, S. James, and P. Abbeel, “Reinforcement Learning with Action-Free Pre-Training from Videos,” Jun. 2022.
|
[36]
|
L. Han, Q. Zhu, J. Sheng, C. Zhang, T. Li, Y. Zhang, et al. Lifelike agility and play in quadrupedal robots using reinforcement learning and generative pre-trained models. Nature Machine Intelligence, 2024
|
[37]
|
T. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “ALOHA: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,” Apr. 2023.
|
[38]
|
C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng et al., “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-TheWild Robots,” Mar. 2024.
|
[39]
|
Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “Humanplus: Humanoid shadowing and imitation from humans,” 2024.
|
[40]
|
P. Wu, Y. Shentu, Z. Yi, X. Lin, and P. Abbeel, “GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators,” Sep. 2023.
|
[41]
|
H. Kim, Y. Ohmura, and Y. Kuniyoshi. Goalconditioned dual-action imitation learning for dexterous dual-arm robot manipulation. IEEE Transactions on Robotics, 2024, 40: 2287−2305 doi: 10.1109/TRO.2024.3372778
|
[42]
|
Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Y. Wang, Z. Erickson et al., “RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation,” Nov. 2023.
|
[43]
|
A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y. Narang, L. Fan et al., “Mimicgen: A data generation system for scalable robot learning using human demonstrations,” 2023.
|
[44]
|
H. Ha, P. Florence, and S. Song, “Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition,” Sep. 2023.
|
[45]
|
Y. J. Ma, W. Liang, H. Wang, S. Wang, Y. Zhu, L. Fan et al., “Dreureka: Language model guided sim-to-real transfer,” 2024.
|
[46]
|
Z. Luo, J. Cao, S. Christen, A. Winkler, K. Kitani, and W. Xu, “Grasping diverse objects with simulated humanoids,” 2024.
|
[47]
|
T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively et al., “Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning,” Jun. 2021.
|
[48]
|
C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martín-Martín et al., “BEHAVIOR-1K: A HumanCentered, Embodied AI Benchmark with 1, 000 Everyday Activities and Realistic Simulation,” Mar. 2024.
|
[49]
|
J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song et al., “TidyBot: Personalized Robot Assistance with Large Language Models.”
|
[50]
|
Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen et al., “VIMA: General Robot Manipulation with Multimodal Prompts,” May 2023.
|
[51]
|
S. Huang, Z. Jiang, H. Dong, Y. Qiao, P. Gao, and H. Li, “Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model,” May 2023.
|
[52]
|
W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models,” Nov. 2023.
|
[53]
|
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn et al., “RT-1: Robotics Transformer for Real-World Control at Scale,” in Robotics: Science and Systems XIX. Robotics: Science and Systems Foundation, Jul. 2023.
|
[54]
|
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski et al., “RT-2: VisionLanguage-Action Models Transfer Web Knowledge to Robotic Control,” Jul. 2023.
|
[55]
|
A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan et al., “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.”
|
[56]
|
Z. Durante, B. Sarkar, R. Gong, R. Taori, Y. Noda, P. Tang et al., “An Interactive Agent Foundation Model.”
|
[57]
|
W. Wang, Y. Lei, S. Jin, G. D. Hager, and L. Zhang, “Vihe: Virtual in-hand eye transformer for 3d robotic manipulation,” 2024.
|
[58]
|
A. Team, J. Aldaco, T. Armstrong, R. Baruch, J. Bingham, S. Chan et al., “ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation,” Feb. 2024.
|
[59]
|
Y.-J. Wang, B. Zhang, J. Chen, and K. Sreenath, “Prompt a Robot to Walk with Large Language Models,” Nov. 2023.
|
[60]
|
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron et al., “A Generalist Agent,” Nov. 2022.
|
[61]
|
X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu et al., “Vision-Language Foundation Models as Effective Robot Imitators,” Feb. 2024.
|
[62]
|
X. Li, M. Zhang, Y. Geng, H. Geng, Y. Long, Y. Shen et al., “ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation,” Dec. 2023.
|
[63]
|
H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y. Du et al., “3D-VLA: A 3D Vision-Language-Action Generative World Model,” Mar. 2024.
|
[64]
|
J. Wu, S. Yin, N. Feng, X. He, D. Li, J. Hao et al., “ivideogpt: Interactive videogpts are scalable world models,” 2024.
|
[65]
|
J. Zhang, K. Wang, R. Xu, G. Zhou, Y. Hong, X. Fang et al., “Navid: Video-based vlm plans the next step for vision-and-language navigation,” 2024.
|
[66]
|
Z. Mandi, S. Jain, and S. Song, “RoCo: Dialectic Multi-Robot Collaboration with Large Language Models,” Jul. 2023.
|
[67]
|
A. Jiao, T. P. Patel, S. Khurana, A.-M. Korol, L. Brunke, V. K. Adajania et al., “Swarm-GPT: Combining Large Language Models with Safe Motion Planning for Robot Choreography Design,” Dec. 2023.
|
[68]
|
W. Huang, C. Wang, Y. Li, R. Zhang, and L. Fei-Fei, “ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation.”
|
[69]
|
P. Liu, Y. Orru, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “OK-Robot What Really Matters in Integrating Open-Knowledge Models for Robotics.”
|
[70]
|
J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter et al., “Code as Policies: Language Model Programs for Embodied Control,” May 2023.
|
[71]
|
Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and Motion Planning with Large Language Models for Object Rearrangement,” Oct. 2023.
|
[72]
|
K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2Motion: From Natural Language Instructions to Feasible Plans. Autonomous Robots, 2023, 47(8): 1345−1365 doi: 10.1007/s10514-023-10131-7
|
[73]
|
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter et al., “PaLM-E: An Embodied Multimodal Language Model,” Mar. 2023.
|
[74]
|
M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David et al., “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.”
|
[75]
|
Y. Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin et al., “EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought,” Sep. 2023.
|
[76]
|
Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel et al., “Guiding Pretraining in Reinforcement Learning with Large Language Models,” Sep. 2023.
|
[77]
|
G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu et al., “Voyager: An Open-Ended Embodied Agent with Large Language Models,” Oct. 2023.
|
[78]
|
C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models,” Mar. 2023.
|
[79]
|
A. Z. Ren, A. Dixit, A. Bodrova, S. Singh, S. Tu, N. Brown et al., “Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners,” Sep. 2023.
|
[80]
|
H. Liu, A. Chen, Y. Zhu, A. Swaminathan, A. Kolobov, and C.-A. Cheng, “Interactive Robot Learning from Verbal Correction,” Oct. 2023.
|
[81]
|
L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo et al., “Yell At Your Robot: Improving On-theFly from Language Corrections,” Mar. 2024.
|
[82]
|
A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker et al., “Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language,” May 2022.
|
[83]
|
R. Shah, R. Martín-Martín, and Y. Zhu, “MUTEX: Learning Unified Policies from Multimodal Task Specifications,” Sep. 2023.
|
[84]
|
Y. Dai, R. Peng, S. Li, and J. Chai, “Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation,” May 2024.
|
[85]
|
F. Liu, K. Fang, P. Abbeel, and S. Levine, “MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting,” Mar. 2024.
|
[86]
|
S. James, K. Wada, T. Laidlow, and A. J. Davison, “Coarse-to-Fine Q-attention: Effcient Learning for Visual Robotic Manipulation via Discretisation,” Mar. 2022.
|
[87]
|
M. Shridhar, L. Manuelli, and D. Fox, “A Multi-Task Transformer for Robotic Manipulation.”
|
[88]
|
M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “LangSplat: 3D Language Gaussian Splatting,” Dec. 2023.
|
[89]
|
O. Shorinwa, J. Tucker, A. Smith, A. Swann, T. Chen, R. Firoozi et al., “Splat-MOVER: Multi-Stage, OpenVocabulary Robotic Manipulation via Editable Gaussian Splatting,” May 2024.
|
[90]
|
J. Yang, X. Chen, S. Qian, N. Madaan, M. Iyengar, D. F. Fouhey et al., “LLM-Grounder: OpenVocabulary 3D Visual Grounding with Large Language Model as an Agent,” Sep. 2023.
|
[91]
|
C. Huang, O. Mees, A. Zeng, and W. Burgard, “Audio Visual Language Maps for Robot Navigation,” Mar. 2023.
|
[92]
|
T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki, “Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation,” Oct. 2023.
|
[93]
|
K. Zhang, B. Li, K. Hauser, and Y. Li, “Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation,” 2024.
|
[94]
|
S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, and L. E. Li, “AffordanceLLM: Grounding Affordance from Vision Language Models,” Apr. 2024.
|
[95]
|
Y. Ye, X. Li, A. Gupta, S. De Mellon, S. Birchfield, J. Song et al., “Affordance Diffusion: Synthesizing Hand-Object Interactions,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 22 479–22 489.
|
[96]
|
H. Huang, F. Lin, Y. Hu, S. Wang, and Y. Gao, “CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models.”
|
[97]
|
Z. Qin, K. Fang, Y. Zhu, L. Fei-Fei, and S. Savarese, “KETO: Learning Keypoint Representations for Tool Manipulation,” Oct. 2019.
|
[98]
|
Y. Ju, K. Hu, G. Zhang, G. Zhang, M. Jiang, and H. Xu, “Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation,” 2024.
|
[99]
|
P. Sundaresan, S. Belkhale, D. Sadigh, and J. Bohg, “KITE: Keypoint-Conditioned Policies for Semantic Manipulation,” Oct. 2023.
|
[100]
|
Y. Hong, Z. Zheng, P. Chen, Y. Wang, J. Li, and C. Gan, “MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World,” Jan. 2024.
|
[101]
|
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” Dec. 2023.
|
[102]
|
S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet et al., “HomeRobot: Open-Vocabulary Mobile Manipulation,” Jan. 2024.
|
[103]
|
N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, “CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory,” May 2023.
|
[104]
|
M. Shridhar, L. Manuelli, and D. Fox, “A Multi-Task Transformer for Robotic Manipulation.”
|
[105]
|
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Dec. 2015.
|
[106]
|
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” Apr. 2017.
|
[107]
|
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” Aug. 2020.
|
[108]
|
W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Distilled Feature Fields Enable FewShot Language-Guided Manipulation,” Dec. 2023.
|
[109]
|
B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics, 2023, 42(4): 1−14
|
[110]
|
B. Fei, J. Xu, R. Zhang, Q. Zhou, W. Yang, and Y. He, “3D Gaussian as a New Vision Era: A Survey,” Feb. 2024.
|
[111]
|
J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “LERF: Language Embedded Radiance Fields,” Mar. 2023.
|
[112]
|
H. Matsuki, R. Murai, P. H. J. Kelly, and A. J. Davison, “Gaussian Splatting SLAM,” Dec. 2023.
|
[113]
|
S. Zhu, R. Qin, G. Wang, J. Liu, and H. Wang, “SemGauss-SLAM: Dense Semantic Gaussian Splatting SLAM,” Mar. 2024.
|
[114]
|
M. Hassanin, S. Khan, and M. Tahtali, “Visual Affordance and Function Understanding: A Survey,” Jul. 2018.
|
[115]
|
Y. Cui, S. Niekum, A. Gupta, V. Kumar, and A. Rajeswaran, “Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?” Apr. 2022.
|
[116]
|
Z. Mandi, H. Bharadhwaj, V. Moens, S. Song, A. Rajeswaran, and V. Kumar, “CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning,” Feb. 2023.
|
[117]
|
T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang et al., “Scaling Robot Learning with Semantically Imagined Experience,” Feb. 2023.
|
[118]
|
B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo, Robotics: Modelling, Planning and Control, ser. Advanced Textbooks in Control and Signal Processing, M. J. Grimble and M. A. Johnson, Eds. London: Springer, 2009.
|
[119]
|
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia et al., “Chain-of-thought prompting elicits reasoning in large language models,” 2023.
|
[120]
|
M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy et al., “Simple OpenVocabulary Object Detection with Vision Transformers,” Jul. 2022.
|
[121]
|
C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, et al. Array programming with NumPy. Nature, 2020, 585(7825): 357−362 doi: 10.1038/s41586-020-2649-2
|
[122]
|
R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu et al., “LLaMA-Adapter: Effcient Fine-tuning of Language Models with Zero-init Attention,” Jun. 2023.
|
[123]
|
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017.
|
[124]
|
D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John et al., “Universal Sentence Encoder,” Apr. 2018.
|
[125]
|
M. Tan and Q. V. Le, “EffcientNet: Rethinking Model Scaling for Convolutional Neural Networks,” Sep. 2020.
|
[126]
|
E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, “FiLM: Visual Reasoning with a General Conditioning Layer,” Dec. 2017.
|
[127]
|
X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu et al., “PaLI-X: On Scaling up a Multilingual Vision and Language Model,” May 2023.
|
[128]
|
D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” Dec. 2022.
|
[129]
|
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” May 2019.
|
[130]
|
Z. Fu, T. Z. Zhao, and C. Finn, “Mobile ALOHA: Learning Bimanual Mobile Manipulation with LowCost Whole-Body Teleoperation,” Jan. 2024.
|
[131]
|
S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “RLBench: The Robot Learning Benchmark & Learning Environment,” Sep. 2019.
|
[132]
|
F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. Tchapmi, A. Toshev, et al. Interactive Gibson Benchmark (iGibson 0.5): A Benchmark for Interactive Navigation in Cluttered Environments. IEEE Robotics and Automation Letters, 2020, 5(2): 713−720 doi: 10.1109/LRA.2020.2965078
|
[133]
|
M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi et al., “ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks,” Mar. 2020.
|
[134]
|
X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler et al., “VirtualHome: Simulating Household Activities via Programs,” Jun. 2018.
|
[135]
|
C. Gan, S. Zhou, J. Schwartz, S. Alter, A. Bhandwaldar, D. Gutfreund et al., “The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark for Physically Realistic Embodied AI,” Mar. 2021.
|
[136]
|
L. Weihs, M. Deitke, A. Kembhavi, and R. Mottaghi, “Visual Room Rearrangement,” Mar. 2021.
|
[137]
|
V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin et al., “Isaac gym: High performance gpu-based physics simulation for robot learning,” 2021.
|
[138]
|
L. Wang, Y. Ling, Z. Yuan, M. Shridhar, C. Bao, Y. Qin et al., “GenSim: Generating Robotic Simulation Tasks via Large Language Models,” Jan. 2024.
|
[139]
|
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion,” Mar. 2024.
|
[140]
|
Z. Zhou, X. Ning, K. Hong, T. Fu, J. Xu, S. Li et al., “A Survey on Effcient Inference for Large Language Models,” Apr. 2024.
|
[141]
|
M. Ahn, D. Dwibedi, C. Finn, M. G. Arenas, K. Gopalakrishnan, K. Hausman et al., “AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents,” Jan. 2024.
|