Agent AI:
Surveying the Horizons of Multimodal Interaction
代理AI：多模态相互作用的视野

Zane Durante^1†, Qiuyuan Huang^2‡∗, Naoki Wake^2∗,
赞恩 · 杜兰特¹黄秋媛（ Quyuan Huang ）² Naoki Wake ²,
Ran Gong^3†, Jae Sung Park^4†, Bidipta Sarkar^1†, Rohan Taori^1†, Yusuke Noda⁵,
Ran Gong ³ Jae Sung Park ⁴ Bidipta Sarkar ^{1 1}饰 Rohan Taori ¹野田裕介（ Yusuke Noda ）⁵,
Demetri Terzopoulos³, Yejin Choi⁴, Katsushi Ikeuchi², Hoi Vo⁵, Li Fei-Fei¹, Jianfeng Gao²
Demetri Terzopoulos³，Yejin Choi⁴，Katsushi Ikeuchi²，Hoi Vo⁵，Li Fei-Fei¹，Jianfeng Gao²
¹Stanford University; ²Microsoft Research, Redmond;
¹斯坦福大学;²微软研究院，雷德蒙;
³University of California, Los Angeles; ⁴University of Washington; ⁵Microsoft Gaming
³加州大学，洛杉矶;⁴华盛顿大学;⁵微软游戏 Equal Contribution. ^‡ Project Lead. ^† Work done while interning at Microsoft Research, Redmond.
同等贡献。项目负责人。^[2]在雷德蒙的微软研究院实习时完成的工作。

Abstract 摘要

Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define “Agent AI” as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied actions. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
多模态AI系统可能会在我们的日常生活中无处不在。使这些系统更具交互性的一个有前途的方法是将它们体现为物理和虚拟环境中的代理。目前，系统利用现有的基础模型作为创建具体代理的基本构建块。在这样的环境中嵌入代理有助于模型处理和解释视觉和上下文数据的能力，这对于创建更复杂和上下文感知的AI系统至关重要。例如，可以感知用户动作、人类行为、环境对象、音频表达和场景的集体情感的系统可以用于通知和指导给定环境内的代理响应。为了加速对基于代理的多模态智能的研究，我们将“代理AI”定义为一类交互式系统，可以感知视觉刺激，语言输入和其他环境接地数据，并可以产生有意义的体现行动。特别是，我们探索的系统，旨在改善代理的基础上，通过结合外部知识，多感官输入，和人类的反馈下一个具体的行动预测。我们认为，通过在接地环境中开发代理人工智能系统，还可以减轻大型基础模型的幻觉及其产生环境错误输出的倾向。Agent AI的新兴领域包含了多模式交互的更广泛的体现和代理方面。除了在物理世界中行动和交互的代理之外，我们设想了一个未来，人们可以轻松地创建任何虚拟现实或模拟场景，并与虚拟环境中体现的代理进行交互。

Refer to caption — Figure 1: Overview of an Agent AI system that can perceive and act in different domains and applications. Agent AI is emerging as a promising avenue toward Artificial General Intelligence (AGI). Agent AI training has demonstrated the capacity for multi-modal understanding in the physical world. It provides a framework for reality-agnostic training by leveraging generative AI alongside multiple independent data sources. Large foundation models trained for agent and action-related tasks can be applied to physical and virtual worlds when trained on cross-reality data. We present the general overview of an Agent AI system that can perceive and act in many different domains and applications, possibly serving as a route towards AGI using an agent paradigm.
图1：可以在不同领域和应用中感知和行动的Agent AI系统概述。Agent AI正在成为实现人工通用智能（AGI）的一条有前途的途径。人工智能训练已经证明了在物理世界中进行多模态理解的能力。它通过利用生成式AI以及多个独立的数据源，为现实不可知的训练提供了一个框架。针对代理和动作相关任务训练的大型基础模型可以在交叉现实数据上训练时应用于物理和虚拟世界。我们提出了一个代理人工智能系统，可以感知和行动在许多不同的领域和应用程序的一般概述，可能作为AGI使用代理范式的路线。

1 Introduction
1引言

1.1 Motivation
1.1动机

Historically, AI systems were defined at the 1956 Dartmouth Conference as artificial life forms that could collect information from the environment and interact with it in useful ways. Motivated by this definition, Minsky’s MIT group built in 1970 a robotics system, called the “Copy Demo,” that observed “blocks world” scenes and successfully reconstructed the observed polyhedral block structures. The system, which comprised observation, planning, and manipulation modules, revealed that each of these subproblems is highly challenging and further research was necessary. The AI field fragmented into specialized subfields that have largely independently made great progress in tackling these and other problems, but over-reductionism has blurred the overarching goals of AI research.
从历史上看，人工智能系统在1956年的达特茅斯会议上被定义为可以从环境中收集信息并以有用的方式与之交互的人工生命形式。在这个定义的激励下，明斯基的麻省理工学院团队在1970年建立了一个机器人系统，称为“复制演示”，它观察“块世界”场景，并成功地重建了观察到的多面体块结构。该系统由观察、规划和操作模块组成，揭示了这些子问题中的每一个都具有高度的挑战性，需要进一步研究。人工智能领域被分割成多个专门的子领域，这些子领域在解决这些问题和其他问题方面基本上都取得了独立的巨大进展，但过度的简化论模糊了人工智能研究的总体目标。

To advance beyond the status quo, it is necessary to return to AI fundamentals motivated by Aristotelian Holism. Fortunately, the recent revolution in Large Language Models (LLMs) and Visual Language Models (VLMs) has made it possible to create novel AI agents consistent with the holistic ideal. Seizing upon this opportunity, this article explores models that integrate language proficiency, visual cognition, context memory, intuitive reasoning, and adaptability. It explores the potential completion of this holistic synthesis using LLMs and VLMs. In our exploration, we also revisit system design based on Aristotle’s Final Cause, the teleological “why the system exists”, which may have been overlooked in previous rounds of AI development.
要想超越现状，就必须回到亚里士多德整体论所推动的人工智能基本原理。幸运的是，最近大语言模型（LLMs）和视觉语言模型（VLM）的革命使得创建符合整体理想的新型AI代理成为可能。本文正是抓住这一契机，探索了整合语言能力、视觉认知、语境记忆、直觉推理和适应性的模型。它探讨了使用LLMs和VLM完成这种整体合成的可能性。在我们的探索中，我们还重新审视了基于亚里士多德最终原因的系统设计，即目的论的“系统为什么存在”，这可能在前几轮人工智能开发中被忽视了。

With the advent of powerful pretrained LLMs and VLMs, a renaissance in natural language processing and computer vision has been catalyzed. LLMs now demonstrate an impressive ability to decipher the nuances of real-world linguistic data, often achieving abilities that parallel or even surpass human expertise OpenAI (2023). Recently, researchers have shown that LLMs may be extended to act as agents within various environments, performing intricate actions and tasks when paired with domain-specific knowledge and modules Xi et al. (2023). These scenarios, characterized by complex reasoning, understanding of the agent’s role and its environment, along with multi-step planning, test the agent’s ability to make highly nuanced and intricate decisions within its environmental constraints Wu et al. (2023); Meta Fundamental AI Research Diplomacy Team et al.(2022)Meta Fundamental AI Research (FAIR) Diplomacy Team, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, et al. (FAIR).
随着强大的预训练LLMs和VLM的出现，自然语言处理和计算机视觉的复兴已经被催化。LLMs现在展示了令人印象深刻的破译现实世界语言数据的细微差别的能力，通常可以实现与人类专业知识OpenAI（2023）相媲美甚至超越的能力。最近，研究人员已经表明，LLMs可以扩展为在各种环境中充当代理，当与特定领域的知识和模块配对时执行复杂的动作和任务Xi et al.（2023）。这些场景，其特点是复杂的推理，理解代理的角色和环境，沿着多步规划，测试代理的能力，使其环境约束内的高度细致入微和复杂的决策吴等。（2023）; Meta Fundamental AI Research Diplomacy Team et al.（2022）Meta基础人工智能研究（FAIR）外交团队，巴赫金，布朗，迪南，法里纳，弗莱厄蒂，弗里德，戈夫，格雷，胡等。（FAIR）。

Building upon these initial efforts, the AI community is on the cusp of a significant paradigm shift, transitioning from creating AI models for passive, structured tasks to models capable of assuming dynamic, agentic roles in diverse and complex environments. In this context, this article investigates the immense potential of using LLMs and VLMs as agents, emphasizing models that have a blend of linguistic proficiency, visual cognition, contextual memory, intuitive reasoning, and adaptability. Leveraging LLMs and VLMs as agents, especially within domains like gaming, robotics, and healthcare, promises not just a rigorous evaluation platform for state-of-the-art AI systems, but also foreshadows the transformative impacts that Agent-centric AI will have across society and industries. When fully harnessed, agentic models can redefine human experiences and elevate operational standards. The potential for sweeping automation ushered in by these models portends monumental shifts in industries and socio-economic dynamics. Such advancements will be intertwined with multifaceted leader-board, not only technical but also ethical, as we will elaborate upon in Section 11. We delve into the overlapping areas of these sub-fields of Agent AI and illustrate their interconnectedness in Fig.1.
在这些初步努力的基础上，人工智能社区正处于一个重大范式转变的风口浪尖，从为被动的结构化任务创建人工智能模型过渡到能够在多样化和复杂环境中承担动态代理角色的模型。在这种情况下，本文研究了使用LLMs和VLM作为代理的巨大潜力，强调具有语言能力，视觉认知，上下文记忆，直觉推理和适应性的混合模型。利用LLMs和VLM作为代理，特别是在游戏，机器人和医疗保健等领域，不仅有望为最先进的人工智能系统提供严格的评估平台，而且还预示着以代理为中心的人工智能将对整个社会和行业产生变革性的影响。当被充分利用时，代理模型可以重新定义人类体验并提升运营标准。这些模式带来的全面自动化的潜力预示着行业和社会经济动态的巨大变化。这些进步将与多方面的排行榜交织在一起，不仅是技术上的，而且是道德上的，我们将在第11节详细介绍。我们深入研究了Agent AI的这些子领域的重叠区域，并在图中说明了它们的相互联系。1 .第一次会议

1.2 Background
1.2背景

We will now introduce relevant research papers that support the concepts, theoretical background, and modern implementations of Agent AI.
现在我们将介绍支持Agent AI的概念、理论背景和现代实现的相关研究论文。

Large Foundation Models: 大型基础模型：

LLMs and VLMs have been driving the effort to develop general intelligent machines (Bubeck et al., 2023; Mirchandani et al., 2023). Although they are trained using large text corpora, their superior problem-solving capacity is not limited to canonical language processing domains. LLMs can potentially tackle complex tasks that were previously presumed to be exclusive to human experts or domain-specific algorithms, ranging from mathematical reasoning (Imani et al., 2023; Wei et al., 2022; Zhu et al., 2022) to answering questions of professional law (Blair-Stanek et al., 2023; Choi et al., 2023; Nay, 2022). Recent research has shown the possibility of using LLMs to generate complex plans for robots and game AI (Liang et al., 2022; Wang et al., 2023a, b; Yao et al., 2023a; Huang et al., 2023a), marking an important milestone for LLMs as general-purpose intelligent agents.
LLMs和VLM一直在推动开发通用智能机器的努力（Bubeck 等人，2023年; Mirchandani等人，2023年）。虽然它们是使用大型文本语料库训练的，但它们的上级问题解决能力并不限于规范语言处理领域。LLMs可以潜在地解决以前被认为是人类专家或特定领域算法专有的复杂任务，从数学推理（Imani 等人，2023年; Wei等人，2022年; Zhu等人，2022年）到回答专业法律问题（Blair-Stanek 等人，2023年; Choi等人，2023年;不，2022年）。最近的研究表明，使用LLMs来为机器人和游戏AI生成复杂计划的可能性（Liang 等人，2022; Wang等人，2023 a，B; Yao等人，2023 a; Huang等人，2023 a），标志着LLMs作为通用智能代理的重要里程碑。

Embodied AI: 体现式人工智能：

A number of works leverage LLMs to perform task planning (Huang et al., 2022a; Wang et al., 2023b; Yao et al., 2023a; Li et al., 2023a), specifically the LLMs’ WWW-scale domain knowledge and emergent zero-shot embodied abilities to perform complex task planning and reasoning. Recent robotics research also leverages LLMs to perform task planning (Ahn et al., 2022a; Huang et al., 2022b; Liang et al., 2022) by decomposing natural language instruction into a sequence of subtasks, either in the natural language form or in Python code, then using a low-level controller to execute these subtasks. Additionally, they incorporate environmental feedback to improve task performance (Huang et al., 2022b), (Liang et al., 2022), (Wang et al., 2023a), and (Ikeuchi et al., 2023).
许多作品利用LLMs来执行任务规划（Huang 等人，2022 a; Wang等人，2023 b; Yao等人，2023 a; Li等人，2023 a），具体地说是LLMs'WWW规模的领域知识和紧急零触发体现能力，以执行复杂的任务规划和推理。最近的机器人研究还利用LLMs来执行任务规划（Ahn 等人，2022 a; Huang等人，2022 b; Liang等人，2022），通过将自然语言指令分解成自然语言形式或Python代码的子任务序列，然后使用低级控制器来执行这些子任务。此外，它们结合了环境反馈以改善任务表现（Huang 等人，2022b），（Liang 等人，2022），（Wang 等人，2023a），以及（Ikeuchi 等人，2023年）。

Interactive Learning: 互动学习：

AI agents designed for interactive learning operate using a combination of machine learning techniques and user interactions. Initially, the AI agent is trained on a large dataset. This dataset includes various types of information, depending on the intended function of the agent. For instance, an AI designed for language tasks would be trained on a massive corpus of text data. The training involves using machine learning algorithms, which could include deep learning models like neural networks. These training models enable the AI to recognize patterns, make predictions, and generate responses based on the data on which it was trained. The AI agent can also learn from real-time interactions with users. This interactive learning can occur in various ways: 1) Feedback-based learning: The AI adapts its responses based on direct user feedback Li et al. (2023b); Yu et al. (2023a); Parakh et al. (2023); Zha et al. (2023); Wake et al. (2023a, b, c). For example, if a user corrects the AI’s response, the AI can use this information to improve future responses Zha et al. (2023); Liu et al. (2023a). 2) Observational Learning: The AI observes user interactions and learns implicitly. For example, if users frequently ask similar questions or interact with the AI in a particular way, the AI might adjust its responses to better suit these patterns. It allows the AI agent to understand and process human language, multi-model setting, interpret the cross reality-context, and generate human-users’ responses. Over time, with more user interactions and feedback, the AI agent’s performance generally continuous improves. This process is often supervised by human operators or developers who ensure that the AI is learning appropriately and not developing biases or incorrect patterns.
为交互式学习而设计的人工智能代理使用机器学习技术和用户交互的组合进行操作。最初，在大型数据集上训练AI代理。该数据集包括各种类型的信息，具体取决于代理的预期功能。例如，一个为语言任务设计的人工智能将在大量文本数据的语料库上进行训练。培训涉及使用机器学习算法，其中可能包括神经网络等深度学习模型。这些训练模型使人工智能能够识别模式、做出预测，并根据训练数据生成响应。人工智能代理还可以从与用户的实时交互中学习。这种交互式学习可以以各种方式发生：1）基于反馈的学习：AI根据直接用户反馈调整其响应Li et al.（2023 b）; Yu et al.（2023 a）; Parakh et al.（2023）; Zha et al.（2023）。（2023）; Wake 等人（2023 a，B，c）。例如，如果用户纠正了AI的反应，AI可以使用此信息来改善未来的反应Zha et al.（2023）; Liu et al.（2023 a）。2)观察式学习：人工智能观察用户交互并进行隐式学习。例如，如果用户频繁地提出类似的问题或以特定的方式与AI交互，AI可能会调整其响应以更好地适应这些模式。它允许人工智能主体理解和处理人类语言、多模型设置、解释跨现实上下文以及生成人类用户的响应。随着时间的推移，随着更多的用户交互和反馈，AI代理的性能通常会不断提高。这一过程通常由人类操作员或开发人员监督，他们确保AI正在适当地学习，而不会产生偏见或错误的模式。

1.3 Overview
1.3概述

Multimodal Agent AI (MAA) is a family of systems that generate effective actions in a given environment based on the understanding of multimodal sensory input. With the advent of Large Language Models (LLMs) and Vision-Language Models (VLMs), numerous MAA systems have been proposed in fields ranging from basic research to applications. While these research areas are growing rapidly by integrating with the traditional technologies of each domain (e.g., visual question answering and vision-language navigation), they share common interests such as data collection, benchmarking, and ethical perspectives. In this paper, we focus on the some representative research areas of MAA, namely multimodality, gaming (VR/AR/MR), robotics, and healthcare, and we aim to provide comprehensive knowledge on the common concerns discussed in these fields. As a result we expect to learn the fundamentals of MAA and gain insights to further advance their research. Specific learning outcomes include:
多模态智能体AI（MAA）是一系列系统，它们基于对多模态感官输入的理解，在给定环境中生成有效的动作。随着大语言模型（LLMs）和视觉语言模型（VLM）的出现，从基础研究到应用领域都提出了许多MAA系统。虽然这些研究领域通过与每个领域的传统技术（例如，视觉问答和视觉语言导航），他们有着共同的兴趣，如数据收集，基准测试和道德观点。在本文中，我们重点讨论了多模态、游戏（VR/AR/MR）、机器人和医疗保健等多模态人工智能的一些代表性研究领域，旨在为这些领域中讨论的共同关注点提供全面的知识。因此，我们希望了解MAA的基本原理，并获得进一步推进研究的见解。具体的学习成果包括：

•

MAA Overview: A deep dive into its principles and roles in contemporary applications, providing researcher with a thorough grasp of its importance and uses.
MAA概述：深入研究其原理和在当代应用中的作用，为研究人员提供对其重要性和用途的全面掌握。
•

Methodologies: Detailed examples of how LLMs and VLMs enhance MAAs, illustrated through case studies in gaming, robotics, and healthcare.
方法论：通过游戏、机器人和医疗保健领域的案例研究，详细说明LLMs和VLM如何增强MAA。
•

Performance Evaluation: Guidance on the assessment of MAAs with relevant datasets, focusing on their effectiveness and generalization.
绩效评估：使用相关数据集评估MAA的指南，重点关注其有效性和通用性。
•

Ethical Considerations: A discussion on the societal impacts and ethical leader-board of deploying Agent AI, highlighting responsible development practices.
伦理考虑：讨论部署Agent AI的社会影响和道德领导委员会，强调负责任的开发实践。
•

Emerging Trends and Future leader-board: Categorize the latest developments in each domain and discuss the future directions.
新兴趋势和未来排行榜：对每个领域的最新发展进行分类，并讨论未来的发展方向。

Computer-based action and generalist agents (GAs) are useful for many tasks. A GA to become truly valuable to its users, it can natural to interact with, and generalize to a broad range of contexts and modalities. We aims to cultivate a vibrant research ecosystem and create a shared sense of identity and purpose among the Agent AI community. MAA has the potential to be widely applicable across various contexts and modalities, including input from humans. Therefore, we believe this Agent AI area can engage a diverse range of researchers, fostering a dynamic Agent AI community and shared goals. Led by esteemed experts from academia and industry, we expect that this paper will be an interactive and enriching experience, complete with agent instruction, case studies, tasks sessions, and experiments discussion ensuring a comprehensive and engaging learning experience for all researchers.
基于计算机的动作和通才代理（GA）对于许多任务都是有用的。一个遗传算法要成为真正有价值的用户，它可以自然地互动，并推广到广泛的上下文和模态。我们的目标是培育一个充满活力的研究生态系统，并在Agent AI社区中创造一种共享的身份感和目标感。MAA有可能广泛应用于各种环境和模式，包括来自人类的输入。因此，我们相信这个人工智能领域可以吸引各种各样的研究人员，培养一个充满活力的人工智能社区和共同的目标。在学术界和工业界备受尊敬的专家的带领下，我们希望这篇论文将是一个互动和丰富的体验，包括代理指令，案例研究，任务会话和实验讨论，确保为所有研究人员提供全面和引人入胜的学习体验。

This paper aims to provide general and comprehensive knowledge about the current research in the field of Agent AI. To this end, the rest of the paper is organized as follows. Section 2 outlines how Agent AI benefits from integrating with related emerging technologies, particularly large foundation models. Section 3 describes a new paradigm and framework that we propose for training Agent AI. Section 4 provides an overview of the methodologies that are widely used in the training of Agent AI. Section 5 categorizes and discusses various types of agents. Section 6 introduces Agent AI applications in gaming, robotics, and healthcare. Section 7 explores the research community’s efforts to develop a versatile Agent AI, capable of being applied across various modalities, domains, and bridging the sim-to-real gap. Section 8 discusses the potential of Agent AI that not only relies on pre-trained foundation models, but also continuously learns and self-improves by leveraging interactions with the environment and users. Section 9 introduces our new datasets that are designed for the training of multimodal Agent AI. Section 11 discusses the hot topic of the ethics consideration of AI agent, limitations, and societal impact of our paper.
本文旨在对Agent人工智能领域的研究现状提供一个较为全面的认识。为此，本文的其余部分组织如下。第2节概述了Agent AI如何从与相关新兴技术的集成中受益，特别是大型基础模型。第3节描述了我们提出的用于训练Agent AI的新的范式和框架。第4节概述了在Agent AI培训中广泛使用的方法。第5节对各种类型的代理进行了分类和讨论。第6节介绍了Agent AI在游戏、机器人和医疗保健中的应用。第7节探讨了研究界为开发一个多功能的Agent AI所做的努力，该Agent AI能够应用于各种模式、领域，并弥合模拟到真实的差距。第8节讨论了Agent AI的潜力，它不仅依赖于预先训练的基础模型，而且还通过利用与环境和用户的交互不断学习和自我改进。第9节介绍了我们为训练多模态Agent AI而设计的新数据集。第11节讨论了人工智能主体的伦理考量、局限性以及本文的社会影响等热点话题。

2 Agent AI Integration
2智能体AI集成

Foundation models based on LLMs and VLMs, as proposed in previous research, still exhibit limited performance in the area of embodied AI, particularly in terms of understanding, generating, editing, and interacting within unseen environments or scenarios Huang et al. (2023a); Zeng et al. (2023). Consequently, these limitations lead to sub-optimal outputs from AI agents. Current agent-centric AI modeling approaches focus on directly accessible and clearly defined data (e.g. text or string representations of the world state) and generally use domain and environment-independent patterns learned from their large-scale pretraining to predict action outputs for each environment Xi et al. (2023); Wang et al. (2023c); Gong et al. (2023a); Wu et al. (2023). In (Huang et al., 2023a), we investigate the task of knowledge-guided collaborative and interactive scene generation by combining large foundation models, and show promising results that indicate knowledge-grounded LLM agents can improve the performance of 2D and 3D scene understanding, generation, and editing, alongside with other human-agent interactions Huang et al. (2023a). By integrating an Agent AI framework, large foundation models are able to more deeply understand user input to form a complex and adaptive HCI system. Emergent ability of LLM and VLM works invisible in generative AI, embodied AI, knowledge augmentation for multi-model learning, mix-reality generation, text to vision editing, human interaction for 2D/3D simulation in gaming or robotics tasks. Agent AI recent progress in foundation models present an imminent catalyst for unlocking general intelligence in embodied agents. The large action models, or agent-vision-language models open new possibilities for general-purpose embodied systems such as planning, problem-solving and learning in complex environments. Agent AI test further step in metaverse, and route the early version of AGI.
基于LLMs和VLM的基础模型，如以前的研究所提出的，在体现AI领域仍然表现出有限的性能，特别是在理解，生成，编辑和在看不见的环境或场景中交互方面。因此，这些限制导致AI代理的次优输出。当前以代理为中心的人工智能建模方法专注于直接访问和明确定义的数据（例如，世界状态的文本或字符串表示），并且通常使用从大规模预训练中学习到的与域和环境无关的模式来预测每个环境的动作输出Xi et al.（2023）; Wang et al.（2023 c）; Gong et al.（2023 a）; Wu et al.（2023）。在（Huang 等人，2023 a），我们通过结合大型基础模型研究了知识引导的协作和交互式场景生成任务，并显示了有希望的结果，表明基于知识的LLM代理可以提高2D和3D场景理解，生成和编辑的性能，以及其他人类-代理交互Huang et al.（2023 a）。通过集成Agent AI框架，大型基础模型能够更深入地理解用户输入，形成复杂的自适应人机交互系统。LLM和VLM的涌现能力在生成AI、嵌入式AI、多模型学习的知识增强、混合现实生成、文本到视觉编辑、游戏或机器人任务中的2D/3D模拟的人机交互中不可见。智能体人工智能在基础模型方面的最新进展为在实体化智能体中解锁一般智能提供了一种迫在眉睫的催化剂。大型动作模型或主体-视觉-语言模型为通用的具体化系统（如复杂环境中的规划、问题解决和学习）开辟了新的可能性。Agent AI在metaverse中测试进一步的步骤，并路由早期版本的AGI。

2.1 Infinite AI agent
2.1无限AI代理

AI agents have the capacity to interpret, predict, and respond based on its training and input data. While these capabilities are advanced and continually improving, it’s important to recognize their limitations and the influence of the underlying data they are trained on. AI agent systems generally possess the following abilities: 1) Predictive Modeling: AI agents can predict likely outcomes or suggest next steps based on historical data and trends. For instance, they might predict the continuation of a text, the answer to a question, the next action for a robot, or the resolution of a scenario. 2) Decision Making: In some applications, AI agents can make decisions based on their inferences. Generally, the agent will base their decision on what is most likely to achieve a specified goal. For AI applications like recommendation systems, an agent can decide what products or content to recommend based on its inferences about user preferences. 3) Handling Ambiguity: AI agents can often handle ambiguous input by inferring the most likely interpretation based on context and training. However, their ability to do so is limited by the scope of their training data and algorithms. 4) Continuous Improvement: While some AI agents have the ability to learn from new data and interactions, many large language models do not continuously update their knowledge-base or internal representation after training. Their inferences are usually based solely on the data that was available up to the point of their last training update.
人工智能代理有能力根据其训练和输入数据进行解释，预测和响应。虽然这些能力是先进的，并在不断改进，但重要的是要认识到它们的局限性和它们所训练的基础数据的影响。人工智能代理系统通常具有以下能力：1）预测建模：人工智能代理可以预测可能的结果，或根据历史数据和趋势建议下一步。例如，它们可以预测文本的延续、问题的答案、机器人的下一个动作或场景的解决方案。2)决策：在一些应用中，人工智能代理可以根据自己的推断做出决策。通常，代理人会根据最有可能实现指定目标的方式来做出决策。对于推荐系统等AI应用，代理可以根据其对用户偏好的推断来决定推荐什么产品或内容。 3)处理歧义：AI代理通常可以通过基于上下文和训练推断最可能的解释来处理歧义输入。但是，它们这样做的能力受到它们的训练数据和算法的范围的限制。4)持续改进：虽然一些人工智能代理有能力从新数据和交互中学习，但许多大型语言模型在训练后不会持续更新其知识库或内部表示。他们的推断通常仅基于最后一次训练更新时可用的数据。

We show augmented interactive agents for multi-modality and cross reality-agnostic integration with an emergence mechanism in Fig. 2. An AI agent requires collecting extensive training data for every new task, which can be costly or impossible for many domains. In this study, we develop an infinite agent that learns to transfer memory information from general foundation models (e.g., GPT-X, DALL-E) to novel domains or scenarios for scene understanding, generation, and interactive editing in physical or virtual worlds.
我们展示了图2中具有涌现机制的用于多模态和跨现实不可知集成的增强的交互代理。人工智能代理需要为每个新任务收集大量的训练数据，这对于许多领域来说可能是昂贵的或不可能的。在这项研究中，我们开发了一个无限代理，它可以学习从一般基础模型（例如，GPT-X、DALL-E）到用于在物理或虚拟世界中的场景理解、生成和交互式编辑的新颖领域或场景。

An application of such an infinite agent in robotics is RoboGen Wang et al. (2023d). In this study, the authors propose a pipeline that autonomously run the cycles of task proposition, environment generation, and skill learning. RoboGen is an effort to transfer the knowledge embedded in large models to robotics.
这种无限代理在机器人中的应用是RoboGenWang等人（2023 d）。在本研究中，作者提出了一个自动运行任务建议、环境生成和技能学习循环的流水线。RoboGen致力于将嵌入大型模型的知识转移到机器人技术中。

2.2 Agent AI with Large Foundation Models
2.2具有大型基础模型的Agent AI

Recent studies have indicated that large foundation models play a crucial role in creating data that act as benchmarks for determining the actions of agents within environment-imposed constraints. For example, using foundation models for robotic manipulation Black et al. (2023); Ko et al. (2023) and navigation Shah et al. (2023a); Zhou et al. (2023a). To illustrate, Black et al. employed an image-editing model as a high-level planner to generate images of future sub-goals, thereby guiding low-level policies Black et al. (2023). For robot navigation, Shah et al. proposed a system that employs a LLM to identify landmarks from text and a VLM to associate these landmarks with visual inputs, enhancing navigation through natural language instructions Shah et al. (2023a).
最近的研究表明，大型基础模型在创建数据方面发挥着至关重要的作用，这些数据可作为基准，用于确定环境施加的约束条件下主体的行动。例如，使用基础模型进行机器人操作Black et al.（2023）; Ko et al.（2023）和导航Shah et al.（2023 a）; Zhou et al.（2023 a）。为了说明，Black等人采用图像编辑模型作为高级规划器来生成未来子目标的图像，从而指导低级政策Black et al.（2023）。对于机器人导航，Shah等人提出了一种系统，该系统采用LLM从文本中识别地标，并采用VLM将这些地标与视觉输入相关联，通过自然语言指令增强导航Shah等人（2023 a）。

There is also growing interest in the generation of conditioned human motions in response to language and environmental factors. Several AI systems have been proposed to generate motions and actions that are tailored to specific linguistic instructions Kim et al. (2023); Zhang et al. (2022); Tevet et al. (2022) and to adapt to various 3D scenes Wang et al. (2022a). This body of research emphasizes the growing capabilities of generative models in enhancing the adaptability and responsiveness of AI agents across diverse scenarios.
人们也越来越感兴趣的是产生条件化的人类动作，以响应语言和环境因素。已经提出了几种人工智能系统来生成针对特定语言指令的运动和动作Kim et al.（2023）; Zhang et al.（2022）; Tevet et al.（2022）并适应各种3D场景Wang et al.（2022 a）。这一研究机构强调了生成模型在增强AI代理跨不同场景的适应性和响应性方面不断增长的能力。

2.2.1 Hallucinations
2.2.1幻觉

Agents that generate text are often prone to hallucinations, which are instances where the generated text is nonsensical or unfaithful to the provided source content Raunak et al. (2021); Maynez et al. (2020). Hallucinations can be split into two categories, intrinsic and extrinsic Ji et al. (2023). Intrinsic hallucinations are hallucinations that are contradictory to the source material, whereas extrinsic hallucinations are when the generated text contains additional information that was not originally included in the source material.
生成文本的代理通常容易产生幻觉，这是生成的文本对所提供的源内容无意义或不忠实的情况Raunak et al.（2021）; Maynez et al.（2020）。幻觉可以分为两类，内在和外在Ji et al.（2023）。内在幻觉是与原始材料相矛盾的幻觉，而外在幻觉是当生成的文本包含原始材料中最初没有包含的额外信息时。

Some promising routes for reducing the rate of hallucination in language generation involve using retrieval-augmented generation Lewis et al. (2020); Shuster et al. (2021) or other methods for grounding natural language outputs via external knowledge retrieval Dziri et al. (2021); Peng et al. (2023). Generally, these methods seek to augment language generation by retrieving additional source material and by providing mechanisms to check for contradictions between the generated response and the source material.
降低语言生成中幻觉发生率的一些有希望的方法包括使用检索增强生成刘易斯等人（2020）;舒斯特等人（2021）或其他通过外部知识检索来建立自然语言输出的方法。通常，这些方法通过检索附加的源材料并通过提供检查所生成的响应与源材料之间的矛盾的机制来寻求增强语言生成。

Within the context of multi-modal agent systems, VLMs have been shown to hallucinate as well Zhou et al. (2023b). One common cause of hallucination for vision-based language-generation is due to the over-reliance on co-occurrence of objects and visual cues in the training data Rohrbach et al. (2018). AI agents that exclusively rely upon pretrained LLMs or VLMs and use limited environment-specific finetuning can be particularly vulnerable to hallucinations since they rely upon the internal knowledge-base of the pretrained models for generating actions and may not accurately understand the dynamics of the world state in which they are deployed.
在多模式代理系统的背景下，VLM也被证明会产生幻觉Zhou等人（2023 b）。基于视觉的语言生成产生幻觉的一个常见原因是过度依赖训练数据中物体和视觉线索的共同出现Rohrbach et al.（2018）。完全依赖于预训练的LLMs或VLM并使用有限的环境特定微调的AI代理可能特别容易受到幻觉的影响，因为它们依赖于预训练模型的内部知识库来生成动作，并且可能无法准确理解它们所部署的世界状态的动态。

2.2.2 Biases and Inclusivity
2.2.2偏见和包容性

AI agents based on LLMs or LMMs (large multimodal models) have biases due to several factors inherent in their design and training process. When designing these AI agents, we must be mindful of being inclusive and aware of the needs of all end users and stakeholders. In the context of AI agents, inclusivity refers to the measures and principles employed to ensure that the agent’s responses and interactions are inclusive, respectful, and sensitive to a wide range of users from diverse backgrounds. We list key aspects of agent biases and inclusivity below.
基于LLMs或Lectin（大型多模态模型）的AI代理由于其设计和训练过程中固有的几个因素而存在偏差。在设计这些人工智能代理时，我们必须注意包容性，并意识到所有最终用户和利益相关者的需求。在人工智能代理的背景下，包容性是指用于确保代理的响应和交互对来自不同背景的广泛用户具有包容性、尊重性和敏感性的措施和原则。我们在下面列出了代理人偏见和包容性的关键方面。

•

Training Data: Foundation models are trained on vast amounts of text data collected from the internet, including books, articles, websites, and other text sources. This data often reflects the biases present in human society, and the model can inadvertently learn and reproduce these biases. This includes stereotypes, prejudices, and slanted viewpoints related to race, gender, ethnicity, religion, and other personal attributes. In particular, by training on internet data and often only English text, models implicitly learn the cultural norms of Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies Henrich et al. (2010) who have a disproportionately large internet presence. However, it is essential to recognize that datasets created by humans cannot be entirely devoid of bias, since they frequently mirror the societal biases and the predispositions of the individuals who generated and/or compiled the data initially.
训练数据：基础模型基于从互联网收集的大量文本数据进行训练，包括书籍，文章，网站和其他文本来源。这些数据通常反映了人类社会中存在的偏见，模型可能会无意中学习和复制这些偏见。这包括与种族、性别、民族、宗教和其他个人属性有关的成见、偏见和偏见。特别是，通过对互联网数据的训练，而且通常只对英语文本进行训练，模型隐含地学习了西方、教育、工业化、富裕和民主（WEIRD）社会的文化规范。然而，必须认识到，人类创建的数据集不可能完全没有偏见，因为它们经常反映出最初生成和/或汇编数据的个人的社会偏见和倾向。
•

Historical and Cultural Biases: AI models are trained on large datasets sourced from diverse content. Thus, the training data often includes historical texts or materials from various cultures. In particular, training data from historical sources may contain offensive or derogatory language representing a particular society’s cultural norms, attitudes, and prejudices. This can lead to the model perpetuating outdated stereotypes or not fully understanding contemporary cultural shifts and nuances.
历史和文化偏见：人工智能模型是在来自不同内容的大型数据集上训练的。因此，训练数据通常包括来自各种文化的历史文本或材料。特别是，来自历史来源的培训数据可能包含代表特定社会文化规范、态度和偏见的冒犯性或贬损性语言。这可能会导致模式延续过时的刻板印象或不完全理解当代文化的转变和细微差别。
•

Language and Context Limitations: Language models might struggle with understanding and accurately representing nuances in language, such as sarcasm, humor, or cultural references. This can lead to misinterpretations or biased responses in certain contexts. Furthermore, there are many aspects of spoken language that are not captured by pure text data, leading to a potential disconnect between human understanding of language and how models understand language.
语言和上下文限制：语言模型可能难以理解和准确表示语言中的细微差别，例如讽刺、幽默或文化引用。在某些情况下，这会导致误解或有偏见的反应。此外，口语的许多方面无法通过纯文本数据捕获，导致人类对语言的理解与模型如何理解语言之间存在潜在的脱节。
•

Policies and Guidelines: AI agents operate under strict policies and guidelines to ensure fairness and inclusivity. For instance, in generating images, there are rules to diversify depictions of people, avoiding stereotypes related to race, gender, and other attributes.
政策和准则：人工智能代理在严格的政策和准则下运作，以确保公平性和包容性。例如，在生成图像时，有规则来多样化对人的描述，避免与种族、性别和其他属性有关的刻板印象。
•

Overgeneralization: These models tend to generate responses based on patterns seen in the training data. This can lead to overgeneralizations, where the model might produce responses that seem to stereotype or make broad assumptions about certain groups.
过度泛化：这些模型倾向于根据训练数据中看到的模式生成响应。这可能会导致过度概括，模型可能会产生似乎刻板印象或对某些群体做出广泛假设的反应。
•

Constant Monitoring and Updating: AI systems are continuously monitored and updated to address any emerging biases or inclusivity issues. Feedback from users and ongoing research in AI ethics play a crucial role in this process.
持续监测和更新：人工智能系统会持续监测和更新，以解决任何新出现的偏见或包容性问题。来自用户的反馈和正在进行的人工智能伦理研究在这一过程中发挥着至关重要的作用。
•

Amplification of Dominant Views: Since the training data often includes more content from dominant cultures or groups, the model may be more biased towards these perspectives, potentially underrepresenting or misrepresenting minority viewpoints.
主流观点的放大：由于训练数据通常包含更多来自主流文化或群体的内容，因此模型可能更偏向于这些观点，可能会低估或歪曲少数群体的观点。
•

Ethical and Inclusive Design: AI tools should be designed with ethical considerations and inclusivity as core principles. This includes respecting cultural differences, promoting diversity, and ensuring that the AI does not perpetuate harmful stereotypes.
道德和包容性设计：人工智能工具的设计应将道德考虑和包容性作为核心原则。这包括尊重文化差异，促进多样性，并确保AI不会使有害的陈规定型观念永久化。
•

User Guidelines: Users are also guided on how to interact with AI in a manner that promotes inclusivity and respect. This includes refraining from requests that could lead to biased or inappropriate outputs. Furthermore, it can help mitigate models learning harmful material from user interactions.
用户指南：还指导用户如何以促进包容性和尊重的方式与人工智能进行交互。这包括避免提出可能导致有偏见或不适当产出的要求。此外，它还可以帮助减少模型从用户交互中学习有害内容的情况。

Despite these measures, AI agents still exhibit biases. Ongoing efforts in agent AI research and development are focused on further reducing these biases and enhancing the inclusivity and fairness of agent AI systems. Efforts to Mitigate Biases:
尽管有这些措施，人工智能代理仍然表现出偏见。人工智能研究和开发的持续努力集中在进一步减少这些偏见，提高人工智能系统的包容性和公平性。减少偏见的努力：

•

Diverse and Inclusive Training Data: Efforts are made to include a more diverse and inclusive range of sources in the training data.
多样化和包容性的培训数据：努力在培训数据中纳入更多样化和更具包容性的来源。
•

Bias Detection and Correction: Ongoing research focuses on detecting and correcting biases in model responses.
偏差检测和纠正：正在进行的研究重点是检测和纠正模型响应中的偏差。
•

Ethical Guidelines and Policies: Models are often governed by ethical guidelines and policies designed to mitigate biases and ensure respectful and inclusive interactions.
道德准则和政策：模型通常受旨在减少偏见并确保尊重和包容性互动的道德准则和政策的约束。
•

Diverse Representation: Ensuring that the content generated or the responses provided by the AI agent represent a wide range of human experiences, cultures, ethnicities, and identities. This is particularly relevant in scenarios like image generation or narrative construction.
多样化的表示：确保AI代理生成的内容或提供的响应代表广泛的人类经验，文化，种族和身份。这在图像生成或叙事构建等场景中尤为重要。
•

Bias Mitigation: Actively working to reduce biases in the AI’s responses. This includes biases related to race, gender, age, disability, sexual orientation, and other personal characteristics. The goal is to provide fair and balanced responses that do not perpetuate stereotypes or prejudices.
偏见缓解：积极努力减少AI反应中的偏见。这包括与种族、性别、年龄、残疾、性取向和其他个人特征相关的偏见。目标是提供公平和平衡的反应，不使陈规定型观念或偏见永久化。
•

Cultural Sensitivity: The AI is designed to be culturally sensitive, acknowledging and respecting the diversity of cultural norms, practices, and values. This includes understanding and appropriately responding to cultural references and nuances.
文化敏感性：人工智能旨在对文化敏感，承认和尊重文化规范，实践和价值观的多样性。这包括理解和适当地应对文化参考和细微差别。
•

Accessibility: Ensuring that the AI agent is accessible to users with different abilities, including those with disabilities. This can involve incorporating features that make interactions easier for people with visual, auditory, motor, or cognitive impairments.
可访问性：确保AI代理可供具有不同能力的用户访问，包括残疾人。这可能涉及整合功能，使视觉，听觉，运动或认知障碍的人更容易进行交互。
•

Language-based Inclusivity: Providing support for multiple languages and dialects to cater to a global user base, and being sensitive to the nuances and variations within a language Liu et al. (2023b).
基于语言的包容性：提供对多种语言和方言的支持，以满足全球用户群的需求，并对语言中的细微差别和变化保持敏感。
•

Ethical and Respectful Interactions: The Agent is programmed to interact ethically and respectfully with all users, avoiding responses that could be deemed offensive, harmful, or disrespectful.
道德和尊重的互动：代理程序被编程为与所有用户进行道德和尊重的互动，避免可能被视为冒犯，有害或不尊重的反应。
•

User Feedback and Adaptation: Incorporating user feedback to continually improve the inclusivity and effectiveness of the AI agent. This includes learning from interactions to better understand and serve a diverse user base.
用户反馈和自适应：收集用户反馈，以不断提高AI代理的包容性和有效性。这包括从互动中学习，以更好地理解和服务于多样化的用户群。
•

Compliance with Inclusivity Guidelines: Adhering to established guidelines and standards for inclusivity in AI agent, which are often set by industry groups, ethical boards, or regulatory bodies.
遵守包容性准则：遵守人工智能代理的包容性准则和标准，这些准则和标准通常由行业组织、道德委员会或监管机构制定。

Despite these efforts, it’s important to be aware of the potential for biases in responses and to interpret them with critical thinking. Continuous improvements in AI agent technology and ethical practices aim to reduce these biases over time. One of the overarching goals for inclusivity in agent AI is to create an agent that is respectful and accessible to all users, regardless of their background or identity.
尽管有这些努力，重要的是要意识到反应中可能存在的偏见，并用批判性思维来解释它们。人工智能代理技术和道德实践的不断改进旨在随着时间的推移减少这些偏见。Agent AI包容性的首要目标之一是创建一个尊重所有用户的Agent，无论他们的背景或身份如何。

2.2.3 Data Privacy and Usage
2.2.3数据隐私和使用

One key ethical consideration of AI agents involves comprehending how these systems handle, store, and potentially retrieve user data. We discuss key aspects below:
人工智能代理的一个关键道德考虑因素涉及理解这些系统如何处理，存储和潜在地检索用户数据。我们讨论以下关键方面：

Data Collection, Usage and Purpose.
数据收集、使用和目的。

When using user data to improve model performance, model developers access the data the AI agent has collected while in production and interacting with users. Some systems allow users to view their data through user accounts or by making a request to the service provider. It is important to recognize what data the AI agent collects during these interactions. This could include text inputs, user usage patterns, personal preferences, and sometimes more sensitive personal information. Users should also understand how the data collected from their interactions is used. If, for some reason, the AI holds incorrect information about a particular person or group, there should be a mechanism for users to help correct this once identified. This is important for both accuracy and to be respectful of all users and groups. Common uses for retrieving and analyzing user data include improving user interaction, personalizing responses, and system optimization. It is extremely important for developers to ensure the data is not used for purposes that users have not consented to, such as unsolicited marketing.
当使用用户数据来提高模型性能时，模型开发人员访问AI代理在生产和与用户交互时收集的数据。一些系统允许用户通过用户帐户或向服务提供商提出请求来查看他们的数据。重要的是要认识到AI代理在这些交互过程中收集了哪些数据。这可能包括文本输入、用户使用模式、个人偏好，有时还包括更敏感的个人信息。用户还应该了解如何使用从他们的交互中收集的数据。如果由于某种原因，人工智能持有关于特定个人或团体的错误信息，则应建立一种机制，让用户在发现后帮助纠正这一错误。这对于准确性和尊重所有用户和组都很重要。检索和分析用户数据的常见用途包括改进用户交互、个性化响应和系统优化。对于开发人员来说，确保数据不被用于用户未同意的目的（如未经请求的营销）是极其重要的。

Storage and Security. 存储和安全性。

Developers should know where the user interaction data is stored and what security measures are in place to protect it from unauthorized access or breaches. This includes encryption, secure servers, and data protection protocols. It is extremely important to determine if agent data is shared with third parties and under what conditions. This should be transparent and typically requires user consent.
开发人员应该知道用户交互数据的存储位置，以及采取了哪些安全措施来保护数据免受未经授权的访问或破坏。这包括加密、安全服务器和数据保护协议。确定是否与第三方共享座席数据以及在什么条件下共享座席数据是极其重要的。这应该是透明的，通常需要用户同意。

Data Deletion and Retention.
数据删除和保留。

It is also important for users to understand how long user data is stored and how users can request its deletion. Many data protection laws give users the right to be forgotten, meaning they can request their data be erased. AI agents must adhere to data protection laws like GDPR in the EU or CCPA in California. These laws govern data handling practices and user rights regarding their personal data.
用户还必须了解用户数据的存储时间以及用户如何请求删除数据。许多数据保护法律赋予用户被遗忘的权利，这意味着他们可以要求删除他们的数据。人工智能代理必须遵守欧盟的GDPR或加州的CCPA等数据保护法。这些法律适用于数据处理实践和用户关于其个人数据的权利。

Data Portability and Privacy Policy.
数据可移植性和隐私政策。

Furthermore, developers must create the AI agent’s privacy policy to document and explain to users how their data is handled. This should detail data collection, usage, storage, and user rights. Developers should ensure that they obtain user consent for data collection, especially for sensitive information. Users typically have the option to opt-out or limit the data they provide. In some jurisdictions, users may even have the right to request a copy of their data in a format that can be transferred to another service provider.
此外，开发人员必须创建AI代理的隐私策略，以记录并向用户解释如何处理他们的数据。这应该详细说明数据收集、使用、存储和用户权限。开发人员应确保在收集数据（尤其是敏感信息）时获得用户的同意。用户通常可以选择退出或限制他们提供的数据。在某些司法管辖区，用户甚至有权要求以可传输给其他服务提供商的格式提供其数据的副本。

Anonymization. 反病毒化。

For data used in broader analysis or AI training, it should ideally be anonymized to protect individual identities. Developers must understand how their AI agent retrieves and uses historical user data during interactions. This could be for personalization or improving response relevance.
对于在更广泛的分析或AI培训中使用的数据，理想情况下应该进行匿名处理，以保护个人身份。开发人员必须了解他们的AI代理在交互过程中如何检索和使用历史用户数据。这可能是为了个性化或提高响应相关性。

In summary, understanding data privacy for AI agents involves being aware of how user data is collected, used, stored, and protected, and ensuring that users understand their rights regarding accessing, correcting, and deleting their data. Awareness of the mechanisms for data retrieval, both by users and the AI agent, is also crucial for a comprehensive understanding of data privacy.
总而言之，理解AI代理的数据隐私包括了解用户数据是如何被收集、使用、存储和保护的，并确保用户了解他们在访问、更正和删除数据方面的权利。用户和AI代理对数据检索机制的认识对于全面理解数据隐私也至关重要。

2.2.4 Interpretability and Explainability
2.2.4可解释性和可解释性

Imitation Learning $\rightarrow$ Decoupling.
模仿学习 $\rightarrow$ 解耦。

Agents are typically trained using a continuous feedback loop in Reinforcement Learning (RL) or Imitation Learning (IL), starting with a randomly initialized policy. However, this approach faces leader-board in obtaining initial rewards in unfamiliar environments, particularly when rewards are sparse or only available at the end of a long-step interaction. Thus, a superior solution is to use an infinite-memory agent trained through IL, which can learn policies from expert data, improving exploration and utilization of unseen environmental space with emergent infrastructure as shown in Fig. 3. With expert characteristics to help the agent explore better and utilize the unseen environmental space. Agent AI, can learn policies and new paradigm flow directly from expert data.
代理通常使用强化学习（RL）或模仿学习（IL）中的连续反馈循环进行训练，从随机初始化的策略开始。然而，这种方法在不熟悉的环境中获得初始奖励时面临排行榜，特别是当奖励稀疏或仅在长步骤交互结束时可用时。因此，一个上级解决方案是使用通过IL训练的无限内存代理，它可以从专家数据中学习策略，改善对具有新兴基础设施的不可见环境空间的探索和利用，如图3所示。具有专家特征，帮助智能体更好地探索和利用看不见的环境空间。Agent AI可以直接从专家数据中学习策略和新的范式流。

Traditional IL has an agent mimicking an expert demonstrator’s behavior to learn a policy. However, learning the expert policy directly may not always be the best approach, as the agent may not generalize well to unseen situations. To tackle this, we propose learning an agent with in-context prompt or a implicit reward function that captures key aspects of the expert’s behavior, as shown in Fig. 3. This equips the infinite memory agent with physical-world behavior data for task execution, learned from expert demonstrations. It helps overcome existing imitation learning drawbacks like the need for extensive expert data and potential errors in complex tasks. The key idea behind the Agent AI has two parts: 1) the infinite agent that collects physical-world expert demonstrations as state-action pairs and 2) the virtual environment that imitates the agent generator. The imitating agent produces actions that mimic the expert’s behavior, while the agent learns a policy mapping from states to actions by reducing a loss function of the disparity between the expert’s actions and the actions generated by the learned policy.
传统的IL有一个代理模仿专家演示者的行为来学习策略。然而，直接学习专家策略可能并不总是最好的方法，因为智能体可能无法很好地推广到看不见的情况。为了解决这个问题，我们建议学习一个带有上下文提示或隐式奖励函数的代理，它可以捕获专家行为的关键方面，如图3所示。这为无限内存代理提供了从专家演示中学习到的用于任务执行的物理世界行为数据。它有助于克服现有的模仿学习的缺点，如需要大量的专家数据和复杂任务中的潜在错误。Agent AI背后的关键思想有两个部分：1）收集物理世界专家演示作为状态-动作对的无限代理，以及2）模仿代理生成器的虚拟环境。模仿代理产生模仿专家的行为的动作，而代理通过减少专家的动作和由学习的策略生成的动作之间的差异的损失函数来学习从状态到动作的策略映射。

Decoupling $\rightarrow$ Generalization.
解耦 $\rightarrow$ 泛化。

Rather than relying on a task-specific reward function, the agent learns from expert demonstrations, which provide a diverse set of state-action pairs covering various task aspects. The agent then learns a policy that maps states to actions by imitating the expert’s behavior. Decoupling in imitation learning refers to separating the learning process from the task-specific reward function, allowing the policy to generalize across different tasks without explicit reliance on the task-specific reward function. By decoupling, the agent can learn from expert demonstrations and learn a policy that is adaptable to a variety of situations. Decoupling enables transfer learning, where a policy learned in one domain can adapt to others with minimal fine-tuning. By learning a general policy that is not tied to a specific reward function, the agent can leverage the knowledge it acquired in one task to perform well in other related tasks. Since the agent does not rely on a specific reward function, it can adapt to changes in the reward function or environment without the need for significant retraining. This makes the learned policy more robust and generalizable across different environments. Decoupling in this context refers to the separation of two tasks in the learning process: learning the reward function and learning the optimal policy.
代理不依赖于特定于任务的奖励函数，而是从专家演示中学习，专家演示提供了一组涵盖各种任务方面的状态-动作对。然后，代理通过模仿专家的行为来学习将状态映射到动作的策略。模仿学习中的解耦是指将学习过程与特定任务的奖励函数分离，允许策略在不同任务之间泛化，而不显式依赖特定任务的奖励函数。通过解耦，代理可以从专家演示中学习，并学习适应各种情况的策略。解耦使迁移学习成为可能，在一个领域中学习的策略可以以最小的微调适应其他领域。通过学习不与特定奖励函数绑定的一般策略，智能体可以利用它在一个任务中获得的知识在其他相关任务中表现良好。由于智能体不依赖于特定的奖励函数，因此它可以适应奖励函数或环境的变化，而无需进行大量的重新训练。这使得学习的策略在不同的环境中更加健壮和通用。解耦是指在学习过程中分离两个任务：学习奖励函数和学习最优策略。

Generalization $\rightarrow$ Emergent Behavior.
概括 $\rightarrow$ 紧急行为。

Generalization explains how emergent properties or behaviors can arise from simpler components or rules. The key idea lies in identifying the basic elements or rules that govern the behavior of the system, such as individual neurons or basic algorithms. Consequently, by observing how these simple components or rules interact with one another. These interactions of these components often lead to the emergence of complex behaviors, which are not predictable by examining individual components alone. Generalization across different levels of complexity allows a system to learn general principles applicable across these levels, leading to emergent properties. This enables the system to adapt to new situations, demonstrating the emergence of more complex behaviors from simpler rules. Furthermore, the ability to generalize across different complexity levels facilitates knowledge transfer from one domain to another, which contributes to the emergence of complex behaviors in new contexts as the system adapts.
泛化解释了如何从更简单的组件或规则中产生紧急属性或行为。关键思想在于识别控制系统行为的基本元素或规则，例如单个神经元或基本算法。因此，通过观察这些简单的组件或规则如何相互作用。这些组件的相互作用往往会导致复杂行为的出现，而这些行为是无法通过单独检查单个组件来预测的。跨不同复杂性级别的泛化允许系统学习适用于这些级别的一般原则，从而产生紧急属性。这使系统能够适应新的情况，展示更复杂的行为从简单的规则中出现。此外，跨不同复杂性水平进行概括的能力有助于知识从一个领域转移到另一个领域，这有助于随着系统的适应，在新的环境中出现复杂的行为。

2.2.5 Inference Augmentation
2.2.5推理增强

The inference ability of an AI agent lies in its capacity to interpret, predict, and respond based on its training and input data. While these capabilities are advanced and continually improving, it’s important to recognize their limitations and the influence of the underlying data they are trained on. Particularly, in the context of large language models, it refers to its capacity to draw conclusions, make predictions, and generate responses based on the data it has been trained on and the input it receives. Inference augmentation in AI agents refers to enhancing the AI’s natural inference abilities with additional tools, techniques, or data to improve its performance, accuracy, and utility. This can be particularly important in complex decision-making scenarios or when dealing with nuanced or specialized content. We denote particularly important sources for inference augmentation below:
人工智能代理的推理能力在于它根据训练和输入数据解释，预测和响应的能力。虽然这些能力是先进的，并在不断改进，但重要的是要认识到它们的局限性和它们所训练的基础数据的影响。特别是在大型语言模型的背景下，它指的是根据它所训练的数据和接收的输入得出结论，做出预测和生成响应的能力。AI代理中的推理增强是指通过额外的工具，技术或数据来增强AI的自然推理能力，以提高其性能，准确性和实用性。这在复杂的决策场景中或处理细微差别或专业内容时尤为重要。我们在下面表示推理增强的特别重要的来源：

Data Enrichment. 数据丰富。

Incorporating additional, often external, data sources to provide more context or background can help the AI agent make more informed inferences, especially in areas where its training data may be limited. For example, AI agents can infer meaning from the context of a conversation or text. They analyze the given information and use it to understand the intent and relevant details of user queries. These models are proficient at recognizing patterns in data. They use this ability to make inferences about language, user behavior, or other relevant phenomena based on the patterns they’ve learned during training.
整合额外的（通常是外部的）数据源以提供更多的上下文或背景，可以帮助AI代理做出更明智的推断，尤其是在其训练数据可能有限的领域。例如，人工智能代理可以从对话或文本的上下文中推断出意思。他们分析给定的信息，并使用它来理解用户查询的意图和相关详细信息。这些模型擅长识别数据中的模式。他们使用这种能力来推断语言，用户行为，或其他相关现象的基础上，他们在培训期间学到的模式。

Algorithm Enhancement. 算法增强。

Improving the AI’s underlying algorithms to make better inferences. This could involve using more advanced machine learning models, integrating different types of AI (like combining NLP with image recognition), or updating algorithms to better handle complex tasks. Inference in language models involves understanding and generating human language. This includes grasping nuances like tone, intent, and the subtleties of different linguistic constructions.
改进人工智能的底层算法，以做出更好的推断。这可能涉及使用更先进的机器学习模型，集成不同类型的AI（如将NLP与图像识别相结合），或更新算法以更好地处理复杂任务。语言模型中的推理包括理解和生成人类语言。这包括掌握音调，意图和不同语言结构的微妙之处。

Human-in-the-Loop (HITL).
人在回路（HITL）。

Involving human input to augment the AI’s inferences can be particularly useful in areas where human judgment is crucial, such as ethical considerations, creative tasks, or ambiguous scenarios. Humans can provide guidance, correct errors, or offer insights that the agent would not be able to infer on its own.
在人类判断至关重要的领域，例如道德考量、创造性任务或模棱两可的场景，引入人类输入来增强人工智能的推理尤其有用。人类可以提供指导，纠正错误，或提供智能体无法自行推断的见解。

Real-Time Feedback Integration.
实时反馈集成。

Using real-time feedback from users or the environment to enhance inferences is another promising method for improving performance during inference. For example, an AI might adjust its recommendations based on live user responses or changing conditions in a dynamic system. Or, if the agent is taking actions in a simulated environment that break certain rules, the agent can be dynamically given feedback to help correct itself.
使用来自用户或环境的实时反馈来增强推理是另一种用于在推理期间改进性能的有前途的方法。例如，AI可能会根据实时用户响应或动态系统中不断变化的条件来调整其推荐。或者，如果代理在模拟环境中采取违反某些规则的行为，则可以动态地向代理提供反馈以帮助其纠正。

Cross-Domain Knowledge Transfer.
跨领域知识转移。

Leveraging knowledge or models from one domain to improve inferences in another can be particularly helpful when producing outputs within a specialized discipline. For instance, techniques developed for language translation might be applied to code generation, or insights from medical diagnostics could enhance predictive maintenance in machinery.
利用一个领域的知识或模型来改进另一个领域的推理在专业领域内产生输出时尤其有用。例如，为语言翻译开发的技术可能会应用于代码生成，或者来自医疗诊断的见解可以增强机器的预测性维护。

Customization for Specific Use Cases.
针对特定使用情形的定制。

Tailoring the AI’s inference capabilities for particular applications or industries can involve training the AI on specialized datasets or fine-tuning its models to better suit specific tasks, such as legal analysis, medical diagnosis, or financial forecasting. Since the particular language or information within one domain can greatly contrast with the language from other domains, it can be beneficial to finetune the agent on domain-specific information.
为特定应用或行业定制人工智能的推理能力可能涉及在专业数据集上训练人工智能或微调其模型，以更好地适应特定任务，例如法律的分析，医疗诊断或财务预测。由于一个域内的特定语言或信息可以与来自其它域的语言形成很大的对比，因此在域特定信息上微调代理是有益的。

Ethical and Bias Considerations.
伦理和偏倚考虑。

It is important to ensure that the augmentation process does not introduce new biases or ethical issues. This involves careful consideration of the sources of additional data or the impact of the new inference augmentation algorithms on fairness and transparency. When making inferences, especially about sensitive topics, AI agents must sometimes navigate ethical considerations. This involves avoiding harmful stereotypes, respecting privacy, and ensuring fairness.
重要的是要确保扩增过程不会引入新的偏见或伦理问题。这涉及到仔细考虑额外数据的来源或新的推理增强算法对公平性和透明度的影响。在做出推论时，特别是在敏感话题上，人工智能代理有时必须考虑到道德因素。这包括避免有害的陈规定型观念，尊重隐私，确保公平。

Continuous Learning and Adaptation.
持续学习和适应。

Regularly updating and refining the AI’s capabilities to keep up with new developments, changing data landscapes, and evolving user needs.
定期更新和完善AI的功能，以跟上新的发展，不断变化的数据环境和不断变化的用户需求。

In summmary, winference augmentation in AI agents involves methods in which their natural inference abilities can be enhanced through additional data, improved algorithms, human input, and other techniques. Depending on the use-case, this augmentation is often essential for dealing with complex tasks and ensuring accuracy in the agent’s outputs.
总之，人工智能智能体中的推理增强涉及到通过附加数据、改进算法、人工输入和其他技术来增强其自然推理能力的方法。根据用例的不同，这种增强对于处理复杂任务和确保代理输出的准确性通常是必要的。

2.2.6 Regulation
2.2.6法规

Recently, Agent AI has made significant advancements, and its integration into embodied systems has opened new possibilities for interacting with agents via more immersive, dynamic, and engaging experiences. To expedite the process and ease the cumbersome work in agent AI developing, we are proposing to develop the next-generation AI-empowered pipeline for agent interaction. Develop a human-machine collaboration system where humans and machines can communicate and interact meaningfully. The system can leverage the LLM’s or VLM dialog capabilities and vast action to talk with human players and identify human needs. Then it will perform proper actions to help human players upon request.
最近，Agent AI取得了重大进展，它与具体系统的集成为通过更具沉浸式，动态和吸引力的体验与代理进行交互开辟了新的可能性。为了加快这一过程并减轻Agent AI开发中的繁琐工作，我们建议为Agent交互开发下一代AI授权管道。开发一个人机协作系统，人类和机器可以进行有意义的交流和互动。系统可以利用LLM或VLM对话能力和大量动作来与人类玩家交谈并识别人类需求。然后，它将根据请求执行适当的操作以帮助人类玩家。

When employing LLM/VLMs for a human-machine collaboration system, it is essential to note that these operate as black boxes, generating unpredictable output. This uncertainty can become crucial in a physical setup, such as operating actual robotics. An approach to address this challenge is constraining the focus of the LLM/VLM through prompt engineering. For instance, in robotic task planning from instructions, providing environmental information within the prompt has been reported to yield more stable outputs than relying solely on text Gramopadhye and Szafir (2022). This report is supported by the Minsky’s frame theory of AI Minsky (1975), suggesting that the problem space to be solved by LLM/VLMs is defined by the given prompts. Another approach is designing prompts to make LLM/VLMs include explanatory text to allow users understand what the model has focused on or recognized. Additionally, implementing a higher layer that allows for pre-execution verification and modification under human guidance can facilitate the operation of systems working under such guidance (Fig. 4).
在人机协作系统中使用LLM/VLM时，必须注意这些操作就像黑盒一样，会产生不可预测的输出。这种不确定性在物理设置中可能变得至关重要，例如操作实际的机器人。解决这一挑战的一种方法是通过快速工程来限制LLM/VLM的重点。例如，在根据指令进行机器人任务规划时，据报道，在提示中提供环境信息比仅依赖文本Gramopadhye和Szafir（2022）产生更稳定的输出。该报告得到了AIMinsky（1975）的Minsky框架理论的支持，该理论认为LLM/VLMs要解决的问题空间由给定的提示定义。另一种方法是设计提示，使LLM/VLM包含解释性文本，以允许用户理解模型关注或识别的内容。此外，实现允许在人类指导下进行执行前验证和修改的更高层可以促进在这种指导下工作的系统的操作（图4）。

2.3 Agent AI for Emergent Abilities
2.3用于紧急能力的智能体AI

Despite the growing adoption of interactive agent AI systems, the majority of proposed methods still face a challenge in terms of their generalization performance in unseen environments or scenarios. Current modeling practices require developers to prepare large datasets for each domain to finetune/pretrain models; however, this process is costly and even impossible if the domain is new. To address this issue, we build interactive agents that leverage the knowledge-memory of general-purpose foundation models (ChatGPT, Dall-E, GPT-4, etc.) for a novel scenario, specifically for generating a collaboration space between humans and agents. We discover an emergent mechanism–- which we name Mixed Reality with Knowledge Inference Interaction–-that facilitates collaboration with humans to solve challenging tasks in complex real-world environments and enables the exploration of unseen environments for adaptation to virtual reality. For this mechanism, the agent learns i) micro-reactions in cross-modality: collecting relevant individual knowledge for each interaction task (e.g., understanding unseen scenes) from the explicit web source and by implicitly inferring from the output of pretrained models; ii) macro-behavior in reality-agnostic: improving interactive dimensions and patterns in language and multi-modality domains, and make changes based on characterized roles, certain target variable, influenced diversification of collaborative information in mixed-reality and LLMs. We investigate the task of knowledge-guided interactive synergistic effects to collaborated scene generation with combining various OpenAI models, and show promising results of how the interactive agent system can further boost the large foundation models in our setting. It integrates and improves the depth of generalization, conscious and interpretability of a complex adaptive AI systems.
尽管越来越多地采用交互式代理人工智能系统，但大多数提出的方法在看不见的环境或场景中的泛化性能方面仍然面临挑战。当前的建模实践要求开发人员为每个领域准备大型数据集来微调/预训练模型;然而，如果领域是新的，这个过程成本高昂，甚至是不可能的。为了解决这个问题，我们建立了交互式代理，利用通用基础模型（ChatGPT，Dall-E，GPT-4等）的知识记忆。用于一种新颖的场景，特别是用于生成人与代理之间的协作空间。我们发现了一种新兴机制-我们将其命名为混合现实与知识推理交互-促进与人类的合作，以解决复杂现实环境中具有挑战性的任务，并能够探索未知的环境以适应虚拟现实。对于该机制，代理学习i）交叉模态中的微反应：为每个交互任务收集相关的个体知识（例如，理解看不见的场景）从显式的网络源和通过从预训练模型的输出隐式推断; ii）现实不可知的宏观行为：改善语言和多模态域中的交互维度和模式，并根据特征化角色，某些目标变量，混合现实中协作信息的受影响多样化和LLMs进行更改。我们研究了知识引导的交互协同效应的任务，以协同场景生成与各种OpenAI模型相结合，并展示了交互式代理系统如何进一步提升我们设置中的大型基础模型的有希望的结果。它集成并提高了复杂自适应人工智能系统的泛化、意识和可解释性的深度。

3 Agent AI Paradigm
3智能体AI范式

In this section, we discuss a new paradigm and framework for training Agent AI. We seek to accomplish several goals with our proposed framework:
在本节中，我们将讨论一种新的训练Agent AI的范式和框架。我们寻求通过我们提议的框架实现以下几个目标：

•

Make use of existing pre-trained models and pre-training strategies to effectively bootstrap our agents with effective understanding of important modalities, such as text or visual inputs.
利用现有的预训练模型和预训练策略，有效地引导我们的智能体有效地理解重要的模式，如文本或视觉输入。
•

Support for sufficient long-term task-planning capabilities.
支持足够的长期任务规划能力。
•

Incorporate a framework for memory that allows for learned knowledge to be encoded and retrieved later.
整合一个记忆框架，允许学习的知识被编码和稍后检索。
•

Allow for environmental feedback to be used to effectively train the agent to learn which actions to take.
允许使用环境反馈来有效地训练智能体学习采取哪些行动。

We show a high-level new agent diagram outlining the important submodules of such a system in Fig. 5.
在图5中，我们展示了一个高级的新代理图，它概括了这样一个系统的重要子模块。

3.1 LLMs and VLMs
3.1LLMs和VLMs

We can use the LLM or VLM model to bootstrap the components of the Agent as showed in Fig. 5. In particular, LLMs have been shown to perform well for task-planning Gong et al. (2023a), contain significant world knowledge Yu et al. (2023b), and display impressive logical reasoning capabilities Creswell et al. (2022). Additionally, VLMs such as CLIP Radford et al. (2021) provide a general visual encoder that is language-aligned, as well as providing zero-shot visual recognition capabilities. For example, state-of-the-art open-source multi-modal models such as LLaVA Liu et al. (2023c) and InstructBLIP Dai et al. (2023) rely upon frozen CLIP models as visual encoders.
我们可以使用LLM或VLM模型来引导Agent的组件，如图5所示。特别是，LLMs已被证明在任务规划方面表现良好Gong et al.（2023 a），包含重要的世界知识Yu et al.（2023 b），并显示出令人印象深刻的逻辑推理能力Creswell et al.（2022）。此外，诸如CLIP拉德福等人（2021）的VLM提供了一种通用的视觉编码器，该编码器与语言对齐，并提供零拍摄视觉识别能力。例如，最先进的开源多模态模型，如LLaVALiu等人（2023 c）和InstructBLIPDai等人（2023）依赖于冻结的CLIP模型作为视觉编码器。

3.2 Agent Transformer Definition
3.2代理Transformer定义

Instead of using frozen LLMs and VLMs for the AI agent, it is also possible to use a single-agent transformer model that takes visual tokens and language tokens as input, similar to Gato Reed et al. (2022). In addition to vision and language, we add a third general type of input, which we denote as agent tokens. Conceptually, agent tokens are used to reserve a specific subspace of the input and output space of the model for agentic behaviors. For robotics or game playing, this may be represented as the input action space of the controller. When training agents to use specific tools, such as image-generation or image-editing models, or for other API calls, agent tokens can also be used. As showed in Fig. 7, we can combine the agent tokens with visual and language tokens to generate a unified interface for training multi-modal agent AI. Compared to using large, proprietary LLMs as agents, there are several advantages to using an agent transformer. Firstly, the model can be easily customized to very specific agentic tasks that may be difficult to represent in natural language (e.g. controller inputs or other specific actions). Thus, the agent can learn from environmental interactions and domain-specific data to improve performance. Secondly, it can be easier to understand why the model does or does not take specific actions by having access to the probabilities of the agent tokens. Thirdly, there are certain domains such as healthcare and law that have strict data privacy requirements. Finally, a relatively smaller agent transformer can potentially be significantly cheaper than a larger proprietary language model.
除了使用冻结的LLMs和VLM作为AI代理之外，还可以使用单代理Transformer模型，该模型将视觉令牌和语言令牌作为输入，类似于GatoReed et al.（2022）。除了视觉和语言之外，我们还增加了第三种一般类型的输入，我们称之为代理令牌。在概念上，Agent令牌被用来为Agent行为保留模型的输入和输出空间的特定子空间。对于机器人或玩游戏，这可以表示为控制器的输入动作空间。当训练代理使用特定工具（如图像生成或图像编辑模型）或其他API调用时，也可以使用代理令牌。如图7所示，我们可以将代理令牌与视觉和语言令牌联合收割机，以生成用于训练多模态代理AI的统一界面。与使用大型专有#1002 #作为代理相比，使用代理Transformer有几个优点。首先，该模型可以被容易地定制为可能难以用自然语言表示的非常具体的代理任务（例如，控制器输入或其他具体动作）。因此，代理可以从环境交互和特定于域的数据中学习以提高性能。其次，通过访问代理令牌的概率，可以更容易地理解模型为什么采取或不采取特定的行动。第三，某些领域，如医疗保健和法律，有严格的数据隐私要求。最后，相对较小的代理Transformer可能比较大的专有语言模型便宜得多。

3.3 Agent Transformer Creation
3.3代理Transformer创建

As shown above in Fig. 5, we can use the new agent paradigm with LLM and VLM-bootstrapped agents, as well as leveraging data generated from large foundation models to train the agent transformer model for learning to execute specific goals. Within this process, the agent model is trained to be specialized and tailored for specific tasks and domains. This approach allows you to leverage a pre-existing, foundation model’s learned features and knowledge. We show a simplified overview of the process in two steps below:
如图5所示，我们可以使用新的代理范例LLM和VLM-bootstrapped代理，以及利用从大型基础模型生成的数据来训练代理Transformer模型，以学习执行特定目标。在这个过程中，代理模型被训练为专门的，并为特定的任务和领域量身定制。这种方法允许您利用预先存在的基础模型的学习特性和知识。我们通过以下两个步骤展示了该过程的简化概述：

Define Objectives within the Domain.
定义领域内的目标。

In order to train the agent transformer, the objectives and the action-space of the agent within the context of each specific environment needs to be clearly defined. This includes determining which specific tasks or actions the agent needs to perform and assigning unique agent tokens for each. Furthermore, any automatic rules or procedures that can be used to identify successful completion of tasks can significantly improve the amount of data available for training. Otherwise, foundation-model generated or human-annotated data will be required for training the model. After the data is collected and it is possible to evaluate the performance of the agent, the process of continuous improvement can begin.
为了训练代理Transformer，需要明确定义每个特定环境背景下代理的目标和动作空间。这包括确定代理需要执行哪些特定任务或操作，以及为每个任务或操作分配唯一的代理令牌。此外，可用于识别任务的成功完成的任何自动规则或过程可显著地改进可用于训练的数据量。否则，将需要基础模型生成的或人工注释的数据来训练模型。在收集数据并有可能评估代理的性能之后，持续改进的过程就可以开始了。

Continuous Improvement. 持续改进。

Continuous monitoring of the model’s performance and collection of feedback are essential steps in the process. Feedback should be used for further fine-tuning and updates. It is also crucial to ensure that the model does not perpetuate biases or unethical outcomes. This necessitates a careful examination of the training data, regular checks for biases in outputs, and, if needed, training the model to recognize and avoid biases. Once the model achieves satisfactory performance, it can be deployed for the intended application. Continuous monitoring remains vital to ensure that the model performs as expected and to facilitate necessary adjustments. More details on this process, sources of training data, and details surrounding continous learning for agent AI can be found in Section 8.
持续监控模型的性能和收集反馈是该过程中的重要步骤。反馈应用于进一步的微调和更新。同样至关重要的是，要确保该模式不会使偏见或不道德的结果永久化。这就需要仔细检查训练数据，定期检查输出中的偏差，并在必要时训练模型以识别和避免偏差。一旦模型达到令人满意的性能，就可以为预期的应用程序部署它。持续监测对于确保模型按预期运行和促进必要的调整仍然至关重要。关于这一过程的更多细节、训练数据的来源以及围绕智能体AI的持续学习的细节可以在第8节中找到。

4 Agent AI Learning
4智能体AI学习

4.1 Strategy and Mechanism
4.1战略与机制

The strategy of interactive AI on different domains which extends the paradigm of calling large foundation models with a trained agent that actively seeks to collect user feedback, action information, useful knowledge for generation and interaction. Some times, the LLM/VLM models are not need to trained again, and we improve their performance by providing improved contextual prompts at test time for an agent. On the other hand, it always involves a knowledge/reasoning/commonsense/inference interactive modeling through a combination of triple systems - one performing knowledge retrieval from multi-model query, second performing interactive generation from the relevant agent, and last one the trained a new, informative self-supervised training or pre-training with reinforcement learning or imitation learning with improved way.
不同领域的交互式AI策略，扩展了使用经过训练的代理调用大型基础模型的范例，该代理积极寻求收集用户反馈，动作信息，有用的知识以进行生成和交互。有时，LLM/VLM模型不需要再次训练，我们通过在测试时为代理提供改进的上下文提示来提高其性能。另一方面，它总是涉及一个知识/推理/常识/推理的交互建模，通过三重系统的组合--一个从多模型查询中进行知识检索，第二个从相关的agent中进行交互生成，最后一个训练一个新的，信息性的自我监督训练或预训练与强化学习或模仿学习的改进方法。

4.1.1 Reinforcement Learning (RL)
4.1.1强化学习（RL）

There is a rich history of leveraging reinforcement learning (RL) to train interactive agents that exhibits intelligent behaviors. RL is a methodology to learn the optimal relationship between states and actions based on rewards (or penalties) received as a result of its actions. RL is a highly scalable framework that has been applied to numerous applications including robotics, however, it generally faces several leader-board and LLM/VLMs have shown their potential to mitigate or overcome some of those difficulties:
利用强化学习（RL）来训练具有智能行为的交互式智能体有着丰富的历史。RL是一种学习状态和行为之间的最佳关系的方法，该关系基于作为其行为的结果而收到的奖励（或惩罚）。RL是一种高度可扩展的框架，已应用于包括机器人在内的众多应用，然而，它通常面临几个排行榜，LLM/VLM已经显示出缓解或克服其中一些困难的潜力：

•

Reward designing The efficiency of policy learning greatly depends on the design of the reward function. Designing the reward function requires not only knowledge of RL algorithms but also a deep understanding of the nature of the task, and thus often necessitates crafting the function based on expert experience. Several studies explored the use of LLM/VLMs for designing reward functions Yu et al. (2023a); Katara et al. (2023); Ma et al. (2023).
策略学习的效率在很大程度上取决于奖励函数的设计。设计奖励函数不仅需要RL算法的知识，还需要对任务性质的深入理解，因此通常需要基于专家经验来制作函数。几项研究探索了使用LLM/VLM设计奖励函数Yu et al.（2023 a）; Katara et al.（2023）; Ma et al.（2023）。
•

Data collection and efficiency Given its exploratory nature, RL-based policy learning requires a significant amount of data Padalkar et al. (2023). The necessity for extensive data becomes particularly evident when the policy involves managing long sequences or integrating complex actions. This is because these scenarios demand more nuanced decision-making and learning from a wider range of situations. In recent studies, efforts have been directed towards enhancing data generation to support policy learning Kumar et al. (2023); Du et al. (2023). Additionally, in some studies, these models have been integrated into the reward function to improve policy learning Sontakke et al. (2023). Parallel to these developments, another strand of research has focused on achieving parameter efficiency in learning processes using VLMs Tang et al. (2023); Li et al. (2023d) and LLMs Shi et al. (2023)
数据收集和效率鉴于其探索性质，基于RL的政策学习需要大量的数据Padalkar et al.（2023）。当策略涉及管理长序列或集成复杂操作时，对大量数据的必要性变得尤为明显。这是因为这些情景需要更细致的决策，并从更广泛的情况中学习。在最近的研究中，努力的方向是加强数据生成，以支持政策学习Kumar et al.（2023）; Du et al.（2023）。此外，在一些研究中，这些模型已被集成到奖励函数中，以改善策略学习Sontakke et al.（2023）。与这些发展并行，另一系列研究集中在使用VLM Tang et al.（2023）; Li et al. （2023d）和LLMsShi 等人（2023）
•

Long-horizon steps In relation to the issue of data efficiency, RL becomes more challenging as the length of action sequences increases. This is due to the ambiguity in the relationship between actions and rewards, known as the credit assignment problem, and the increase in the number of states to be explored, necessitating a significant amount of time and data. One typical approach for long and complex tasks is to break them down into a sequence of subgoals and apply pretrained policies to solve each subgoal (e.g., Takamatsu et al. (2022)). This idea falls within the framework called the task and motion planning (TAMP)Garrett et al. (2021). TAMP is composed of two primary components: task planning, which entails identifying sequences of high-level actions, and motion planning, which involves finding physically consistent, collision-free trajectories to achieve the objectives of the task plan.
长水平阶跃关于数据效率的问题，随着动作序列长度的增加，强化学习变得更具挑战性。这是由于行动和回报之间的关系模糊，即所谓的学分分配问题，以及需要探索的状态数量的增加，需要大量的时间和数据。对于长时间且复杂的任务的一种典型方法是将它们分解成一系列子目标，并应用预先训练的策略来解决每个子目标（例如，Takamatsu等人（2022））。这个想法属于福尔斯的框架称为任务和运动规划（TAMP）加勒特等人。（2021）。TAMP由两个主要部分组成：任务规划，它需要识别高级别动作的序列，以及运动规划，它涉及到找到物理上一致的，无碰撞的轨迹，以实现任务计划的目标。

LLMs are well-suited to TAMP, and recent research has often adopted an approach where LLMs are used to execute high-level task planning, while low-level controls are addressed with RL-based policies Xu et al. (2023); Sun et al. (2023a); Li et al. (2023b); Parakh et al. (2023). The advanced capabilities of LLMs enable them to effectively decompose even abstract instructions into subgoals Wake et al. (2023c), contributing to the enhancement of language understanding abilities in robotic systems.
LLMs非常适合TAMP，最近的研究经常采用一种方法，其中LLMs用于执行高级任务规划，而低级控制则通过基于RL的策略来解决Xu et al.（2023）; Sun et al.（2023 a）; Li et al.（2023 b）; Parakh et al.（2023）。LLMs的高级功能使它们能够有效地将抽象指令分解为子目标Wake et al.（2023 c），有助于增强机器人系统的语言理解能力。

4.1.2 Imitation Learning (IL)
4.1.2模仿学习（IL）

While RL aims to train a policy based on exploratory behavior and maximizing rewards through interactions with the environment, imitation learning (IL) seeks to leverage expert data to mimic the actions of experienced agents or experts. For example, in robotics, one of the major frameworks based on IL is Behavioral Cloning (BC). BC is an approach where a robot is trained to mimic the actions of an expert by directly copying them. In this approach, the expert’s actions in performing specific tasks are recorded, and the robot is trained to replicate these actions in similar situations. Recent BC-based methods often incorporate technologies from LLM/VLMs, enabling more advanced end-to-end models. For example, Brohan et al. proposed RT-1 Brohan et al. (2022) and RT-2 Brohan et al. (2023), transformer-based models that output an action sequence for the base and arm, taking a series of images and language as input. These models are reported to show high generalization performance as the result of training on a large amount of training data.
强化学习旨在训练基于探索行为的策略，并通过与环境的交互来最大化奖励，而模仿学习（IL）则试图利用专家数据来模仿有经验的代理或专家的行为。例如，在机器人技术中，基于IL的主要框架之一是行为克隆（BC）。BC是一种训练机器人通过直接模仿专家的动作来模仿他们的方法。在这种方法中，专家在执行特定任务时的动作被记录下来，机器人被训练成在类似的情况下重复这些动作。最近的基于BC的方法通常结合了LLM/VLM的技术，从而实现了更先进的端到端模型。例如，Brohan等人提出了RT-1Brohan等人（2022）和RT-2Brohan等人（2023），基于transformer的模型输出基座和手臂的动作序列，将一系列图像和语言作为输入。据报道，这些模型在大量训练数据的训练下表现出很高的泛化性能。

4.1.3 Traditional RGB
4.1.3传统RGB

Learning intelligent agent behavior leveraging image inputs has been of interest for many years Mnih et al. (2015). The inherent challenge of using RGB input is the curse of dimensionality. To solve this problem, researchers either use more data Jang et al. (2022); Ha et al. (2023) or introduce inductive biases into the model design to improve sample efficiency. In particular, authors incorporate 3D structures into the model architecture for manipulations Zeng et al. (2021); Shridhar et al. (2023); Goyal et al. (2023); James and Davison (2022). For robot navigation, authors Chaplot et al. (2020a, b) leverage maps as a representation. Maps can either be learned from a neural network aggregating all previous RGB inputs or through 3D reconstruction methods such as Neural Radiance Fields Rosinol et al. (2022).
利用图像输入学习智能代理行为多年来一直受到人们的关注，Mnih et al.（2015）。使用RGB输入的固有挑战是维度的灾难。为了解决这个问题，研究人员要么使用更多的数据Jang et al.（2022）; Ha et al.（2023），要么在模型设计中引入归纳偏差以提高样本效率。特别是，作者将3D结构纳入模型架构中进行操作Zeng et al.（2021）; Shridhar et al.（2023）; Goyal et al.（2023）; James and Davison（2022）。对于机器人导航，作者Chaplot等人（2020 a，B）利用地图作为表示。地图可以从神经网络聚合所有先前的RGB输入或通过3D重建方法（如Neural Radiance FieldsRosinol et al.（2022））学习。

To obtain more data, researchers synthesize synthetic data using graphics simulators Mu et al. (2021); Gong et al. (2023b), and try to close the sim2real gap Tobin et al. (2017); Sadeghi and Levine (2016); Peng et al. (2018). Recently, there has been some collective effort to curate large-scale dataset that aims to resolve the data scarcity problem Padalkar et al. (2023); Brohan et al. (2023). On the other hand, to improve sample complexity, data augmentation techniques have been extensively studied as well Zeng et al. (2021); Rao et al. (2020); Haarnoja et al. (2023); Lifshitz et al. (2023).
为了获得更多数据，研究人员使用图形模拟器Mu et al.（2021）合成合成数据; Gong et al.（2023 b），并试图缩小sim 2 real差距Tobin et al.（2017）; Sadeghi and Levine（2016）; Peng et al.（2018）。最近，已经有一些集体努力来管理大规模数据集，旨在解决数据稀缺问题Padalkar et al.（2023）; Brohan et al.（2023）。另一方面，为了提高样本复杂性，数据增强技术也得到了广泛的研究，Zeng et al.（2021）; Rao et al.（2020）; Haarnoja et al.（2023）; Lifshitz et al.（2023）。

4.1.4 In-context Learning
4.1.4情境学习

In-context learning was shown to be an effective method for solving tasks in NLP with the advent of large language models like GPT-3 Brown et al. (2020); Min et al. (2022). Few-shot prompts were seen to be an effective way to contextualize model output’s across a variety of tasks in NLP by providing examples of the task within the context of the LLM prompt. Factors like the diversity of examples and quality of examples shown for the in-context demonstrations may improve the quality of model outputs An et al. (2023); Dong et al. (2022). Within the context of multi-modal foundation models, models like Flamingo and BLIP-2 Alayrac et al. (2022); Li et al. (2023c) have been shown to be effective at a variety of visual understanding tasks when given only given a small number of examples. In context learning can be further improved for agents within environments by incorporating environment-specific feedback when certain actions are taken Gong et al. (2023a).
随着大型语言模型的出现，如GPT-3Brown et al.（2020）; Min et al.（2022），上下文学习被证明是解决NLP任务的有效方法。通过在LLM提示符的上下文中提供任务的示例，少镜头提示符被认为是一种有效的方法，可以在NLP中的各种任务中将模型输出上下文化。为上下文演示显示的示例的多样性和示例的质量等因素可能会提高模型输出的质量An et al.（2023）; Dong et al.（2022）。在多模态基础模型的背景下，像Flamingo和BLIP-2这样的模型Alayrac et al.（2022）; Li et al.（2023 c）已经被证明在各种视觉理解任务中是有效的，只要给出少量的例子。通过在采取某些行动时纳入环境特定的反馈，可以进一步改善环境中的智能体的上下文学习（Gong et al.，2023 a）。

4.1.5 Optimization in the Agent System
4.1.5代理系统中的优化

The optimization of agent systems can be divided into spatial and temporal aspects. Spatial optimization considers how agents operate within a physical space to execute tasks. This includes inter-robot coordination, resource allocation, and keeping an organized space.
Agent系统的优化可以分为空间和时间两个方面。空间优化考虑了代理如何在物理空间内操作以执行任务。这包括机器人之间的协调、资源分配和保持有组织的空间。

In order to effectively optimize agent AI systems, especially systems with large numbers of agents acting in parallel, previous works have focused on using large batch reinforcement learning Shacklett et al. (2023). Since datasets of multi-agent interactions for specific tasks are rare, self-play reinforcement learning enables a team of agents to improve over time. However, this may also lead to very brittle agents that can only work under self-play and not with humans or other independent agents since they over-fit to the self-play training paradigm. To address this issue, we can instead discover a diverse set of conventions Cui et al. (2023); Sarkar et al. (2023), and train an agent that is aware of a wide range of conventions. Foundation models can further help to establish conventions with humans or other independent agents, enabling smooth coordination with new agents.
为了有效地优化智能体AI系统，特别是具有大量并行代理的系统，以前的工作主要集中在使用大批量强化学习Shacklett et al.（2023）。由于特定任务的多智能体交互数据集很少，自我强化学习使一组智能体能够随着时间的推移而改进。然而，这也可能导致非常脆弱的代理人，他们只能在自我游戏下工作，而不能与人类或其他独立的代理人一起工作，因为他们过于适合自我游戏训练范例。为了解决这个问题，我们可以发现一组不同的约定Cui et al.（2023）; Sarkar et al.（2023），并训练一个知道各种约定的智能体。基础模型可以进一步帮助建立与人类或其他独立代理的约定，从而实现与新代理的平滑协调。

Temporal optimization, on the other hand, focuses on how agents execute tasks over time. This encompasses task scheduling, sequencing, and timeline efficiency. For instance, optimizing the trajectory of a robot’s arm is an example of efficiently optimizing movement between consecutive tasks Zhou et al. (2023c). At the level of task scheduling, methods like LLM-DP Dagan et al. (2023) and ReAct Yao et al. (2023a) have been proposed to solve efficient task planning by incorporating environmental factors interactively.
另一方面，时间优化关注代理如何随时间推移执行任务。这包括任务调度、排序和时间轴效率。例如，优化机器人手臂的轨迹是有效优化连续任务之间移动的一个例子。在任务调度层面，已经提出了像LLM-DPDagan et al.（2023）和ReActYao et al.（2023 a）这样的方法，通过交互地结合环境因素来解决有效的任务规划。

4.2 Agent Systems (zero-shot and few-shot level)
4.2药剂系统（零次和少次水平）

4.2.1 Agent Modules
4.2.1代理模块

Our foray into the agent paradigm involves the development of Agent AI "Modules" for interactive multi-modal agents using LLMs or VLMs. Our initial Agent Modules facilitate training or in-context learning and adopt a minimalist design for the purposes of demonstrating the agent’s ability to schedule and coordinate effectively. We also explored initial prompt-based memory techniques that facilitate better planning and inform future actions approaches within the domain. To illustrate, our “MindAgent" infrastructure comprises 5 main modules: 1) environment perception with task planning, 2) agent learning, 3) memory, 4) general agent action prediction and 5) cognition, as shown in Figure 5.
我们对代理模式的尝试涉及使用LLMs或VLM为交互式多模态代理开发代理AI“模块”。我们的初始代理模块有助于培训或情境学习，并采用了最低限度的设计，以展示代理的能力，以安排和协调有效。我们还探索了最初的基于记忆的技术，这些技术有助于更好地规划和告知该领域内的未来行动方法。为了说明，我们的“MindAgent”基础设施包括5个主要模块：1）环境感知与任务规划，2）代理学习，3）记忆，4）一般代理动作预测和5）认知，如图5所示。

4.2.2 Agent Infrastructure
4.2.2代理基础设施

Agent-based AI is a large and fast-growing community within the domains of entertainment, research, and industry. The development of large foundation models has significantly improved the performance of agent AI systems. However, creating agents in this vein is limited by the increasing effort necessary to create high-quality datasets and overall cost. At Microsoft, building high-quality agent infrastructure has significantly impacted multi-modal agent copilots by using advanced hardware, diverse data sources, and powerful software libraries. As Microsoft continues to push the boundaries of agent technology, AI agent platforms are poised to remain a dominant force in the world of multimodal intelligence for years to come. Nevertheless, agent AI interaction is currently still a complex process that requires a combination of multiple skills. The recent advancements in the space of large generative AI models have the potential to greatly reduce the current high cost and time required for interactive content, both for large studios, as well as empowering smaller independent content creators to design high quality experiences beyond what they are currently capable of. The current human-machine interaction systems inside multi-modal agents are primarily rule-based. They do have intelligent behaviors in response to human/user actions and possess web knowledge to some extent. However, these interactions are often limited by software development costs to enable specific behaviors in the system. In addition, current models are not designed to help human to achieve a goal in the case of users’ inability to achieve specific tasks. Therefore, there is a need for an agent AI system infrastructure to analyze users behaviors and provide proper support when needed.
基于Agent的人工智能是一个大型的、快速发展的社区，在娱乐、研究和工业领域都有广泛的应用。大型基础模型的发展显著提高了智能体人工智能系统的性能。然而，以这种方式创建代理受到创建高质量数据集所需的不断增加的工作量和总体成本的限制。在Microsoft，构建高质量的代理基础架构通过使用先进的硬件、多样化的数据源和功能强大的软件库，对多模式代理副本产生了重大影响。随着微软不断推进代理技术的边界，AI代理平台将在未来几年保持多模式智能世界的主导力量。尽管如此，智能体AI交互目前仍然是一个复杂的过程，需要多种技能的组合。大型生成式人工智能模型领域的最新进展有可能大幅降低大型工作室当前交互式内容所需的高成本和时间，并使规模较小的独立内容创作者能够设计出超出其当前能力的高质量体验。目前多模态智能体内部的人机交互系统主要是基于规则的。它们确实具有响应人类/用户动作的智能行为，并在一定程度上拥有网络知识。然而，这些交互通常受到软件开发成本的限制，以实现系统中的特定行为。此外，当前的模型并不是设计成在用户不能实现特定任务的情况下帮助人类实现目标。因此，需要一个代理人工智能系统基础设施来分析用户行为，并在需要时提供适当的支持。

4.3 Agentic Foundation Models (pretraining and finetune level)
4.3训练基础模型（pretraining和finetune级别）

The use of pre-trained foundation models offers a significant advantage in their wide applicability across diverse use cases. The integration of these models enables the development of customized solutions for various applications, circumventing the need for extensive labeled datasets for each specific task.
使用预先训练的基础模型在其跨不同用例的广泛适用性方面提供了显著的优势。这些模型的集成使得能够为各种应用开发定制的解决方案，从而避免了为每个特定任务需要大量的标记数据集。

A notable example in the field of navigation is the LM-Nav system Shah et al. (2023a), which incorporates GPT-3 and CLIP in a novel approach. It effectively uses textual landmarks generated by the language model, anchoring them in images acquired by robots for navigation. This method demonstrates a seamless fusion of textual and visual data, significantly enhancing the capabilities of robotic navigation, while maintaining wide applicability.
导航领域的一个值得注意的例子是LM-Nav系统Shah等人（2023 a），该系统将GPT-3和CLIP结合在一种新的方法中。它有效地使用了语言模型生成的文本地标，将它们锚定在机器人获取的图像中用于导航。该方法实现了文本和视觉数据的无缝融合，在保持广泛适用性的同时，显著增强了机器人导航的能力。

In robot manipulation, several studies have proposed the use of off-the-shelf LLMs (e.g., ChatGPT) while using open vocabulary object detectors. The combination of LLM and advanced object detectors (e.g., Detic Zhou et al. (2022)) facilitates the understanding of human instruction while grounding the textual information in scenery information Parakh et al. (2023). Furthermore, the latest advancements showcase the potential of using prompt engineering with advanced multi-modal models such as GPT-4V(ision) Wake et al. (2023b). This technique opens avenues for multi-modal task planning, underscoring the versatility and adaptability of pre-trained models in a variety of contexts.
在机器人操作中，一些研究已经提出使用现成的LLMs（例如，ChatGPT），同时使用开放词汇对象检测器。LLM和高级物体检测器（例如，DeticZhou et al.（2022））促进了对人类指令的理解，同时将文本信息建立在风景信息中Parakh et al.（2023）。此外，最新的进展展示了使用先进的多模态模型（如GPT-4V（Vision）Wake等人（2023 b））进行快速工程的潜力。该技术为多模态任务规划开辟了道路，强调了预训练模型在各种上下文中的通用性和适应性。

5 Agent AI Categorization
5智能体AI分类

5.1 Generalist Agent Areas
5.1通才代理区域

Computer-based action and generalist agents (GAs) are useful for many tasks. Recent progress in the field of large foundation models and interactive AI has enabled new functionalities for GAs. However, for a GA to become truly valuable to its users, it must be natural to interact with, and generalize to a broad range of contexts and modalities. We high-quality extended main Chapters on Agent foundation AI in Sec.6, especially in areas relevant to the themes in general of these topics:
基于计算机的动作和通才代理（GA）对于许多任务都是有用的。大型基础模型和交互式AI领域的最新进展为GA提供了新的功能。然而，GA要想对用户真正有价值，它必须能够自然地与广泛的上下文和模式进行交互和推广。在第二章中，我们对Agent基础AI的主要章节进行了高质量的扩充。6、特别是在与这些专题的一般主题有关的领域：

Multimodal Agent AI (MMA) is an upcoming forum¹¹1Current URL: https://multimodalagentai.github.io/ for our research and industry communities to engage with each other and with the broader research and technology communities in Agent AI. Recent progress in the field of large foundation models and interactive AI has enabled new functionalities for generalist agents (GAs), such as predicting user actions and task planning in constrained settings (e.g., MindAgent Gong et al. (2023a), fine-grained multimodal video understanding Luo et al. (2022), Robotics Ahn et al. (2022b); Brohan et al. (2023)), or providing a chat companion for users that incorporates knowledge feedback (e.g., website customer support for healthcare systems Peng et al. (2023)). More details about the representative works and most recent representative works are shown below. We hope to discuss our vision for the future of MAA and inspire future researchers to work in this space. This article and our forum covers the following main topics, but is not limited exclusively to these:
Multimodal Agent AI（MMA）是一个即将到来的论坛，供我们的研究和行业社区相互交流，并与Agent AI中更广泛的研究和技术社区进行交流。大型基础模型和交互式AI领域的最新进展为通用代理（GA）提供了新的功能，例如预测用户操作和约束设置中的任务规划（例如，MindAgentGong等人（2023 a）、细粒度多模态视频理解Luo等人（2022）、RoboticsAhn等人（2022 b）; Brohan等人（2023）），或者为用户提供包含知识反馈的聊天伴侣（例如，医疗保健系统的网站客户支持Peng等人（2023））。有关代表作品和最新代表作品的更多详细信息如下所示。我们希望讨论我们对MAA未来的愿景，并激励未来的研究人员在这个领域工作。这篇文章和我们的论坛涵盖了以下主要主题，但不仅限于这些：

•

Primary Subject Topics: Multimodal Agent AI, General Agent AI
主题：Multimodal Agent AI，General Agent AI
•

Secondary Subject Topics: Embodied Agents, Action Agents, Language-based Agents, Vision & Language Agents, Knowledge and Inference Agents, Agents for Gaming, Robotics, Healthcare, etc.
次级主题：智能代理，动作代理，基于智能的代理，视觉和语言代理，知识和推理代理，游戏代理，机器人，医疗保健等。
•

Extend Subject Topics: Visual Navigation, Simulation Environments, Rearrangement, Agentic Foundation Models, VR/AR/MR, Embodied Vision & Language.
扩展主题：视觉导航，仿真环境，重排，抽象基础模型，VR/AR/MR，虚拟视觉和语言。

Next, we present a specific lists of representative agent categories as follows:
接下来，我们将代表性代理类别的具体列表如下：

5.2 Embodied Agents
5.2代理商

Our biological minds live in bodies, and our bodies move through a changing world. The goal of embodied artificial intelligence is to create agents, such as robots, which learn to creatively solve challenging tasks requiring interaction with the environment. While this is a significant challenge, important advances in deep learning and the increasing availability of large datasets like ImageNet have enabled superhuman performance on a variety of AI tasks previously thought intractable. Computer vision, speech recognition and natural language processing have experienced transformative revolutions at passive input-output tasks like language translation and image classification, and reinforcement learning has similarly achieved world-class performance at interactive tasks like game playing. These advances have supercharged embodied AI, enabling a growing collection of users to make rapid progress towards intelligent agents can interactive with machine.
我们的生物思想生活在身体中，我们的身体在不断变化的世界中移动。具体化人工智能的目标是创建代理，如机器人，学习创造性地解决需要与环境交互的挑战性任务。虽然这是一个重大挑战，但深度学习的重要进展以及ImageNet等大型数据集的可用性不断提高，使人们能够在以前认为棘手的各种人工智能任务上实现超人的性能。计算机视觉、语音识别和自然语言处理在语言翻译和图像分类等被动输入输出任务中经历了变革性的革命，强化学习在游戏等交互式任务中也取得了世界级的性能。这些进步为嵌入式人工智能提供了动力，使越来越多的用户能够在智能代理与机器交互方面取得快速进展。

5.2.1 Action Agents
5.2.1行动代理

Action agents refer to the agents that need to execute physical actions in the simulated physical environment or real world. In particular, they need to be actively engaging in activities with the environment. We broadly classify action agents into two different categories based on their application domains: gaming AI and robotics.
动作智能体是指需要在模拟的物理环境或真实的世界中执行物理动作的智能体。特别是，他们需要积极参与与环境有关的活动。我们根据其应用领域将动作代理大致分为两类：游戏AI和机器人。

In gaming AI, the agents will interact with the game environment and other independent entities. In these settings, natural language can enable smooth communication between agents and humans. Depending on the game, there may be a specific task to accomplish, providing a true reward signal. For instance, in the competitive Diplomacy game, training a language model using human conversation data along with an action policy with RL enables human-level play Meta Fundamental AI Research Diplomacy Team et al.(2022)Meta Fundamental AI Research (FAIR) Diplomacy Team, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, et al. (FAIR).
在游戏AI中，智能体将与游戏环境和其他独立实体进行交互。在这些环境中，自然语言可以实现智能体和人类之间的顺畅沟通。根据游戏的不同，可能会有一个特定的任务要完成，提供一个真正的奖励信号。例如，在竞争性的外交游戏中，使用人类对话数据沿着RL的行动策略训练语言模型可以实现人类水平的游戏。（2022）Meta基础人工智能研究（FAIR）外交团队，巴赫金，布朗，迪南，法里纳，弗莱厄蒂，弗里德，戈夫，格雷，胡等。（FAIR）。

There are also settings where we agents act as normal residents in a town Park et al. (2023a), without trying to optimize a specific goal. Foundation models are useful in these settings because they can model interactions that appear more natural by mimicking human behavior. When augmented with external memory, they produce convincing agents that can have conversations, daily schedules, form relationships, and have a virtual life.
也有一些设置，我们代理人作为城镇中的正常居民Park et al.（2023a），而不试图优化特定的目标。基础模型在这些设置中非常有用，因为它们可以通过模拟人类行为来模拟看起来更自然的交互。当外部记忆被增强时，他们会产生令人信服的代理人，这些代理人可以进行对话、安排日程、建立关系，并拥有虚拟生活。

5.2.2 Interactive Agents
5.2.2交互代理

Interactive agents simply refer to agents that can interact with the world, a broader class of agents than action agents. Their forms of interaction do not necessarily require physical actions, but may involve communicating information to users or modifying the environment. For instance, an embodied interactive agent may answer a user’s questions about a topic through dialogue or help users parse through existing information similar to a chatbot. By extending an agent’s capabilities to include information sharing, the core designs and algorithms of Agent AI can be effectively adapted for a range of applications, such as diagnostic Lee et al. (2023) and knowledge-retrieval Peng et al. (2023) agents.
交互式代理简单地指可以与世界交互的代理，这是一个比行动代理更广泛的代理类别。它们的交互形式不一定需要物理动作，但可能涉及向用户传达信息或修改环境。例如，具体化的交互式代理可以通过对话回答用户关于主题的问题，或者类似于聊天机器人帮助用户解析现有信息。通过扩展Agent的功能以包括信息共享，Agent AI的核心设计和算法可以有效地适用于一系列应用，例如诊断Lee et al.（2023）和知识检索Peng et al.（2023）Agent。

5.3 Simulation and Environments Agents
5.3模拟和环境代理

An effective approach for AI agents to learn how to act in an environment is to go through trial-and-error experiences via interactions with the environment. A representative method is RL, which requires extensive experience of failures to train an agent. Although there exist approaches that use physical agents Kalashnikov et al. (2018), using physical agents is time-consuming and costly. Furthermore, training in the physical environment is often feasible when failure in actual environments can be dangerous (e.g., autonomous driving, underwater vehicles). Hence, using simulators to learn policies is a common approach.
人工智能主体学习如何在环境中行动的一种有效方法是通过与环境的交互进行试错体验。一个代表性的方法是RL，它需要大量的失败经验来训练代理。虽然存在使用物理制剂的方法，但使用物理制剂既耗时又昂贵。此外，当在实际环境中的失败可能是危险的（例如，自动驾驶、水下航行器）。因此，使用模拟器来学习策略是一种常见的方法。

Many simulation platforms have been proposed for research in embodied AI, ranging from navigation Tsoi et al. (2022); Deitke et al. (2020); Kolve et al. (2017) to object manipulation Wang et al. (2023d); Mees et al. (2022); Yang et al. (2023a); Ehsani et al. (2021). One example is Habitat Savva et al. (2019); Szot et al. (2021), which provides a 3D indoor environment where human- and robotic-agents can perform various tasks such as navigation, instruction following, and question answering. Another representative simulation platform is VirtualHome Puig et al. (2018), supporting human avatars for object manipulation in 3D indoor environments. In the field of gaming, Carroll et al. have introduced "Overcooked-AI," a benchmark environment designed to study cooperative tasks between humans and AI Carroll et al. (2019). Along similar lines, several works aim to incorporate real human intervention beyond the focus of interaction between agents and the environment Puig et al. (2023); Li et al. (2021a); Srivastava et al. (2022). These simulators contribute to the learning of policies in practical settings involving agent and robot interactions, and IL-based policy learning utilizing human demonstrative actions.
许多模拟平台已经被提出用于嵌入式AI的研究，从导航Tengue et al.（2022）; Deitke et al.（2020）; Kolve et al.（2017）到对象操纵Wang et al.（2023 d）; Mees et al.（2022）; Yang et al.（2023 a）; Ehsani et al.（2021）。一个例子是HabitatSavva等人（2019）; Szot等人（2021），它提供了一个3D室内环境，人类和机器人代理可以执行各种任务，如导航，指令遵循和问题回答。另一个代表性的模拟平台是VirtualHomePuig等人（2018），支持在3D室内环境中进行对象操作的人类化身。在游戏领域，卡罗尔等人（Carroll et al.）引入了“Overcooked-AI”，这是一个旨在研究人类和AI之间合作任务的基准环境卡罗尔等人。沿着类似的路线，几项工作旨在将真实的人类干预纳入到主体与环境之间的相互作用之外Puig et al.（2023）; Li et al.（2021 a）; Srivastava et al.（2022）。这些模拟器有助于在实际环境中的政策学习，涉及代理和机器人的互动，IL为基础的政策学习利用人类示范行动。

In certain scenarios, the process of learning a policy may necessitate the integration of specialized features within simulators. For example, in the case of learning image-based policies, realistic rendering is often required to facilitate adaptability to real environments Mittal et al. (2023); Zhong et al. (2023). Utilizing a realistic rendering engine is effective for generating images that reflect various conditions, such as lighting environments. Moreover, simulators employing physics engines are required to simulate physical interactions with objects Liu and Negrut (2021). The integration of physics engines in simulation has been shown to facilitate the acquisition of skills that are applicable in real-world scenarios Saito et al. (2023).
在某些情况下，学习策略的过程可能需要在模拟器中集成专门的功能。例如，在学习基于图像的策略的情况下，通常需要逼真的渲染来促进对真实的环境的适应性Mittal et al.（2023）; Zhong et al.（2023）。利用逼真渲染引擎对于生成反映诸如照明环境的各种条件的图像是有效的。此外，需要使用物理引擎的模拟器来模拟与对象的物理交互Liu和Negrut（2021）。模拟中物理引擎的集成已被证明有助于获得适用于现实世界场景的技能Saito et al.（2023）。

5.4 Generative Agents
5.4生成剂

The recent advancements in the space of large generative AI models have the potential to greatly reduce the current high cost and time required for interactive content, both for large gaming studios, as well as empower smaller independent studios to create high quality experiences beyond what they are currently capable of. Additionally, embedding large AI models within a sandbox environment will allow users to author their own experiences and express their creativity in ways that are currently out of reach.
大型生成式人工智能模型领域的最新进展有可能大大降低大型游戏工作室当前交互式内容所需的高成本和时间，并使小型独立工作室能够创建超出其目前能力的高质量体验。此外，在沙箱环境中嵌入大型AI模型将允许用户以目前无法实现的方式创作自己的体验并表达他们的创造力。

The goals of this agent go beyond simply adding interactive 3d content to scenes, but also include:
这个代理的目标不仅仅是简单地将交互式3D内容添加到场景中，还包括：

•

Adding arbitrary behavior and rules of interactions to the objects, allowing the user to create their own VR rules with minimal prompting.
为对象添加任意行为和交互规则，允许用户在最小提示的情况下创建自己的VR规则。
•

Generating whole level geometry from a sketch on a piece of paper, by using the multimodal GPT4-v model, as well as other chains of models involving vision AI models
通过使用多模态GPT 4-v模型以及涉及视觉AI模型的其他模型链，从一张纸上的草图生成整个层次的几何图形
•

Retexturing content in scenes using diffusion models
使用扩散模型重新纹理化场景中的内容
•

Creating custom shaders and visual special effects from simple user prompts
从简单的用户提示创建自定义着色器和视觉特效

One potential application in the short term is the VR creation of a storyboarding/prototype tool allowing a single user to create a rough (but functional) sketch of an experience/game an order of magnitude faster than currently feasible. Such a prototype then could be expanded and made more polished using these tools as well.
短期内的一个潜在应用是VR创建故事板/原型工具，允许单个用户创建体验/游戏的粗略（但功能）草图，速度比目前可行的快一个数量级。这样一个原型，然后可以扩大和更加完善使用这些工具以及。

5.4.1 AR/VR/mixed-reality Agents
5.4.1AR/VR/混合现实代理

AR/VR/mixed-reality (jointly referred to as XR) settings currently require skilled artists and animators to create characters, environments, and objects to be used to model interactions in virtual worlds. This is a costly process that involves concept art, 3D modeling, texturing, rigging, and animation. XR agents can assist in this process by facilitating interactions between creators and building tools to help build the final virtual environment.
AR/VR/混合现实（统称为XR）设置目前需要熟练的艺术家和动画师创建角色，环境和对象，用于模拟虚拟世界中的交互。这是一个昂贵的过程，涉及概念艺术，3D建模，纹理，索具和动画。XR代理可以通过促进创建者和构建工具之间的交互来帮助构建最终的虚拟环境，从而在此过程中提供帮助。

Our early experiments have already demonstrated that GPT models can be used in the few-shot regime inside of the Unity engine (without any additional fine-tuning) to call engine-specific methods, use API calls to download 3d models from the internet and place them into the scene, and assign state trees of behavior and animations to them Huang et al. (2023a). This behavior likely emerges due to the presence of similar code in open source game repositories that use Unity. Therefore, GPT models are capable of building rich visual scenes in terms of loading in many objects into the scene from a simple user prompt.
我们早期的实验已经证明，GPT模型可以在Unity引擎内部的几个镜头中使用（无需任何额外的微调），以调用引擎特定的方法，使用API调用从互联网下载3D模型并将其放置到场景中，并为它们分配行为和动画的状态树Huang et al.（2023 a）。这种行为可能是由于使用Unity的开源游戏存储库中存在类似的代码而出现的。因此，GPT模型能够通过从简单的用户提示将许多对象加载到场景中来构建丰富的视觉场景。

The aim of this category of agents is to build a platform and a set of tools that provide an efficient interface between large AI models (both GPT-family ones as well as diffusion image models) and a rendering engine. We explore two primary avenues here:
这类代理的目的是构建一个平台和一组工具，在大型AI模型（GPT系列模型和扩散图像模型）和渲染引擎之间提供有效的接口。我们在这里探索两个主要途径：

•

Integration of large models into the various editor tools in the agent infrastructure, allowing for significant speedups in development.
将大型模型集成到代理基础设施中的各种编辑器工具中，从而大大加快了开发速度。
•

Controlling the rendering engine from within a user experience, by generating code that follows user instruction and then compiling it at runtime, allowing for users to potentially edit the VR/simulation they are interacting with in arbitrary ways, even by introducing new agent mechanics.
从用户体验中控制渲染引擎，通过生成遵循用户指令的代码，然后在运行时编译它，允许用户以任意方式编辑他们正在交互的VR/模拟，甚至通过引入新的代理机制。

Introducing an AI copilot focused on XR settings would be useful for XR creators, who can use the copilot to complete tedious tasks, like providing simple assets or writing code boilerplate, freeing creators to focus on their creative vision and quickly iterate on ideas.
引入一个专注于XR设置的人工智能副驾驶仪对XR创作者来说是有用的，他们可以使用副驾驶仪完成繁琐的任务，比如提供简单的资产或编写代码样板，让创作者专注于他们的创意愿景并快速实现想法。

Furthermore, agents can help users interactively modify the environment by adding new assets, changing the dynamics of the environment, or building new settings. This form of dynamic generation during runtime can also be specified by a creator, enabling the user’s experience to feel fresh and continue evolving over time.
此外，代理可以帮助用户交互式地修改环境，通过添加新的资产，改变环境的动态，或建立新的设置。这种在运行时动态生成的形式也可以由创建者指定，从而使用户的体验能够感到新鲜并随着时间的推移而继续发展。

5.5 Knowledge and Logical Inference Agents
5.5知识和逻辑推理代理

The capacity to infer and apply knowledge is a defining feature of human cognition, particularly evident in complex tasks such as logical deduction, and understanding theory of mind ²²2https://plato.stanford.edu/entries/cognitive-science. Making inferences on knowledge ensures that the AI’s responses and actions are consistent with known facts and logical principles. This coherence is a crucial mechanism for maintaining trust and reliability in AI systems, especially in critical applications like medical diagnosis or legal analysis. Here, we introduce agents that incorporate the interplay between knowledge and inference that address specific facets of intelligence and reasoning.
推理和应用知识的能力是人类认知的一个定义性特征，在逻辑推理和理解心理理论等复杂任务中尤其明显。对知识进行推理可以确保人工智能的反应和行动与已知事实和逻辑原则相一致。这种一致性是维持人工智能系统信任和可靠性的关键机制，特别是在医疗诊断或法律的分析等关键应用中。在这里，我们介绍的代理，将知识和推理，解决特定方面的智能和推理之间的相互作用。

5.5.1 Knowledge Agent
5.5.1知识代理

Knowledge Agents reason over their acquired knowledge systems in two directions: implicit and explicit. Implicit knowledge is typically what large-scale language models like the GPT series Brown et al. (2020); OpenAI (2023) encapsulate after being trained on vast amounts of text data. These models can generate responses that give the impression of understanding, as they draw on patterns and information implicitly learned during training. Explicit knowledge, conversely, is structured and can be directly queried, such as the information found in knowledge bases or databases, which was traditionally used to enhance AI reasoning capabilities by referencing verifiable external resources.
知识主体在两个方向上对他们获得的知识系统进行推理：隐式和显式。隐式知识通常是大规模语言模型，如GPT系列Brown et al.（2020）; OpenAI（2023）在大量文本数据上训练后封装的。这些模型可以产生给人理解的印象的反应，因为它们利用了在培训期间隐性学习的模式和信息。相反，显性知识是结构化的，可以直接查询，比如在知识库或数据库中找到的信息，传统上是通过引用可验证的外部资源来增强AI推理能力。

Despite the advancements in language models, their implicit knowledge is static and becomes outdated as the world evolves Lewis et al. (2020); Peng et al. (2023). This limitation necessitates the integration of explicit knowledge sources that are updated continuously, ensuring that AI systems can provide accurate and current responses. The fusion of implicit and explicit knowledge equips AI agents with a more nuanced understanding and the ability to apply knowledge contextually, akin to human intelligence Gao et al. (2022). Such integration is crucial for crafting knowledge-centric AI agents that not only possess information but can also understand, explain, and employ it, thereby narrowing the chasm between extensive learning and profound knowledge (Marcus and Davis, 2019; Gao et al., 2020). These agents are designed to reason with flexibility and dynamic information about the world, enhancing their robustness and adaptability (Marcus, 2020).
尽管语言模型取得了进步，但它们的隐性知识是静态的，随着世界的发展而变得过时刘易斯等人（2020）;彭等人（2023）。这一限制需要整合不断更新的显性知识源，以确保人工智能系统能够提供准确和最新的响应。隐式和显式知识的融合使AI代理具有更细致的理解和根据上下文应用知识的能力，类似于人类智能Gao et al.（2022）。这种整合对于制作以知识为中心的人工智能代理至关重要，这些代理不仅拥有信息，而且还可以理解，解释和使用它，从而缩小广泛学习和深刻知识之间的鸿沟（Marcus和Davis，2019; Gao 等人，2020年）。这些智能体被设计成能够灵活地推理和动态地了解世界，增强它们的鲁棒性和适应性（Marcus，2020）。

5.5.2 Logic Agents
5.5.2逻辑代理

Generally, a logic agent is a component of a system designed to apply logical reasoning to process data or solve tasks specific to logical inference or logical reasoning. Logic agents within the context of large foundation models like GPT-4 refers to a specialized component or submodules designed to handle logical reasoning tasks. These tasks often involve understanding and manipulating abstract concepts, deducing conclusions from given premises, or solving problems that require a structured, logical approach. Broadly, foundation models like GPT-4 are trained on a vast corpus of text data and learn to perform a wide range of tasks, including those that require some form of logical reasoning. Thus, their capability for logical reasoning is integrated into the overall architecture, and they generally do not possess a distinct, isolated "Logic agent". While GPT-4 and similar models can perform tasks that involve logic, their approach is fundamentally different from how humans or traditional logic-based systems operate. They do not follow formal logical rules or have an explicit understanding of logic; rather, they generate responses based on patterns learned from the training data. As a result, their performance in logical tasks can be impressive, but it can also be inconsistent or limited by the nature of the training data and the inherent limitations of the model’s design. One example of embedding a separate logical submodule into the architecture is (Wang et al., 2023e), which modifies the token embedding process used by LLMs during pre-training by parsing text into logical segments and explicitly modeling logical hierarchies in the token embeddings.
通常，逻辑代理是被设计为应用逻辑推理来处理数据或解决特定于逻辑推理或逻辑推理的任务的系统的组件。在大型基础模型（如GPT-4）的上下文中，逻辑代理是指专门用于处理逻辑推理任务的组件或子模块。这些任务通常涉及理解和操作抽象概念，从给定的前提推导结论，或解决需要结构化，逻辑方法的问题。广义地说，像GPT-4这样的基础模型是在大量文本数据的语料库上训练的，并学习执行广泛的任务，包括那些需要某种形式的逻辑推理的任务。因此，它们的逻辑推理能力被集成到整个体系结构中，并且它们通常不拥有独特的、孤立的“逻辑代理”。虽然GPT-4和类似的模型可以执行涉及逻辑的任务，但它们的方法与人类或传统的基于逻辑的系统的操作方式有着根本的不同。它们不遵循正式的逻辑规则或对逻辑有明确的理解;相反，它们基于从训练数据中学习到的模式生成响应。因此，它们在逻辑任务中的性能可能令人印象深刻，但也可能不一致或受到训练数据的性质和模型设计的固有限制的限制。将单独的逻辑子模块嵌入到体系结构中的一个示例是（Wang 等人，2023 e），其通过将文本解析成逻辑段并显式地对令牌嵌入中的逻辑层次结构建模来修改由LLMs在预训练期间使用的令牌嵌入过程。

5.5.3 Agents for Emotional Reasoning
5.5.3情感推理代理

Emotional understanding and empathy are important skills for agents in many human-machine interactions. To illustrate, one important goal for creating engaging dialogue agents is to have the agents act with increased emotion and empathy while minimizing socially inappropriate or offensive outputs. To advance towards this goal for dialogue agents, we released the Neural Image Commenting with Empathy (NICE) dataset Chen et al. (2021) consisting of almost two million images and the corresponding human-generated comments and a set of human emotion annotations. We also provided a novel pre-training model - Modeling Affect Gneration for Image Comments (MAGIC) Chen et al. (2021) - which aims to generate comments for images, conditioned on linguistic representations that capture style and affect, and to help generate more empathetic, emotional, engaging and socially appropriate comments. Our experiments show that the approach is effective in training a more human-like and engaging image comment agent. Developing empathy-aware agents is a promising direction for interactive agents, and it is important to create agents with emotional understanding capabilities across a wide range of groups and populations, especially considering that many current language models exhibit bias in their emotional understanding and empathetic reasoning capabilities Mao et al. (2022); Wake et al. (2023d).
情感理解和同理心是智能体在许多人机交互中的重要技能。为了说明，创建参与对话代理的一个重要目标是让代理以增加的情感和同理心行事，同时最大限度地减少社交不适当或冒犯性的输出。为了实现对话代理的这一目标，我们发布了Chen et al.（2021）的Neural Image Commenting with Empathy（NICE）数据集，其中包括近200万张图像和相应的人类生成的评论以及一组人类情感注释。我们还提供了一种新的预训练模型-为图像评论建模情感生成（MAGIC）Chen et al.（2021）-旨在为图像生成评论，以捕捉风格和影响的语言表示为条件，并帮助生成更具同情心，情感，吸引力和社会适当的评论。实验结果表明，该方法能够有效地训练出更人性化、更吸引人的图像评论Agent。开发移情感知代理是交互式代理的一个有前途的方向，重要的是要创建具有跨广泛群体和人群的情感理解能力的代理，特别是考虑到许多当前的语言模型在情感理解和移情推理能力方面表现出偏见Mao et al.（2022）; Wake et al.（2023 d）。

5.5.4 Neuro-Symbolic Agents
5.5.4神经象征物

Neuro-Symbolic agents operate on a hybrid system of neurons and symbols d’Avila Garcez and Lamb (2020). To solve problems stated in natural language is a challenging task because it requires explicitly capturing discrete symbolic structural information implicit in the input. However, most general neural sequence models do not explicitly capture such structural information, limiting their performance on these tasks. The work Chen et al. (2020) propose a new encoder-decoder model based on a structured neural representation agent, The encoder of TP-N2F employs TPR ‘binding’ to encode natural-language symbolic structure in vector space and the decoder uses TPR ‘unbinding’ to generate, in symbolic space, a sequential program represented by relational tuples, each consisting of a relation (or operation) and a number of arguments.
神经符号代理在神经元和符号的混合系统上运行d 'Avila Garcez和Lamb（2020）。解决自然语言表述的问题是一项具有挑战性的任务，因为它需要显式地捕获隐含在输入中的离散符号结构信息。然而，大多数一般的神经序列模型并没有明确地捕获这样的结构信息，限制了它们在这些任务上的性能。Chen et al.（2020）提出了一种基于结构化神经表示代理的新的编码器-解码器模型，TP-N2 F的编码器采用TPR“绑定”来编码向量空间中的自然语言符号结构，解码器使用TPR“解绑定”来生成符号空间中由关系元组表示的顺序程序，每个元组由关系（或操作）和许多参数组成。

Instruction following vision-language (VL) models like GPT-4 offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to “point to” and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. In (Park et al., 2023b), we build Localized Visual Commonsense model which allows users to specify (multiple) regions-as-input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt a LLM to collect common sense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. This pipeline is scalable and fully automatic, as no aligned or human-authored image and text pairs are required. With a separately trained critic model that selects high quality examples, we find that training on the localized commonsense corpus expanded solely from images can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in zero-shot settings demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression.
像GPT-4这样的指令跟随视觉语言（VL）模型提供了一个灵活的界面，以零触发的方式支持广泛的多模态任务。然而，在完整图像上操作的界面不能直接使用户“指向”和访问图像内的特定区域。这种能力不仅对于支持基于参考的VL基准测试非常重要，而且对于需要精确图像内推理的实际应用也非常重要。在（Park 等人，2023 b），我们构建了本地化视觉常识模型，该模型允许用户指定（多个）区域作为输入。我们通过从大型语言模型（LLM）中采样本地化常识知识来训练我们的模型：具体地说，我们提示LLM收集常识知识，给出一组VL模型自动生成的全局文字图像描述和局部文字区域描述。此管道是可伸缩的，并且是全自动的，因为不需要对齐或人工创作的图像和文本对。通过单独训练的评论家模型选择高质量的示例，我们发现仅从图像扩展的本地化常识语料库上的训练可以成功地提取现有的VL模型以支持参考作为输入接口。实验结果和在零触发设置下的人类评估表明，与传递生成的指称表达式的基线相比，我们的提取方法导致更精确的推理VL模型。

5.6 LLMs and VLMs Agent
5.6LLMs和VLM代理

A number of works leverage LLMs as agents to perform task planning (Huang et al., 2022a; Wang et al., 2023b; Yao et al., 2023a; Li et al., 2023a), and leverage the LLMs’ large internet-scale domain knowledge and zero-shot planning abilities to perform agentic tasks like planning and reasoning. Recent robotics research also leverages LLMs to perform task planning (Ahn et al., 2022a; Huang et al., 2022b; Liang et al., 2022) by decomposing natural language instruction into a sequence of subtasks, either in the natural language form or in Python code , then using a low-level controller to execute these subtasks. Additionally, Huang et al. (2022b), Liang et al. (2022), and Wang et al. (2023a) also incorporate environmental feedback to improve task performance. There have also been a number of works that demonstrate the ability of general-purpose visually-aligned large language models trained on large-scale text, image, and video data to serve as a foundation for creating multi-modal agents that are embodied and can act in various environments Baker et al. (2022); Driess et al. (2023); Brohan et al. (2023).
许多作品利用LLMs作为代理来执行任务规划（Huang 等人，2022 a; Wang等人，2023 b; Yao等人，2023 a; Li等人，2023 a），并利用LLMs的大规模互联网领域知识和零触发规划能力来执行代理任务，如规划和推理。最近的机器人研究还利用LLMs来执行任务规划（Ahn 等人，2022 a; Huang等人，2022 b; Liang等人，2022），通过将自然语言指令分解成自然语言形式或Python代码的子任务序列，然后使用低级控制器来执行这些子任务。此外，Huang 等人（2022 b）、Liang 等人（2022）和Wang等人（2023 b）也进行了研究。（2023 a）还包括环境反馈，以提高任务绩效。还有一些作品证明了在大规模文本，图像和视频数据上训练的通用视觉对齐大型语言模型的能力，可以作为创建多模态代理的基础，这些代理可以在各种环境中体现和发挥作用Baker et al.（2022）; Driess et al.（2023）; Brohan et al.（2023）。

6 Agent AI Application Tasks
6Agent AI应用任务

6.1 Agents for Gaming
6.1游戏代理

Games provide a unique sandbox to test the agentic behavior of LLMs and VLMs, pushing the boundaries of their collaborative and decision-making abilities. We describe three areas in particular that highlight agent’s abilities to interact with human players and other agents, as well as their ability to take meaningful actions within an environment.
游戏提供了一个独特的沙盒来测试LLMs和VLM的代理行为，推动他们的协作和决策能力的界限。我们描述了三个领域，特别是突出代理的能力，与人类球员和其他代理，以及他们的能力，在一个环境中采取有意义的行动。

6.1.1 NPC Behavior
6.1.1NPC行为

In modern gaming systems, the behavior of Non-Player Characters (NPCs) is predominantly dictated by predefined scripts crafted by developers. These scripts encompass a range of reactions and interactions based on various triggers or player actions within the gaming environment. However, this scripted nature often results in predictable or repetitive NPC behavior which fails to evolve in response to player’s actions or the dynamic environment of the game. This rigidity hampers the immersive experience intended in a dynamic gaming environment. Therefore, there is a burgeoning interest in leveraging LLMs to induce autonomy and adaptability in NPC behavior, making interactions more nuanced and engaging. AI-driven NPCs can learn from player behavior, adapt to varying strategies, and provide a more challenging and less predictable gameplay experience. Large Language Models (LLMs) can significantly contribute to evolving NPC behavior in games. By processing vast amounts of text, LLMs can learn patterns and generate responses that are more varied and human-like. They can be utilized to create dynamic dialogue systems, making interactions with NPCs more engaging and less predictable. Furthermore, LLMs can be trained on player feedback and in-game data to continually refine NPC behaviors, making them more attuned to player expectations and game dynamics.
在现代游戏系统中，非玩家角色（NPC）的行为主要由开发人员精心制作的预定义脚本决定。这些脚本包含基于游戏环境中的各种触发器或玩家动作的一系列反应和交互。然而，这种脚本化的性质通常会导致可预测或重复的NPC行为，这些行为无法响应玩家的动作或游戏的动态环境。这种刚性阻碍了动态游戏环境中的沉浸式体验。因此，人们越来越有兴趣利用LLMs来诱导NPC行为的自主性和适应性，使互动更加细致入微和引人入胜。AI驱动的NPC可以从玩家行为中学习，适应不同的策略，并提供更具挑战性和更少可预测性的游戏体验。大型语言模型（LLMs）可以显著促进游戏中NPC行为的发展。通过处理大量的文本，LLMs可以学习模式并生成更多样化和更人性化的响应。它们可以用来创建动态对话系统，使与NPC的互动更具吸引力，更不可预测。此外，LLMs可以根据玩家反馈和游戏中的数据进行训练，以不断改进NPC的行为，使其更加适应玩家的期望和游戏动态。

6.1.2 Human-NPC Interaction
6.1.2人-NPC交互

The interaction between human players and NPCs is a crucial aspect of the gaming experience. The conventional interaction paradigm is primarily one-dimensional, with NPCs reacting in a preset manner to player inputs. This limitation stifles the potential for a more organic and enriching interaction, akin to human-human interaction within the virtual realm. The advent of LLM and VLM technologies holds the promise of transforming this paradigm. By employing these technologies, gaming systems can analyze and learn from human behavior to provide more human-like interactions. This not only enhances the realism and engagement of the game but also provides a platform for exploring and understanding human-machine interaction in a controlled yet complex setting.
人类玩家和NPC之间的互动是游戏体验的一个重要方面。传统的互动模式主要是一维的，NPC以预设的方式对玩家的输入做出反应。这种限制扼杀了更有机和丰富的互动的潜力，类似于虚拟领域中的人与人的互动。LLM和VLM技术的出现有望改变这种模式。通过采用这些技术，游戏系统可以分析和学习人类行为，以提供更人性化的交互。这不仅增强了游戏的真实感和参与度，还为在受控但复杂的环境中探索和理解人机交互提供了平台。

6.1.3 Agent-based Analysis of Gaming
6.1.3基于Agent的博弈分析

Gaming is an integral part of daily life, estimated to engage half of the world’s population³³3https://www.dfcint.com/global-video-game-audience-reaches-3-7-billion/. Additionally, it exhibits a positive impact on mental health⁴⁴4https://news.microsoft.com/source/features/work-life/mind-games-how-gaming-can-play-a-positive-role-in-mental-health/. However, contemporary game systems exhibit a deficiency in interactions with human players since their behaviors are primarily hand-crafted by game developers. These pre-programmed behaviors frequently fail to adapt to players’ needs. Consequently, there exists a need for new AI systems in games that can analyze player behaviors and furnish appropriate support when necessary. Intelligent interactive systems bear the potential to revolutionize how gamers interact with gaming systems in general. NPCs’ interactions with gamers are no longer confined by the restricted rule sets designed by game developers. They have the potential to adapt seamlessly to gamers’ experiences, providing timely feedback to enrich the gaming experience and elevate the synergy of human-machine interaction.
游戏是日常生活中不可或缺的一部分，估计吸引了世界一半的人口 ³³3https://www.dfcint.com/global-video-game-audience-reaches-3-7-billion/ 。此外，它对心理健康也有积极的影响#1。然而，当代游戏系统在与人类玩家的交互方面表现出不足，因为他们的行为主要是由游戏开发者手工制作的。这些预先设定好的行为经常不能适应玩家的需要。因此，游戏中需要新的AI系统，可以分析玩家行为并在必要时提供适当的支持。智能交互系统有可能彻底改变玩家与游戏系统的交互方式。NPC与玩家的互动不再受游戏开发者设计的限制性规则集的限制。它们有潜力无缝适应玩家的体验，提供及时的反馈，以丰富游戏体验并提升人机交互的协同作用。

LLMs can serve as a robust tool for analyzing in-game text data, including chat logs, player feedback, and narrative content. They can help in identifying patterns of player behavior, preferences, and interactions which can be invaluable for game developers to improve game mechanics and narratives. Additionally, VLMs can parse through large quantities of image and video data from gaming sessions to help analyze user intent and actions within the game world. Moreover, LLMs and VLMs can facilitate the development of intelligent agents within games that can communicate with players and other agents in a sophisticated and human-like manner, enhancing the overall gaming experience. Beyond LLMs and VLMs, user input data, provides a promising avenue for creating game-playing agents that model perception, game playing, and game understanding by imitating human players. By incorporating a combination of player interactions and feedback, pixel inputs, and natural language planning and understanding, agent models can assist in the continuous improvement of game dynamics, driving a more player-centric evolution of the gaming environment.
LLMs可以作为一个强大的工具，用于分析游戏中的文本数据，包括聊天记录，玩家反馈和叙事内容。它们可以帮助识别玩家行为、偏好和交互的模式，这对游戏开发人员改善游戏机制和叙事是非常宝贵的。此外，VLM可以解析来自游戏会话的大量图像和视频数据，以帮助分析游戏世界中的用户意图和动作。此外，LLMs和VLM可以促进游戏中智能代理的开发，这些智能代理可以以复杂和人性化的方式与玩家和其他代理进行通信，从而增强整体游戏体验。除了LLMs和VLM之外，用户输入数据为创建游戏代理提供了一个有前途的途径，这些代理通过模仿人类玩家来模拟感知，游戏和游戏理解。通过整合玩家交互和反馈、像素输入以及自然语言规划和理解的组合，智能体模型可以帮助持续改进游戏动态，推动游戏环境以玩家为中心的演变。

6.1.4 Scene Synthesis for Gaming
6.1.4游戏场景合成

Scene synthesis is a vital component in the creation and enhancement of immersive gaming environments. It entails the automatic or semi-automatic generation of three-dimensional (3D) scenes and environments within a game. This process includes the generation of terrain, placement of objects, creation of realistic lighting, and sometimes even dynamic weather systems.
场景合成是创建和增强沉浸式游戏环境的重要组成部分。它需要在游戏中自动或半自动生成三维（3D）场景和环境。此过程包括地形的生成、对象的放置、真实照明的创建，有时甚至包括动态天气系统。

Modern games often feature vast, open-world environments. Manually designing these landscapes can be incredibly time-consuming and resource-intensive. Automated terrain generation, often leveraging procedural or AI-driven techniques, can produce complex, realistic landscapes with less manual effort. LLMs and VLMs can utilize the internet scale knowledge to formulate rules to design non-repeating landscapes that are visually impressive and unique. Additionally, LLMs and VLMs can be used to ensure the semantic consistency and variability of generated assets. Placing objects such as buildings, vegetation, and other elements within a scene in a realistic and aesthetically pleasing manner is crucial for immersion.
现代游戏通常以广阔、开放的世界环境为特色。手动设计这些景观可能非常耗时和资源密集。自动地形生成通常利用程序或人工智能驱动的技术，可以生成复杂、逼真的地形，而人工工作量更少。LLMs和VLM可以利用互联网规模知识来制定规则，以设计视觉上令人印象深刻和独特的非重复景观。此外，LLMs和VLM可用于确保生成的资产的语义一致性和可变性。以逼真且美观的方式在场景中放置对象（如建筑物、植被和其他元素）对于沉浸效果至关重要。

VLMs and LLMs can assist in object placement by adhering to predefined or learned rules and aesthetics, thus speeding up the level design process. VLMs and LLMs can be further trained to understand the principles of design and aesthetics, aiding in the procedural generation of content. They can help formulate rules or guidelines that procedural algorithms can follow to generate objects, and scenes that are both visually appealing and contextually appropriate.
VLM和LLMs可以通过遵守预定义或学习的规则和美学来帮助对象放置，从而加快关卡设计过程。VLM和LLMs可以进一步培训，以了解设计和美学的原则，帮助内容的程序生成。它们可以帮助制定规则或指导方针，程序算法可以遵循这些规则或指导方针来生成视觉吸引力和上下文适当的对象和场景。

Realistic lighting and atmospheric effects are fundamental for creating a believable and engaging gaming environment. Advanced algorithms can simulate natural lighting conditions and dynamic weather effects, enhancing the realism and mood of the scene. LLMs can help develop systems to acheive more realistic lighting and atmospheric effects in several innovative ways. VLMs can analyze vast datasets from real-world lighting and atmospheric conditions to help develop more realistic algorithms for simulating these effects in games. By understanding the patterns and intricacies of natural lighting and weather, these models can contribute to the development of algorithms that mimic reality closely. LLMs and VLMs could also be used to develop systems that adjust lighting and atmospheric effects in real-time based on player actions, game states, or external inputs. They can process natural language commands from players to modify the game environment, providing a more interactive and immersive experience.
逼真的灯光和氛围效果是创建可信和引人入胜的游戏环境的基础。先进的算法可以模拟自然光照条件和动态天气效果，增强场景的真实感和情绪。LLMs可以帮助开发系统，以多种创新方式实现更逼真的照明和大气效果。VLM可以分析来自真实世界照明和大气条件的大量数据集，以帮助开发更逼真的算法来模拟游戏中的这些效果。通过了解自然光和天气的模式和复杂性，这些模型可以为开发紧密模拟现实的算法做出贡献。LLMs和VLM还可以用于开发基于玩家动作、游戏状态或外部输入实时调整照明和氛围效果的系统。它们可以处理玩家的自然语言命令来修改游戏环境，提供更具交互性和沉浸式的体验。

6.1.5 Experiments and Results
6.1.5实验和结果

Zero-shot/Few-shot Learning with LLM or LVM.
使用LLM或LVM进行零次/少次学习。

As we showed in the Fig. 8 and Fig. 9, we used GPT-4V for high-level description and action prediction. Fig. 8 showed some qualitative examples of action description generation and editing with GPT-4V. Agent-enhanced text opens up a novel method of generating 3D scenes with game action priors to help improve the naturalness of the scene. Consequently, GPT-4V generates relevant high-level descriptions that are appropriate for the gaming videos.
如图8和图9所示，我们使用GPT-4V进行高级描述和动作预测。图8显示了使用GPT-4V生成和编辑动作描述的一些定性示例。Agent增强文本开辟了一种利用游戏动作先验信息生成3D场景的新方法，有助于提高场景的自然度。因此，GPT-4V生成适合于游戏视频的相关高级描述。

Small Agent Pretraining Model.
小Agent预训练模型。

To showcase our agent vision-language architecture, we first study its application in a widely used domain for gaming agents by pretraining on Minecraft data. As shown in Fig. 7, given an input action agent, key frame of video, and corresponding text, a standard encoder-decoder can be employed to convert the agent action and image into action text token and image patch token and then use the agent-vision-language decoder to convert it into a action prediction sentence. The overall architecture is depicted in Fig. 7. We evaluate our approach with several Minecraft demonstrations. The Minecraft video data consists of 5min clips, and we use for pretraining contains 78K videos, and we used 5K videos (6% of pretraining data) for the first round pretraining. We train a 250M parameter model on 16 NVIDIA v100 GPUs for one day and visualize our model outputs in Fig. 10 and Fig. 11. Fig. 10 shows that our relatively small agent architecture can produce reasonable outputs for Minecraft scenes unseen during training. Fig. 11 showed the model’s predictions compared to the ground truth human player actions indicating potential low-level understanding for our small agent model.
为了展示我们的智能体视觉语言体系结构，我们首先通过对Minecraft数据进行预训练，研究了它在一个广泛使用的游戏智能体领域中的应用。如图7所示，给定输入的动作主体、视频的关键帧和相应的文本，可以采用标准编码器-解码器将主体动作和图像转换为动作文本标记和图像块标记，然后使用主体视觉语言解码器将其转换为动作预测语句。图7中描绘了总体架构。我们通过几个Minecraft演示来评估我们的方法。Minecraft的视频数据由5分钟的视频片段组成，我们用于预训练的视频包含78 K，第一轮预训练使用了5 K的视频（预训练数据的6%）。我们在16个NVIDIA v100 GPU上训练了一天的250 M参数模型，并在图10和图12中可视化了我们的模型输出。 11个国家。图10显示了我们的相对较小的代理架构可以为训练期间看不到的Minecraft场景产生合理的输出。图11显示了模型的预测与地面真实人类玩家行为的比较，表明了对我们的小代理模型的潜在低水平理解。

Multi-Agent Infrastructure.
多代理基础设施。

As showed in the agent paradigm in Fig. 5, we designed a novel infrastructure for a new gaming scenario called “CuisineWorld" Gong et al. (2023a). We detail our approach in Fig. 12. Our infrastructure allows for multi-agent collaboration by leveraging GPT-4 as a central planner and works across multiple gaming domains. We investigated our system’s multi-agent planning capabilities, and we deployed the infrastructure into real-world video games to demonstrate its multi-agent and human-AI collaboration effectiveness. Additionally, we presented “Cuisineworld", a text-based multi-agent collaboration benchmark that provides a new auto-metric Collaboration Score (CoS) to quantify collaboration efficiency.
如图5中的代理范例所示，我们为一个名为“CuisineWorld”的新游戏场景设计了一个新的基础设施。我们在图12中详细描述了我们的方法。我们的基础设施允许多代理协作，利用GPT-4作为一个中央规划器，并跨多个游戏领域工作。我们研究了我们的系统的多智能体规划能力，并将该基础设施部署到现实世界的视频游戏中，以证明其多智能体和人-AI协作的有效性。此外，我们还介绍了“Cuisineworld”，一个基于文本的多Agent协作基准测试，它提供了一个新的自动度量协作评分（CoS）来量化协作效率。

Please refer to the Appendix for more examples and details for gaming description, high-level action prediction, and GPT-4V prompting. We show examples for Bleeding Edge in Fig. 32 and Appendix B, Microsoft Flight Simulator in Fig. 33 and Appendix C, ASSASSIN’s CREED ODYSSEY in Fig. 34 and Appendix D, GEARS of WAR 4 in Fig. 35 and Appendix E, and Starfield in Fig. 36 and Appendix F. We also provide a detailed screenshot of the prompting process for GPT4V used to generate Minecraft examples with Fig. 31 in Appendix A.
请参阅附录，了解更多游戏描述、高级动作预测和GPT-4V提示的示例和详细信息。我们在图32和附录B中展示了“出血边缘”的例子，在图33和附录C中展示了微软的飞行模拟器，在图34和附录D中展示了阿萨森的“信条奥德赛”，在图35和附录E中展示了“战争4的齿轮”，在图36和附录F中展示了“星域”的例子。我们还提供了用于生成Minecraft示例的GPT 4V提示过程的详细屏幕截图，见附录A中的图31。

6.2 Robotics
6.2机器人

Robots are representative agents that necessitate effective interaction with their environment. In this section, we will introduce key elements essential for efficient robotic operation, review research topics where the latest LLM/VLM technologies have been applied, and share findings from our most recent studies.
机器人是需要与环境进行有效交互的代表性代理。在本节中，我们将介绍高效机器人操作所必需的关键要素，回顾最新LLM/VLM技术已应用的研究主题，并分享我们最新研究的结果。

Visual Motor Control. 视觉运动控制。

Visual Motor Control refers to the integration of visual perception and motor action to execute tasks effectively in a robotic system. This integration is paramount as it enables robots to interpret the visual data from their environment and accordingly adjust their motor actions to interact with the environment accurately. For instance, in an assembly line, a robot equipped with visual motor control can perceive the position and orientation of objects and accurately align its manipulator to interact with these objects. This capability is essential for ensuring the precision and effectiveness of robotic operations across a myriad of applications, ranging from industrial automation to assisting the elderly in their daily chores. Moreover, visual motor control facilitates robots in adapting to dynamic environments where the state of the environment may change rapidly, requiring real-time adjustments to motor actions based on visual feedback.
视觉运动控制是指视觉感知和运动动作的整合，以有效地执行机器人系统中的任务。这种集成是至关重要的，因为它使机器人能够解释来自其环境的视觉数据，并相应地调整其运动动作以准确地与环境交互。例如，在装配线中，配备视觉电机控制的机器人可以感知物体的位置和方向，并精确地调整其操纵器以与这些物体进行交互。这种能力对于确保机器人在各种应用中的精确性和有效性至关重要，从工业自动化到协助老年人完成日常家务。此外，视觉运动控制有助于机器人适应动态环境，其中环境的状态可能会迅速变化，需要基于视觉反馈对运动动作进行实时调整。

Additionally, within the context of safe operation, visual information is crucial for detecting execution errors and confirming the pre- and post-conditions of each robot action. In uncontrolled environments, such as unknown domestic settings, robots are more likely to face unexpected outcomes due to unpredictable factors like changing furniture shapes, varied lighting, and slippage. Executing a pre-planned action plan solely in a feedforward manner can pose significant risks in these settings. Therefore, utilizing visual feedback to continually verify outcomes at each step is key to ensuring robust and reliable operation of robotic systems.
此外，在安全操作的背景下，视觉信息对于检测执行错误和确认每个机器人动作的前置和后置条件至关重要。在不受控制的环境中，例如未知的家庭环境，由于不可预测的因素，如改变家具形状，变化的照明和滑动，机器人更有可能面临意想不到的结果。仅以前馈方式执行预先计划的行动计划可能会在这些环境中造成重大风险。因此，利用视觉反馈不断验证每一步的结果是确保机器人系统稳健可靠运行的关键。

Language Conditioned Manipulation.
语言条件操纵。

Language Conditioned Manipulation entails the ability of a robotic system to interpret and execute tasks based on language instructions. This aspect is particularly crucial for creating intuitive and user-friendly interfaces for human-robot interaction. Through natural language commands, users can specify goals and tasks to robots in a manner similar to human-human communication, thereby lowering the barrier to operating robotic systems. In a practical scenario, for instance, a user could instruct a service robot to “pick up the red apple from the table,” and the robot would parse this instruction, identify the referred object and execute the task of picking it up Wake et al. (2023c). The core challenge lies in developing robust natural language processing and understanding algorithms that can accurately interpret a wide array of instructions, ranging from direct commands to more abstract directives, and enable the robot to convert these instructions into actionable tasks. Furthermore, ensuring that robots can generalize these instructions across diverse tasks and environments is critical for enhancing their versatility and utility in real-world applications. The use of language input to guide robot’s task planning has gained attention in the context of a robot framework called Task and Motion Planning Garrett et al. (2021).
语言条件操作需要机器人系统基于语言指令解释和执行任务的能力。这一点对于为人机交互创建直观和用户友好的界面尤为重要。通过自然语言命令，用户可以以类似于人与人通信的方式向机器人指定目标和任务，从而降低了操作机器人系统的障碍。在实际场景中，例如，用户可以指示服务机器人“从桌子上拿起红苹果”，并且机器人将解析该指令，识别所涉及的对象并执行将其拿起的任务。核心挑战在于开发强大的自然语言处理和理解算法，这些算法可以准确地解释从直接命令到更抽象的指令等各种指令，并使机器人能够将这些指令转换为可操作的任务。此外，确保机器人能够在不同的任务和环境中推广这些指令，对于增强其在现实世界应用中的多功能性和实用性至关重要。使用语言输入来指导机器人的任务规划已经在称为任务和运动规划的机器人框架的上下文中得到了关注加勒特等人（2021）。

Skill Optimization. 技能优化。

Recent studies highlight the effectiveness of LLMs in robotic task planning. However the optimal execution of tasks, especially those involving physical interactions like grasping, requires a deeper understanding of the environment that goes beyond simply interpreting human instructions. For example, robot grasping necessitates precise contact points Wake et al. (2023e) and arm posture Sasabuchi et al. (2021) to efficiently execute subsequent actions. While these elements—precise contact points and arm posture—are intuitive for humans, articulating them through language is challenging. Despite advances in internet-scale VLMs, capturing these nuanced indirect cues from scenes and translating them effectively into robotic skills remains a significant challenge. In response, the robotics community is increasingly focusing on collecting enhanced datasets(e.g., Wang et al. (2023d); Padalkar et al. (2023)) or developing methodologies for direct skill acquisition from human demonstrations Wake et al. (2021a). Frameworks including Learning-from-Demonstration and Imitation Learning are leading these developments, playing a crucial role in the optimization of physical skills.
最近的研究强调了LLMs在机器人任务规划中的有效性。然而，任务的最佳执行，特别是那些涉及物理交互（如抓取）的任务，需要对环境有更深的理解，而不仅仅是解释人类指令。例如，机器人抓取需要精确的接触点Wake等人（2023 e）和手臂姿势Sasabuchi等人（2021），以有效地执行后续动作。虽然这些元素--精确的接触点和手臂姿势--对人类来说是直观的，但通过语言表达它们是具有挑战性的。尽管互联网规模的VLM取得了进步，但从场景中捕捉这些细微的间接线索并将其有效地转化为机器人技能仍然是一个重大挑战。作为响应，机器人社区越来越关注收集增强的数据集（例如，Wang等人（2023 d）; Padalkar 等人（2023））或开发从人类演示中直接获取技能的方法Wake 等人（2021 a）。示范学习和模仿学习等教学模式引领着体育教学的发展，在体育技能优化中发挥着重要作用。

6.2.1 LLM/VLM Agent for Robotics.
6.2.1LLM/VLM机器人代理。

Recent research has demonstrated the potential of LLM/VLMs for robotic agents that involve interactions with humans in an environment. Research topics that aim to leverage latest LLM/VLM technologies include:
最近的研究已经证明了LLM/VLM对于机器人代理的潜力，这些机器人代理涉及与环境中的人类的交互。旨在利用最新LLM/VLM技术的研究主题包括：

Multimodal Systems: Recent research has been actively focusing on developing end-to-end systems that incorporate the latest LLM and VLM technologies as encoders for input information. Particularly, there is a significant trend towards modifying these foundation models to process multimodal information. Jiang et al. (2022); Brohan et al. (2023, 2022); Li et al. (2023d); Ahn et al. (2022b); Shah et al. (2023b); Li et al. (2023e). This adaptation aims to guide robotic actions based on both linguistic instructions and visual cues, thus achieving an effective embodiment.
多模式系统：最近的研究一直在积极关注开发端到端系统，这些系统将最新的LLM和VLM技术作为输入信息的编码器。特别地，存在修改这些基础模型以处理多模态信息的显著趋势。Jiang等人（2022）; Brohan等人（2023，2022）; Li等人（2023 d）; Ahn等人（2022 b）; Shah等人（2023 b）; Li等人（2023 e）。该适应性修改旨在基于语言指令和视觉线索两者来引导机器人动作，从而实现有效的实施例。

Task Planning and Skill Training: In contrast to end-to-end systems, Task And Motion Planning (TAMP) based systems first compute a high-level task plan and then achieve them with low-level robot control, known as skills.
任务规划和技能培训：与端到端系统相比，基于任务和运动规划（TAMP）的系统首先计算高级任务计划，然后通过低级机器人控制（称为技能）实现这些计划。

The advanced language processing abilities of LLMs have demonstrated the capability to interpret instructions and decompose them into robot action steps, greatly advancing task planning technologies Ni et al. (2023); Li et al. (2023b); Parakh et al. (2023); Wake et al. (2023c). For skill training, several studies have explored the use of LLMs/VLMs for designing reward functions Yu et al. (2023a); Katara et al. (2023); Ma et al. (2023), generating data to facilitate policy learning Kumar et al. (2023); Du et al. (2023), or serving as part of a reward function Sontakke et al. (2023). Together with training frameworks such as RL and IL, these efforts will contribute to the development of efficient robot controllers.
LLMs的高级语言处理能力已经证明了解释指令并将其分解为机器人动作步骤的能力，极大地推进了任务规划技术Ni et al.（2023）; Li et al.（2023 b）; Parakh et al.（2023）; Wake et al.（2023 c）。对于技能培训，一些研究探索了使用LLMs/VLM设计奖励函数Yu et al.（2023 a）; Katara et al.（2023）; Ma et al.（2023），生成数据以促进策略学习Kumar et al.（2023）; Du et al.（2023），或作为奖励函数Sontakke et al.（2023）的一部分。与RL和IL等训练框架一起，这些努力将有助于开发高效的机器人控制器。

On-site Optimization: Executing long task steps in robotics can be difficult due to unexpected and unpredictable environmental conditions. Therefore, a significant challenge in the field of robotics involves dynamically adapting and refining robotic skills by integrating task plans with real-time environmental data. For instance, (Ahn et al., 2022b) proposed an approach that calculates the feasibility of actions (i.e., affordance) from visual information and compares it with planned tasks. Additionally, there are approaches that focus on enabling LLMs to output the pre-conditions and post-conditions (e.g., states of objects and their interrelationships) of task steps to optimize their execution Zhou et al. (2023c) and detect pre-condition errors for necessary revisions to the task plan Raman et al. (2023). These strategies seek to achieve environment-grounded robot execution by integrating environmental information and adjusting the robot’s actions at the task plan or controller level.
现场优化：由于意外和不可预测的环境条件，在机器人中执行长任务步骤可能会很困难。因此，在机器人领域的一个重大挑战涉及通过将任务计划与实时环境数据集成来动态地适应和改进机器人技能。例如，（Ahn 等人，2022b）提出了一种计算动作的可行性（即，示能）并将其与计划的任务进行比较。另外，存在专注于使LLMs能够输出前置条件和后置条件（例如，对象的状态及其相互关系），以优化其执行Zhou et al.（2023 c），并检测任务计划拉曼et al.（2023）的必要修订的先决条件错误。这些策略通过整合环境信息和在任务计划或控制器级别调整机器人的动作来寻求实现基于环境的机器人执行。

Conversation Agents: In creating conversational robots, LLMs can contribute to natural, context-sensitive interactions with humans Ye et al. (2023a); Wake et al. (2023f). These models process and generate responses that mimic human conversation, allowing robots to participate in meaningful dialogues. Additionally, LLMs play a significant role in the estimation of conceptual Hensel et al. (2023); Teshima et al. (2022) and emotional attributes Zhao et al. (2023); Yang et al. (2023b); Wake et al. (2023d) of utterances. Those attributes facilitate the understanding of human intent and meaningful gesture generation, thus contributing to the naturalness and efficacy of human-robot communication.
会话代理：在创建会话机器人时，LLMs可以促进与人类的自然，上下文敏感的交互Ye et al.（2023 a）; Wake et al.（2023 f）。这些模型处理并生成模仿人类对话的响应，允许机器人参与有意义的对话。此外，LLMs在概念Hensel et al.（2023）; Teshima et al.（2022）和话语的情感属性Zhao et al.（2023）; Yang et al.（2023 b）; Wake et al.（2023 d）的估计中发挥重要作用。这些属性有助于理解人类意图和有意义的手势生成，从而有助于人类与机器人通信的自然性和有效性。

Navigation Agents: Robot navigation has a long history of research, focusing on core aspects such as map-based path planning and Simultaneous Localization and Mapping (SLAM) for creating environmental maps. These functionalities have become standard in widely used robot middleware like the Robot Operating System (ROS) Guimarães et al. (2016).
导航代理：机器人导航的研究由来已久，主要集中在基于地图的路径规划和用于创建环境地图的同步定位与地图绘制（SLAM）等核心方面。这些功能已成为广泛使用的机器人中间件的标准，如机器人操作系统（ROS）Guimarces et al.（2016）。

While classic navigation techniques remain prevalent in many robotics applications, they typically rely on static or pre-created maps. Recently, there has been an increased interest in advanced technologies that enable robots to navigate in more challenging environments, leveraging breakthroughs in fields like computer vision and natural language processing. One representative task is object navigation Chaplot et al. (2020a); Batra et al. (2020); Gervet et al. (2023); Ramakrishnan et al. (2022); Zhang et al. (2021), where robots use object names for navigation instead of map coordinates, requiring the visual grounding of object names in the environment. Furthermore, recent attention has been given to technologies that navigate robots in entirely unfamiliar new environments on a zero-shot basis, on top of foundation models, so-called zero-shot object navigation Gadre et al. (2023); Dorbala et al. (2023); Cai et al. (2023). Additionally, Vision-Language Navigation (VLN) Anderson et al. (2018a) is a representative task, where the task involves navigating an agent by natural language instructions in previously unseen, real-world environments Shah et al. (2023a); Zhou et al. (2023a); Dorbala et al. (2022); Liang et al. (2023); Huang et al. (2023b). VLN interprets sentences rather than object names, such as “go to the bathroom on your left.,” thus it requires a higher functionality to parse input text Wang et al. (2019).
虽然经典的导航技术在许多机器人应用中仍然很流行，但它们通常依赖于静态或预先创建的地图。最近，人们对先进技术越来越感兴趣，这些技术使机器人能够在更具挑战性的环境中导航，利用计算机视觉和自然语言处理等领域的突破。一个代表性的任务是对象导航Chaplot等人（2020 a）; Batra等人（2020）; Gervet等人（2023）; Ramakrishnan等人（2022）; Zhang等人（2021），其中机器人使用对象名称而不是地图坐标进行导航，需要环境中对象名称的视觉基础。此外，最近的注意力已经被给予在完全不熟悉的新环境中在零拍摄基础上导航机器人的技术，在基础模型之上，所谓的零拍摄对象导航Gadre et al.（2023）; Dorbala et al.（2023）; Cai et al.（2023）。此外，视觉语言导航（VLN）安德森等人（2018 a）是一个代表性的任务，其中任务涉及在以前看不见的真实世界环境中通过自然语言指令导航智能体Shah等人（2023 a）; Zhou等人（2023 a）; Dorbala等人（2022）; Liang等人（2023）; Huang等人（2023 b）。 VLN解释句子而不是对象名称，例如“去你左边的浴室。”因此，它需要更高的功能来解析输入文本Wang et al.（2019）。

The advent of foundation models contributes to the development of such adaptive, on-the-fly navigation technologies by enhancing the understanding of human language instructions and the visual interpretation of environmental information. More detailed explanations of representative VLN research are provided in 6.2.2.
基础模型的出现通过增强对人类语言指令的理解和对环境信息的视觉解释，促进了这种自适应、飞行中导航技术的发展。代表性VLN研究的更详细解释见6.2.2。

6.2.2 Experiments and Results.
6.2.2实验和结果。

An accumulating body of evidence suggests that recent VLMs and LLMs have promising capabilities for symbolic task planning (e.g., what-to-do). However, each task requires low-level control policy (e.g., how-to-do) to achieve successful interaction between the environment. While reinforcement learning and imitation learning are promising approach to learn policies in a data-driven manner, another promising approach is to obtain the strategy directly from humans through on-site demonstration, an approach called Learning-from-Observation Wake et al. (2021a); Ikeuchi et al. . In this section, we introduce a study where we employ ChatGPT for task planning and enrich the plan by parameterizing it with affordance information to facilitate effective and precise execution (Fig. 13).
越来越多的证据表明，最近的VLM和LLMs具有很好的符号任务规划能力（例如，要做什么）。然而，每个任务需要低级控制策略（例如，如何做），以实现环境之间的成功互动。虽然强化学习和模仿学习是以数据驱动的方式学习策略的有前途的方法，但另一种有前途的方法是通过现场演示直接从人类获得策略，这种方法称为从观察中学习Wake et al.（2021 a）;Ikeuchi etal.在本节中，我们介绍了一项研究，其中我们使用ChatGPT进行任务规划，并通过使用示能信息对其进行参数化来丰富计划，以促进有效和精确的执行（图13）。

The pipeline was composed of two modules: task planning and parameterization. In task planning, the system is fed with language instructions and the description of the working environment. These instructions, along with a predefined set of robot actions and output specifications, are compiled into a comprehensive prompt provided to ChatGPT, which then generates a sequence of decomposed tasks with their textual descriptions (Fig. 13; left pane). Notably, we employ a few-shot approach, meaning ChatGPT is not trained on this task, offering an advantage in applicability as it eliminates the need for hardware-dependent data collection and model training. Additionally, the textual descriptions in the output enable the user to check and adjust the results as necessary, which is a crucial feature for a safe and robust operation. Fig. 14 shows the qualitative results conducted for an agentic simulation on top of VirtualHome Puig et al. (2018). The results demonstrate a reasonable task plan and its flexibility in adjusting outputs, indicating the broad applicability of our approach.
管道由两个模块组成：任务规划和参数化。在任务规划中，系统被输入语言指令和工作环境的描述。这些指令，沿着一组预定义的机器人动作和输出规范，被编译成一个提供给ChatGPT的综合提示，然后生成一系列分解任务及其文本描述（图13;左窗格）。值得注意的是，我们采用了少量的方法，这意味着ChatGPT不会在此任务上进行训练，这在适用性方面具有优势，因为它消除了对依赖于硬件的数据收集和模型训练的需求。此外，输出中的文本描述使用户能够根据需要检查和调整结果，这是安全和稳健操作的关键功能。图图14显示了在VirtualHomePuig et al.（2018）之上进行的代理模拟的定性结果。结果表明，一个合理的任务计划和调整输出的灵活性，表明我们的方法的广泛适用性。

While the task planner guarantees coherency between the task sequences, successful operation in reality requires detailed parameters. For example, grasp type is crucial for carrying a container while spilling out the content, such a parameter is often ignored in a simulators (see Fig. 14 in grasping a pie). In our robot system, therefore, users are asked to demonstrate each action visually (Fig. 13; right pane). The tasks had predefined parameters necessary for execution, which our vision system extracts from the videos Wake et al. (2021b). Notably, our robotic system is not designed for exact replication of human motions (i.e., teleoperation) but rather to handle varying real-world conditions, such as changes in object locations. Hence, the parameters extracted from human demonstrations encompass not precise motion paths but affordance information that dictates effective environmental movement (e.g., waypoints for collision avoidance Wake et al. (2023a), grasp types Wake et al. (2023e), and upper-limbs postures Sasabuchi et al. (2021); Wake et al. (2021a)). The posture of the upper limbs is critical in robots with high degrees of freedom and is designed to assume predictable postures for humans coexisting with the operational robot. The task sequence endowed with affordances is transformed into a sequence of reusable robot skills acquired through reinforcement learning and executed by the robot Takamatsu et al. (2022). 重试错误原因

LLM-empowered task planning can be extended to a more versatile robotic system by integrating it with VLMs. Here, we show an example where we use the GPT-4V(ision) to broaden the aforementioned task planner in a multimodal input context (Fig. 15), a human performs actions that are intended to be replicated by the robot. In this paper, only part of the prompt is shown. The whole prompt is available at microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts.
LLM-授权的任务规划可以通过将其与VLM集成而扩展到更通用的机器人系统。在这里，我们展示了一个例子，其中我们使用GPT-4V（视觉）来扩展上述任务规划器在多模态输入环境中（图15），人类执行旨在由机器人复制的动作。本文仅显示部分提示。完整的提示可在microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts上找到。

This pipeline takes demonstration videos and text, then outputs a sequence of robot actions. A vision analyzer aims to understand the actions performed by humans in the video. We used GPT-4V and provided a prompt to generate text instructions in a style typical of human-to-human communication.Fig. 16 demonstrates how the usage of text input allows user to give feedback on GPT-4V’s recognition results for correction purposes. Such a feature, aiming at improving the accuracy of the recognition results, also enables more robust operation.
该管道获取演示视频和文本，然后输出一系列机器人动作。视觉分析器的目的是了解视频中人类所执行的动作。我们使用了GPT-4V，并提供了一个提示，以生成典型的人与人之间的交流风格的文本指令。图16演示了文本输入的使用如何允许用户对GPT-4V的识别结果给出反馈，以用于校正目的。这种旨在提高识别结果的准确性的特征还使得能够实现更鲁棒的操作。

Next, the scene analyzer compiles the expected work environment into the text information based on the instructions and the first frame of the video data (or an image of the environment). This environmental information includes a list of object names recognized by GPT-4V, the graspable properties of objects, and the spatial relationships between objects. Although these computational processes are a black box within GPT-4V, the information is output based on the knowledge of GPT-4V and the image/text input. Fig. 17 shows the example outputs of our scene analyzer. As shown in the figure, GPT-4V successfully selects the objects that are related to the manipulation. For example, a table is included in the output when the human is relocating a spam container on the table, while the table is ignored for the fridge opening task. These results suggest that the scene analyzer encodes the scene information with respect to the human’s actions. We prompted GPT-4V to explain the results of the object selection process and the reasons behind those choices. In practice, we found this approach resulted in reasonable outputs. Finally, based on the given text instructions and environmental information, the task planner outputs a sequence of tasks Wake et al. (2023c).
接下来，场景分析器基于指令和视频数据的第一帧（或环境的图像）将预期的工作环境编译成文本信息。该环境信息包括GPT-4V所识别的对象名称的列表、对象的可抓取属性以及对象之间的空间关系。虽然这些计算过程是GPT-4V中的黑盒，但信息是基于GPT-4V的知识和图像/文本输入输出的。图17显示了我们的场景分析器的示例输出。如图所示，GPT-4V成功地选择了与操作相关的对象。例如，当人类在桌子上重新定位垃圾邮件容器时，输出中包括桌子，而对于冰箱打开任务，忽略该桌子。这些结果表明，场景分析器编码的场景信息相对于人类的行动。我们促使GPT-4V解释对象选择过程的结果以及这些选择背后的原因。在实践中，我们发现这种方法产生了合理的输出。最后，基于给定的文本指令和环境信息，任务规划器输出任务序列Wake等人（2023 c）。

Embodied Agents for Robotics Navigation.
用于机器人导航的智能体。

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. Navigation in 3D environments Zhu et al. (2017a); Mirowski et al. (2016); Mousavian et al. (2018); Hemachandra et al. (2015) is an essential capability of a mobile intelligent system that functions in the physical world. In the past few years, a plethora of tasks and evaluation protocols Savva et al. (2017); Kolve et al. (2017); Song et al. (2017); Xia et al. (2018); Anderson et al. (2018a) have been proposed as summarized in Anderson et al. (2018b). VLN Anderson et al. (2018a) focuses on language-grounded navigation in the real 3D environment. In order to solve the VLN task, (Anderson et al., 2018a) set up an attention-based sequence-to-sequence baseline model. Then (Wang et al., 2018) introduced a hybrid approach that combines model-free and model-based reinforcement learning (RL) to improve the model’s generalizability. Lastly, (Fried et al., 2018) proposed a speaker-follower model that adopts data augmentation, a panoramic action space and modified beam search for VLN, establishing the current state-of-the-art performance on the Room-to-Room dataset. Extending prior work, we propose a Reinforced Cross-Modal Matching (RCM) for VLN in (Wang et al., 2019). The RCM model is built upon Fried et al. (2018) but differs in many significant aspects: (1) RCM combines a novel multi-reward RL with imitation learning for VLN while Speaker-Follower models Fried et al. (2018) only uses supervised learning as in (Anderson et al., 2018a). (2) The RCM reasoning navigator performs cross-modal grounding rather than the temporal attention mechanism on single-modality input. (3) The RCM matching critic is similar to the Speaker in terms of the architecture design, but the former is used to provide the cycle-reconstruction intrinsic reward for both RL and SIL training while the latter is used to augment training data for supervised learning. In (Wang et al., 2019), we study how to address three critical leader-board for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problem. As shown in Fig. 18, we propose a novel Reinforced Cross-Modal Matching approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieved a new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%). Moreover, in Wang et al. (2019) we introduce a self-supervised imitation learning method for exploration in order to explicitly address the generalization issue, which is a problem not well-studied in prior work. Concurrent to the work, Thomason et al. (2018); Ke et al. (2019); Ma et al. (2019a, b) studies the VLN tasks from various aspects, and Nguyen et al. (2018) introduces a variant of the VLN task to find objects by requesting language assistance when needed. Note that we are the first to propose to explore unseen environments for the VLN task.
视觉语言导航（VLN）是导航具体代理以在真实的3D环境中执行自然语言指令的任务。3D环境中的导航Zhu et al.（2017 a）; Mirowski et al.（2016）; Mousavian et al.（2018）; Hemachandra et al.（2015）是在物理世界中运行的移动的智能系统的基本功能。在过去的几年中，Savva等人（2017）; Kolve等人（2017）; Song等人（2017）; Xia等人（2018）;安德森等人（2018 a）提出了大量的任务和评价方案，如安德森等人（2018 b）所总结。VLN安德森等人（2018 a）专注于真实的3D环境中基于语言的导航。为了解决VLN任务，（安德森等人，2018 a）建立了一个基于注意力的序列到序列基线模型。然后（Wang 等人，2018）介绍了一种混合方法，该方法结合了无模型和基于模型的强化学习（RL），以提高模型的泛化能力。最后，（Fried et al.，2018）提出了一种说话者跟随模型，该模型采用数据增强，全景动作空间和VLN的改进波束搜索，建立了当前最先进的Room-to-Room数据集性能。扩展先前的工作，我们提出了VLN的增强跨模态匹配（RCM）（Wang 等人，2019年）。RCM模型建立在Fried et al. （2018），但在许多重要方面有所不同：（1）RCM将一种新型的多奖励RL与VLN的模仿学习相结合，而Speaker-Follower模型Fried et al.（2018）仅使用监督学习（安德森et al.，2018年a）。(2)RCM推理导航器对单模态输入执行跨模态基础而不是时间注意机制。(3)RCM匹配评价器在结构设计上与Speaker相似，但前者用于为RL和SIL训练提供循环重构内在奖励，而后者用于为监督学习扩充训练数据。在（Wang 等人，2019）中，我们研究了如何解决这一任务的三个关键领导板：跨模态接地，不适定反馈和泛化问题。如图 18，我们提出了一种新的增强跨模态匹配方法，通过强化学习（RL）在局部和全局上强制跨模态接地。特别地，匹配评论器用于提供内在奖励以鼓励指令和轨迹之间的全局匹配，并且推理导航器用于在局部视觉场景中执行跨模态接地。在VLN基准数据集上的评估表明，我们的RCM模型在SPL上的性能明显优于以前的方法10%，并达到了新的最先进的性能。为了提高学习策略的泛化能力，我们进一步引入了一种自监督模仿学习（SIL）方法，通过模仿自己过去的好决策来探索未知的环境。我们证明，SIL可以近似一个更好，更有效的政策，这极大地减少了成功率的性能差距之间的可见和不可见的环境（从30.7%到11.7%）。此外，在Wang et al.（2019）中，我们引入了一种用于探索的自监督模仿学习方法，以明确解决泛化问题，这是一个在以前的工作中没有得到很好研究的问题。在这项工作的同时，Kazason et al.（2018）; Ke et al.（2019）; Ma et al.（2019 a，B）从各个方面研究了VLN任务，Nguyen et al.（2018）引入了VLN任务的变体，通过在需要时请求语言帮助来查找对象。请注意，我们是第一个提议为VLN任务探索未知环境的人。

6.3 Healthcare
6.3保健

In healthcare, LLMs and VLMs can act as diagnostic agents, patient care assistants, or even therapy aids, but they come with unique leader-board and responsibilities. With the tremendous potential for AI agents to improve patient care and save lives comes an equally dangerous possibility that their misuse or hasty deployment could endanger thousands or millions of people worldwide. We discuss some of the promising routes for AI agents within the context of healthcare and also discuss some of the key leader-board faced.
在医疗保健领域，LLMs和VLM可以作为诊断试剂，患者护理助理，甚至治疗辅助工具，但它们具有独特的排行榜和责任。随着人工智能代理在改善患者护理和拯救生命方面的巨大潜力，同样危险的可能性是，它们的滥用或匆忙部署可能危及全球数千万人。我们讨论了在医疗保健背景下人工智能代理的一些有前途的路线，也讨论了一些关键的领导者面临的。

Diagnostic Agents. 诊断剂。

Using LLMs as medical chatbots for patient diagnosis has recently attracted great attention due to the high-demand for medical experts and the potential for LLMs to help triage and diagnose patients Lee et al. (2023). Dialogue agents, especially those that can effectively communicate important medical information to a broad range of people from diverse patient populations, have the potential to provide equitable healthcare access to historically disadvantaged or marginalized groups. Furthermore, doctors and healthcare systems across the world are largely over-burdened and under-resourced, resulting in insufficient access to medical care for hundreds of millions of people worldwide World Health Organization and World Bank (2015). Diagnostic agents provide a particularly advantageous pathway to improve healthcare for millions since they have they can be built with the capability to understand a variety of languages, cultures, and health conditions. Initial results have shown that healthcare-knowledgeable LMMs can be trained by utilizing large-scale web data Li et al. (2023f). Although an exciting direction, the promise of diagnostic agents does not come without risks. We highlight the risks of hallucination within medical contexts, as well as potential pathways for solutions in the following section.
使用LLMs作为医疗聊天机器人进行患者诊断最近引起了极大的关注，因为对医疗专家的高需求以及LLMs帮助分流和诊断患者的潜力Lee et al.（2023）。对话代理人，特别是那些能够有效地将重要的医疗信息传达给来自不同患者群体的广泛人群的代理人，有可能为历史上处于不利地位或边缘化的群体提供公平的医疗保健服务。此外，世界各地的医生和医疗保健系统负担过重，资源不足，导致全球数亿人无法获得足够的医疗保健。诊断剂提供了一个特别有利的途径，以改善数百万人的医疗保健，因为他们可以建立与理解各种语言，文化和健康状况的能力。初步结果表明，医疗保健知识丰富的Lemen可以通过利用大规模网络数据进行训练Li et al.（2023 f）。尽管这是一个令人兴奋的方向，但诊断试剂的前景并非没有风险。我们强调了医疗环境中幻觉的风险，以及下一节中解决问题的潜在途径。

Knowledge Retrieval Agents.
知识检索代理。

Within the medical context, model hallucinations are particularly dangerous and may even result in serious patient harm or death, depending on the severity of the error. For instance, if a patient mistakenly receives a diagnosis suggesting they are free of a condition they actually have, it can lead to catastrophic outcomes. These include postponed or inappropriate treatments, or in some cases, a total lack of necessary medical intervention. The gravity of undiagnosed or misdiagnosed conditions can lead to escalated healthcare expenses, extended therapies causing further physical strain, and in extreme scenarios, severe harm or even death. Thus, approaches that can use agents to more reliably retrieve knowledge Peng et al. (2023) or generate text in a retrieval-based manner Guu et al. (2020) are promising directions. Pairing a diagnostic agent with a medical knowledge retrieval agent has the potential to significantly reduce hallucinations while simultaneously improving the quality and preciseness of the responses of the diagnostic dialogue agent.
在医学背景下，模型幻觉特别危险，甚至可能导致严重的患者伤害或死亡，这取决于错误的严重程度。例如，如果一个病人错误地得到了一个诊断，表明他们没有实际患有的疾病，这可能会导致灾难性的结果。这包括延误或不适当的治疗，或者在某些情况下，完全缺乏必要的医疗干预。未诊断或误诊的疾病的严重性可能导致医疗费用增加，延长治疗导致进一步的身体紧张，在极端情况下，严重伤害甚至死亡。因此，可以使用代理更可靠地检索知识的方法Peng et al.（2023）或以基于检索的方式生成文本Guu et al.（2020）是有希望的方向。将诊断代理与医学知识检索代理配对有可能显着减少幻觉，同时提高诊断对话代理响应的质量和准确性。

Telemedicine and Remote Monitoring.
远程医疗和远程监控。

Agent-based AI also has great potential within the world of Telemedicine and Remote Monitoring by improving the access to healthcare, improving communications between healthcare providers and patients, as well as improving the efficiency and reducing the costs of frequent doctor-patient interactions Amjad et al. (2023). Primary care clinicians spend significant amounts of time sifting through patient messages, reports, and emails that are often irrelevant or unnecessary for them to view. There is significant potential to allow for support agents to help triage messages from doctors, patients, and other healthcare providers and to help highlight important messages for all parties. By enabling agentic AI systems to coordinate with patients, clinicians, and other AI agents, there is a massive potential to revolutionize the remote healthcare and digital health industry.
基于代理的人工智能在远程医疗和远程监控领域也具有巨大的潜力，可以改善医疗服务的获取，改善医疗服务提供者和患者之间的沟通，以及提高效率和降低频繁医患互动的成本Amjad等人（2023）。初级保健临床医生花费大量时间筛选患者消息、报告和电子邮件，而这些消息、报告和电子邮件通常与他们无关或不需要查看。支持代理可以帮助分类来自医生、患者和其他医疗保健提供者的消息，并帮助向各方突出重要消息，这一点具有很大的潜力。通过使代理人工智能系统能够与患者、临床医生和其他人工智能代理进行协调，远程医疗和数字健康行业将有巨大的潜力发生革命性变化。

6.3.1 Current Healthcare Capabilities
6.3.1目前的医疗保健能力

Image understanding. 图像理解。

We demonstrate the current capabilities and limitations of modern multimodal agents such as GPT-4V within the context of healthcare in Fig. 19. We can see that although GPT-4V possesses significant internal knowledge of the equipment and procedures involved in hospital care, it does not always respond to more prescriptive or diagnostic queries by the user.
我们在图19中展示了现代多模式试剂（如GPT-4V）在医疗保健背景下的当前能力和局限性。我们可以看到，虽然GPT-4V拥有医院护理中涉及的设备和程序的重要内部知识，但它并不总是对用户提出的更多规定性或诊断性询问做出响应。

Video understanding. 视频理解

We investigate the performance of VLM agents for medical video understanding in two contexts. First, we investigate the ability for VLM agents to identify important patient care activities in clinical spaces. Secondly, we explore the usage of of VLMs for more technical videos such as ultrasounds. Specifically, in Figure 20, we demonstrate some of the current capabilities and limitations of GPT-4V for hospital care and medical video analysis.
我们研究了两种背景下VLM Agent在医学视频理解中的性能。首先，我们调查VLM代理在临床空间中识别重要患者护理活动的能力。其次，我们探索了VLM在更多技术视频（如超声波）中的使用。具体而言，在图20中，我们展示了GPT-4V在医院护理和医疗视频分析方面的一些当前功能和局限性。

6.4 Multimodal Agents
6.4多模式代理

The integration of visual and linguistic understanding is crucial for developing sophisticated multimodal AI agents. This includes tasks such as image captioning, visual question answering, video language generation, and video understanding, amongst others. We aim to delve into these visual-language tasks, exploring the leader-board and opportunities they present in the context of AI agents.
视觉和语言理解的整合对于开发复杂的多模态AI代理至关重要。这包括图像字幕、视觉问题回答、视频语言生成和视频理解等任务。我们的目标是深入研究这些视觉语言任务，探索它们在人工智能代理背景下的排行榜和机会。

6.4.1 Image-Language Understanding and Generation
6.4.1图像语言理解和生成

Image-language understanding is a task that involves the interpretation of visual content in a given image with language and the generation of associated linguistic descriptions. This task is critical to the development of AI agents that can interact with the world in a more human-like manner. Some of most popular ones are image captioning Lin et al. (2014); Sharma et al. (2018); Young et al. (2014); Krishna et al. (2016), referring expression Yu et al. (2016); Karpathy et al. (2014), and visual question answering Antol et al. (2015); Ren et al. (2015); Singh et al. (2019).
图像语言理解是一项任务，涉及到在一个给定的图像与语言的视觉内容的解释和相关的语言描述的生成。这项任务对于开发能够以更人性化的方式与世界交互的人工智能代理至关重要。一些最流行的是图像字幕Lin et al.（2014）; Sharma et al.（2018）; Young et al.（2014）; Krishna et al.（2016），指涉表达Yu et al.（2016）; Karpathy et al.（2014）和视觉问答Antol et al.（2015）; Ren et al.（2015）; Singh et al.（2019）。

More recently, knowledge-intensive Visual Question Answering tasks such as OKVQA Marino et al. (2019), KB-VQA Wang et al. (2015), FVQA Wang et al. (2017), and WebQA Chang et al. (2021) have been introduced. Multimodal agents should capable of identifying objects in an image, comprehending their spatial relationships, generating accurate descriptive sentences about the scene, and utilizing reasoning skills to handle knowledge-intensive visual reasoning. This requires not just object recognition capabilities, but also a deep understanding of spatial relationships, visual semantics, and the ability to map these visual elements to linguistic constructs with integration of the world knowledge.
最近，引入了知识密集型的视觉问题分类任务，如OKVQAMarino et al.（2019），KB-VQAWang et al.（2015），FVQAWang et al.（2017）和WebQAChang et al.（2021）。多模态智能体应该能够识别图像中的对象，理解它们的空间关系，生成关于场景的准确描述性句子，并利用推理技能来处理知识密集型视觉推理。这不仅需要对象识别能力，还需要对空间关系、视觉语义的深刻理解，以及将这些视觉元素映射到语言结构并整合世界知识的能力。

6.4.2 Video and Language Understanding and Generation
6.4.2视频和语言理解和生成

Video-language generation.
视频语言生成。

Video captioning or video storytelling is the task of generating a sequence of coherent sentences for a stream of video frames. Inspired by the successful use of recurrent large foundation models employed in video and language tasks, variants of agent driven enhanced models have shown promising results on the task of video-lanaguage generation. The fundamental challenge is that the strong performance of neural encoder-decoder models does not generalize well for visual storytelling, because the task requires a full understanding of the content of each image as well as the relation among different frames. One important goal for the field is to create an agent-aware text-synthesis model that can efficiently encode the sequence of frames and generate a topically coherent multi-sentence paragraph.
视频字幕或视频讲故事是为视频帧流生成连贯句子序列的任务。受成功使用视频和语言任务中使用的经常性大型基础模型的启发，代理驱动的增强模型的变体在视频语言生成任务上显示出有希望的结果。最根本的挑战是，神经编码器-解码器模型的强大性能并不能很好地概括视觉故事，因为这项任务需要充分理解每个图像的内容以及不同帧之间的关系。该领域的一个重要目标是创建一个代理感知的文本合成模型，可以有效地编码帧序列，并生成一个主题连贯的多句段落。

Video Understanding. 视频理解

Video understanding extends the scope of image understanding to dynamic visual content. This involves interpretation and reasoning about the sequence of frames in a video, often in conjunction with accompanying audio or textual information. An agent should be able interact with various modalities from visual, text, and also audio modalities to demonstrate their advanced comprehension of video content. Tasks in this domain include video captioning, video question answering, and activity recognition, amongst others. The leader-board in video understanding are manifold. They include the temporal alignment of visual and linguistic content, the handling of long sequences of frames, and the interpretation of complex activities that unfold over time. Regarding audio, the agent could process spoken words, background noises, music, and tone of voice to comprehend the mood, setting, and subtleties of the video content.
视频理解将图像理解的范围扩展到动态视觉内容。这涉及对视频中的帧序列的解释和推理，通常与伴随的音频或文本信息结合。智能体应该能够与各种视觉、文本和音频形式进行交互，以展示他们对视频内容的高级理解。该领域中的任务包括视频字幕、视频问答和活动识别等。视频理解的领先者是多方面的。它们包括视觉和语言内容的时间对齐、长帧序列的处理，以及对随时间展开的复杂活动的解释。对于音频，代理可以处理口语、背景噪音、音乐和语音语调，以理解视频内容的情绪、设置和微妙之处。

Previous works have focused on employing existing video-language training data available online for establishing video foundational models Li et al. (2020, 2021b); Fu et al. (2022); Bain et al. (2021); Zellers et al. (2021, 2022); Fu et al. (2023). Supporting such training pipelines and functionalities is, however, difficult due to the limited and often inconsistent nature of these datasets. Video foundational models are designed with masked and contrastive pretraining objectives and later tuned on their respective tasks. Despite showing remarkable results in multimodal benchmarks, these models encounter difficulties in video-only tasks such as action recognition due to their dependency on limited video-text data built from noisy audio transcriptions. This limitation also leads to the lack of robustness and fine-grained reasoning skills that large language models generally possess.
以前的工作主要集中在利用现有的视频语言训练数据在线建立视频基础模型Li et al.（2020，2021 b）; Fu et al.（2022）; Bain et al.（2021）; Zellers et al.（2021，2022）; Fu et al.（2023）。然而，由于这些数据集的有限且经常不一致的性质，支持这样的训练管道和功能是困难的。视频基础模型设计有掩蔽和对比的预训练目标，然后根据各自的任务进行调整。尽管在多模态基准测试中显示出显著的结果，但是这些模型在诸如动作识别之类的纯视频任务中遇到了困难，因为它们依赖于从嘈杂的音频转录中构建的有限的视频文本数据。这一限制也导致缺乏大型语言模型通常具有的健壮性和细粒度推理技能。

Other methods, similar to those used in image-language understanding, have drawn on the strong reasoning skills and broad knowledge of large language models to improve different facets of video interpretation. The task of video understanding is simplified by language only models like ChatGPT and GPT4 or image-language models like GPT4-V, which treat the audio, video, and language modalities as individual interpretable input data types and position the agents as strong open-source models. For example, Huang et al. (2023c); Li et al. (2023g) transformed video understanding into a natural language processing (NLP) question-answering formulation by textualizing video content with open-source vision classification/detection/caption models. Lin et al. (2023) integrated GPT4-V with specialized tools in vision, audio, and speech, to facilitate complex video understanding tasks, such as scripting character movements and actions in long-form videos.
其他方法，类似于那些用于图像语言理解的方法，利用强大的推理技能和对大型语言模型的广泛知识来改进视频解释的不同方面。视频理解的任务被像ChatGPT和GPT 4这样的纯语言模型或像GPT 4-V这样的图像语言模型简化，这些模型将音频，视频和语言模态视为单独的可解释输入数据类型，并将代理定位为强大的开源模型。例如，Huang等人（2023 c）; Li等人（2023 g）通过使用开源视觉分类/检测/字幕模型对视频内容进行文本化，将视频理解转换为自然语言处理（NLP）问答公式。Lin等人（2023）将GPT 4-V与视觉、音频和语音方面的专用工具集成在一起，以促进复杂的视频理解任务，例如在长视频中编写角色动作和动作。

Parallel research explores generating scaled datasets from large models, then applying visual instruction tuning Liu et al. (2023c); Li et al. (2023c); Zhu et al. (2023) on the generated data. Considerable audio, speech, and visual expert perception models are subsequently used to verbalize videos. Speech is transcribed with automatic speech recognition tools, and video descriptions and related data are produced with various tagging, grounding, and captioning models Li et al. (2023g); Maaz et al. (2023); Chen et al. (2023); Wang et al. (2023f). These techniques demonstrate how instruction tuning video-language models on generated datasets may lead to enhanced video-reasoning and communication abilities.
并行研究探索从大型模型生成缩放数据集，然后对生成的数据应用可视化指令调优Liu et al.（2023 c）; Li et al.（2023 c）; Zhu et al.（2023）。相当多的音频，语音和视觉专家感知模型随后被用于语言化视频。语音是用自动语音识别工具转录的，视频描述和相关数据是用各种标记、基础和字幕模型产生的。Li et al.（2023 g）; Maaz et al.（2023）; Chen et al.（2023）; Wang et al.（2023 f）。这些技术展示了如何在生成的数据集上调整视频语言模型，从而增强视频推理和通信能力。

6.4.3 Experiments and Results
6.4.3实验和结果

•

Knowledge-Intensive Models: As introduced in INK Park et al. (2022), and KAT Gui et al. (2022a), an intensive neural knowledge task that incorporates required knowledge annotated by humans to support knowledge-intensive retrieval task.
知识密集型模型：如INKPark等人（2022）和KATGui等人（2022 a）所介绍的，一种密集的神经知识任务，它包含了人类注释的所需知识，以支持知识密集型检索任务。
•

Multimodal-Agents: There has been a growing interest in multimodal language models like Chameleon Lu et al. (2023) and MM-React Yang et al. (2023c).
多模式代理：人们对多模态语言模型越来越感兴趣， ChameleonLu等人（2023）和MM-ReactYang等人（2023 c）。
•

Visual Instruction Tuning: VCLGui et al. (2022b), Mini-GPT4 Zhu et al. (2023), MPLUG-OWL Ye et al. (2023b), LSKD Park et al. (2023c) generate image-level instruction tuning dataset.
视觉指令调整：VCLGui等人（2022 b）、Mini-GPT 4Zhu等人（2023）、MPLUG-OWLYe等人（2023 b）、LSKDPark等人（2023 c）生成图像级指令调整数据集。

Knowledge-Intensive Agent.
知识密集型代理。

As showed in Fig. 22 and Fig. 23, Knowledge-based visual question answering and vision-language retrieval tasks are challenging tasks in multi-modal machine learning that requires outside knowledge beyond image contents. Recent studies on large-scale transformers have primarily focused on maximizing the efficiency of the model’s parameters to store information. This line of research explores a different aspect: whether multimodal transformers can use explicit knowledge in their decision-making process. Pretraining methods based on transformers have shown remarkable success in implicitly learning knowledge representations across multiple modalities. However, traditional methods, mainly unimodal, have investigated knowledge retrieval and subsequent answer prediction, raising questions about the quality and relevance of the knowledge retrieved and the integration of reasoning processes using both implicit and explicit knowledge. To tackle these issues, we introduce the Knowledge Augmented Transformer (KAT), which outperforms others by 6% on the 2022 OK-VQA open-domain multimodal task. KAT combines implicit knowledge from GPT3 with explicit knowledge from websites using an encoder-decoder structure, and allows for concurrent reasoning with both knowledge types during answer generation. Furthermore, incorporating explicit knowledge enhances the interpretability of the model’s predictions. The code and pre-trained models are available at https://github.com/guilk/KAT.
如图22和图23所示，在多模态机器学习中，基于知识的视觉问题回答和视觉语言检索任务是具有挑战性的任务，需要图像内容之外的外部知识。最近的研究主要集中在大型变压器的模型的参数存储信息的效率最大化。这一系列的研究探讨了一个不同的方面：多模态变压器是否可以使用明确的知识在他们的决策过程中。基于变换器的预训练方法在隐式学习多模态知识表示方面取得了显著的成功。然而，传统的方法，主要是单峰的，已经研究了知识检索和随后的答案预测，提出了关于检索到的知识的质量和相关性以及使用隐式和显式知识的推理过程的集成的问题。为了解决这些问题，我们引入了知识增强的Transformer（KAT），它在2022年OK-VQA开放域多模态任务中的表现优于其他人6%。KAT使用编码器-解码器结构将来自GPT 3的隐式知识与来自网站的显式知识相结合，并允许在答案生成期间使用这两种知识类型进行并发推理。此外，纳入明确的知识增强了模型预测的可解释性。代码和预训练模型可在https://github.com/guilk/KAT上获得。

Vision-language Transformer Agent.
视觉语言Transformer Agent。

Next, we introduce the "Training Vision-Language Transformers from Captions" (VLC) model Gui et al. (2022b), a transformer that has been pretrained exclusively with image-caption pairs. Despite using just a simple linear projection layer for image embeddings, VLC attains competitive results across various vision-language tasks, in contrast to other methods that depend on object detectors or supervised CNN/ViT networks. Through extensive analysis, we explore the potential of VLC as a vision-language transformer agent. For instance, we show that VLC’s visual representations are highly effective for ImageNet-1K classification, and our visualizations confirm that VLC can accurately match image patches to corresponding text tokens. The scalability of performance with more training data highlights the promising potential for developing large-scale, weakly-supervised, open-domain vision-language models.
接下来，我们介绍“从标题训练视觉语言转换器”（VLC）模型Gui et al.（2022 b），这是一个专门用图像-标题对进行预训练的Transformer。与其他依赖于目标检测器或监督CNN/ViT网络的方法相比，VLC虽然只使用了一个简单的线性投影层来进行图像嵌入，但在各种视觉语言任务中都获得了有竞争力的结果。通过广泛的分析，我们探索的潜力VLC作为一个视觉语言的Transformer代理。例如，我们表明VLC的视觉表示对于ImageNet-1 K分类是非常有效的，并且我们的可视化结果证实VLC可以将图像块与相应的文本标记准确地匹配。更多训练数据的性能可扩展性突出了开发大规模，弱监督，开放域视觉语言模型的潜力。

6.5 Video-language Experiments
6.5视频语言实验

To understand the practicality of converting pre-trained image-LLMs for video understanding, we temporally expand and fine-tune InstructBLIP Dai et al. (2023) for video captioning. Specifically, we expand the visual encoder of InstructBLIP (EVA-CLIP-G Sun et al. (2023b)) using the same divided space-time attention scheme as Frozen in Time Bain et al. (2021) and keep the Q-former and LLM (Flan-T5-XL Chung et al. (2022)) frozen during training. We freeze all spatial layers of the visual encoder, while keeping the temporal layers unfrozen during captioning training. This allows for our model to take image and videos as input (matching the image-level performance of InstructBLIP). We train on a 5 million video-caption subset of WebVid10M Bain et al. (2021). We visualize two example outputs in Figure 25. However, existing agents fail to fully comprehend precise, fine-grained visual details in the video content. A similar limitation is seen by visual instruction tuning methods, where they lack the general, human-level perception abilities that are remain to be solved by multimodal models and agents.
为了理解转换预训练图像LLMs用于视频理解的实用性，我们暂时扩展并微调了InstructBLIPDai等人（2023）的视频字幕。具体来说，我们使用与Frozen in TimeBain et al.（2021）相同的划分时空注意力方案扩展了InstructBLIP（EVA-CLIP-GSun et al.（2023 b））的视觉编码器，并在训练期间保持Q-former和LLM（Flan-T5-XLChung et al.（2022））冻结。我们冻结视觉编码器的所有空间层，同时在字幕训练期间保持时间层不冻结。这使得我们的模型可以将图像和视频作为输入（与InstructBLIP的图像级性能相匹配）。我们在WebVid 10 MBain et al.（2021）的500万个视频字幕子集上进行训练。我们在图25中看到了两个输出示例。然而，现有的代理不能完全理解视频内容中的精确的、细粒度的视觉细节。视觉指令调整方法也存在类似的局限性，它们缺乏通用的、人类水平的感知能力，这些能力仍有待于多模态模型和智能体来解决。

The instruction-tuned models show promise in accurately summarizing visible actions within videos and identifying actions like "person sitting on a bench" effectively in Fig. 25. However, they sometimes add incorrect details, such as "person smiling to the camera," revealing a shortfall in capturing conversation topics or the video’s ambiance, elements that are readily apparent to human observers. This shortfall underscores another key limitation: the omission of audio and speech modalities that would enrich the video understanding with context, aiding in more accurate interpretation and preventing such misrepresentations. Bridging this gap requires a holistic integration of available modalities, allowing multimodal agents to reach a level of comprehension akin to human perception and ensuring a fully multimodal approach to video interpretation.
图25中，指令调整模型在准确总结视频中的可见动作和有效识别“坐在长椅上的人”等动作方面显示出了希望。然而，他们有时会添加不正确的细节，如“人对着镜头微笑”，揭示了在捕捉谈话主题或视频氛围方面的不足，这些元素对人类观察者来说是显而易见的。这种不足突出了另一个关键限制：音频和语音模式的遗漏，这将丰富视频理解的背景，有助于更准确的解释和防止这种错误的陈述。弥合这一差距需要全面整合现有模式，使多式联运代理人能够达到与人类感知相似的理解水平，并确保对视频口译采取完全多式联运的做法。

Audio-Video-Language Agents with GPT-4V.
音频视频语言代理与GPT-4V。

We then evaluate the capabilities of GPT-4V as a multimodal agent that integrates vision, audio, and speech for a nuanced and precise understanding of videos, following the methodology outlined in Lin et al. (2023). Results depicted in Fig. 26 compare the performance of various video agents on the task of video summarization. The video-instruction tuned model (Li et al., 2023g) provides accurate content but falls short on comprehensiveness and detail, missing specific actions like the methodical use of a broomstick to measure a tree’s height.
然后，我们评估了GPT-4V作为多模式代理的能力，该代理集成了视觉，音频和语音，以实现对视频的细致入微和精确的理解，遵循Lin等人（2023）中概述的方法。图26所示的结果比较了各种视频代理在视频摘要任务上的性能。视频指令调谐模型（Li 等人，2023 g）提供了准确的内容，但福尔斯缺乏全面性和细节，缺少具体的行动，如有条不紊地使用扫帚柄测量树的高度。

To enhance the accuracy of video descriptions, we employ GPT-4V to caption frames, while audio and its transcriptions are sourced from the OpenAI Whisper model. We then prompt GPT-4V to create video summaries using only frame captions and then using both frame captions and audio transcriptions. Initially, we observe that frame captions alone can lead to fabricated events, such as a person biting down on a stick in the third segment. These inaccuracies persist in the video summary, with descriptions like "in a playful twist, he bites down on it while holding it horizontally." Without audio input, the agent cannot correct these captioning errors, resulting in descriptions that are semantically correct but visually misleading.
为了提高视频描述的准确性，我们使用GPT-4V来说明帧，而音频及其transmittance来自OpenAI Whisper模型。然后，我们提示GPT-4V仅使用帧标题创建视频摘要，然后使用帧标题和音频传输。最初，我们观察到单独的帧标题可以导致捏造的事件，例如在第三段中一个人咬在一根棍子上。这些不准确之处在视频摘要中一直存在，描述为“在一个有趣的扭曲中，他咬下它，同时水平地拿着它。“如果没有音频输入，工程师将无法更正这些字幕错误，从而导致描述在语义上是正确的，但在视觉上会产生误导。

However, when we provide the audio transcriptions to the agent, it manages to accurately depict the content, even capturing detailed physical actions like "holding the broomstick perpendicular to the body and rotating it downwards." This level of detail is significantly more informative and gives viewers a clearer understanding of the video’s purpose and key details. These findings highlight the importance of integrating audio, video, and language interactions to develop high-quality multimodal agents. GPT-4V emerges as a promising foundation for such advanced multimodal understanding and interaction.
然而，当我们向智能体提供音频传输时，它设法准确地描述内容，甚至捕捉到详细的物理动作，如“将扫帚柄垂直于身体并向下旋转。“这种细节程度的信息量更大，让观众更清楚地了解视频的目的和关键细节。这些发现强调了整合音频，视频和语言交互以开发高质量多模态代理的重要性。GPT-4V成为这种先进的多模态理解和交互的一个有前途的基础。

Embodied Multi-modal Agents with GPT-4V.
用GPT-4V修饰的多模式剂。

As shown in Fig. 27, We mainly used StackOverflow to get the initial Question, then we used the “Bing search" API to retrieve a related video and audio corresponding to the question. Next, we mainly use GPT-4V to get the relevant text information and high-level video description. On the other hand, we transfer the key frame audio to a low-level segment description of the key frames via ASR. Finally, we use GPT-4V to generate convincing "hallucinations" that serve as hard negative queries for video-question and answer tasks. We support interactions and question answering in the current frame of the video, as well as summarization for the overall high-level video description. During inference, we also combine external knowledge information via web search to improve answering capapbilities.
如图27所示，我们主要使用StackOverflow获取初始Question，然后使用“Bing search”API检索与问题对应的相关视频和音频。接下来，我们主要使用GPT-4V来获取相关的文本信息和高层视频描述。另一方面，我们通过ASR将关键帧音频转换为关键帧的低级段描述。最后，我们使用GPT-4V生成令人信服的“幻觉”，作为视频问答任务的硬否定查询。我们支持视频当前帧中的交互和问题回答，以及对整体高级视频描述的摘要。在推理过程中，我们还通过网络搜索结合联合收割机的外部知识信息，以提高回答能力。

The main prompt information for GPT-4V is described as below. The entire prompt is indented for clarity; it is over one page long.
GPT-4V的主要提示信息如下。为了清晰起见，整个提示符都是缩进的;它超过一页长。

GPT-4V are an assistant to provide descriptive, informative, and full comprehensive details in the video for the visually impaired who can hear the video but cannot see. The job is to create high-quality, dense descriptions of the video by synthesizing the given annotations and output them as JSON. Specifically, GPT-4V will be given original query used to search the video, the video title, description, audio transcription, and potentially noisy descriptions for specific time in the video. Different segments of same video is annotated as "[time start - time end (in seconds)] ’text’ ". Utilize the transcriptions and descriptions all together to reason about the exact detail and visual demonstration that might be happening in the video. GPT-4V will to combine or segment the timestamps as necessary to provide the best segmentation of the video.
GPT-4V是一个助手，为那些能听到但看不见的视障人士提供视频中的描述性、信息性和全面的细节。工作是通过合成给定的注释并将其输出为JSON来创建高质量，密集的视频描述。具体而言，GPT-4V将被给予用于搜索视频的原始查询、视频标题、描述、音频转录以及视频中特定时间的潜在噪声描述。同一视频的不同片段被注释为“[时间开始-时间结束（以秒为单位）]”文本““。利用transmittance和描述一起来推理视频中可能发生的确切细节和视觉演示。GPT-4V将根据需要对时间戳进行联合收割机组合或分段，以提供视频的最佳分段。

Expectations for GPT-4V Output:
对GPT-4V输出的期望：

1. Action-Oriented Descriptions: Prioritize plausible actions, motions, and physical demonstrations that the audio implies, enriching your narrative with dynamic visual cues.
1.面向对象的描述：优先考虑音频暗示的合理动作，运动和物理演示，用动态视觉线索丰富你的叙述。

2. Complete Video Coverage: Provide a continuous and consistent audio-descriptive experience that covers every moment of the video’s duration, ensuring no content is left undescribed.
2.完整的视频覆盖：提供持续一致的音频描述体验，涵盖视频持续时间的每一刻，确保没有内容未被描述。

3. Concise Segmentation: Construct your descriptions in focused, succinct segments of 1-2 sentences each to effectively communicate visual actions without overwhelming detail.
3.简明细分：用1-2句话的集中、简洁的片段来构建你的描述，以有效地传达视觉动作，而不需要压倒性的细节。

4. Contextual Audio-Visual Synthesis: Seamlessly blend the spoken audio content with inferred visual elements to form a narrative that reflects potential onscreen activities.
4.情境视听合成：将口头音频内容与推断的视觉元素无缝地混合，以形成反映潜在的活动的叙述。

5. Imaginative and Plausible Speculation: Infuse your descriptions with creative yet believable visual details that correspond with the audio, enhancing scene comprehension.
5.富有想象力和合理的推测：在你的描述中注入与音频相对应的创造性但可信的视觉细节，增强场景理解。

6. Accurate Timecode Correspondence: Align your descriptive segments with corresponding timecodes, ensuring that speculative visual details synchronize with the audio narrative’s timeline.
6.准确的时间码对应：将你的描述片段与相应的时间码对齐，确保推测性的视觉细节与音频叙事的时间轴同步。

7. Confident Narrative Delivery: Present the descriptions with assurance, as though the speculated visuals are occurring, to instill confidence in the listener.
7.自信的叙述交付：保证描述，就像推测的视觉效果正在发生一样，给听众灌输信心。

8. Omit Implausible Details: Exclude descriptions of objects or events that do not reasonably fit within the context established by the audio and visual information provided.
8.忽略不真实的细节：对不合理地符合所提供的音频和视频信息所建立的背景的对象或事件的描述。

The final output should be structured in a JSON format containing a list of dictionaries, each detailing a segment of the video.
最终的输出应该以JSON格式进行结构化，包含一个字典列表，每个字典详细描述视频的一个片段。

The final output should be structured in a JSON format containing a list of dictionaries, each detailing a segment of the video.
最终的输出应该以JSON格式进行结构化，包含一个字典列表，每个字典详细描述视频的一个片段。

[ ‘start’: <start-time-in-seconds>, ‘end’: <end-time-in-seconds>, ‘text’: “<Your detailed single-sentence, audio-visual description here>" ]
[ 'start'：，'end'：，'text'：““]

For MC Creation: our task is to create multiple-choice questions for video-to-text retrieval tasks that is trivially solved by looking at the title and reading through audio transcriptions. To do so, we will be given original query to get the video, description, audio transcription, and potentially noisy descriptions for specific time in the video.
对于MC创建：我们的任务是为视频到文本检索任务创建多项选择题，通过查看标题和通过音频传输进行阅读来简单地解决这些问题。为此，我们将获得原始查询以获取视频，描述，音频转录以及视频中特定时间的潜在噪声描述。

•

Format of audio transcription: -[start-end time in seconds] “transcription"
音频转录格式：-[开始-结束时间，以秒为单位]“转录”

•

Format of noisy description: - [time in seconds] “description"
噪声描述格式：- [时间（秒）]“描述”

We kindly ask GPT-4V to generate four queries, where the primary query is aligned with the video content, and the other three negatives are subtly different from our primary one. Selecting the primary one should not simply involve listening to audio transcriptions e.g. the text original query is contained in audio transcriptions. The negatives should be closely related but not fully aligned with the video content, requiring visual understanding of the video to differentiate. For example, modify the semantics in nuanced way so that one needs to watch the video than just listening to select the original query. Compile four queries in caption-like statement, with the first one being the rephrased original.
我们请GPT-4V生成四个查询，其中主要查询与视频内容一致，其他三个否定与我们的主要查询略有不同。选择主要查询不应仅涉及收听音频transmittance，例如，文本原始查询包含在音频transmittance中。底片应与视频内容密切相关，但不完全一致，需要对视频进行视觉理解以区分。例如，以微妙的方式修改语义，以便人们需要观看视频而不仅仅是收听以选择原始查询。用标题式语句编译四个查询，第一个查询是改写后的原始查询。

Think step by step how you can come up with negative statements using the information from the video. And justify the negative queries are incorrect but still compelling choices that demand nuanced understanding of the video. And how humans would not accidentally choose the negatives over the original query.
一步一步地思考如何使用视频中的信息提出负面陈述。并证明负面的质疑是不正确的，但仍然是令人信服的选择，需要细致入微的理解视频。以及人类如何不会意外地选择否定词而不是原始查询。

Finally, we present the work in the following format of analyses and 4 queries. No need to generate how you translated the original query.
最后，我们提出了以下格式的分析和4个问题的工作。无需生成您如何翻译原始查询。

•

Video Analysis: xxx 视频分析：xxx

•

Queries: [query1, query2, query3, query4]
查询：[query1，query2，query3，query4]

•

Justification: xxx 理由：xxx

6.6 Agent for NLP
6.6NLP代理

6.6.1 LLM agent
6.6.1LLM药剂

Recognizing task directives and taking action has been a fundamental challenge in interactive AI and natural language processing for decades. With the recent advances in deep learning, there is a growing interest in studying these areas jointly to improve human-agent collaboration. We identify three specific directions, among others, to improve language-grounded agents:
几十年来，识别任务指令并采取行动一直是交互式人工智能和自然语言处理的基本挑战。随着深度学习技术的发展，人们越来越有兴趣联合研究这些领域，以改善人-智能体协作。我们确定了三个具体的方向，以改善语言为基础的代理：

•

Tool use and querying from knowledge bases. This direction emphasizes the importance of integrating external knowledge bases, web search, or other helpful tools into the reasoning processes of AI agents. By leveraging structured and unstructured data from various sources, agents can enhance their understanding and provide more accurate and context-aware responses. Furthermore, it fosters the agent’s ability to proactively seek out information when faced with unfamiliar scenarios or queries, ensuring more comprehensive and informed responses. Examples include Toolformer Schick et al. (2023) and Retrieve What You Need Wang et al. (2023g).
工具使用和知识库查询。该方向强调了将外部知识库、网络搜索或其他有用工具集成到人工智能代理的推理过程中的重要性。通过利用来自各种来源的结构化和非结构化数据，代理可以增强其理解并提供更准确和上下文感知的响应。此外，它还能培养代理在遇到不熟悉的情况或查询时主动寻找信息的能力，从而确保做出更全面、更明智的响应。例子包括ToolformerSchick等人（2023）和Retrieve What You NeedWang等人（2023 g）。
•

Improved agent reasoning and planning. Enhancing the agent’s ability to reason and plan is pivotal for effective human-agent collaboration. This involves the development of models that can understand complex instructions, infer user intentions, and predict potential future scenarios. This can be accomplished by asking the agent to reflect on past actions and failures as in ReAct Yao et al. (2023a), or by structuring the agent thought process as a form of search Yao et al. (2023b). By simulating different outcomes and assessing the ramifications of various actions, agents can make more informed context-aware decisions.
改进了座席的推理和规划。提高智能体的推理和规划能力是有效的人-智能体协作的关键。这涉及到开发能够理解复杂指令、推断用户意图和预测潜在未来场景的模型。这可以通过要求代理反思过去的行动和失败来实现，如ReActYao et al.（2023 a），或者通过将代理思维过程结构化为搜索的形式Yao et al.（2023 b）。通过模拟不同的结果并评估各种行动的后果，代理可以做出更明智的上下文感知决策。
•

Incorporating system and human feedback. AI agents can frequently operate in two primary contexts: environments that provide explicit signals about the effectiveness of their actions (system feedback), and settings where they collaborate with humans who can offer verbal critiques (human feedback). This direction underscores the need for adaptive learning mechanisms that allow agents to refine their strategies and rectify mistakes, such as in AutoGen Wu et al. (2023). The ability to continuously learn and adapt from diverse feedback sources ensures that agents remain helpful and aligned for user needs.
系统和人的反馈。人工智能代理通常可以在两种主要环境中工作：一种是提供关于其行为有效性的明确信号的环境（系统反馈），另一种是与能够提供口头批评的人类合作的环境（人类反馈）。这一方向强调了自适应学习机制的必要性，这些机制允许智能体改进其策略并纠正错误，例如AutoGenWu et al.（2023）。不断学习和适应不同反馈源的能力确保了代理保持对用户需求的帮助和一致性。

6.6.2 General LLM agent
6.6.2一般LLM代理

Recognizing and understanding agent content and natural language has been a fundamental challenge in interactive AI and natural language processing for decades. With the recent advances in deep learning, there is a growing interest in studying these two areas jointly for deep understanding of both agent planning or human feedback for knowledge-inference and natural language generation. These are the key components of many human-machine-interaction agents, such as “AutoGen"Wu et al. (2023) and “Retrieve What You Need"Wang et al. (2023g).
几十年来，识别和理解智能体内容和自然语言一直是交互式人工智能和自然语言处理的基本挑战。随着深度学习的最新进展，人们越来越有兴趣联合研究这两个领域，以深入理解智能体规划或人类反馈，从而进行知识推理和自然语言生成。这些是许多人机交互代理的关键组件，例如“AutoGen”Wu et al.（2023）和“满足您的需求”Wang et al.（2023 g）。

6.6.3 Instruction-following LLM agents
6.6.3遵循指令LLM代理

Furthermore, the creation of LLM Agents that can be trained to effectively follow human instructions has become an important area of research. Initial models used human feedback to train a proxy reward model to simulate human preferences, through a process known as Reinforcement Learning with Human Feedback (RLHF) Ouyang et al. (2022). This process produced models such as InstructGPT and ChatGPT. In order to more efficiently train instruction-following LLM agents without needing human labels, researchers developed a more efficient method for instruction-tuning that trains the LLM agent directly on instruction/response pairs, either generated by humans like Dolly 2.0 ⁵⁵5Dolly 2.0 blogpost link or automatically from LLMs like Alpaca Taori et al. (2023). We show the overall Alpaca training pipeline in Figure 28.
此外，创建LLM智能体，可以训练它们有效地遵循人类的指令，已经成为一个重要的研究领域。最初的模型使用人类反馈来训练代理奖励模型，以模拟人类的偏好，通过一个被称为人类反馈强化学习（RLHF）的过程，欧阳等人（2022）。这一过程产生了诸如InstructGPT和ChatGPT的模型。为了在不需要人类标签的情况下更有效地训练遵循LLM的智能体，研究人员开发了一种更有效的方法来训练LLM智能体，该方法直接在指令/响应对上训练LLM智能体，这些指令/响应对可以由人类（如Dolly 2.0 ⁵⁵5Dolly 2.0 blogpost link ）生成，也可以由LLMs自动生成，如AlpacaTaori等人（2023）。我们在图28中展示了整个羊驼训练流程。

6.6.4 Experiments and Results
6.6.4实验和结果

Despite the growing adoption of conversational and self-feedback systems, these forms of AI still do not perform well with regard to generating factually correct responses from their own implicit knowledge and therefore often use external tools like web search and knowledge retrieval mechanisms at inference-time to augment their response as a consequence. Addressing this would help create more engaging experiences for users in many real-life applications. In social conversations (such as those on social media platforms like Instagram and Facebook), or with Q+A websites (such as Ask or Quora), people usually engage with others through a series of comments and by web-searching for information and knowledge relevant to the discussion. Thus, the task of generating conversational turns in this context is not to simply bootstrap upon traditional NLP models and tasks, but to use agents to generate dialogue through intelligent behaviors that reflect knowledge search and acquisition Peng et al. (2023). In this way, intelligent agents for NLP tasks extends the task description and improves upon the interpretability of the response by adding an explicit knowledge search and retrieval step during dialogue. Incorporating these web search and retrieval agents as feedback during dialogue will help to engage further and deeper the social interactions between humans and agents Wang et al. (2023e). As the Fig 29 showed, we introduced a new modeling paradigm for transformer language models that detects and extracts important logical structures and information from input texts and then integrates them into the input embeddings through carefully designed multi-layer hierarchical logical projections to infuse logical structures into pre-trained language models as one kind of NLP agent. Wang et al. (2023e) propose a novel approach to construct logic-aware input embeddings for transformer language models through a combination of logic detection, logic mapping and hierarchical logical projections, and then develop a corresponding new modeling paradigm that can upgrade all existing transformer language models into logical transformers to consistently boost their performance. The proposed logical transformer agent consistently achieve superior performance over their baseline transformer models through a deeper understanding of the logical structures of texts. To human users, it is often these aspects that are more important for delivering a meaningful and interesting conversation via a agent-based coordination between dialogue and information retrieval. Delving deep into natural language processing, this topic will discuss the advancements and leader-board in making LLMs more agentic and better suited for various language-centered tasks.
尽管越来越多地采用对话和自我反馈系统，但这些形式的人工智能在从自己的隐含知识中生成事实上正确的响应方面仍然表现不佳，因此经常在推理时使用外部工具，如网络搜索和知识检索机制来增强其响应。解决这一问题将有助于在许多现实生活中的应用程序中为用户创造更具吸引力的体验。在社交对话中（例如在Instagram和Facebook等社交媒体平台上的对话），或者在问答网站上（例如Ask或Quora），人们通常通过一系列评论和网络搜索与讨论相关的信息和知识来与他人互动。因此，在这种情况下生成会话轮的任务不是简单地在传统的NLP模型和任务上引导，而是使用代理通过反映知识搜索和获取的智能行为来生成对话。通过这种方式，用于NLP任务的智能代理扩展了任务描述，并通过在对话期间添加明确的知识搜索和检索步骤来提高响应的可解释性。将这些网络搜索和检索代理作为对话期间的反馈，将有助于进一步和更深入地参与人类和代理之间的社会互动。如图29所示，我们为Transformer语言模型引入了一种新的建模范式，该范式从输入文本中检测和提取重要的逻辑结构和信息，然后通过精心设计的多层分层逻辑投影将它们集成到输入嵌入中，以将逻辑结构注入到预先训练的语言模型中，作为一种NLP代理。Wang等人（2023 e）提出了一种新的方法，通过逻辑检测、逻辑映射和层次逻辑投影的组合，为Transformer语言模型构建逻辑感知的输入嵌入，然后开发相应的新建模范式，可以将所有现有的Transformer语言模型升级为逻辑transformer，以持续提高其性能。所提出的逻辑Transformer代理一致地实现上级的性能超过其基线Transformer模型，通过更深入地理解文本的逻辑结构。对于人类用户来说，这些方面通常对于通过对话和信息检索之间基于代理的协调来提供有意义和有趣的对话更重要。深入研究自然语言处理，本主题将讨论在使LLMs更具代理性和更适合各种以语言为中心的任务方面的进步和领先地位。

An open-domain question answering (QA) system usually follows a retrieve-then-read paradigm, in which a retriever is used to retrieve relevant passages from a large corpus, and then a reader generates answers based on the retrieved passages and the original question. In Wang et al. (2023g), we propose a simple and novel mutual learning framework to improve the performance of retrieve-then-read-style models via an intermediate module named the knowledge selector agent, which we train with reinforcement learning. The fine-grained knowledge selector into the retrieve-then-reader paradigm, whose goal is to construct a small subset of passages which retain question-relevant information. As showed in Figure 30, The knowledge selector agent is trained as a component of our novel mutual learning framework, which iteratively trains the knowledge selector and the reader. We adopt a simple and novel approach employing policy gradients to optimize the knowledge selector agnet, using feedback from the reader to train it to select a small and informative set of passages. This approach avoids brute-force search or manually-designed heuristics, without requiring any annotated query-document pairs for supervision. We show that iteratively training the reader and the knowledge selector agent leads to better predictive performance on some public open-domain question answering benchmarks.
开放域问答系统通常遵循检索-阅读模式，即检索器从大型语料库中检索相关段落，然后读者根据检索到的段落和原始问题生成答案。在Wang et al.（2023 g）中，我们提出了一个简单而新颖的相互学习框架，通过一个名为知识选择器代理的中间模块来提高检索然后阅读风格模型的性能，我们用强化学习训练它。细粒度的知识选择器进入检索然后读者范式，其目标是构建一个小的子集的段落保留问题相关的信息。如图30所示，知识选择器代理被训练为我们的新的相互学习框架的一个组件，该框架迭代地训练知识选择器和阅读器。我们采用了一种简单而新颖的方法，采用政策梯度来优化知识选择器agnet，使用读者的反馈来训练它选择一个小的和翔实的段落集。这种方法避免了强制搜索或手动设计的启发式方法，不需要任何带注释的查询-文档对进行监督。我们表明，迭代训练读者和知识选择器代理导致更好的预测性能在一些公共开放领域的问答基准。

7 Agent AI Across Modalities, Domains, and Realities
第7章跨模式、领域和现实的智能体AI

7.1 Agents for Cross-modal Understanding
7.1跨模态理解的代理

Multi-modal understanding is a significant challenge for creating generalist AI agents due to the lack of large-scale datasets that contain vision, language, and agent behavior. More generally, training data for AI agents is often modality specific. This results in most modern multi-modal systems using a combination of frozen submodules. Some notable examples are Flamingo Alayrac et al. (2022), BLIP-2 Li et al. (2023c), and LLaVA Liu et al. (2023c), all of which utilize a frozen LLM and frozen visual encoder. These submodules are trained individually on separate datasets, and then adaptation layers are trained to encode the visual encoder into the LLM embedding space. In order to make further progress for cross-modal understanding for AI agents, it is likely that the strategy of using frozen LLMs and visual encoders will need to change. Indeed, RT-2, a recent visual-language model that is capable of taking actions within the domain of robotics showed significantly improved performance when jointly tuning the visual encoder and LLM for robotics and visual-language tasks Brohan et al. (2023).
由于缺乏包含视觉、语言和代理行为的大规模数据集，多模态理解是创建通才AI代理的一个重大挑战。更一般地，用于AI代理的训练数据通常是模态特定的。这导致大多数现代多模态系统使用冻结子模块的组合。一些值得注意的例子是FlamingoAlayrac等人（2022），BLIP-2Li等人（2023 c）和LLaVALiu等人（2023 c），所有这些都使用了冻结的LLM和冻结的视觉编码器。这些子模块在单独的数据集上单独训练，然后训练适配层以将视觉编码器编码到LLM嵌入空间中。为了在AI代理的跨模态理解方面取得进一步的进展，使用冻结LLMs和视觉编码器的策略可能需要改变。事实上，RT-2是一种能够在机器人领域内采取行动的最新视觉语言模型，当联合调整视觉编码器和LLM用于机器人和视觉语言任务时，表现出显着改善的性能Brohan et al.（2023）。

7.2 Agents for Cross-domain Understanding
7.2跨领域理解代理

A key challenge for creating generalist agents is the distinctive visual appearance and disparate action spaces across different domains. Humans possess the capability to interpret images and videos from various sources, including the real world, video games, and specialized domains such as robotics and healthcare, once they become familiar with the specific details of these areas. However, existing LLMs and VLMs often demonstrate significant differences between the data they were trained on and the varied domains in which they are applied. And notably, training agent models to predict specific actions presents a considerable challenge when trying to develop a single policy that can effectively learn multiple control systems across domains. Generally, the approach most modern works take when applying systems within specific domains is to start from a pretrained foundation model and then finetune a separate model for each specific domain. This fails to capture any commonalities between domains and results in a smaller total set of data used for training instead of leveraging each domain’s data.
创建通才代理的一个关键挑战是不同领域的独特视觉外观和不同的动作空间。人类拥有解释来自各种来源的图像和视频的能力，包括真实的世界，视频游戏和专业领域，如机器人和医疗保健，一旦他们熟悉这些领域的具体细节。然而，现有的LLMs和VLM通常在它们训练的数据和它们应用的不同领域之间表现出显着差异。值得注意的是，当试图开发一个可以有效学习跨域多个控制系统的单一策略时，训练代理模型来预测特定的动作是一个相当大的挑战。一般来说，大多数现代作品在特定领域应用系统时采取的方法是从预先训练的基础模型开始，然后为每个特定领域微调单独的模型。这无法捕获域之间的任何共性，并导致用于训练的总数据集较小，而不是利用每个域的数据。

7.3 Interactive agent for cross-modality and cross-reality
7.3跨模态和跨现实的交互代理

Developing AI agents that can successfully understand and perform tasks across different realities is an on-going challenge that has seen some recent success for image and scene generation Huang et al. (2023a). In particular, it is challenging for agents to simultaneously understand real-world and virtual reality environments due to their visual dissimilarities and separate environment physics. Within the context of cross-reality, Sim to Real transfer is a particularly important problem when using simulation-trained policies for real-world data, which we discuss in the next section.
开发能够在不同现实中成功理解和执行任务的AI代理是一项持续的挑战，最近在图像和场景生成方面取得了一些成功Huang et al.（2023a）。特别是，由于真实世界和虚拟现实环境的视觉差异和环境物理分离，对智能体同时理解它们是具有挑战性的。在跨现实的背景下，Sim到真实的传输是一个特别重要的问题，当使用模拟训练的策略时，我们将在下一节中讨论。

7.4 Sim to Real Transfer
7.4模拟到真实的传输

Techniques which enable models trained in simulation to be deployed in the real world. Embodied agents, especially one based on RL policies, are typically trained in simulated environments. These simulations do not fully replicate the characteristics of the real world (e.g., disturbances, light, gravity, and other physical properties). Due to this discrepancy between simulation and reality, models trained in simulation often struggle to perform well when applied in the real world. This issue is known as the “sim-to-real” problem. To solve this problem, several approaches can be taken:
使在仿真中训练的模型能够部署在真实的世界中的技术。具体化Agent，尤其是基于RL策略的Agent，通常在模拟环境中训练。这些模拟并不完全复制真实的世界的特征（例如，扰动、光、重力和其他物理性质）。由于模拟和现实之间的这种差异，在模拟中训练的模型在应用于真实的世界时往往难以表现良好。这个问题被称为“模拟到真实的”问题。为了解决这个问题，可以采取几种方法：

•

Domain randomization: domain randomization is a technique that trains a model while randomly varying parameters within a simulation environment (e.g., object appearance, sensor noise, and optical properties) in anticipation of the uncertainties and variations of the real world Tobin et al. (2017). For instance, in the context of training a RL-based grasping skills, introducing randomness in the shapes of objects can lead to a policy capable of adapting to objects with somewhat different shapes Saito et al. (2022).
域随机化：域随机化是一种训练模型同时在模拟环境内随机改变参数的技术（例如，对象外观，传感器噪声和光学特性），以预测真实的世界的不确定性和变化Tobin et al.（2017）。例如，在训练基于RL的抓握技能的背景下，在物体形状中引入随机性可以导致能够适应具有稍微不同形状的物体的策略Saito et al.（2022）。
•

Domain adaptation: Domain adaptation, or domain transfer is a technique that bridges the gap between simulated and real-world domains by training models with a large number of simulated images and a smaller set of real-world images. In practical settings, unpaired image-to-image translation methods such as CycleGAN Zhu et al. (2017b) are employed due to the difficulty in preparing paired images across domains. Several enhanced versions exist for reinforcement learning, including RL-CycleGAN Rao et al. (2020), and for imitation learning, such as RetinaGAN Ho et al. (2021).
域自适应：域自适应或域转移是一种技术，它通过使用大量模拟图像和较小的真实图像集来训练模型，从而弥合模拟域和真实域之间的差距。在实际环境中，由于难以跨域制备配对图像，因此采用了CycleGANZhu等人（2017 b）等未配对图像到图像转换方法。强化学习有几个增强版本，包括RL-CycleGANRao et al.（2020）和模仿学习，如RetinaGANHo et al.（2021）。
•

Improvement of simulation: Realistic simulation is a key for sim-to-real transfer. Part of this effort is achieved by a system identification techniques Zhu et al. (2017c); Allevato et al. (2020), which aims to identify simulation parameters to mimic the real-world environments. Additionally, use of photorealistic simulators would be effective in image-based reinforcement learning Martinez-Gonzalez et al. (2020); Müller et al. (2018); Shah et al. (2018); Sasabuchi et al. (2023).
仿真的改进：逼真的仿真是模拟到真实的转换的关键。这一努力的一部分是通过系统识别技术Zhu et al.（2017 c）实现的; Allevato et al.（2020），其目的是识别模拟参数以模拟真实世界的环境。此外，使用真实感模拟器在基于图像的强化学习中也是有效的Martinez-Gonzalez et al.（2020）; Müller et al.（2018）; Shah et al.（2018）; Sasabuchi et al.（2023）。

The sim-to-real transfer remains a central challenge in the study of Embodied Agents, as approaches keep evolving. Both theoretical and empirical research are essential to advance these technologies further.
模拟到真实的转移仍然是一个中心的挑战，在研究的模拟代理，方法不断发展。理论和实证研究对于进一步推进这些技术都是必不可少的。

8 Continuous and Self-improvement for Agent AI
第8章Agent AI的持续和自我改进

Currently, foundation model based AI agents have the capacity to learn from multiple different data sources, which allow for more flexible sources for data for training. Two key consequences of this are (1) user and human-based interaction data can be used to further refine and improve the agent and (2) existing foundation models and model artifacts can be used to generate training data. We discuss each of these in more detail in the following sections, but we note that since current AI Agents are largely tied to existing pretrained foundation models, they generally do not learn from continuous interaction with their environments. We think this is an exciting future direction, and initial work by Bousmalis et al. has shown that self-improving agents for robotic control are able to continuous learn and improve through environmental interactions without supervision Bousmalis et al. (2023).
目前，基于基础模型的AI代理有能力从多个不同的数据源学习，这允许更灵活的数据源用于训练。这样做的两个关键结果是：（1）用户和基于人类的交互数据可以用于进一步改进和改进代理，以及（2）现有的基础模型和模型工件可以用于生成训练数据。我们将在下面的章节中更详细地讨论其中的每一个，但我们注意到，由于当前的AI Agent在很大程度上与现有的预训练基础模型相关联，它们通常不会从与环境的持续交互中学习。我们认为这是一个令人兴奋的未来方向，Bousmalis等人的初步工作表明，用于机器人控制的自我改进代理能够在没有监督的情况下通过环境相互作用不断学习和改进Bousmalis等人（2023）。

8.1 Human-based Interaction Data
8.1基于人的交互数据

The core idea behind using human-based interaction data is to leverage a large number of of agent-human interactions to train and improve future iterations of the agent. There are several strategies used to improve agents from human-agent interactions.
使用基于人类的交互数据背后的核心思想是利用大量的Agent-人类交互来训练和改进Agent的未来迭代。有几种策略用于从人-代理交互中改进代理。

•

Additional training data Perhaps the simplest usage of human-agent interactions is to use the interaction examples themselves as training data for a future iteration of the agent. This generally requires filtering strategies to differentiate successful agent examples from unsuccessful interaction examples. Filtering can be rules-based (e.g., reaching some desired end goal state), model-based (e.g., classifying successful vs unsuccessful interactions), or manually selected after a posthoc inspection and/or modification of the interaction examples.
额外的训练数据也许人类-代理交互的最简单用法是使用交互示例本身作为代理未来迭代的训练数据。这通常需要过滤策略来区分成功的代理示例和不成功的交互示例。过滤可以是基于规则的（例如，达到某个期望的最终目标状态），基于模型的（例如，将成功的交互与不成功的交互进行分类），或者在事后检查和/或修改交互示例之后手动选择。
•

Human preference learning During interaction with the user, the agent system can prompt the user with several different model outputs and allow for the user to select the best output. This is commonly used by LLMs like ChatGPT and GPT-4, whereby users can select one output (out of several) that aligns best with their preferences.
在与用户交互的过程中，代理系统可以提示用户几个不同的模型输出，并允许用户选择最佳输出。这通常由LLMs使用，如ChatGPT和GPT-4，用户可以选择一个输出（从几个输出中）最符合他们的偏好。
•

Safety training (red-teaming) Red-teaming within the context of Agent AI refers to having a dedicated team of adversaries (either human or computer) that seek to exploit and expose weaknesses and vulnerabilities within the Agent AI system. Although adversarial in nature, red-teaming is commonly used as a means for understanding how to improve AI safety measures and reduce the occurrence of harmful outputs. The core principle is to discover consistent methods for inducing unwanted agent outputs so that the model can be trained on data that explicitly corrects this behavior.
安全培训（红队）在Agent AI的背景下，红队是指拥有一个专门的对手团队（人类或计算机），试图利用和暴露Agent AI系统中的弱点和漏洞。虽然本质上是对抗性的，但红队通常被用作了解如何改善人工智能安全措施和减少有害输出发生的一种手段。核心原则是发现诱导不需要的代理输出的一致方法，以便可以在明确纠正这种行为的数据上训练模型。

8.2 Foundation Model Generated Data
8.2基础模型生成的数据

With the advent of powerful foundation model artifacts produced by academia and industry, there have been a variety of methods developed to extract and generate meaningful training data from these artifacts using a variety of prompting and data-pairing techniques.
随着学术界和工业界产生的强大的基础模型工件的出现，已经开发了各种方法来使用各种提示和数据配对技术从这些工件中提取和生成有意义的训练数据。

•

LLM Instruction-tuning Methods for generating instruction-following training data from LLMs have allowed for the finetuning of smaller, open-source models based on the outputs of larger proprietary LLMs Wang et al. (2022b). For example, Alpaca Taori et al. (2023) and Vicuna Zheng et al. (2023) are LLMs based on the open-source LLaMA family Touvron et al. (2023) that have been tuned on various outputs from ChatGPT and human participants. This method of instruction tuning can be viewed as a form of knowledge distillation, where the larger LLM serves as a teacher model to a smaller student model. Importantly, although LLM instruction-tuning has been shown to transfer the writing style and some instruction-following capabilities of the teacher model to the student model, significant gaps still exist between the factuality and capabilities of the teacher and student models Gudibande et al. (2023).
LLM指令调整用于从LLMs生成遵循训练数据的方法允许基于较大专有LLMsWang et al.（2022 b）的输出对较小的开源模型进行微调。例如，AlpacaTaori等人（2023）和维库纳Zheng等人（2023）是LLMs基于开源LLaMA家族Touvron等人（2023），该家族已根据ChatGPT和人类参与者的各种输出进行了调整。这种指令调整方法可以被视为一种知识蒸馏的形式，其中较大的LLM作为较小的学生模型的教师模型。重要的是，尽管LLM练习调整已被证明可以将教师模型的写作风格和一些练习跟随能力转移到学生模型，但教师和学生模型的真实性和能力之间仍然存在重大差距Gudibande et al.（2023）。
•

Vision-language pairs A number of recent works have sought to increase the number of diversity of pretraining data available to visual-language models by automatically generating captions and other text for visual content. For example, LLaVA Liu et al. (2023c) uses 150,000 examples of instruction-following behavior from textual and visual inputs that are mainly LLM-generated. Other work has shown that using VLMs to re-caption images can improve the training data and subsequent quality of image generation models Segalis et al. (2023). Within the realm of video understanding, using VLMs and LLMs to recaption videos has been shown to improve the performance and quality of subsequent VLMs trained on the recaptioned videos Wang et al. (2023f); Zhao et al. (2022).
视觉语言对最近的一些工作试图通过自动生成视觉内容的标题和其他文本来增加视觉语言模型可用的预训练数据的多样性。例如，LLaVALiu等人（2023 c）使用了150，000个主要由LLM生成的文本和视觉输入中的预防遵循行为的示例。其他工作表明，使用VLM重新说明图像可以提高训练数据和图像生成模型的后续质量Segalis et al.（2023）。在视频理解领域，使用VLM和LLMs来重新捕获视频已被证明可以提高在重新捕获的视频上训练的后续VLM的性能和质量Wang et al.（2023 f）; Zhao et al.（2022）。

9 Agent Dataset and Leaderboard
9代理数据集和排行榜

To accelerate research in this domain, we propose two benchmarks respectively for multi-agent gaming and agentic visual language tasks. We will release two new datasets - “CuisineWorld” and “VideoAnalytica” - and a set of baseline models, encouraging participants to explore new models, systems, and submit their results on the test set of our leaderboard.
为了加速这一领域的研究，我们提出了两个基准分别为多智能体游戏和代理视觉语言任务。我们将发布两个新的数据集-“CuisineWorld”和“VideoAnalytica”-以及一组基线模型，鼓励参与者探索新的模型，系统，并在我们的排行榜测试集上提交他们的结果。

9.1 “CuisineWorld” Dataset for Multi-agent Gaming
9.1用于多智能体游戏的“CuisineWorld”数据集

CuisineWorld is a text-based game reminiscent of Overcooked! It offers a platform for AI-powered agents to cooperate and play in tandem. This dataset will test the collaboration efficiency of multi-agent systems, offering insights into how well LLMs and other systems can work together in dynamic scenarios. In particular, the dataset will focus on how well the agents understand goals, and how well the agents can coordinate among themselves. Two types of modes are supported in this dataset: a centralized dispatcher mode and a decentralized mode. Participants can choose a play mode and make a submission to our leaderboard.
CuisineWorld是一款基于文本的游戏，让人想起Overcooked！它为AI驱动的代理提供了一个合作和协同游戏的平台。该数据集将测试多智能体系统的协作效率，深入了解LLMs和其他系统在动态场景中的协作效果。特别是，数据集将关注代理对目标的理解程度，以及代理之间的协调程度。此数据集中支持两种模式：集中式调度器模式和分散式模式。参与者可以选择一个游戏模式，并提交到我们的排行榜。

9.1.1 Benchmark
9.1.1基准

For our competition, we will release a benchmark, the CuisineWorld benchmark, which includes a text interface that includes extendable task definition files, and an interface for multi-agent interaction, and human-machine interactions. We introduce the gaming interaction task in which the goal is to generate relevant, appropriate, multi-agent collaboration strategies that can maximize collaboration efficiency. We evaluate the collaboration efficiency with the proposed evaluation metric: CoS.
对于我们的比赛，我们将发布一个基准，CuisineWorld基准，其中包括一个包含可扩展任务定义文件的文本界面，以及一个用于多代理交互和人机交互的界面。我们引入了博弈交互任务，其目标是生成相关的、合适的、多Agent协作策略，以最大化协作效率。我们使用所提出的评估指标CoS来评估协作效率。

The “CuisineWorld" dataset was collected by Microsoft, UCLA, and Stanford University. The goal of the competition is to explore how different, existing and novel, grounded-LLM and interactive techniques perform with this benchmark and establish strong baselines for the task of multi-agent gaming infrastructure.
“CuisineWorld”数据集由微软、加州大学洛杉矶分校和斯坦福大学收集。比赛的目标是探索不同的，现有的和新颖的，接地的LLM和交互式技术如何与此基准一起执行，并为多智能体游戏基础设施的任务建立强大的基线。

The dataset of CuisineWorld includes:
CuisineWorld的数据集包括：

- A selection of well-defined multi-agent collaboration tasks.
- 一系列定义良好的多智能体协作任务。
- An API system to facilitate agent interactions.
- 一个API系统，用于促进代理交互。
- An automatic evaluation system.
- 一个自动评估系统。

(The link for downloading the dataset will soon be made available and this article will be updated to include it here.)
(The用于下载数据集的链接将很快提供，本文将更新为在此处包括该链接。）

9.1.2 Task
9.1.2任务

•

We provide a dataset and related the benchmark, called Microsoft MindAgent and and correspondingly release a dataset “CuisineWorld” to the to the research community.
我们提供了一个数据集和相关的基准，称为Microsoft MindAgent，并相应地发布了一个数据集“CuisineWorld”给研究社区。
•

We will provide benchmarks to evaluate and rank the submitted “MindAgent" algorithms. We will also provide baseline results generated using popular infrastructures.
我们将提供基准来评估和排名提交的“MindAgent”算法。我们还将提供使用常用基础架构生成的基准结果。

9.1.3 Metrics and Judging
9.1.3判断和判断

The quality of multi-agent collaboration efficiency is determined by the new “cos" auto-metric (from MindAgent Gong et al. (2023a)). The final rating of out metric is calculated as an average over the evaluated collaboration efficiency metrics of the multi-agent system on all tasks. Human evaluators will be asked to rate individual responses as well as provide subjective judgement of the engagement, breadth and an overall quality of the users’ interactions with the agents.
多智能体协作效率的质量由新的“cos”自动度量确定（来自MindAgentGong等人（2023 a））。输出度量的最终评级被计算为多代理系统在所有任务上的所评估的协作效率度量的平均值。将要求人类评估人员对个人的反应进行评级，并对用户与代理的互动的参与度、广度和整体质量提供主观判断。

9.1.4 Evaluation
9.1.4评价

•

Automated Evaluation. We plan to release a leaderboard, starting on the release date (TBA), registered participants will be asked to submit their results on the task associated with the dataset “CuisineWorld" (our publicly released dataset for the leaderboard). Submission of results will be closed on the end date (TBA). Each team will be required to submit their generated results on the testing set for automated evaluation of the “cos" metric.
自动评估。我们计划发布排行榜，从发布日期（TBA）开始，注册参与者将被要求提交与数据集“CuisineWorld”（我们公开发布的排行榜数据集）相关的任务结果。结果提交将在结束日期（TBA）截止。每个团队将被要求提交他们在测试集上生成的结果，以自动评估“cos”指标。
•

Human Evaluation on our leaderboard. The leaderboard participants will need to provide a submission file generated by evaluation scripts locally. We will use the evalAI system to check the submission file and optionally rerun the code for top challenge contenders. Therefore, teams must also submit their code with a Readme file on how to run their code. Human evaluation will be performed by the organization team.
在我们的排行榜上的人类评价。排行榜参与者需要提供一个由本地评估脚本生成的提交文件。我们将使用evalAI系统来检查提交文件，并选择性地检查顶级挑战者的代码。因此，团队还必须提交他们的代码，并附上一个关于如何运行代码的自述文件。将由组织团队进行人因评价。
•

Winner Announcement. We will make an announcement of the winners and post the final ratings of the submissions on our leaderboard.
赢家宣布。我们将公布获奖者，并在我们的排行榜上公布提交的最终评级。

9.2 Audio-Video-Language Pre-training Dataset.
9.2音视频语言预训练数据集。

We introduce VideoAnalytica: a new benchmark for analytical video demonstration comprehension. VideoAnalytica focuses on leveraging video demonstrations as aids to better understand complex, high-level reasoning embedded within long-formed instructional videos. The objective is to evaluate the cognitive reasoning abilities of video language models, pushing them beyond mere recognition tasks and basic comprehension, towards a more sophisticated and nuanced understanding of videos. Crucially, VideoAnalytica emphasizes the integration of multiple modalities, such as audio, video, and language, as well as the ability of models to apply domain-specific knowledge, to contextualize and interpret the information presented in the videos. Specifically, VideoAnalytica involves two primary tasks:
我们介绍VideoAnalytica：分析视频演示理解的新基准。VideoAnalytica专注于利用视频演示作为辅助手段，以更好地理解嵌入在长格式教学视频中的复杂、高层次的推理。我们的目标是评估视频语言模型的认知推理能力，使它们超越单纯的识别任务和基本理解，走向更复杂和细致入微的视频理解。至关重要的是，VideoAnalytica强调多种形式的集成，如音频，视频和语言，以及模型应用特定领域知识的能力，以上下文和解释视频中呈现的信息。具体来说，VideoAnalytica涉及两个主要任务：

1.

Video Text Retrieval: This task involves accurately retrieving relevant text from the instructional videos. The challenge lies in distinguishing between relevant and irrelevant information, thus requiring a deep understanding of the video content, and analysis of the demonstration to retrieve the correct query. To further increase the complexity of these tasks, we introduce hard negatives into our datasets generated by large language models. We run human validation on the generated negatives and remove instances that make the task invalid and unfair (e.g. negatives being valid).
视频文本检索：该任务涉及从教学视频中准确检索相关文本。挑战在于区分相关和不相关的信息，因此需要深入理解视频内容，并分析演示以检索正确的查询。为了进一步增加这些任务的复杂性，我们在由大型语言模型生成的数据集中引入了硬否定。我们对生成的否定结果进行人工验证，并删除使任务无效和不公平的实例（例如否定结果有效）。
2.

Video Assisted Informative Question Answering: This task requires the model to answer questions based on the information extracted from the videos. The focus is on complex questions that require analytical reasoning and a thorough comprehension of the video demonstration.
视频辅助信息性问题回答：此任务要求模型根据从视频中提取的信息回答问题。重点是复杂的问题，需要分析推理和视频演示的透彻理解。

To facilitate the development of an audio-video-language agent for analytical video understanding, we introduce a benchmark leaderboard for the two tasks from VideoAnalytica.
为了便于开发用于分析视频理解的音频-视频-语言代理，我们为VideoAnalytica的两个任务引入了基准排行榜。

•

The leaderboard participants will need to submit their solutions for evaluation. The evaluation will be based on the model’s performance on the two tasks, and the results will be displayed on the leaderboard. Participants are required to submit their code, along with a detailed explanation of their approach and methodology.
排行榜参与者需要提交他们的解决方案进行评估。评估将基于模型在两个任务上的表现，结果将显示在排行榜上。参与者必须提交他们的代码，沿着对他们的方法和方法的详细解释。
•

Ethical considerations: The leaderboard focuses on understanding and interpreting video content, which could potentially be used in surveillance or other privacy-invasive applications. Therefore, it’s crucial to consider the ethical implications and potential misuse of the technology. We encourage participants to consider these aspects in their submissions and promote the ethical use of AI.
伦理学考量：排行榜的重点是理解和解释视频内容，这可能会被用于监控或其他侵犯隐私的应用。因此，考虑该技术的伦理影响和潜在滥用至关重要。我们鼓励参与者在提交的材料中考虑这些方面，并促进人工智能的道德使用。

10 Broader Impact Statement
10更广泛的影响声明

This article and our associated forum ⁶⁶6https://multimodalagentai.github.io aim to be a catalyst for innovative research, fostering collaborations that will drive the next wave of AI applications. By focusing on multimodal agents, we emphasize the future direction of human-AI interactions, leader-board, and solutions. We detail three ways in which we make significant contributions to the broader community.
本文和我们的相关论坛 ⁶⁶6https://multimodalagentai.github.io 旨在成为创新研究的催化剂，促进合作，推动下一波人工智能应用。通过专注于多模态代理，我们强调了人类-AI交互，排行榜和解决方案的未来方向。我们详细介绍了我们为更广泛的社区做出重大贡献的三种方式。

Firstly, we hope our forum grounds AI researchers to develop solutions motivated by real-world problems in gaming, robotics, healthcare, and long-video understanding. Specifically, the development of multimodal agents in gaming could lead to more immersive and personalized gaming experiences, thereby transforming the gaming industry. In robotics, the development of adaptive robotic systems could revolutionize industries ranging from manufacturing to agriculture, potentially addressing labor shortages and improving efficiency. In healthcare, the use of LLMs and VLMs as diagnostic agents or patient care assistants could lead to more accurate diagnoses, improved patient care, and increased accessibility to medical services, particularly in underserved areas. Furthermore, the ability of these models to interpret long-form videos could have far-reaching applications, from enhancing online learning to improving technical support services. In general, the topics covered in our forum will have significant downstream effects on a wide range of industries and humans across the world.
首先，我们希望我们的论坛能够让人工智能研究人员开发出基于游戏、机器人、医疗保健和长视频理解等现实问题的解决方案。具体来说，多模式代理在游戏中的发展可能会带来更身临其境和个性化的游戏体验，从而改变游戏行业。在机器人领域，自适应机器人系统的发展可能会彻底改变从制造业到农业的各个行业，从而可能解决劳动力短缺问题并提高效率。在医疗保健中，使用LLMs和VLM作为诊断剂或患者护理助理可以导致更准确的诊断，改善患者护理，并增加医疗服务的可及性，特别是在服务不足的地区。此外，这些模型解释长格式视频的能力可能具有深远的应用，从加强在线学习到改善技术支持服务。总的来说，我们论坛所涵盖的主题将对世界各地的广泛行业和人类产生重大的下游影响。

Secondly, we hope our forum stands as a valuable resource for AI practitioners and researchers alike, serving as a platform to explore and deeply comprehend the diverse and complex leader-board that come with implementing AI agents across a wide variety of environments and situations. This exploration includes, for instance, understanding the specific limitations and potential hazards linked to Agentic AI systems when they are developed for specialized sectors such as healthcare diagnostics. In this domain, issues like dangerous hallucinations in AI behavior can pose significant risks, highlighting the critical need for meticulous design and testing. However, these specific leader-board may not be equally relevant or noticeable when considering AI agents crafted for the gaming industry. In such recreational fields, developers might instead prioritize tackling different hurdles, such as the need for AI to perform more open-ended generation and exhibit creativity, adapting dynamically to unpredictable gameplay scenarios and player interactions. By attending the forum, participants will gain insights into how these varied environments dictate the focus and direction of AI development, and how best to tailor AI solutions to meet these distinct needs and overcome the pertinent leader-board.
其次，我们希望我们的论坛能成为人工智能从业者和研究人员的宝贵资源，作为一个平台，探索和深入理解在各种环境和情况下实施人工智能代理所带来的多样化和复杂的排行榜。例如，这种探索包括了解在为医疗诊断等专业领域开发时与人工智能系统相关的特定限制和潜在危险。在这一领域，人工智能行为中的危险幻觉等问题可能会带来重大风险，这凸显了对精心设计和测试的迫切需求。然而，当考虑为游戏行业制作的AI代理时，这些特定的排行榜可能并不同样相关或值得注意。在此类娱乐领域，开发人员可能会优先考虑解决不同的障碍，例如需要人工智能执行更多开放式生成并展现创造力，动态适应不可预测的游戏场景和玩家互动。通过参加论坛，与会者将深入了解这些不同的环境如何决定人工智能开发的重点和方向，以及如何最好地定制人工智能解决方案以满足这些不同的需求并克服相关的领导者。

Thirdly, the various elements of our event, including the expert presentations, informative posters, and notably the winners of our two leader-board, are set to offer a substantive yet succinct overview of the latest and significant trends, research directions, and innovative concepts in the realm of multimodal agents. These presentations will encapsulate pivotal findings and developments, shining a light on new systems, ideas, and technologies in the field of mulitmodal agent AI. This assortment of knowledge is not only beneficial for the attendees of our forum, who are looking to deepen their understanding and expertise in this domain, but it also serves as a dynamic and rich resource board. Those visiting our forum’s website can tap into this reservoir of information to discover and understand the cutting-edge advancements and creative ideas steering the future of multimodal agent AI. We strive to serve as a useful knowledge base for both newcomers and veterans in the field. By engaging with these resources, we hope participants and online visitors alike can remain informed of the transformative changes and novel approaches that are shaping the exciting landscape surrounding multimodal agent AI.
第三，我们活动的各种元素，包括专家演讲，信息丰富的海报，特别是我们两个排行榜的获奖者，将为多式联运代理领域的最新和重要趋势，研究方向和创新概念提供实质性而简洁的概述。这些演讲将概括关键的发现和发展，揭示多模态智能体AI领域的新系统、想法和技术。这种知识的分类不仅有利于我们论坛的与会者，他们希望加深他们在这一领域的理解和专业知识，而且它也是一个充满活力和丰富的资源板。那些访问我们论坛网站的人可以利用这个信息库来发现和了解引导多模态智能体AI未来的前沿进步和创造性想法。我们努力为新来者和退伍军人在该领域的有用的知识基础。通过参与这些资源，我们希望参与者和在线访问者都能了解正在塑造围绕多模态智能体AI的令人兴奋的景观的变革性变化和新颖方法。

11 Ethical Considerations
11伦理考量

Multimodal Agent AI systems have many applications. In addition to interactive AI, grounded multimodal models could help drive content generation for bots and AI agents, and assist in productivity applications, helping to re-play, paraphrase, action prediction or synthesize 3D or 2D scenario. Fundamental advances in agent AI help contribute towards these goals and many would benefit from a greater understanding of how to model embodied and empathetic in a simulate reality or a real world. Arguably many of these applications could have positive benefits.
多模态Agent人工智能系统有许多应用。除了交互式人工智能之外，基于多模态的模型还可以帮助机器人和人工智能代理驱动内容生成，并协助生产力应用程序，帮助重播，释义，动作预测或合成3D或2D场景。人工智能的基本进步有助于实现这些目标，许多人将受益于更好地理解如何在模拟现实或真实的世界中建模体现和移情。可以说，这些应用程序中有许多可能具有积极的好处。

However, this technology could also be used by bad actors. Agent AI systems that generate content can be used to manipulate or deceive people. Therefore, it is very important that this technology is developed in accordance with responsible AI guidelines. For example, explicitly communicating to users that content is generated by an AI system and providing the user with controls in order to customize such a system. It is possible the Agent AI could be used to develop new methods to detect manipulative content - partly because it is rich with hallucination performance of large foundation model - and thus help address another real world problem.
然而，这项技术也可能被坏人利用。生成内容的人工智能系统可以用来操纵或欺骗人们。因此，根据负责任的AI指导方针开发这项技术非常重要。例如，明确地向用户传达内容是由AI系统生成的，并为用户提供控件以定制这样的系统。Agent AI有可能被用来开发新的方法来检测操纵内容-部分原因是它具有丰富的大型基础模型的幻觉性能-从而帮助解决另一个真实的世界问题。

For examples, 1) in health topic, ethical deployment of LLM and VLM agents, especially in sensitive domains like healthcare, is paramount. AI agents trained on biased data could potentially worsen health disparities by providing inaccurate diagnoses for underrepresented groups. Moreover, the handling of sensitive patient data by AI agents raises significant privacy and confidentiality concerns. 2) In the gaming industry, AI agents could transform the role of developers, shifting their focus from scripting non-player characters to refining agent learning processes. Similarly, adaptive robotic systems could redefine manufacturing roles, necessitating new skill sets rather than replacing human workers. Navigating these transitions responsibly is vital to minimize potential socio-economic disruptions.
例如，1）在健康主题中，LLM和VLM代理的道德部署，特别是在医疗保健等敏感领域，是至关重要的。接受偏见数据培训的人工智能代理可能会通过为代表性不足的群体提供不准确的诊断来恶化健康差异。此外，人工智能代理对敏感患者数据的处理引发了严重的隐私和保密问题。2)在游戏行业，人工智能代理可以改变开发人员的角色，将他们的重点从编写非玩家角色的脚本转移到改进代理学习过程。同样，自适应机器人系统可以重新定义制造角色，需要新的技能组合，而不是取代人类工人。负责任地驾驭这些过渡对于尽量减少潜在的社会经济干扰至关重要。

Furthermore, the agent AI focuses on learning collaboration policy in simulation and there is some risk if directly applying the policy to the real world due to the distribution shift. Robust testing and continual safety monitoring mechanisms should be put in place to minimize risks of unpredictable behaviors in real-world scenarios. Our “VideoAnalytica" dataset is collected from the Internet and considering which is not a fully representative source, so we already go through-ed the ethical review and legal process from both Microsoft and University Washington. Be that as it may, we also need to understand biases that might exist in this corpus. Data distributions can be characterized in many ways. In this workshop, we have captured how the agent level distribution in our dataset is different from other existing datasets. However, there is much more than could be included in a single dataset or workshop. We would argue that there is a need for more approaches or discussion linked to real tasks or topics and that by making these data or system available.
此外，Agent AI侧重于在仿真中学习协作策略，由于分布的变化，如果直接将策略应用于真实的世界，则存在一定的风险。应建立强有力的测试和持续的安全监测机制，以最大限度地减少现实世界中不可预测行为的风险。我们的“VideoAnalytica”数据集是从互联网上收集的，考虑到这不是一个完全代表性的来源，所以我们已经通过了微软和华盛顿大学的道德审查和法律的程序。尽管如此，我们还需要了解这个语料库中可能存在的偏见。数据分布可以以多种方式表征。在这个研讨会中，我们已经捕获了我们数据集中的代理级别分布与其他现有数据集的不同之处。然而，有更多的可以包括在一个单一的数据集或研讨会。我们认为，有必要采取更多的办法或讨论与真实的任务或专题联系起来，并通过提供这些数据或系统。

We will dedicate a segment of our project to discussing these ethical issues, exploring potential mitigation strategies, and deploying a responsible multi-modal AI agent. We hope to help more researchers answer these questions together via this paper.
我们将在项目中专门讨论这些道德问题，探索潜在的缓解策略，并部署负责任的多模态人工智能代理。我们希望通过本文帮助更多的研究者共同回答这些问题。

12 Diversity Statement
12多样性声明

By examining the adaptability of AI agent models in various domains, we inherently embrace a diversity of leader-board, perspectives, and solutions. In this vein, our project aims to build a diverse community by exploring the wide array of subjects in multimodal and agentic AI.
通过检查AI代理模型在各个领域的适应性，我们固有地拥有各种各样的排行榜，观点和解决方案。在这种情况下，我们的项目旨在通过探索多模态和代理AI中的各种主题来建立一个多元化的社区。

With these principles in mind, this project focuses on advanced multimodal systems that interact effectively within both physical and virtual environments and facilitate effective interaction with humans. As such, we intend to engage a broad range of experts and practitioners across a wide-range of technical specialities, cultures, countries, and scholarly fields to discuss important topics, including but not limited to:
考虑到这些原则，该项目的重点是先进的多模式系统，在物理和虚拟环境中有效地进行交互，并促进与人类的有效互动。因此，我们打算聘请广泛的专家和从业人员在广泛的技术专业，文化，国家和学术领域讨论重要的主题，包括但不限于：

•

Application of foundation models: the development of agents with integrated modalities (audio, image, text, sensor inputs), aiming to enhance their recognition and response capabilities for a wide variety of applications.
基础模型的应用：开发具有集成模式（音频，图像，文本，传感器输入）的代理，旨在增强其识别和响应能力，以适应各种应用。
•

General-purpose end-to-end systems: the development of end-to-end models that are trained with large-scale data, seeking to create versatile and adaptable AI solutions.
通用端到端系统：开发使用大规模数据训练的端到端模型，寻求创建通用和适应性强的人工智能解决方案。
•

Methodologies for grounding modalities: integrating information across various modalities, enhancing the coherence and efficacy of data processing.
为模式奠定基础的方法：整合各种模式的信息，加强数据处理的一致性和有效性。
•

Intuitive human interface: the development of effective and meaningful interaction between humans and agents.
直观的人机界面：开发人与代理之间有效和有意义的交互。
•

Taming LLM/VLMs: exploring new approaches to address common issues in large-scale models, such as hallucinations and biases in their outputs.
驯服LLM/VLM：探索解决大规模模型中常见问题的新方法，例如输出中的幻觉和偏差。

We aspire to broaden our collective understanding of the potential and limitations of agentic AI by leveraging our unique and diverse perspectives. We strongly believe that this approach will not only enrich individual perspectives, but will also enhance the community’s collective knowledge and promote a holistic view that is more inclusive of the wide-ranging leader-board faced by multimodal AI agents.
我们渴望通过利用我们独特而多样的视角，扩大我们对代理人工智能潜力和局限性的集体理解。我们坚信，这种方法不仅会丰富个人的观点，而且还会增强社区的集体知识，并促进一种更包容多模式人工智能代理所面临的广泛领导者的整体观点。

References

Ahn et al. (2022a) M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, and A. Zeng, “Do as i can and not as i say: Grounding language in robotic affordances,” in arXiv preprint arXiv:2204.01691, 2022.
Ahn et al. (2022b) M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.
Alayrac et al. (2022) J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
Allevato et al. (2020) A. Allevato, E. S. Short, M. Pryor, and A. Thomaz, “Tunenet: One-shot residual tuning for system identification and sim-to-real robot task transfer,” in Conference on Robot Learning. PMLR, 2020, pp. 445–455.
Amjad et al. (2023) A. Amjad, P. Kordel, and G. Fernandes, “A review on innovation in healthcare sector (telehealth) through artificial intelligence,” Sustainability, vol. 15, no. 8, p. 6655, 2023.
An et al. (2023) S. An, Z. Lin, Q. Fu, B. Chen, N. Zheng, J.-G. Lou, and D. Zhang, “How do in-context examples affect compositional generalization?” arXiv preprint arXiv:2305.04835, 2023.
Anderson et al. (2018b) P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva et al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018.
Anderson et al. (2018a) P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
Antol et al. (2015) S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
Bain et al. (2021) M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1728–1738.
Baker et al. (2022) B. Baker, I. Akkaya, P. Zhokov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune, “Video pretraining (vpt): Learning to act by watching unlabeled online videos,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 639–24 654, 2022.
Batra et al. (2020) D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “Objectnav revisited: On evaluation of embodied agents navigating to objects,” arXiv preprint arXiv:2006.13171, 2020.
Black et al. (2023) K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image-editing diffusion models,” arXiv preprint arXiv:2310.10639, 2023.
Blair-Stanek et al. (2023) A. Blair-Stanek, N. Holzenberger, and B. Van Durme, “Can gpt-3 perform statutory reasoning?” arXiv preprint arXiv:2302.06100, 2023.
Bousmalis et al. (2023) K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. Zhou, A. Gupta, A. Raju et al., “Robocat: A self-improving foundation agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023.
Brohan et al. (2022) A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.
Brohan et al. (2023) A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023.
Brown et al. (2020) T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
Bubeck et al. (2023) S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
Cai et al. (2023) W. Cai, S. Huang, G. Cheng, Y. Long, P. Gao, C. Sun, and H. Dong, “Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill,” arXiv preprint arXiv:2309.10309, 2023.
Carroll et al. (2019) M. Carroll, R. Shah, M. K. Ho, T. Griffiths, S. Seshia, P. Abbeel, and A. Dragan, “On the utility of learning about humans for human-ai coordination,” Advances in neural information processing systems, vol. 32, 2019.
Chang et al. (2021) Y. Chang, M. Narang, H. Suzuki, G. Cao, J. Gao, and Y. Bisk, “WebQA: Multihop and Multimodal QA,” arXiv preprint arXiv:2109.00590, 2021.
Chaplot et al. (2020a) D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020.
Chaplot et al. (2020b) D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta, “Neural topological slam for visual navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 875–12 884.
Chen et al. (2023) G. Chen, Y.-D. Zheng, J. Wang, J. Xu, Y. Huang, J. Pan, Y. Wang, Y. Wang, Y. Qiao, T. Lu, and L. Wang, “Videollm: Modeling video sequence with large language models,” 2023.
Chen et al. (2020) K. Chen, Q. Huang, H. Palangi, P. Smolensky, K. D. Forbus, and J. Gao, “Mapping natural-language problems to formal-language solutions using structured neural representations,” in ICML 2020, July 2020.
Chen et al. (2021) K. Chen, Q. Huang, D. McDuff, X. Gao, H. Palangi, J. Wang, K. Forbus, and J. Gao, “Nice: Neural image commenting with empathy,” in EMNLP 2021, October 2021. [Online]. Available: https://www.microsoft.com/en-us/research/publication/nice-neural-image-commenting-with-empathy/
Choi et al. (2023) J. H. Choi, K. E. Hickman, A. Monahan, and D. Schwarcz, “Chatgpt goes to law school,” Available at SSRN, 2023.
Chung et al. (2022) H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
Codella et al. (2018) N. C. F. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic),” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), 2018, pp. 168–172.
Creswell et al. (2022) A. Creswell, M. Shanahan, and I. Higgins, “Selection-inference: Exploiting large language models for interpretable logical reasoning,” arXiv preprint arXiv:2205.09712, 2022.
Cui et al. (2023) B. Cui, A. Lupu, S. Sokota, H. Hu, D. J. Wu, and J. N. Foerster, “Adversarial diversity in hanabi,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=uLE3WF3-H_5
Dagan et al. (2023) G. Dagan, F. Keller, and A. Lascarides, “Dynamic planning with a llm,” arXiv preprint arXiv:2308.06391, 2023.
Dai et al. (2023) W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023.
d’Avila Garcez and Lamb (2020) A. d’Avila Garcez and L. C. Lamb, “Neurosymbolic ai: The 3rd wave,” 2020.
Deitke et al. (2020) M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford et al., “Robothor: An open simulation-to-real embodied ai platform,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3164–3174.
Dong et al. (2022) Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
Dorbala et al. (2022) V. S. Dorbala, G. Sigurdsson, R. Piramuthu, J. Thomason, and G. S. Sukhatme, “Clip-nav: Using clip for zero-shot vision-and-language navigation,” arXiv preprint arXiv:2211.16649, 2022.
Dorbala et al. (2023) V. S. Dorbala, J. F. Mullen Jr, and D. Manocha, “Can an embodied agent find your" cat-shaped mug"? llm-based zero-shot object navigation,” arXiv preprint arXiv:2303.03480, 2023.
Dosovitskiy et al. (2021) A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021.
Driess et al. (2023) D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
Du et al. (2023) Y. Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum et al., “Video language planning,” arXiv preprint arXiv:2310.10625, 2023.
Dziri et al. (2021) N. Dziri, A. Madotto, O. Zaiane, and A. J. Bose, “Neural path hunter: Reducing hallucination in dialogue systems via path grounding,” arXiv preprint arXiv:2104.08455, 2021.
Ehsani et al. (2021) K. Ehsani, W. Han, A. Herrasti, E. VanderBilt, L. Weihs, E. Kolve, A. Kembhavi, and R. Mottaghi, “Manipulathor: A framework for visual object manipulation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4497–4506.
Fried et al. (2018) D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-follower models for vision-and-language navigation,” in Advances in Neural Information Processing Systems (NIPS), 2018.
Fu et al. (2022) T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, “Violet : End-to-end video-language transformers with masked visual-token modeling,” 2022.
Fu et al. (2023) ——, “An empirical study of end-to-end video-language transformers with masked visual modeling,” 2023.
Gadre et al. (2023) S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 171–23 181.
Gao et al. (2020) J. Gao, B. Peng, C. Li, J. Li, S. Shayandeh, L. Liden, and H.-Y. Shum, “Robust conversational ai with grounded text generation,” arXiv preprint arXiv:2009.03457, 2020.
Gao et al. (2022) J. Gao, C. Xiong, P. Bennett, and N. Craswell, “Neural approaches to conversational information retrieval,” arXiv preprint arXiv:2201.05176, 2022.
Garrett et al. (2021) C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-Pérez, “Integrated task and motion planning,” Annual review of control, robotics, and autonomous systems, vol. 4, pp. 265–293, 2021.
Gervet et al. (2023) T. Gervet, S. Chintala, D. Batra, J. Malik, and D. S. Chaplot, “Navigating to objects in the real world,” Science Robotics, vol. 8, no. 79, p. eadf6991, 2023.
Gong et al. (2023b) R. Gong, J. Huang, Y. Zhao, H. Geng, X. Gao, Q. Wu, W. Ai, Z. Zhou, D. Terzopoulos, S.-C. Zhu et al., “Arnold: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
Gong et al. (2023a) R. Gong, Q. Huang, X. Ma, H. Vo, Z. Durante, Y. Noda, Z. Zheng, S.-C. Zhu, D. Terzopoulos, L. Fei-Fei et al., “Mindagent: Emergent gaming interaction,” arXiv preprint arXiv:2309.09971, 2023.
Goyal et al. (2023) A. Goyal, J. Xu, Y. Guo, V. Blukis, Y.-W. Chao, and D. Fox, “Rvt: Robotic view transformer for 3d object manipulation,” arXiv preprint arXiv:2306.14896, 2023.
Gramopadhye and Szafir (2022) M. Gramopadhye and D. Szafir, “Generating executable action plans with environmentally-aware language models,” arXiv preprint arXiv:2210.04964, 2022.
Gudibande et al. (2023) A. Gudibande, E. Wallace, C. Snell, X. Geng, H. Liu, P. Abbeel, S. Levine, and D. Song, “The false promise of imitating proprietary llms,” arXiv preprint arXiv:2305.15717, 2023.
Gui et al. (2022b) L. Gui, Q. Huang, A. Hauptmann, Y. Bisk, and J. Gao, “Vlc: Training vision-language transformers from captions,” May 2022.
Gui et al. (2022a) L. Gui, B. Wang, Q. Huang, A. Hauptmann, Y. Bisk, and J. Gao, “Kat: A knowledge augmented transformer for vision-and-language,” in NAACL 2022. Long paper, Oral., January 2022.
Guimarães et al. (2016) R. L. Guimarães, A. S. de Oliveira, J. A. Fabro, T. Becker, and V. A. Brenner, “Ros navigation: Concepts and tutorial,” Robot Operating System (ROS) The Complete Reference (Volume 1), pp. 121–160, 2016.
Guu et al. (2020) K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” in International conference on machine learning. PMLR, 2020, pp. 3929–3938.
Ha et al. (2023) H. Ha, P. Florence, and S. Song, “Scaling up and distilling down: Language-guided robot skill acquisition,” arXiv preprint arXiv:2307.14535, 2023.
Haarnoja et al. (2023) T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, M. Wulfmeier, J. Humplik, S. Tunyasuvunakool, N. Y. Siegel, R. Hafner et al., “Learning agile soccer skills for a bipedal robot with deep reinforcement learning,” arXiv preprint arXiv:2304.13653, 2023.
He et al. (2022) K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” CVPR, 2022.
Hemachandra et al. (2015) S. Hemachandra, F. Duvallet, T. M. Howard, N. Roy, A. Stentz, and M. R. Walter, “Learning models for following natural language directions in unknown environments,” arXiv preprint arXiv:1503.05079, 2015.
Henrich et al. (2010) J. Henrich, S. J. Heine, and A. Norenzayan, “The weirdest people in the world?” Behavioral and Brain Sciences, vol. 33, no. 2-3, p. 61–83, 2010.
Hensel et al. (2023) L. B. Hensel, N. Yongsatianchot, P. Torshizi, E. Minucci, and S. Marsella, “Large language models in textual analysis for gesture selection,” in INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2023, pp. 378–387.
Ho et al. (2021) D. Ho, K. Rao, Z. Xu, E. Jang, M. Khansari, and Y. Bai, “Retinagan: An object-aware approach to sim-to-real transfer,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 10 920–10 926.
Huang et al. (2023b) C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 608–10 615.
Huang et al. (2023a) Q. Huang, J. S. Park, A. Gupta, P. Bennett, R. Gong, S. Som, B. Peng, O. K. Mohammed, C. Pal, Y. Choi et al., “Ark: Augmented reality with knowledge interactive emergent ability,” arXiv preprint arXiv:2305.00970, 2023.
Huang et al. (2022a) W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17–23 Jul 2022, pp. 9118–9147. [Online]. Available: https://proceedings.mlr.press/v162/huang22a.html
Huang et al. (2022b) W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reasoning through planning with language models,” in arXiv preprint arXiv:2207.05608, 2022.
Huang et al. (2023c) Z. Huang, H. Feng, Z. Chongzhi, L. Sheng, L. Ziwei, and J. Shao, “Dolphin: General video interaction platform based on llms,” 2023, https://github.com/kaleido-lab/dolphin.
Ikeuchi et al. (0) K. Ikeuchi, N. Wake, K. Sasabuchi, and J. Takamatsu, “Semantic constraints to represent common sense required in household actions for multimodal learning-from-observation robot,” The International Journal of Robotics Research, vol. 0, no. 0, p. 02783649231212929, 0.
Ikeuchi et al. (2023) K. Ikeuchi, J. Takamatsu, K. Sasabuchi, N. Wake, and A. Kanehiro, “Applying learning-from-observation to household service robots: three common-sense formulation,” arXiv preprint arXiv:2304.09966, 2023.
Imani et al. (2023) S. Imani, L. Du, and H. Shrivastava, “Mathprompter: Mathematical reasoning using large language models,” arXiv preprint arXiv:2303.05398, 2023.
James and Davison (2022) S. James and A. J. Davison, “Q-attention: Enabling efficient learning for vision-based robotic manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1612–1619, 2022.
Jang et al. (2022) E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning. PMLR, 2022, pp. 991–1002.
Ji et al. (2023) Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.
Jiang et al. (2022) Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” arXiv, 2022.
Kalashnikov et al. (2018) D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning. PMLR, 2018, pp. 651–673.
Karpathy et al. (2014) A. Karpathy, A. Joulin, and L. F. Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping,” Advances in neural information processing systems, vol. 27, 2014.
Katara et al. (2023) P. Katara, Z. Xian, and K. Fragkiadaki, “Gen2sim: Scaling up robot learning in simulation with generative models,” arXiv preprint arXiv:2310.18308, 2023.
Ke et al. (2019) L. Ke, X. Li, B. Yonatan, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi, and S. Srinivasa, “Tactical rewind: Self-correction via backtracking in vision-and-language navigation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Kim et al. (2023) J. Kim, J. Kim, and S. Choi, “Flame: Free-form language-based motion synthesis & editing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 7, 2023, pp. 8255–8263.
Ko et al. (2023) P.-C. Ko, J. Mao, Y. Du, S.-H. Sun, and J. B. Tenenbaum, “Learning to act from actionless videos through dense correspondences,” arXiv preprint arXiv:2310.08576, 2023.
Kolve et al. (2017) E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu et al., “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
Krishna et al. (2016) R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” in arXiv:1602.07332, 2016.
Kumar et al. (2023) K. N. Kumar, I. Essa, and S. Ha, “Words into action: Learning diverse humanoid robot behaviors using language guided iterative motion refinement,” arXiv preprint arXiv:2310.06226, 2023.
Lee et al. (2023) P. Lee, S. Bubeck, and J. Petro, “Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine,” New England Journal of Medicine, vol. 388, no. 13, pp. 1233–1239, 2023.
Lewis et al. (2020) P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
Li et al. (2023b) B. Li, P. Wu, P. Abbeel, and J. Malik, “Interactive task planning with language models,” arXiv preprint arXiv:2310.10645, 2023.
Li et al. (2021a) C. Li, F. Xia, R. Martín-Martín, M. Lingelbach, S. Srivastava, B. Shen, K. Vainio, C. Gokmen, G. Dharan, T. Jain et al., “igibson 2.0: Object-centric simulation for robot learning of everyday household tasks,” arXiv preprint arXiv:2108.03272, 2021.
Li et al. (2023f) C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” arXiv preprint arXiv:2306.00890, 2023.
Li et al. (2023a) G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel: Communicative agents for" mind" exploration of large scale language model society,” arXiv preprint arXiv:2303.17760, 2023.
Li et al. (2023e) J. Li, Q. Gao, M. Johnston, X. Gao, X. He, S. Shakiah, H. Shi, R. Ghanadan, and W. Y. Wang, “Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,” arXiv preprint arXiv:2310.09676, 2023.
Li et al. (2023c) J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
Li et al. (2023g) K. Li, Y. He, W. Yi, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv preprint arXiv:2305.06355, 2023.
Li et al. (2020) L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hierarchical encoder for video+language omni-representation pre-training,” 2020.
Li et al. (2021b) L. Li, J. Lei, Z. Gan, L. Yu, Y.-C. Chen, R. Pillai, Y. Cheng, L. Zhou, X. E. Wang, W. Y. Wang, T. L. Berg, M. Bansal, J. Liu, L. Wang, and Z. Liu, “Value: A multi-task benchmark for video-and-language understanding evaluation,” 2021.
Li et al. (2023d) X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y. Jing, W. Zhang, H. Liu et al., “Vision-language foundation models as effective robot imitators,” arXiv preprint arXiv:2311.01378, 2023.
Liang et al. (2022) J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in arXiv preprint arXiv:2209.07753, 2022.
Liang et al. (2023) X. Liang, L. Ma, S. Guo, J. Han, H. Xu, S. Ma, and X. Liang, “Mo-vln: A multi-task benchmark for open-set zero-shot vision-and-language navigation,” arXiv preprint arXiv:2306.10322, 2023.
Lifshitz et al. (2023) S. Lifshitz, K. Paster, H. Chan, J. Ba, and S. McIlraith, “Steve-1: A generative model for text-to-behavior in minecraft,” arXiv preprint arXiv:2306.00937, 2023.
Lin et al. (2023) K. Lin, F. Ahmed, L. Li, C.-C. Lin, E. Azarnasab, Z. Yang, J. Wang, L. Liang, Z. Liu, Y. Lu, C. Liu, and L. Wang, “Mm-vid: Advancing video understanding with gpt-4v(ision),” 2023.
Lin et al. (2014) T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” Proceedings of ECCV, 2014.
Liu and Negrut (2021) C. K. Liu and D. Negrut, “The role of physics-based simulators in robotics,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 4, pp. 35–58, 2021.
Liu et al. (2023c) H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023.
Liu et al. (2023a) H. Liu, A. Chen, Y. Zhu, A. Swaminathan, A. Kolobov, and C.-A. Cheng, “Interactive robot learning from verbal correction,” arXiv preprint arXiv:2310.17555, 2023.
Liu et al. (2023b) Y. Liu, W. Held, and D. Yang, “Dada: Dialect adaptation via dynamic aggregation of linguistic rules,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
Lu et al. (2023) P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-and-play compositional reasoning with large language models,” 2023.
Luo et al. (2022) Z. Luo, Z. Durante, L. Li, W. Xie, R. Liu, E. Jin, Z. Huang, L. Y. Li, J. Wu, J. C. Niebles et al., “Moma-lrg: Language-refined graphs for multi-object multi-actor activity parsing,” Advances in Neural Information Processing Systems, vol. 35, pp. 5282–5298, 2022.
Ma et al. (2019a) C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong, “Self-monitoring navigation agent via auxiliary progress estimation,” arXiv preprint arXiv:1901.03035, 2019.
Ma et al. (2019b) C.-Y. Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira, “The regretful agent: Heuristic-aided navigation through progress estimation,” arXiv preprint arXiv:1903.01602, 2019.
Ma et al. (2023) Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” arXiv preprint arXiv:2310.12931, 2023.
Maaz et al. (2023) M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” 2023.
Mao et al. (2022) R. Mao, Q. Liu, K. He, W. Li, and E. Cambria, “The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection,” IEEE Transactions on Affective Computing, 2022.
Marcus (2020) G. Marcus, “The next decade in ai: four steps towards robust artificial intelligence,” arXiv preprint arXiv:2002.06177, 2020.
Marcus and Davis (2019) G. Marcus and E. Davis, Rebooting AI: Building artificial intelligence we can trust. Pantheon, 2019.
Marino et al. (2019) K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” in CVPR, 2019.
Martinez-Gonzalez et al. (2020) P. Martinez-Gonzalez, S. Oprea, A. Garcia-Garcia, A. Jover-Alvarez, S. Orts-Escolano, and J. Garcia-Rodriguez, “Unrealrox: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation,” Virtual Reality, vol. 24, pp. 271–288, 2020.
Maynez et al. (2020) J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 1906–1919. [Online]. Available: https://aclanthology.org/2020.acl-main.173
Mees et al. (2022) O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022.
Meta Fundamental AI Research Diplomacy Team et al.(2022)Meta Fundamental AI Research (FAIR) Diplomacy Team, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, et al. (FAIR) Meta Fundamental AI Research (FAIR) Diplomacy Team, A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu et al., “Human-level play in the game of Diplomacy by combining language models with strategic reasoning,” Science, vol. 378, no. 6624, pp. 1067–1074, 2022.
Min et al. (2022) S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role of demonstrations: What makes in-context learning work?” arXiv preprint arXiv:2202.12837, 2022.
Minsky (1975) M. L. Minsky, “Minsky’s frame system theory,” in Proceedings of the 1975 Workshop on Theoretical Issues in Natural Language Processing, ser. TINLAP ’75. USA: Association for Computational Linguistics, 1975, p. 104–116. [Online]. Available: https://doi.org/10.3115/980190.980222
Mirchandani et al. (2023) S. Mirchandani, F. Xia, P. Florence, B. Ichter, D. Driess, M. G. Arenas, K. Rao, D. Sadigh, and A. Zeng, “Large language models as general pattern machines,” arXiv preprint arXiv:2307.04721, 2023.
Mirowski et al. (2016) P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al., “Learning to navigate in complex environments,” arXiv preprint arXiv:1611.03673, 2016.
Mittal et al. (2023) M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y. Guo, H. Mazhar et al., “Orbit: A unified simulation framework for interactive robot learning environments,” IEEE Robotics and Automation Letters, 2023.
Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
Mousavian et al. (2018) A. Mousavian, A. Toshev, M. Fiser, J. Kosecka, and J. Davidson, “Visual representations for semantic target driven navigation,” arXiv preprint arXiv:1805.06066, 2018.
Mu et al. (2021) T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su, “Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,” arXiv preprint arXiv:2107.14483, 2021.
Müller et al. (2018) M. Müller, V. Casser, J. Lahoud, N. Smith, and B. Ghanem, “Sim4cv: A photo-realistic simulator for computer vision applications,” International Journal of Computer Vision, vol. 126, pp. 902–919, 2018.
Nay (2022) J. J. Nay, “Law informs code: A legal informatics approach to aligning artificial intelligence with humans,” Nw. J. Tech. & Intell. Prop., vol. 20, p. 309, 2022.
Nguyen et al. (2018) K. Nguyen, D. Dey, C. Brockett, and B. Dolan, “Vision-based navigation with language-based assistance via imitation learning with indirect intervention,” arXiv preprint arXiv:1812.04155, 2018.
Ni et al. (2023) Z. Ni, X.-X. Deng, C. Tai, X.-Y. Zhu, X. Wu, Y.-J. Liu, and L. Zeng, “Grid: Scene-graph-based instruction-driven robotic task planning,” arXiv preprint arXiv:2309.07726, 2023.
OpenAI (2023) OpenAI, “GPT-4 technical report,” OpenAI, Tech. Rep., 2023.
Ouyang et al. (2022) L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
Padalkar et al. (2023) A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan et al., “Open x-embodiment: Robotic learning datasets and rt-x models,” arXiv preprint arXiv:2310.08864, 2023.
Parakh et al. (2023) M. Parakh, A. Fong, A. Simeonov, A. Gupta, T. Chen, and P. Agrawal, “Human-assisted continual robot learning with foundation models,” arXiv preprint arXiv:2309.14321, 2023.
Park et al. (2023b) J. S. Park, J. Hessel, K. Chandu, P. P. Liang, X. Lu, P. West, Q. Huang, J. Gao, A. Farhadi, and Y. Choi, “Multimodal agent – localized symbolic knowledge distillation for visual commonsense models,” in NeurIPS 2023, October 2023.
Park et al. (2023c) J. S. Park, J. Hessel, K. Chandu, P. P. Liang, X. Lu, P. West, Y. Yu, Q. Huang, J. Gao, A. Farhadi, and Y. Choi, “Localized symbolic knowledge distillation for visual commonsense models,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=V5eG47pyVl
Park et al. (2022) J. Park, Q. Huang, Y. Bisk, J. Yang, S. Som, A. Farhadi, Y. Choi, and J. Gao, “Ink: Intensive neural knowledge,” July 2022.
Park et al. (2023a) J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” arXiv preprint arXiv:2304.03442, 2023.
Peng et al. (2023) B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen et al., “Check your facts and try again: Improving large language models with external knowledge and automated feedback,” arXiv preprint arXiv:2302.12813, 2023.
Peng et al. (2018) X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 3803–3810.
Puig et al. (2018) X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba, “Virtualhome: Simulating household activities via programs,” in 2018 IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8494–8502.
Puig et al. (2023) X. Puig, E. Undersander, A. Szot, M. D. Cote, T.-Y. Yang, R. Partsey, R. Desai, A. W. Clegg, M. Hlavac, S. Y. Min et al., “Habitat 3.0: A co-habitat for humans, avatars and robots,” arXiv preprint arXiv:2310.13724, 2023.
Radford et al. (2021) A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
Ramakrishnan et al. (2022) S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman, “Poni: Potential functions for objectgoal navigation with interaction-free learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 890–18 900.
Raman et al. (2023) S. S. Raman, V. Cohen, D. Paulius, I. Idrees, E. Rosen, R. Mooney, and S. Tellex, “Cape: Corrective actions from precondition errors using large language models,” in 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
Rao et al. (2020) K. Rao, C. Harris, A. Irpan, S. Levine, J. Ibarz, and M. Khansari, “Rl-cyclegan: Reinforcement learning aware simulation-to-real,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 157–11 166.
Raunak et al. (2021) V. Raunak, A. Menezes, and M. Junczys-Dowmunt, “The curious case of hallucinations in neural machine translation,” arXiv preprint arXiv:2104.06683, 2021.
Reed et al. (2022) S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg et al., “A generalist agent,” arXiv preprint arXiv:2205.06175, 2022.
Ren et al. (2015) M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” Advances in neural information processing systems, vol. 28, 2015.
Rohrbach et al. (2018) A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko, “Object hallucination in image captioning,” arXiv preprint arXiv:1809.02156, 2018.
Rosinol et al. (2022) A. Rosinol, J. J. Leonard, and L. Carlone, “Nerf-slam: Real-time dense monocular slam with neural radiance fields,” arXiv preprint arXiv:2210.13641, 2022.
Sadeghi and Levine (2016) F. Sadeghi and S. Levine, “Cad2rl: Real single-image flight without a single real image,” arXiv preprint arXiv:1611.04201, 2016.
Saito et al. (2022) D. Saito, K. Sasabuchi, N. Wake, J. Takamatsu, H. Koike, and K. Ikeuchi, “Task-grasping from a demonstrated human strategy,” in 2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids), 2022, pp. 880–887.
Saito et al. (2023) D. Saito, K. Sasabuchi, N. Wake, A. Kanehira, J. Takamatsu, H. Koike, and K. Ikeuchi, “Constraint-aware policy for compliant manipulation,” 2023.
Sarkar et al. (2023) B. Sarkar, A. Shih, and D. Sadigh, “Diverse conventions for human-AI collaboration,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Sasabuchi et al. (2021) K. Sasabuchi, N. Wake, and K. Ikeuchi, “Task-oriented motion mapping on robots of various configuration using body role division,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 413–420, 2021.
Sasabuchi et al. (2023) K. Sasabuchi, D. Saito, A. Kanehira, N. Wake, J. Takamatsu, and K. Ikeuchi, “Task-sequencing simulator: Integrated machine learning to execution simulation for robot manipulation,” arXiv preprint arXiv:2301.01382, 2023.
Savva et al. (2017) M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun, “Minos: Multimodal indoor simulator for navigation in complex environments,” arXiv preprint arXiv:1712.03931, 2017.
Savva et al. (2019) M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik et al., “Habitat: A platform for embodied ai research,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9339–9347.
Schick et al. (2023) T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023.
Segalis et al. (2023) E. Segalis, D. Valevski, D. Lumen, Y. Matias, and Y. Leviathan, “A picture is worth a thousand words: Principled recaptioning improves image generation,” arXiv preprint arXiv:2310.16656, 2023.
Shacklett et al. (2023) B. Shacklett, L. G. Rosenzweig, Z. Xie, B. Sarkar, A. Szot, E. Wijmans, V. Koltun, D. Batra, and K. Fatahalian, “An extensible, data-oriented architecture for high-performance, many-world simulation,” ACM Trans. Graph., vol. 42, no. 4, 2023.
Shah et al. (2023a) D. Shah, B. Osiński, S. Levine et al., “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in Conference on Robot Learning. PMLR, 2023, pp. 492–504.
Shah et al. (2023b) R. Shah, R. Martín-Martín, and Y. Zhu, “Mutex: Learning unified policies from multimodal task specifications,” arXiv preprint arXiv:2309.14320, 2023.
Shah et al. (2018) S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics: Results of the 11th International Conference. Springer, 2018, pp. 621–635.
Sharma et al. (2018) P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
Shi et al. (2023) R. Shi, Y. Liu, Y. Ze, S. S. Du, and H. Xu, “Unleashing the power of pre-trained language models for offline reinforcement learning,” arXiv preprint arXiv:2310.20587, 2023.
Shridhar et al. (2023) M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning. PMLR, 2023, pp. 785–799.
Shuster et al. (2021) K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,” arXiv preprint arXiv:2104.07567, 2021.
Singh et al. (2019) A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8317–8326.
Song et al. (2017) S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” IEEE Conference on Computer Vision and Pattern Recognition, 2017.
Sontakke et al. (2023) S. A. Sontakke, J. Zhang, S. M. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti, “Roboclip: One demonstration is enough to learn robot policies,” arXiv preprint arXiv:2310.07899, 2023.
Srivastava et al. (2022) S. Srivastava, C. Li, M. Lingelbach, R. Martín-Martín, F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch, K. Liu et al., “Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments,” in Conference on Robot Learning. PMLR, 2022, pp. 477–490.
Sun et al. (2023a) J. Sun, Q. Zhang, Y. Duan, X. Jiang, C. Cheng, and R. Xu, “Prompt, plan, perform: Llm-based humanoid control via quantized imitation learning,” arXiv preprint arXiv:2309.11359, 2023.
Sun et al. (2023b) Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “Eva-clip: Improved training techniques for clip at scale,” arXiv preprint arXiv:2303.15389, 2023.
Szot et al. (2021) A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V. Vondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V. Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
Takamatsu et al. (2022) J. Takamatsu, K. Sasabuchi, N. Wake, A. Kanehira, and K. Ikeuchi, “Learning-from-observation system considering hardware-level reusability,” arXiv preprint arXiv:2212.09242, 2022.
Tang et al. (2023) C. Tang, D. Huang, W. Ge, W. Liu, and H. Zhang, “Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping,” IEEE Robotics and Automation Letters, 2023.
Taori et al. (2023) R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
Teshima et al. (2022) H. Teshima, N. Wake, D. Thomas, Y. Nakashima, H. Kawasaki, and K. Ikeuchi, “Deep gesture generation for social robots using type-specific libraries,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 8286–8291.
Tevet et al. (2022) G. Tevet, S. Raab, B. Gordon, Y. Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,” arXiv preprint arXiv:2209.14916, 2022.
Thomason et al. (2018) J. Thomason, D. Gordan, and Y. Bisk, “Shifting the baseline: Single modality performance on visual navigation & qa,” arXiv preprint arXiv:1811.00613, 2018.
Tobin et al. (2017) J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2017, pp. 23–30.
Touvron et al. (2023) H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
Tsoi et al. (2022) N. Tsoi, A. Xiang, P. Yu, S. S. Sohn, G. Schwartz, S. Ramesh, M. Hussein, A. W. Gupta, M. Kapadia, and M. Vázquez, “Sean 2.0: Formalizing and generating social situations for robot navigation,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 047–11 054, 2022.
Wake et al. (2021a) N. Wake, R. Arakawa, I. Yanokura, T. Kiyokawa, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “A learning-from-observation framework: One-shot robot teaching for grasp-manipulation-release household operations,” in 2021 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2021.
Wake et al. (2021b) N. Wake, I. Yanokura, K. Sasabuchi, and K. Ikeuchi, “Verbal focus-of-attention system for learning-from-observation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 10 377–10 384.
Wake et al. (2023a) N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “Interactive task encoding system for learning-from-observation,” in 2023 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), 2023, pp. 1061–1066.
Wake et al. (2023d) ——, “Bias in emotion recognition with chatgpt,” arXiv preprint arXiv:2310.11753, 2023.
Wake et al. (2023f) ——, “Gpt models meet robotic applications: Co-speech gesturing chat system,” arXiv preprint arXiv:2306.01741, 2023.
Wake et al. (2023b) ——, “Gpt-4v(ision) for robotics: Multimodal task planning from human demonstration,” arXiv preprint arXiv:2311.12015, 2023.
Wake et al. (2023c) ——, “Chatgpt empowered long-step robot control in various environments: A case application,” IEEE Access, vol. 11, pp. 95 060–95 078, 2023.
Wake et al. (2023e) N. Wake, D. Saito, K. Sasabuchi, H. Koike, and K. Ikeuchi, “Text-driven object affordance for guiding grasp-type recognition in multimodal robot teaching,” Machine Vision and Applications, vol. 34, no. 4, p. 58, 2023.
Wang et al. (2023e) B. Wang, Q. Huang, B. Deb, A. L. Halfaker, L. Shao, D. McDuff, A. Awadallah, D. Radev, and J. Gao, “Logical transformers: Infusing logical structures into pre-trained language models,” in Proceedings of ACL 2023, May 2023.
Wang et al. (2023g) D. Wang, Q. Huang, M. Jackson, and J. Gao, “Retrieve what you need: A mutual learning framework for open-domain question answering,” March 2023. [Online]. Available: https://www.microsoft.com/en-us/research/publication/retrieve-what-you-need-a-mutual-learning-framework-for-open-domain-question-answering/
Wang et al. (2023b) G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291, 2023.
Wang et al. (2023c) L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin et al., “A survey on large language model based autonomous agents,” arXiv preprint arXiv:2308.11432, 2023.
Wang et al. (2015) P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick, “Explicit knowledge-based reasoning for visual question answering,” arXiv preprint arXiv:1511.02570, 2015.
Wang et al. (2017) P. Wang, Q. Wu, C. Shen, A. Dick, and A. Van Den Hengel, “Fvqa: Fact-based visual question answering,” TPAMI, vol. 40, no. 10, pp. 2413–2427, 2017.
Wang et al. (2018) X. Wang, W. Xiong, H. Wang, and W. Y. Wang, “Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation,” in The European Conference on Computer Vision (ECCV), September 2018.
Wang et al. (2019) X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Weng, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in CVPR 2019, June 2019.
Wang et al. (2023f) Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Chen, Y. Wang, P. Luo, Z. Liu, Y. Wang, L. Wang, and Y. Qiao, “Internvid: A large-scale video-text dataset for multimodal understanding and generation,” 2023.
Wang et al. (2022b) Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
Wang et al. (2023d) Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Y. Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan, “Robogen: Towards unleashing infinite data for automated robot learning via generative simulation,” arXiv preprint arXiv:2311.01455, 2023.
Wang et al. (2022a) Z. Wang, Y. Chen, T. Liu, Y. Zhu, W. Liang, and S. Huang, “Humanise: Language-conditioned human motion generation in 3d scenes,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 14 959–14 971. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/6030db5195150ac86d942186f4abdad8-Paper-Conference.pdf
Wang et al. (2023a) Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents,” arXiv preprint arXiv:2302.01560, 2023.
Wei et al. (2022) J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
World Health Organization and World Bank (2015) World Health Organization and World Bank, “Tracking universal health coverage: First global monitoring report,” www.who.int/healthinfo/universal_health_coverage/report/2015/en, Jun 2015.
Wu et al. (2023) Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent conversation,” Microsoft, Tech. Rep. MSR-TR-2023-33, August 2023. [Online]. Available: https://www.microsoft.com/en-us/research/publication/autogen-enabling-next-gen-llm-applications-via-multi-agent-conversation-framework/
Xi et al. (2023) Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
Xia et al. (2018) F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese, “Gibson Env: real-world perception for embodied agents,” in Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018.
Xu et al. (2023) M. Xu, P. Huang, W. Yu, S. Liu, X. Zhang, Y. Niu, T. Zhang, F. Xia, J. Tan, and D. Zhao, “Creative robot tool use with large language models,” arXiv preprint arXiv:2310.13065, 2023.
Yang et al. (2023a) J. Yang, Y. Dong, S. Liu, B. Li, Z. Wang, C. Jiang, H. Tan, J. Kang, Y. Zhang, K. Zhou et al., “Octopus: Embodied vision-language programmer from environmental feedback,” arXiv preprint arXiv:2310.08588, 2023.
Yang et al. (2023b) K. Yang, S. Ji, T. Zhang, Q. Xie, and S. Ananiadou, “On the evaluations of chatgpt and emotion-enhanced prompting for mental health analysis,” arXiv preprint arXiv:2304.03347, 2023.
Yang et al. (2023c) Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” 2023.
Yao et al. (2023b) S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” 2023.
Yao et al. (2023a) S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” 2023.
Ye et al. (2023b) Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen, J. Tian, Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Modularization empowers large language models with multimodality,” 2023.
Ye et al. (2023a) Y. Ye, H. You, and J. Du, “Improved trust in human-robot collaboration with chatgpt,” IEEE Access, 2023.
Young et al. (2014) P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2014.
Yu et al. (2023b) J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-Li, X. Lv, H. Peng, Z. Yao, X. Zhang, H. Li et al., “Kola: Carefully benchmarking world knowledge of large language models,” arXiv preprint arXiv:2306.09296, 2023.
Yu et al. (2016) L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 2016, pp. 69–85.
Yu et al. (2023a) W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. Chiang, T. Erez, L. Hasenclever, J. Humplik et al., “Language to rewards for robotic skill synthesis,” arXiv preprint arXiv:2306.08647, 2023.
Zellers et al. (2021) R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi, “Merlot: Multimodal neural script knowledge models,” 2021.
Zellers et al. (2022) R. Zellers, J. Lu, X. Lu, Y. Yu, Y. Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, and Y. Choi, “Merlot reserve: Neural script knowledge through vision and language and sound,” 2022.
Zeng et al. (2021) A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V. Sindhwani et al., “Transporter networks: Rearranging the visual world for robotic manipulation,” in Conference on Robot Learning. PMLR, 2021, pp. 726–747.
Zeng et al. (2023) A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang, “Agenttuning: Enabling generalized agent abilities for llms,” 2023.
Zha et al. (2023) L. Zha, Y. Cui, L.-H. Lin, M. Kwon, M. G. Arenas, A. Zeng, F. Xia, and D. Sadigh, “Distilling and retrieving generalizable knowledge for robot manipulation via language corrections,” arXiv preprint arXiv:2311.10678, 2023.
Zhang et al. (2022) M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,” arXiv preprint arXiv:2208.15001, 2022.
Zhang et al. (2021) S. Zhang, X. Song, Y. Bai, W. Li, Y. Chu, and S. Jiang, “Hierarchical object-to-zone graph for object navigation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 130–15 140.
Zhao et al. (2023) W. Zhao, Y. Zhao, X. Lu, S. Wang, Y. Tong, and B. Qin, “Is chatgpt equipped with emotional dialogue capabilities?” arXiv preprint arXiv:2304.09582, 2023.
Zhao et al. (2022) Y. Zhao, I. Misra, P. Krähenbühl, and R. Girdhar, “Learning video representations from large language models,” in arXiv preprint arXiv:2212.04501, 2022.
Zheng et al. (2023) L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” 2023.
Zhong et al. (2023) Z. Zhong, J. Cao, S. Gu, S. Xie, W. Gao, L. Luo, Z. Yan, H. Zhao, and G. Zhou, “Assist: Interactive scene nodes for scalable and realistic indoor simulation,” arXiv preprint arXiv:2311.06211, 2023.
Zhou et al. (2023a) G. Zhou, Y. Hong, and Q. Wu, “Navgpt: Explicit reasoning in vision-and-language navigation with large language models,” arXiv preprint arXiv:2305.16986, 2023.
Zhou et al. (2023c) H. Zhou, M. Ding, W. Peng, M. Tomizuka, L. Shao, and C. Gan, “Generalizable long-horizon manipulations with large language models,” arXiv preprint arXiv:2310.02264, 2023.
Zhou et al. (2022) X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” in ECCV, 2022.
Zhou et al. (2023b) Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao, “Analyzing and mitigating object hallucination in large vision-language models,” arXiv preprint arXiv:2310.00754, 2023.
Zhu et al. (2023) D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” 2023.
Zhu et al. (2017b) J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
Zhu et al. (2017c) S. Zhu, A. Kimmel, K. E. Bekris, and A. Boularias, “Fast model identification via physics engines for data-efficient policy search,” arXiv preprint arXiv:1710.08893, 2017.
Zhu et al. (2022) X. Zhu, J. Wang, L. Zhang, Y. Zhang, R. Gan, J. Zhang, and Y. Yang, “Solving math word problem via cooperative reasoning induced language models,” arXiv preprint arXiv:2210.16257, 2022.
Zhu et al. (2017a) Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3357–3364.

Appendix for 附录
Agent AI 智能体AI

Appendix A GPT-4V Agent Prompt Details
附录AGPT-4V代理提示详情

Unless specified, we use the default system prompt for GPT-4V. We show a detailed description of the process to prompt GPT-4V for Minecraft and how we generate GPT-4V’s responses in Fig. 31.
除非指定，否则我们使用GPT-4V的默认系统提示符。我们在图31中详细描述了提示GPT-4V的过程以及我们如何生成GPT-4V的响应。

Appendix B GPT-4V for Bleeding Edge
附录BGPT-4V出血边缘

Bleeding Edge is a third person team-based combat game where players attempt to capture objective points or collect more resources than the enemy team. We show an example input and output when prompting GPT-4V for the game Bleeding Edge in Fig. 32. Compared to Minecraft, we qualitatively find that GPT-4V has a less thorough understanding of the visual content and game rules. This is likely due to the (1) the amount of minecraft data present in GPT-4V’s training data and (2) the visual complexity of Bleeding Edge compared to Minecraft.
Bleeding Edge是一款第三人称团队战斗游戏，玩家试图夺取目标点或收集比敌方团队更多的资源。我们在图32中示出了当针对游戏Bleeding Edge提示GPT-4V时的示例输入和输出。与Minecraft相比，我们定性地发现GPT-4V对视觉内容和游戏规则的理解不够透彻。这可能是由于（1）GPT-4V的训练数据中存在的Minecraft数据量以及（2）与Minecraft相比，Bleeding Edge的视觉复杂性。

Appendix C GPT-4V for Microsoft Fight Simulator
附录CGPT-4V for Microsoft Fight Simulator

As shown in Fig. 33, a GPT-4V based agent can provide the high-level action description for the player in Microsoft Flight Simulator. It describes how the player is in the process of flying an aircraft, shown by the cockpit perspective and external views of the plane, managing various flight controls and instruments to maintain appropriate airspeed and altitude while navigating through the virtual airspace.
如图33所示，基于GPT-4V的代理可以为微软飞行模拟器中的玩家提供高级动作描述。它描述了玩家如何在驾驶飞机的过程中，通过驾驶舱的视角和飞机的外部视图显示，管理各种飞行控制和仪器，以保持适当的空速和高度，同时在虚拟空域中导航。

Appendix D GPT-4V for Assassin’s Creed Odyssey
附录DGPT-4V刺客信条奥德赛

As shown in Fig. 34, the GPT-4V agent provides the high-level action description of the character in the images being engaged in nighttime combat during a historical battle, wielding a spear and utilizing special abilities, such as the Bull Rush, to fight against multiple enemies marked with level indicators, while managing their health and ability meters in a third-person action role-playing setting. GPT-4V also describes that the player is engaged in a large-scale melee combat scenario, controlling a character with red accents on their armor, and is prompted to use a "Bull Rush Ability" as indicated by the on-screen instructions. The environment is dark and fiery, suggesting a battlefield, and the player’s character is facing multiple enemies marked with level indicators and health bars overhead. The action is part of a tutorial or combat sequence where the player learns and executes combat abilities.
如图34所示，GPT-4V智能体提供了图像中的角色在历史战斗期间参与夜间战斗的高级动作描述，挥舞长矛并利用特殊能力（例如奔牛）来对抗标记有等级指示符的多个敌人，同时在第三人称动作角色扮演设置中管理他们的健康和能力计量器。GPT-4V还描述了玩家正在进行一场大规模的近战战斗场景，控制一个盔甲上有红色口音的角色，并被提示使用屏幕上指示的“奔牛能力”。环境是黑暗和火热的，暗示着一个战场，玩家的角色面临着多个标有等级指示器和生命值条的敌人。该动作是教程或战斗序列的一部分，玩家在其中学习并执行战斗能力。

Appendix E GPT-4V for GEARS of WAR 4
附录EWAR 4齿轮的GPT-4V

As shown in Fig. 35, Based on the sequence of 10 frames of video provided, the GPT-4V agent can provide a general description of the player’s actions in the video game:
如图35所示，基于所提供的10帧视频的序列，GPT-4V代理可以提供玩家在视频游戏中的动作的一般描述：

“The player is navigating a character through an industrial environment at night, taking strategic cover behind various objects like crates and walls. The character is armed and engages in combat with enemy entities. The player uses firearms and possibly melee attacks, as indicated by close-quarters combat. They appear to manage their ammunition and health, indicated by visual cues such as a flashing red indicator on the screen, which often signifies damage to the character.
“玩家在夜间通过工业环境导航角色，在各种物体（如板条箱和墙壁）后面进行战略掩护。该角色是武装的，并与敌人实体进行战斗。玩家使用火器，可能也会使用近战攻击，如近距离战斗所示。他们似乎管理他们的弹药和健康，通过视觉提示，如屏幕上闪烁的红色指示灯，这通常意味着对角色的伤害。

Additionally, the player seems to be moving tactically, shifting between different cover positions to advance or defend against the enemy. There is evidence of a protective shield being activated at some point, which suggests the player is responding to incoming fire by using defensive abilities. The character also seems to be reloading their weapon, suggesting an element of resource management during combat. Blood and impact effects indicate successful hits on the enemies.
此外，玩家似乎在战术上移动，在不同的掩护位置之间移动，以推进或防御敌人。有证据表明保护盾在某个时候被激活，这表明玩家正在通过使用防御能力来应对来袭的火力。这个角色似乎也在重新装填武器，暗示了战斗中的资源管理元素。血和冲击效果表示成功击中敌人。

The player’s character collaborates with allies, as other characters are seen in the vicinity, engaging in similar combat actions. The environment has interactive elements, such as exploding barrels, which the player might use to their advantage.
玩家的角色与盟友合作，因为附近可以看到其他角色，参与类似的战斗行动。环境中有互动元素，如爆炸桶，玩家可能会利用它们的优势。

The scenes depict fast-paced action with a focus on strategic movement, shooting, and situational awareness to overcome adversaries and possibly achieve a specific objective within the game."
这些场景描绘了快节奏的动作，重点是战略移动，射击和态势感知，以克服对手，并可能在游戏中实现特定目标。"

Appendix F GPT-4V for Starfield
附录FStarfield的GPT-4V

As showed in Fig. 36, based on the sequence of 10 frames of video provided for Starfield, the GPT-4V agent provides a general description of the player’s actions in a video game. According to GPT-4V: "The images depict a series of events within the game Starfield, showcasing the player’s interactions within a mining environment. Here’s a description of the sequence:
如图36所示，基于为Starfield提供的10帧视频的序列，GPT-4V代理提供视频游戏中玩家动作的一般描述。根据GPT-4V：“这些图像描绘了游戏Starfield中的一系列事件，展示了玩家在采矿环境中的互动。以下是序列的描述：

1) The player stands in a dimly lit underground cavern with stalactites, accompanied by another character in similar space gear. They face a highlighted waypoint, indicating an objective or location of interest.
1)玩家站在一个有钟乳石的昏暗地下洞穴中，旁边还有另一个穿着类似太空装备的角色。他们面对一个突出显示的航路点，表明一个目标或感兴趣的位置。

2) Moving through an industrial interior, likely a mining station, the player approaches another character. The HUD (Heads-Up Display) shows a dialogue line from "Supervisor Lin," which suggests NPC (non-player character) interaction and possibly a mission briefing or update.
2)玩家穿过一个工业内部，可能是一个采矿站，接近另一个角色。HUD（平视显示器）显示来自“林主管”的对话行，这表明NPC（非玩家角色）的互动和可能的使命简报或更新。

3) The player seems to be following the NPC, Supervisor Lin, through the cavern, hinting at a guided segment of gameplay where the NPC leads the player to a specific location or task.
3)玩家似乎跟随NPC林主管穿过洞穴，暗示游戏中有一个引导部分，NPC将玩家带到一个特定的位置或任务。

4) A dialogue from Supervisor Lin suggests a decision point or mission update, possibly directing the player to a new objective related to a "big vein" of resources.
4)来自林主管的对话建议一个决定点或使命更新，可能会引导玩家到一个与资源“大矿脉”相关的新目标。

5) The player encounters Supervisor Lin again, who provides instruction or information about metal deposits, implying the start of a mining task or exploration of a new area.
5)玩家再次遇到林主管，他提供有关金属矿床的指示或信息，意味着开始采矿任务或探索新的区域。

6) In this frame, the player appears to be observing or interacting with a piece of machinery, with another character, Isabelle, indicating an action or dialogue related to equipment operation or technical work.
6)在这一帧中，玩家似乎正在观察或与一台机器互动，另一个角色伊莎贝尔表示与设备操作或技术工作有关的动作或对话。

7) The player looks down at a rough terrain with two glowing objects, which could represent resources to be collected or points of interest for the mission.
7)玩家向下看有两个发光物体的粗糙地形，这可能代表要收集的资源或使命的兴趣点。

8) The player examines a specific resource, indicated by a HUD marker as "Beryllium Deposit," suggesting a resource gathering or mining activity.
8)玩家检查一个特定的资源，由HUD标记指示为“Berkshire存款”，表明资源收集或采矿活动。

9) The player traverses through the caverns, potentially searching for additional resources or navigating to the next objective, as indicated by another character in space gear ahead.
9)玩家穿越洞穴，可能会寻找额外的资源或导航到下一个目标，如前面太空装备中的另一个角色所示。

10) The player is positioned in a darker area of the cavern, with two glowing markers indicating a path or objective points to be reached.
10)玩家被放置在洞穴的一个较暗的区域，有两个发光的标记指示要到达的路径或目标点。

Throughout these images, the player seems to be engaged in a mission involving exploration, NPC interaction, resource collection, and possibly machinery operation within a mining environment. The HUD elements like waypoints, dialogue text, and mission updates provide context to the player’s actions and the game’s objectives."
在这些图像中，玩家似乎在从事一项使命，涉及勘探，NPC互动，资源收集，并可能在采矿环境中操作机械。HUD元素，如航路点、对话文本和使命更新，为玩家的行动和游戏目标提供了背景。"

Author Biographies 作者简介

Zane Durante 赞内·杜兰特

PhD Student in Computer Science at Stanford University
斯坦福大学计算机科学博士生
Email: durante@stanford.edu
电子邮件地址：durante@stanford.edu
Web: https://zanedurante.github.io/
网站：https://zanedurante.github.io/
Zane Durante is a PhD student at Stanford Vision Lab advised by Dr. Fei-Fei Li. H His research interests include video understanding, connecting computer vision with natural language, and AI applications in hospital care. He is leading an ongoing collaboration between Stanford Medical School and Stanford Vision Lab to develop a dataset of natural language descriptions of video clips of hospital scenes. His PhD work is supported by NSF’s Graduate Research Fellowship.
赞内·杜兰特是斯坦福大学视觉实验室的博士生，由李飞飞博士担任顾问。H他的研究兴趣包括视频理解，将计算机视觉与自然语言连接起来，以及AI在医院护理中的应用。他正在领导斯坦福大学医学院和斯坦福大学视觉实验室之间的持续合作，以开发医院场景视频剪辑的自然语言描述数据集。他的博士工作得到了NSF研究生研究奖学金的支持。

Qiuyuan Huang 黄秋媛

Principal Researcher at Microsoft Research, Redmond, WA, USA.
美国华盛顿州雷德蒙微软研究院首席研究员。
Email: qihua@microsoft.com
电子邮件地址：qihua@microsoft.com
Web: https://www.microsoft.com/en-us/research/people/qihua/
网站：https://www.microsoft.com/en-us/research/people/qihua/
Qiuyuan Huang is a principal researcher in the deep learning group at Microsoft Research (MSR), Redmond, WA. Her current research interests are mainly in the deep learning, multi-modality, and natural language processing areas, specifically on Agent AI for Gaming, Robotics and Healthcare; Knowledge-reasoning Intelligence for Interactive AI; Neuro-symbolic Computation for Inference Reasoning; and Large Foundation models for NLP and Multi-modality.
Qiuyuan Huang是位于华盛顿州雷德蒙的微软研究院（MSR）深度学习小组的首席研究员。她目前的研究兴趣主要是深度学习，多模态和自然语言处理领域，特别是游戏，机器人和医疗保健的Agent AI;交互式AI的知识推理智能;推理的神经符号计算;以及NLP和多模态的大型基础模型。

Naoki Wake 卫直树

Researcher at Microsoft, Redmond, WA, USA.
美国华盛顿州雷德蒙微软公司研究员。
Email: naoki.wake@microsoft.com
电子邮件：naoki. microsoft.com
Web: https://www.microsoft.com/en-us/research/people/nawake/
网址：https://www.microsoft.com/en-us/research/people/nawake/
Naoki Wake is a researcher in the Applied Robotics Research group at Microsoft. His current research involves the development of multimodal perception systems for robots and co-speech gesturing systems. His past research has spanned auditory neuroscience, neuro-rehabilitation, and speech processing. Naoki received his B.S. degree in Engineering in 2014, and his Ph.D. in Information Science and Technology in 2019, both from the University of Tokyo.
Naoki Wake是微软应用机器人研究小组的研究员。他目前的研究涉及机器人多模态感知系统和协同语音手势系统的开发。他过去的研究涵盖听觉神经科学，神经康复和语音处理。直树获得了学士学位。2014年获得工程学学位，在信息科学和技术在2019年，都从东京大学。

Ran Gong 冉公

PhD Student in Computer Science at University of California, Los Angeles.
洛杉矶加州大学计算机科学博士生。
Email: nikepupu@ucla.edu 电子邮件：nikepupu@ucla.edu
Web: https://nikepupu.github.io
Web ： WEB ：https://nikepupu.github.io
Ran Gong, is a PhD student at the UCLA VCLA Lab. His research lies in the intersection of Robotics, Computer Vision, Computer Graphics, and Machine Learning. His research focuses on embodied simulation and interaction with a goal of creating intelligent behaviors that can solve diverse tasks in diverse environments as well as well as the capability of collaborating with humans. He received his B.S. degree in Computer Science and Engineering at the University of California, Los Angeles.
龚冉是加州大学洛杉矶分校VCLA实验室的博士生。他的研究领域是机器人学、计算机视觉、计算机图形学和机器学习的交叉领域。他的研究重点是具体的模拟和交互，目标是创造智能行为，可以在不同的环境中解决不同的任务，以及与人类合作的能力。他获得了学士学位。在洛杉矶的加州大学获得计算机科学与工程学位。

Jae Sung Park 朴在成

PhD Student at University of Washington
华盛顿大学博士生
Email: jspark96@cs.washington.edu
电子邮件地址：jspark96@cs.washington.edu
Web: https://homes.cs.washington.edu/~jspark96/
网址：https://homes.cs.washington.edu/~jspark96/
Jae Sung is a PhD student advised by Yejin Choi and Ali Farhadi. His research focuses on developing models with multimodal commonsense reasoning. He is interested in equipping models with grounding linguistic concepts to visual modalities, and having them understand multimedia content in a way that humans process the visual information. Jae Sung received his B.S. degree in Computer Science at University of California, Berkeley.
Jae Sung是Yejin Choi和Ali Farhadi的博士生。他的研究重点是开发具有多模态常识推理的模型。他感兴趣的是为模型提供视觉形式的语言概念，并让它们以人类处理视觉信息的方式理解多媒体内容。Jae Sung获得了学士学位。加州大学伯克利分校计算机科学学位。

Bidipta Sarkar 比迪普塔·萨卡尔

Undergraduate Student at Stanford University
斯坦福大学本科生
Email: bidiptas@stanford.edu
电子邮件地址：bidiptas@stanford.edu
Web: https://bsarkar321.github.io/
网站：https://bsarkar321.github.io/
Bidipta Sarkar is a senior undergraduate student at Stanford University and a member of Stanford’s ILIAD lab. His research focuses on creating AI agents that can interact with their environment and safely work alongside humans and other autonomous agents.
Bidipta Sarkar是斯坦福大学的一名大四学生，也是斯坦福大学ILIAD实验室的成员。他的研究重点是创建能够与环境交互并与人类和其他自主代理安全工作的人工智能代理。

Rohan Taori 罗汉陶里

PhD Student in Computer Science at Stanford University
斯坦福大学计算机科学博士生
Email: rtaori@cs.stanford.edu
Email ：rtaori@cs.stanford.edu
Web: https://www.rohantaori.com/
WEB ：https://www.rohantaori.com/
Rohan Taori is a PhD student at the Stanford AI Lab. His research focuses on studying the foundations of machine learning in the context of real-world systems. Most recently, Rohan has pushed forward the frontier of open-source large language models, finetuning them to be helpful, general-purpose assistants. Rohan is also very interested in augmenting language models with multi-modality capability to allow them to reason over images and videos. Rohan received his Bachelor’s degree in Computer Science at UC Berkeley.
罗汉陶里是斯坦福大学人工智能实验室的博士生。他的研究重点是在真实世界系统的背景下研究机器学习的基础。最近，罗汉推动了开源大型语言模型的前沿，将它们微调为有用的通用助手。罗汉也对增强具有多模态能力的语言模型非常感兴趣，以允许它们对图像和视频进行推理。罗汉在加州大学伯克利分校获得计算机科学学士学位。

Yusuke Noda 野田雄介

Principal Software Engineer. Microsoft Gaming, Redmond, WA.
首席软件工程师。Microsoft Gaming，华盛顿州雷德蒙。
Email: yusuke.noda@microsoft.com
电子邮件：yusuke. microsoft.com
Web: https://www.linkedin.com/in/yusuke-noda-908797/
网站：https://www.linkedin.com/in/yusuke-noda-908797/
Yusuke Noda is a principal software engineer at Microsoft Gaming Platform. He has led the development of gaming infrastructure for Xbox One and Xbox Cloud Gaming and has over 16 years of experience developing efficient infrastructure for cloud and gaming technologies.
Yusuke Noda是Microsoft Gaming Platform的首席软件工程师。他领导了Xbox One和Xbox Cloud Gaming的游戏基础设施开发，并拥有超过16年的云和游戏技术高效基础设施开发经验。

Demetri Terzopoulos 德米特里·泰尔佐普洛斯

UCLA Distinguished Professor, Academy Award winning computer scientist and entrepreneur.
加州大学洛杉矶分校杰出教授，奥斯卡奖获奖计算机科学家和企业家。
Email: dt@cs.ucla.edu 电子邮件地址：dt@cs.ucla.edu
Web: https://web.cs.ucla.edu/~dt/
网站：https://web.cs.ucla.edu/~dt/
Dr. Demetri Terzopoulos is a Distinguished Professor and Chancellor’s Professor of Computer Science at the University of California, Los Angeles, where he directs the UCLA Computer Graphics & Vision Laboratory. He is also Co-Founder and Chief Scientist of VoxelCloud, Inc., a multinational healthcare AI company. He is or was a Guggenheim Fellow, a Fellow of the Royal Society (FRS) of London and the Royal Society of Canada (FRSC), a Fellow of the Association for Computing Machinery (ACM), a Life Fellow of the Institute of Electrical and Electronics Engineers (IEEE), a Fellow of the Asia-Pacific Artificial Intelligence Association (AAIA), a Distinguished Fellow of the International Engineering and Technology Institute (IETI), a Member of the European Academy of Sciences (EAS) and the New York Academy of Sciences (NYAS), and a Life Member of Sigma Xi.
Demetri Terzopoulos博士是洛杉矶的加州大学计算机科学的杰出教授和校长教授，他在那里领导加州大学洛杉矶分校计算机图形与视觉实验室。他还是VoxelCloud，Inc.的联合创始人兼首席科学家，一家跨国医疗AI公司。他是或曾经是古根海姆研究员、伦敦皇家学会（FRS）和加拿大皇家学会（FRSC）研究员、计算机械协会（ACM）研究员、电气和电子工程师协会（IEEE）终身研究员、亚太人工智能协会（AAIA）研究员、国际工程技术研究所（IETI）杰出研究员、欧洲科学院（EAS）和纽约科学院（NYAS）的成员，以及Sigma Xi的终身成员。

Yejin Choi 崔艺珍

Wisnner-Slivaka Chair and Brett Helsel Professor at University of Washington, Senior Research Manager at Allen Institute for Artificial Intelligence, and MacArthur Fellow.
华盛顿大学Wisnner-Slivaka主席和Brett Helsel教授，艾伦人工智能研究所高级研究经理，麦克阿瑟研究员。
Email: yejin@cs.washington.edu
电子邮件地址：yejin@cs.washington.edu
Web: https://homes.cs.washington.edu/~yejin/
网址：https://homes.cs.washington.edu/~yejin/
Yejin is a Wisnner-Slivaka Chair and Brett Helsel Professor at University of Washington and Senior Research Manager at Allen Institute of Artifical Intelligence. She has won the Anita Borg Early Career Award in 2018. She was the recipient of MacArthur Fellow foundation fellowship in 2020. She has received outstanding paper award in AAAI 2020, Neurips 2021, ICML 2022, and ACL 2023, and the best paper award in NAACL 2022 and ACL 2023. She is one of the main organizers of COLM 2024, an academic venue focused on the study of language modeling. Her main research interests are commonsense reasoning in the fields of Natural Language Processing, Machine Learning, Artificial Intelligence, with broader interests in Computer Vision and Digital Humanities.
Yejin是华盛顿大学的Wisnner-Slivaka主席和Brett Helsel教授，也是艾伦军事情报研究所的高级研究经理。她在2018年获得了安妮塔博格早期职业生涯奖。她是2020年麦克阿瑟研究员基金会奖学金的获得者。她曾获得AAAI 2020，Neurips 2021，ICML 2022和ACL 2023的优秀论文奖，以及NAACL 2022和ACL 2023的最佳论文奖。她是COLM 2024的主要组织者之一，这是一个专注于语言建模研究的学术场所。她的主要研究兴趣是自然语言处理，机器学习，人工智能领域的常识推理，对计算机视觉和数字人文有更广泛的兴趣。

Fei-Fei Li 李飞飞

Professor of Computer Science at Stanford University
斯坦福大学计算机科学教授
Email: feifeili@stanford.edu
Email ：feifeili@stanford.edu
Web: https://profiles.stanford.edu/fei-fei-li/
WEB ：https://profiles.stanford.edu/fei-fei-li/
Fei-Fei is the inaugural Sequoia Professor in the Computer Science Department at Stanford University, and Co-Director of Stanford’s Human-Centered AI Institute. She served as the Director of Stanford’s AI Lab from 2013 to 2018. And during her sabbatical from Stanford from January 2017 to September 2018, Dr. Li was Vice President at Google and served as Chief Scientist of AI/ML at Google Cloud. Since then she has served as a Board member or advisor in various public or private companies.
Fei-Fei是斯坦福大学计算机科学系的首届红杉教授，也是斯坦福大学以人为本的人工智能研究所的联合主任。2013年至2018年，她担任斯坦福大学人工智能实验室主任。在2017年1月至2018年9月从斯坦福大学休假期间，李博士担任谷歌副总裁，并担任谷歌云AI/ML首席科学家。从那时起，她一直担任董事会成员或顾问在各种公共或私人公司。

Katsushi Ikeuchi 池内克史

Senior Principal Research Manager at Microsoft and an IEEE Life Fellow. Microsoft Research, Redmond, WA. IEEE Life Fellow.
微软高级首席研究经理和IEEE终身研究员。微软研究院，华盛顿州雷德蒙。IEEE Life Fellow
Email: katsuike@microsoft.com
电子邮件地址：katsuike@microsoft.com
Web: https://www.microsoft.com/en-us/research/people/katsuike/
网址：https://www.microsoft.com/en-us/research/people/katsuike/
Dr. Katsushi Ikeuchi received the BE degree in Mechanical Engineering from Kyoto University in 1973 and the PhD degree in Information Engineering from the University of Tokyo in 1978. After working at the Artificial Intelligence Laboratory, Massachusetts Institute of Technology for three years, Electrotechnical Laboratory, Ministry of International Trade and Industry, Japanese government for five years, and Robotics Institute, Carnegie Mellon University for 10 years, Institute of Industrial Science, the University of Tokyo for 19 years, he joined Microsoft as a Principal Researcher in 2015. During this tenure of CMU and UTokyo, he supervised more than 50 PhD students.
池内克史博士于1973年获得京都大学机械工程学士学位，并于1978年获得东京大学信息工程博士学位。在马萨诸塞州人工智能实验室工作了三年，在日本政府国际贸易和工业部电工实验室工作了五年，在卡内基梅隆大学机器人研究所工作了10年，在东京大学工业科学研究所工作了19年后，他于2015年加入微软担任首席研究员。在CMU和东京大学任职期间，他指导了50多名博士生。

His research interest spans computer vision, robotics, and computer graphics. In these research fields, he has received several best paper awards, including the David Marr Prize in computational vision and IEEE Robotics and Automation Society K. S. Fu memorial best transaction paper.
他的研究兴趣涵盖计算机视觉，机器人和计算机图形学。在这些研究领域，他获得了多项最佳论文奖，包括计算视觉领域的大卫马尔奖和IEEE机器人与自动化学会K。S.傅纪念最好的交易文件。

His community service includes: general chair of IROS95, ITSC99, IV01, ICCV05, ACCV07, ICCV17; program chair of CVPR96, ICCV03, ICRA09, ICPR12, ICCV15; EIC of IJCV (2000-2017), EIC of IJ ITS (2012-2014), associate editor of IEEE Trans. RA, IEEE Trans. PAMI; and a distinguished lecturer of IEEE Signal Processing Society in 2000-2002, IEEE Robotics and Automation Society in 2004-2006 and IEEE Computer Society in 2008-2010.
他的社区服务包括：IROS 95、ITSC 99、IV 01、ICCV 05、ACCV 07、ICCV 17总主席; CVPR 96、ICCV 03、ICRA 09、ICPR 12、ICCV 15项目主席; IJCV EIC（2000-2017），IJ ITS的EIC（2012-2014），IEEE Trans. RA、IEEE Trans. PAMI副主编;以及2000-2002年IEEE信号处理学会、2004-2006年IEEE机器人与自动化学会和2008-2010年IEEE计算机学会的杰出讲师。

Through these research and society service, he was awarded a (life) fellow from IEEE, IEICE, IPSJ and RSJ. He received the Distinguished Researcher Award from IEEE-PAMI, the Medal of Honor with Purple Ribbon from Japanese Emperor, the Okawa prize from Okawa foundation as well as a couple of research achievement awards from Japanese professional societies.
通过这些研究和社会服务，他被授予IEEE，IEICE，IPSJ和RSJ的（终身）研究员。他曾获得IEEE-PAMI杰出研究员奖、日本天皇紫丝带荣誉勋章、大川财团大川奖以及日本专业学会的多项研究成果奖。

Hoi Vo 会武

Technical Fellow. Microsoft Gaming and X-Box Emerging Technologies, Redmond, WA.
技术研究员。Microsoft Gaming and X-Box Emerging Technologies，华盛顿州雷德蒙。
Email: hoiv@microsoft.com
电子邮件地址：hoiv@microsoft.com
Web: https://www.linkedin.com/in/hoi-vo-193420/
网站：https://www.linkedin.com/in/hoi-vo-193420/
Hoi Vo is a Technical Fellow from MS Gaming Division focusing on bridging AGI with new Gaming IP. He has played a pivotal role in establishing the Xbox Cloud Gaming service has led cloud and gaming efforts across teams in Microsoft, including Microsoft’s Windows Azure platform. His focus is on the optimization of AI models to run efficiently at the edge while leveraging the cloud for scenarios that exceed the hardware capabilities, including fine-tuning models to be more engaging with various playing experiences.
Hoi Vo是MS游戏部门的技术研究员，专注于将AGI与新的游戏IP连接起来。他在建立Xbox云游戏服务方面发挥了关键作用，领导了微软团队的云和游戏工作，包括微软的Windows Azure平台。他的重点是优化人工智能模型，使其在边缘高效运行，同时利用云计算来实现超出硬件能力的场景，包括微调模型，使其更好地参与各种游戏体验。

Jianfeng Gao 高剑锋

Distinguished Scientist and Vice President at Microsoft and an IEEE Fellow. Microsoft Research, Redmond, WA. IEEE Fellow.
杰出科学家，微软副总裁，IEEE研究员。微软研究院，华盛顿州雷德蒙。IEEE Fellow
Email: jfgao@microsoft.com
电子邮件地址：jfgao@microsoft.com
Web: https://www.microsoft.com/en-us/research/people/jfgao/
网站：https://www.microsoft.com/en-us/research/people/jfgao/
Dr. Jianfeng Gao is Distinguished Scientist and Vice President at Microsoft Research, IEEE Fellow, ACM Distinguished Member, and the current head of the Deep Learning Group at Microsoft Research. From 2014 to 2017, he was Partner Research Manager in Business AI at Microsoft AI Research and at Deep Learning Technology Center (DLTC) at Microsoft Research, Redmond. He lead the development of AI solutions to Predictive Sales and Marketing. He also works on deep learning for text and image processing (see ACL/SIGIR 2018 Tutorial, Deep Learning 2017 Tutorial and IJCAI 2016 Tutorial or MS internal site) and lead the development of AI systems for dialogue, machine reading comprehension (MRC), and question answering (QA). From 2022, he leads the research of self-improving AI where LLMs (e.g., ChatGPT/GPT4) are augmented and adapted for the development of commercial AI systems.
Jianfeng Gao博士是微软研究院的杰出科学家兼副总裁，IEEE Fellow，ACM杰出成员，现任微软研究院深度学习小组负责人。从2014年到2017年，他在微软人工智能研究院和微软研究院深度学习技术中心（DLTC）担任商业人工智能合作伙伴研究经理，雷德蒙。他领导预测销售和营销人工智能解决方案的开发。他还致力于文本和图像处理的深度学习（参见ACL/SIGIR 2018年版，深度学习2017年版和IJCAI 2016年版或MS内部网站），并领导对话，机器阅读理解（MRC）和问答（QA）的AI系统的开发。从2022年起，他领导了自我改进人工智能的研究，其中LLMs（例如，ChatGPT/GPT 4）被增强并适应于商业AI系统的开发。

Acknowledgements 确认

We are especially grateful to Peter Lee, Doug Burger, Desney Tan, Johannes Gehrke, Ryen White, Ece Kamar, Subhojit Som, and Kareem Choudhry for their advices, enormous support, and encouragement. We thank to Haiyan Zhang, Spencer Perreault, Dave Bignell, Katja Hofmann, Sam Devlin, Shanzheng Tan, Raluca Georgescu, Bill Dolan, Nebojsa Jojic, Sudha Rao, Adrian Brown, Andrzej Banburski-Fahey, Jianwei Yang for the early insightful discussions and helps of Gaming. We appreciate Kiran Muthabatulla, Antonio Criminisi, Tom Cashman, Nguyen Bach, Jennifer Marsman, Jaron Lanier from Mesh team, OCTO of Microsoft, and Microsoft office team for their mix-reality work, dataset work, and their generous helps and feedback for the project. We special thanks to Paul Bennett, Corby Rosset, Michel Galley, Chenglong Wang, Baolin Peng, Hao Chen, Silviu Cucerzan, Ahmed Awadallah, Saleema Amershi for their suggestion and comments for the NLP part. The authors gratefully acknowledge Paul Smolensky, Yonatan Bisk, Kezhen Chen, Borui Wang, Liangke Gui, Dingmin Wang, Xin (Eric) Wang, Zhe Gan, Xiaojian Ma, Zilong Zheng, Song-chun Zhu, Dragomir R. Radev, Daniel McDuff, Harry Shum for the related previous works, comments, suggestions, painstaking multiple reviews of this paper, and their pointers to the literature. Finally, we would like to really appreciate Microsoft Holulens team, Microsoft X-box team, and Meta Quest team for their generous provision of the equipment; MSR Central Engineering (CE) team, Microsoft 343 team for the data collection and sharing; Microsoft AOAI and GCR team for their Azure-OpenAI endpoint supporting.
我们特别感谢Peter Lee、Doug Burger、Desney Tan、Johannes Gehrke、Ryen白色、Ece Koughs、Subhojit Som和Kareem Choudhry的建议、巨大的支持和鼓励。我们感谢Haiyan Zhang、Spencer Perreault、Dave Bignell、Katja Hofmann、Sam Devlin、Shanzheng Tan、Raluca Georgescu、Bill Dolan、Nebojsa Jojic、Sudha Rao、Adrian Brown、Andrzej Banburski-Fahey和Jianwei Yang对Gaming的早期深入讨论和帮助。我们感谢来自Mesh团队的Kiran Muthabatulla，Antonio Criminisi，Tom Cashman，Nguyen Bach，Jennifer Marsman，Jaron Lanier，Microsoft的OCTO和Microsoft Office团队的混合现实工作，数据集工作以及他们对项目的慷慨帮助和反馈。我们特别感谢Paul班尼特、Corby Rosset、Michel Galley、Chenglong Wang、Baolin Peng、Hao Chen、Silviu Cucerzan、Ahmed Awadallah、Saleema Amershi对NLP部分的建议和评论。作者衷心感谢Paul Smolensky、Yonatan Bisk、Kezhen Chen、Borui Wang、Liangke Gui、Dingmin Wang、Xin（Eric）Wang、Zhe Gan、Xiaojian Ma、Zilong Zheng、Song-chun Zhu、Dragomir R.拉德夫、丹尼尔·麦克达夫、哈里·沈为相关的前人著作、评论、建议，不辞辛苦地对本文进行了多重评述，并对他们的文献进行了指点。最后，我们非常感谢Microsoft Holulens团队、Microsoft X-box团队和Meta Quest团队慷慨提供的设备;感谢MSR Central Engineering（CE）团队、Microsoft 343团队提供的数据收集和共享;感谢Microsoft AOAI和GCR团队提供的Azure-OpenAI端点支持。

We would like to thank our colleagues from Stanford’s Partnership in AI-assisted Care, who helped inform the medical applications explored in this work. In particular, we would like to thank Ehsan Adeli, Paul Tang, Amit Kaushal, Roger Bohn, Kevin Schulman, and Arnold Milstein for their clinical expertise and guidance.
我们要感谢来自斯坦福大学人工智能辅助护理合作伙伴关系的同事，他们帮助我们了解了这项工作中探索的医疗应用。我们要特别感谢Ehsan Adeli、Paul Tang、Amit Kaushal、Roger Bohn、Kevin Schulman和Arnold Milstein的临床专业知识和指导。

This research was supported by Microsoft Research project Fair 2023, Microsoft HackBox 2023, OCTO team.
这项研究得到了微软研究项目Fair 2023，Microsoft HackBox 2023，OCTO团队的支持。

Agent AI: Surveying the Horizons of Multimodal Interaction代理AI： 多模态相互作用的视野

Abstract 摘要

1 Introduction1引言

1.1 Motivation1.1动机

1.2 Background1.2背景

Large Foundation Models: 大型基础模型：

Embodied AI: 体现式人工智能：

Interactive Learning: 互动学习：

1.3 Overview1.3概述

2 Agent AI Integration2智能体AI集成

2.1 Infinite AI agent2.1无限AI代理

2.2 Agent AI with Large Foundation Models2.2具有大型基础模型的Agent AI

2.2.1 Hallucinations2.2.1幻觉

2.2.2 Biases and Inclusivity2.2.2偏见和包容性

2.2.3 Data Privacy and Usage2.2.3数据隐私和使用

Data Collection, Usage and Purpose.数据收集、使用和目的。

Storage and Security. 存储和安全性。

Data Deletion and Retention.数据删除和保留。

Data Portability and Privacy Policy.数据可移植性和隐私政策。

Anonymization. 反病毒化。

2.2.4 Interpretability and Explainability2.2.4可解释性和可解释性

Imitation Learning →\rightarrow Decoupling.模仿学习 →\rightarrow 解耦。

Decoupling →\rightarrow Generalization.解耦 →\rightarrow 泛化。

Generalization →\rightarrow Emergent Behavior.概括 →\rightarrow 紧急行为。

2.2.5 Inference Augmentation2.2.5推理增强

Data Enrichment. 数据丰富。

Algorithm Enhancement. 算法增强。

Human-in-the-Loop (HITL).人在回路（HITL）。

Real-Time Feedback Integration.实时反馈集成。

Cross-Domain Knowledge Transfer.跨领域知识转移。

Customization for Specific Use Cases.针对特定使用情形的定制。

Ethical and Bias Considerations.伦理和偏倚考虑。

Continuous Learning and Adaptation.持续学习和适应。

2.2.6 Regulation2.2.6法规

2.3 Agent AI for Emergent Abilities2.3用于紧急能力的智能体AI

3 Agent AI Paradigm3智能体AI范式

3.1 LLMs and VLMs3.1LLMs和VLMs

3.2 Agent Transformer Definition3.2代理Transformer定义

3.3 Agent Transformer Creation3.3代理Transformer创建

Define Objectives within the Domain.定义领域内的目标。

Continuous Improvement. 持续改进。

4 Agent AI Learning4智能体AI学习

4.1 Strategy and Mechanism4.1战略与机制

4.1.1 Reinforcement Learning (RL)4.1.1强化学习（RL）

4.1.2 Imitation Learning (IL)4.1.2模仿学习（IL）

4.1.3 Traditional RGB4.1.3传统RGB

4.1.4 In-context Learning4.1.4情境学习

4.1.5 Optimization in the Agent System4.1.5代理系统中的优化

4.2 Agent Systems (zero-shot and few-shot level)4.2药剂系统（零次和少次水平）

4.2.1 Agent Modules4.2.1代理模块

4.2.2 Agent Infrastructure4.2.2代理基础设施

4.3 Agentic Foundation Models (pretraining and finetune level)4.3训练基础模型（pretraining和finetune级别）

5 Agent AI Categorization5智能体AI分类

5.1 Generalist Agent Areas5.1通才代理区域

5.2 Embodied Agents5.2代理商

5.2.1 Action Agents5.2.1行动代理

5.2.2 Interactive Agents5.2.2交互代理

5.3 Simulation and Environments Agents5.3模拟和环境代理

5.4 Generative Agents5.4生成剂

5.4.1 AR/VR/mixed-reality Agents5.4.1AR/VR/混合现实代理

5.5 Knowledge and Logical Inference Agents5.5知识和逻辑推理代理

5.5.1 Knowledge Agent5.5.1知识代理

5.5.2 Logic Agents5.5.2逻辑代理

5.5.3 Agents for Emotional Reasoning5.5.3情感推理代理

5.5.4 Neuro-Symbolic Agents5.5.4神经象征物

5.6 LLMs and VLMs Agent5.6LLMs和VLM代理

6 Agent AI Application Tasks6Agent AI应用任务

6.1 Agents for Gaming6.1游戏代理

6.1.1 NPC Behavior6.1.1NPC行为

6.1.2 Human-NPC Interaction6.1.2人-NPC交互

6.1.3 Agent-based Analysis of Gaming6.1.3基于Agent的博弈分析

6.1.4 Scene Synthesis for Gaming6.1.4游戏场景合成

6.1.5 Experiments and Results6.1.5实验和结果

Zero-shot/Few-shot Learning with LLM or LVM.使用LLM或LVM进行零次/少次学习。

Small Agent Pretraining Model.小Agent预训练模型。

Multi-Agent Infrastructure.多代理基础设施。

6.2 Robotics6.2机器人