Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
以LLM(大型语言模型)作为核心控制器构建代理是一个很酷的概念。几个概念验证演示,如AutoGPT、GPT-Engineer和BabyAGI,都是鼓舞人心的例子。 LLM的潜力不仅限于生成写作精良的副本、故事、论文和程序;它可以被构想为一个强大的通用问题解决器。

Agent System Overview 代理系统概述

In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:
在一个由LLM驱动的自主代理系统中,LLM充当代理的大脑,辅以几个关键组件:

  • Planning 规划
    • Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
      子目标和分解:代理将大任务分解为更小、可管理的子目标,从而能够有效处理复杂任务。
    • Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.
      反思和完善:代理可以对过去的行动进行自我批评和自我反思,从错误中学习并加以完善,从而提高最终结果的质量。
  • Memory 内存
    • Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.
      短期记忆:我会将所有上下文学习(见提示工程)视为利用模型的短期记忆来学习。
    • Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.
      长期记忆:这为代理提供了在较长时间内保留和召回(无限)信息的能力,通常通过利用外部向量存储和快速检索。
  • Tool use 工具使用
    • The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.
      代理程序学习调用外部API获取模型权重中缺失的额外信息(通常在预训练后难以更改),包括当前信息、代码执行能力、访问专有信息源等。
Fig. 1. Overview of a LLM-powered autonomous agent system.
图1. LLM驱动的自主代理系统概述。

Component One: Planning 组件一:规划

A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
一个复杂的任务通常涉及许多步骤。代理需要知道这些步骤并提前计划。

Task Decomposition 任务分解

Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
思维链(CoT;魏等人,2022年)已成为增强模型在复杂任务上性能的标准提示技术。模型被指示“逐步思考”,利用更多的测试时间计算将困难任务分解为更小、更简单的步骤。CoT将大任务转化为多个可管理的任务,并揭示了模型思考过程的解释。

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
思维树(Yao等,2023年)通过在每个步骤探索多种推理可能性来扩展CoT。它首先将问题分解为多个思维步骤,并在每个步骤生成多个思维,从而创建一个树结构。搜索过程可以是BFS(广度优先搜索)或DFS(深度优先搜索),每个状态由分类器(通过提示)或多数投票评估。

Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.
任务分解可以通过简单提示(如 "Steps for XYZ.\n1.""What are the subgoals for achieving XYZ?" )进行,也可以使用特定于任务的说明,例如写小说的 "Write a story outline." ,或者通过人类输入进行。

Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains.
另一种截然不同的方法,LLM+P(刘等人,2023年),涉及依赖外部经典规划器进行长期规划。这种方法利用规划领域定义语言(PDDL)作为描述规划问题的中间接口。在这个过程中,LLM(1)将问题转化为“问题PDDL”,然后(2)请求经典规划器基于现有的“领域PDDL”生成PDDL计划,最后(3)将PDDL计划转化回自然语言。基本上,规划步骤被外部工具外包,假设领域特定的PDDL和适当的规划器可用,这在某些机器人设置中很常见,但在许多其他领域中并不常见。

Self-Reflection 自我反思

Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable.
自我反思是一个重要的方面,它让自主代理能够通过完善过去的行动决策和纠正以前的错误来不断改进。在现实世界的任务中,试错是不可避免的,自我反思在其中扮演着至关重要的角色。

ReAct (Yao et al. 2023) integrates reasoning and acting within LLM by extending the action space to be a combination of task-specific discrete actions and the language space. The former enables LLM to interact with the environment (e.g. use Wikipedia search API), while the latter prompting LLM to generate reasoning traces in natural language.
ReAct(Yao等,2023年)通过将行动空间扩展为任务特定的离散行动和语言空间的组合,将推理和行动整合在一起。前者使代理能够与环境互动(例如使用维基百科搜索API),而后者促使代理生成自然语言中的推理痕迹。

The ReAct prompt template incorporates explicit steps for LLM to think, roughly formatted as:
ReAct提示模板包含明确的步骤,供LLM思考,大致格式如下:

Thought: ...
Action: ...
Observation: ...
... (Repeated many times)
Fig. 2. Examples of reasoning trajectories for knowledge-intensive tasks (e.g. HotpotQA, FEVER) and decision-making tasks (e.g. AlfWorld Env, WebShop). (Image source: Yao et al. 2023).
图2. 知识密集型任务的推理轨迹示例(例如HotpotQA,FEVER)和决策任务(例如AlfWorld Env,WebShop)。 (图片来源:Yao等人2023年)。

In both experiments on knowledge-intensive tasks and decision-making tasks, ReAct works better than the Act-only baseline where Thought: … step is removed.
在知识密集型任务和决策任务的两个实验中, ReAct 比仅使用 Act 的基准线表现更好,其中移除了 Thought: … 步骤。

Reflexion (Shinn & Labash 2023) is a framework to equips agents with dynamic memory and self-reflection capabilities to improve reasoning skills. Reflexion has a standard RL setup, in which the reward model provides a simple binary reward and the action space follows the setup in ReAct where the task-specific action space is augmented with language to enable complex reasoning steps. After each action at, the agent computes a heuristic ht and optionally may decide to reset the environment to start a new trial depending on the self-reflection results.
反思(Shinn&Labash 2023)是一个框架,为代理提供动态记忆和自我反思能力,以提高推理能力。反思具有标准的RL设置,其中奖励模型提供简单的二进制奖励,行动空间遵循ReAct中的设置,其中任务特定的行动空间通过语言扩展以实现复杂的推理步骤。在每个动作 at 之后,代理计算一个启发式 ht ,并根据自我反思结果决定是否重置环境以开始新的试验。

Fig. 3. Illustration of the Reflexion framework. (Image source: Shinn & Labash, 2023)
图3. 反思框架示意图。(图片来源:Shinn & Labash,2023)

The heuristic function determines when the trajectory is inefficient or contains hallucination and should be stopped. Inefficient planning refers to trajectories that take too long without success. Hallucination is defined as encountering a sequence of consecutive identical actions that lead to the same observation in the environment.
启发式函数确定轨迹是否低效或包含幻觉,并应停止。低效规划指的是花费过长时间却没有成功的轨迹。幻觉被定义为遇到一系列连续相同的动作,导致环境中出现相同的观察。

Self-reflection is created by showing two-shot examples to LLM and each example is a pair of (failed trajectory, ideal reflection for guiding future changes in the plan). Then reflections are added into the agent’s working memory, up to three, to be used as context for querying LLM.
自我反思是通过向LLM展示两个示例来创建的,每个示例都是一对(失败的轨迹,用于指导未来计划变化的理想反思)。然后将反思添加到代理的工作记忆中,最多三个,以用作查询LLM的上下文。

Fig. 4. Experiments on AlfWorld Env and HotpotQA. Hallucination is a more common failure than inefficient planning in AlfWorld. (Image source: Shinn & Labash, 2023)
图4. 在AlfWorld环境和HotpotQA上的实验。在AlfWorld中,产生幻觉比规划低效更常见。(图片来源:Shinn & Labash, 2023)

Chain of Hindsight (CoH; Liu et al. 2023) encourages the model to improve on its own outputs by explicitly presenting it with a sequence of past outputs, each annotated with feedback. Human feedback data is a collection of Dh={(x,yi,ri,zi)}i=1n, where x is the prompt, each yi is a model completion, ri is the human rating of yi, and zi is the corresponding human-provided hindsight feedback. Assume the feedback tuples are ranked by reward, rnrn1r1 The process is supervised fine-tuning where the data is a sequence in the form of τh=(x,zi,yi,zj,yj,,zn,yn), where ijn. The model is finetuned to only predict yn where conditioned on the sequence prefix, such that the model can self-reflect to produce better output based on the feedback sequence. The model can optionally receive multiple rounds of instructions with human annotators at test time.
反事实链(CoH; 刘等人,2023年)鼓励模型通过明确呈现一系列过去的输出来改进自身的输出,每个输出都附有反馈。人类反馈数据是一组 Dh={(x,yi,ri,zi)}i=1n ,其中 x 是提示,每个 yi 是模型完成, riyi 的人类评分, zi 是相应的人类提供的反事实反馈。假设反馈元组按奖励排名, rnrn1r1 该过程是监督微调,其中数据是形式为 τh=(x,zi,yi,zj,yj,,zn,yn) 的序列,其中 ijn 。模型被微调为仅在序列前缀的条件下预测 yn ,以便模型可以自我反思,根据反馈序列产生更好的输出。模型可以在测试时选择性地接收多轮指令与人类注释者。

To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre-training dataset. To avoid shortcutting and copying (because there are many common words in feedback sequences), they randomly mask 0% - 5% of past tokens during training.
为避免过拟合,CoH添加了一个正则化项,以最大化预训练数据集的对数似然。为避免捷径和复制(因为反馈序列中有许多常见词),他们在训练期间随机屏蔽0% - 5%的过去标记。

The training dataset in their experiments is a combination of WebGPT comparisons, summarization from human feedback and human preference dataset.
在他们的实验中,训练数据集是由WebGPT比较、人类反馈总结和人类偏好数据集组合而成。

Fig. 5. After fine-tuning with CoH, the model can follow instructions to produce outputs with incremental improvement in a sequence. (Image source: Liu et al. 2023)
图5. 在使用CoH进行微调后,模型可以遵循指令以逐步改进序列中的输出。(图片来源:Liu等人,2023年)

The idea of CoH is to present a history of sequentially improved outputs in context and train the model to take on the trend to produce better outputs. Algorithm Distillation (AD; Laskin et al. 2023) applies the same idea to cross-episode trajectories in reinforcement learning tasks, where an algorithm is encapsulated in a long history-conditioned policy. Considering that an agent interacts with the environment many times and in each episode the agent gets a little better, AD concatenates this learning history and feeds that into the model. Hence we should expect the next predicted action to lead to better performance than previous trials. The goal is to learn the process of RL instead of training a task-specific policy itself.
CoH的理念是在上下文中呈现逐步改进的历史输出,并训练模型跟随这种趋势产生更好的输出。算法蒸馏(AD;Laskin等人,2023年)将相同的理念应用于强化学习任务中的跨剧集轨迹,其中一个算法被封装在一个长的历史条件策略中。考虑到代理与环境互动多次,每一集中代理都会变得更好,AD将这种学习历史连接起来并将其馈送到模型中。因此,我们应该期望下一个预测的动作会比以前的尝试导致更好的表现。目标是学习RL的过程,而不是训练一个特定任务的策略本身。

Fig. 6. Illustration of how Algorithm Distillation (AD) works.
图6. 算法蒸馏(AD)的工作原理示意图。

(Image source: Laskin et al. 2023).
该论文假设任何生成一组学习历史的算法都可以通过在行为克隆上执行神经网络来提炼。历史数据由一组针对特定任务进行训练的源策略生成。在训练阶段,每次强化学习运行时会随机抽样一个任务,并使用多集历史的子序列进行训练,以使学习到的策略与任务无关。

The paper hypothesizes that any algorithm that generates a set of learning histories can be distilled into a neural network by performing behavioral cloning over actions. The history data is generated by a set of source policies, each trained for a specific task. At the training stage, during each RL run, a random task is sampled and a subsequence of multi-episode history is used for training, such that the learned policy is task-agnostic.

In reality, the model has limited context window length, so episodes should be short enough to construct multi-episode history. Multi-episodic contexts of 2-4 episodes are necessary to learn a near-optimal in-context RL algorithm. The emergence of in-context RL requires long enough context.
实际上,该模型具有有限的上下文窗口长度,因此情节应该足够短,以构建多情节历史。学习接近最优上下文RL算法需要2-4个情节的多情节上下文。上下文RL的出现需要足够长的上下文。

In comparison with three baselines, including ED (expert distillation, behavior cloning with expert trajectories instead of learning history), source policy (used for generating trajectories for distillation by UCB), RL^2 (Duan et al. 2017; used as upper bound since it needs online RL), AD demonstrates in-context RL with performance getting close to RL^2 despite only using offline RL and learns much faster than other baselines. When conditioned on partial training history of the source policy, AD also improves much faster than ED baseline.
与三个基线相比,包括ED(专家蒸馏,使用专家轨迹进行行为克隆而不是学习历史)、源策略(用于通过UCB生成蒸馏轨迹)、RL^2(Duan等人,2017年;作为上限,因为它需要在线RL),AD展示了在上下文中的RL,性能接近RL^2,尽管仅使用离线RL,并且比其他基线学习速度更快。在源策略的部分训练历史的条件下,AD也比ED基线改进得更快。

Fig. 7. Comparison of AD, ED, source policy and RL^2 on environments that require memory and exploration. Only binary reward is assigned. The source policies are trained with A3C for "dark" environments and DQN for watermaze.
图7. 在需要记忆和探索的环境中,AD、ED、源策略和RL^2的比较。只分配二进制奖励。源策略在“黑暗”环境中使用A3C进行训练,在水迷宫中使用DQN进行训练。

(Image source: Laskin et al. 2023)
(图片来源:Laskin等人,2023年)

Component Two: Memory 组件二:记忆

(Big thank you to ChatGPT for helping me draft this section. I’ve learned a lot about the human brain and data structure for fast MIPS in my conversations with ChatGPT.)
(非常感谢ChatGPT帮助我起草这一部分。在与ChatGPT的对话中,我学到了很多关于人类大脑和用于快速MIPS的数据结构。)

Types of Memory 记忆类型

Memory can be defined as the processes used to acquire, store, retain, and later retrieve information. There are several types of memory in human brains.
记忆可以定义为获取、存储、保留和后来检索信息所使用的过程。人类大脑中有几种类型的记忆。

  1. Sensory Memory: This is the earliest stage of memory, providing the ability to retain impressions of sensory information (visual, auditory, etc) after the original stimuli have ended. Sensory memory typically only lasts for up to a few seconds. Subcategories include iconic memory (visual), echoic memory (auditory), and haptic memory (touch).
    感觉记忆:这是记忆的最早阶段,能够在原始刺激结束后保留感官信息(视觉、听觉等)的能力。感觉记忆通常只持续几秒钟。子类包括图像记忆(视觉)、回声记忆(听觉)和触觉记忆(触觉)。

  2. Short-Term Memory (STM) or Working Memory: It stores information that we are currently aware of and needed to carry out complex cognitive tasks such as learning and reasoning. Short-term memory is believed to have the capacity of about 7 items (Miller 1956) and lasts for 20-30 seconds.
    短期记忆(STM)或工作记忆:它存储我们目前意识到并需要进行复杂认知任务(如学习和推理)所需的信息。短期记忆被认为有大约7个项目的容量(米勒,1956年),持续时间为20-30秒。

  3. Long-Term Memory (LTM): Long-term memory can store information for a remarkably long time, ranging from a few days to decades, with an essentially unlimited storage capacity. There are two subtypes of LTM:
    长期记忆(LTM):长期记忆可以存储信息长达数天至数十年,存储容量基本上没有限制。长期记忆有两个子类型:

    • Explicit / declarative memory: This is memory of facts and events, and refers to those memories that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).
      显性/陈述性记忆:这是关于事实和事件的记忆,指的是那些可以有意识地回忆起的记忆,包括情景记忆(事件和经历)和语义记忆(事实和概念)。
    • Implicit / procedural memory: This type of memory is unconscious and involves skills and routines that are performed automatically, like riding a bike or typing on a keyboard.
      隐式/程序性记忆:这种记忆是无意识的,涉及到自动执行的技能和例行程序,比如骑自行车或在键盘上打字。
Fig. 8. Categorization of human memory.
图8. 人类记忆的分类。

We can roughly consider the following mappings:
我们大致可以考虑以下映射:

  • Sensory memory as learning embedding representations for raw inputs, including text, image or other modalities;
    感觉记忆作为学习嵌入原始输入的表示,包括文本、图像或其他模态;
  • Short-term memory as in-context learning. It is short and finite, as it is restricted by the finite context window length of Transformer.
    短期记忆就像上下文学习。它是短暂且有限的,因为受到Transformer有限上下文窗口长度的限制。
  • Long-term memory as the external vector store that the agent can attend to at query time, accessible via fast retrieval.
    长期记忆是代理可以在查询时关注的外部向量存储,可通过快速检索访问。

Maximum Inner Product Search (MIPS)
最大内积搜索(MIPS)

The external memory can alleviate the restriction of finite attention span. A standard practice is to save the embedding representation of information into a vector store database that can support fast maximum inner-product search (MIPS). To optimize the retrieval speed, the common choice is the approximate nearest neighbors (ANN)​ algorithm to return approximately top k nearest neighbors to trade off a little accuracy lost for a huge speedup.
外部存储器可以缓解有限注意力跨度的限制。一个标准做法是将信息的嵌入表示保存到一个向量存储数据库中,该数据库可以支持快速的最大内积搜索(MIPS)。为了优化检索速度,常见选择是使用近似最近邻居(ANN)算法,返回大约前k个最近邻居,以在一定程度上牺牲一点准确性来换取巨大的速度提升。

A couple common choices of ANN algorithms for fast MIPS:
用于快速MIPS的ANN算法中的一些常见选择:

  • LSH (Locality-Sensitive Hashing): It introduces a hashing function such that similar input items are mapped to the same buckets with high probability, where the number of buckets is much smaller than the number of inputs.
    LSH(局部敏感哈希):它引入了一个哈希函数,使得相似的输入项以很高的概率映射到相同的桶中,其中桶的数量远远小于输入项的数量。
  • ANNOY (Approximate Nearest Neighbors Oh Yeah): The core data structure are random projection trees, a set of binary trees where each non-leaf node represents a hyperplane splitting the input space into half and each leaf stores one data point. Trees are built independently and at random, so to some extent, it mimics a hashing function. ANNOY search happens in all the trees to iteratively search through the half that is closest to the query and then aggregates the results. The idea is quite related to KD tree but a lot more scalable.
    烦扰(近似最近邻居):核心数据结构是随机投影树,一组二叉树,其中每个非叶节点表示将输入空间分成两半的超平面,每个叶节点存储一个数据点。树是独立且随机构建的,因此在某种程度上模拟了哈希函数。烦扰搜索在所有树中进行,通过迭代搜索最接近查询的一半,然后汇总结果。这个想法与KD树相关,但规模更大。
  • HNSW (Hierarchical Navigable Small World): It is inspired by the idea of small world networks where most nodes can be reached by any other nodes within a small number of steps; e.g. “six degrees of separation” feature of social networks. HNSW builds hierarchical layers of these small-world graphs, where the bottom layers contain the actual data points. The layers in the middle create shortcuts to speed up search. When performing a search, HNSW starts from a random node in the top layer and navigates towards the target. When it can’t get any closer, it moves down to the next layer, until it reaches the bottom layer. Each move in the upper layers can potentially cover a large distance in the data space, and each move in the lower layers refines the search quality.
    HNSW(分层可导航小世界):灵感来自小世界网络的概念,其中大多数节点可以在很少的步骤内到达任何其他节点;例如社交网络的“六度分隔”特征。HNSW构建这些小世界图的分层结构,底层包含实际数据点。中间层创建快捷方式以加速搜索。在执行搜索时,HNSW从顶层的随机节点开始,并向目标导航。当无法再靠近时,它会向下移动到下一层,直到达到底层。在上层的每次移动可能覆盖数据空间中的大距离,而在下层的每次移动会提高搜索质量。
  • FAISS (Facebook AI Similarity Search): It operates on the assumption that in high dimensional space, distances between nodes follow a Gaussian distribution and thus there should exist clustering of data points. FAISS applies vector quantization by partitioning the vector space into clusters and then refining the quantization within clusters. Search first looks for cluster candidates with coarse quantization and then further looks into each cluster with finer quantization.
    FAISS(Facebook AI相似性搜索):它基于这样的假设,即在高维空间中,节点之间的距离遵循高斯分布,因此应该存在数据点的聚类。FAISS通过将向量空间划分为簇并在簇内进一步细化量化来应用向量量化。搜索首先使用粗量化查找簇候选项,然后使用更精细的量化进一步查找每个簇。
  • ScaNN (Scalable Nearest Neighbors): The main innovation in ScaNN is anisotropic vector quantization. It quantizes a data point xi to x~i such that the inner product q,xi is as similar to the original distance of q,x~i as possible, instead of picking the closet quantization centroid points.
    ScaNN(可扩展最近邻):ScaNN的主要创新是各向异性向量量化。它将数据点 xi 量化为 x~i ,使得内积 q,xi 尽可能与原始距离 q,x~i 相似,而不是选择最接近的量化中心点。
Fig. 9. Comparison of MIPS algorithms, measured in recall@10. (Image source: Google Blog, 2020)
图9. MIPS算法的召回率@10比较。(图片来源:Google博客,2020)

Check more MIPS algorithms and performance comparison in ann-benchmarks.com.
在ann-benchmarks.com上查看更多MIPS算法和性能比较。

Component Three: Tool Use
组件三:工具使用

Tool use is a remarkable and distinguishing characteristic of human beings. We create, modify and utilize external objects to do things that go beyond our physical and cognitive limits. Equipping LLMs with external tools can significantly extend the model capabilities.
工具使用是人类的一个显著且独特的特征。我们创造、修改和利用外部物体来完成超越我们身体和认知极限的事情。为LLMs配备外部工具可以显著扩展模型的能力。

Fig. 10. A picture of a sea otter using rock to crack open a seashell, while floating in the water. While some other animals can use tools, the complexity is not comparable with humans. (Image source: Animals using tools)
图10. 一只海獭在水中漂浮时利用石头打开贝壳的图片。虽然一些其他动物也能使用工具,但其复杂性无法与人类相比。(图片来源:动物使用工具)

MRKL (Karpas et al. 2022), short for “Modular Reasoning, Knowledge and Language”, is a neuro-symbolic architecture for autonomous agents. A MRKL system is proposed to contain a collection of “expert” modules and the general-purpose LLM works as a router to route inquiries to the best suitable expert module. These modules can be neural (e.g. deep learning models) or symbolic (e.g. math calculator, currency converter, weather API).
MRKL(Karpas等人,2022年),即“模块化推理、知识和语言”,是一种用于自主代理的神经符号架构。建议一个MRKL系统包含一组“专家”模块和通用的LLM作为路由器,将查询路由到最适合的专家模块。这些模块可以是神经的(例如深度学习模型)或符号的(例如数学计算器、货币转换器、天气API)。

They did an experiment on fine-tuning LLM to call a calculator, using arithmetic as a test case. Their experiments showed that it was harder to solve verbal math problems than explicitly stated math problems because LLMs (7B Jurassic1-large model) failed to extract the right arguments for the basic arithmetic reliably. The results highlight when the external symbolic tools can work reliably, knowing when to and how to use the tools are crucial, determined by the LLM capability.
他们对微调LLM进行了实验,以调用计算器,使用算术作为测试案例。他们的实验表明,解决口头数学问题比明确陈述的数学问题更困难,因为LLMs(7B Jurassic1-large模型)无法可靠地提取基本算术的正确参数。结果突出了外部符号工具何时可以可靠工作,知道何时以及如何使用这些工具是至关重要的,由LLM能力决定。

Both TALM (Tool Augmented Language Models; Parisi et al. 2022) and Toolformer (Schick et al. 2023) fine-tune a LM to learn to use external tool APIs. The dataset is expanded based on whether a newly added API call annotation can improve the quality of model outputs. See more details in the “External APIs” section of Prompt Engineering.
TALM(工具增强语言模型;Parisi等人2022年)和Toolformer(Schick等人2023年)都对LM进行微调,以学习使用外部工具API。数据集根据新增的API调用注释是否能提高模型输出的质量而进行扩展。有关更多详细信息,请参阅“Prompt Engineering”的“外部API”部分。

ChatGPT Plugins and OpenAI API function calling are good examples of LLMs augmented with tool use capability working in practice. The collection of tool APIs can be provided by other developers (as in Plugins) or self-defined (as in function calls).
ChatGPT插件和OpenAI API函数调用是LLMs增强工具使用能力的实际示例。工具API的集合可以由其他开发人员提供(如插件)或自定义(如函数调用)。

HuggingGPT (Shen et al. 2023) is a framework to use ChatGPT as the task planner to select models available in HuggingFace platform according to the model descriptions and summarize the response based on the execution results.
HuggingGPT(沈等人,2023年)是一个框架,使用ChatGPT作为任务规划器,根据模型描述选择HuggingFace平台上可用的模型,并根据执行结果总结响应。

Fig. 11. Illustration of how HuggingGPT works. (Image source: Shen et al. 2023)
图11. HuggingGPT工作原理示意图。(图片来源:沈等人,2023年)

The system comprises of 4 stages:
该系统包括4个阶段:

(1) Task planning: LLM works as the brain and parses the user requests into multiple tasks. There are four attributes associated with each task: task type, ID, dependencies, and arguments. They use few-shot examples to guide LLM to do task parsing and planning.
任务规划:LLM 作为大脑,将用户请求解析为多个任务。每个任务都有四个属性:任务类型、ID、依赖关系和参数。他们使用少量示例来指导LLM进行任务解析和规划。

Instruction: 指示:

The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning.
AI助手可以将用户输入解析为多个任务:[{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]。"dep"字段表示前一个任务的id,该任务生成了当前任务依赖的新资源。特殊标记"-task_id"指的是依赖任务中以task_id为id生成的文本、图像、音频和视频。任务必须从以下选项中选择:{{可用任务列表}}。任务之间存在逻辑关系,请注意它们的顺序。如果无法解析用户输入,需要回复空的JSON。以下是几种情况供参考:{{演示}}。聊天记录被记录为{{聊天记录}}。从这些聊天记录中,您可以找到用户提到的资源路径,用于任务规划。

(2) Model selection: LLM distributes the tasks to expert models, where the request is framed as a multiple-choice question. LLM is presented with a list of models to choose from. Due to the limited context length, task type based filtration is needed.
(2) 模型选择:LLM将任务分配给专家模型,其中请求被构建为多项选择题。LLM被呈现一个模型列表供选择。由于上下文长度有限,需要基于任务类型进行过滤。

Instruction: 指示:

Given the user request and the call command, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The AI assistant merely outputs the model id of the most appropriate model. The output must be in a strict JSON format: "id": "id", "reason": "your detail reason for the choice". We have a list of models for you to choose from {{ Candidate Models }}. Please select one model from the list.
根据用户请求和呼叫命令,AI助手帮助用户从模型列表中选择一个适合的模型来处理用户请求。AI助手仅输出最合适模型的模型ID。输出必须采用严格的JSON格式:"id": "id","reason": "您选择该模型的详细原因"。我们为您提供了一个模型列表供您选择{{候选模型}}。请从列表中选择一个模型。

(3) Task execution: Expert models execute on the specific tasks and log results.
(3) 任务执行:专家模型在特定任务上执行并记录结果。

Instruction: 指示:

With the input and the inference results, the AI assistant needs to describe the process and results. The previous stages can be formed as - User Input: {{ User Input }}, Task Planning: {{ Tasks }}, Model Selection: {{ Model Assignment }}, Task Execution: {{ Predictions }}. You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path.
AI助手需要根据输入和推理结果描述过程和结果。前几个阶段可以形成为 - 用户输入: {{用户输入}}, 任务规划: {{任务}}, 模型选择: {{模型分配}}, 任务执行: {{预测}}。您必须首先直接回答用户的请求。然后以第一人称描述任务过程,并向用户展示您的分析和模型推理结果。如果推理结果包含文件路径,必须告诉用户完整的文件路径。

(4) Response generation: LLM receives the execution results and provides summarized results to users.
(4) 响应生成: LLM 接收执行结果并向用户提供总结结果。

To put HuggingGPT into real world usage, a couple challenges need to solve: (1) Efficiency improvement is needed as both LLM inference rounds and interactions with other models slow down the process; (2) It relies on a long context window to communicate over complicated task content; (3) Stability improvement of LLM outputs and external model services.
要将HuggingGPT应用于现实世界中,需要解决一些挑战:(1)需要提高效率,因为LLM 推理轮次和与其他模型的交互都会减慢过程;(2)它依赖于长上下文窗口来沟通复杂的任务内容;(3)改进LLM 输出和外部模型服务的稳定性。

API-Bank (Li et al. 2023) is a benchmark for evaluating the performance of tool-augmented LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM workflow, and 264 annotated dialogues that involve 568 API calls. The selection of APIs is quite diverse, including search engines, calculator, calendar queries, smart home control, schedule management, health data management, account authentication workflow and more. Because there are a large number of APIs, LLM first has access to API search engine to find the right API to call and then uses the corresponding documentation to make a call.
API-Bank(Li等,2023年)是用于评估工具增强LLMs性能的基准。它包含53个常用的API工具,完整的工具增强LLM工作流程,以及264个涉及568个API调用的注释对话。所选API非常多样化,包括搜索引擎、计算器、日历查询、智能家居控制、日程管理、健康数据管理、账户认证工作流程等。由于API数量众多,LLM首先访问API搜索引擎找到正确的API进行调用,然后使用相应的文档进行调用。

Fig. 12. Pseudo code of how LLM makes an API call in API-Bank. (Image source: Li et al. 2023)
图12. LLM在API-Bank中进行API调用的伪代码。(图片来源:Li等,2023年)

In the API-Bank workflow, LLMs need to make a couple of decisions and at each step we can evaluate how accurate that decision is. Decisions include:
在API-Bank工作流程中,LLMs需要做出一些决策,而在每一步我们都可以评估该决策的准确性。决策包括:

  1. Whether an API call is needed.
    是否需要进行API调用。
  2. Identify the right API to call: if not good enough, LLMs need to iteratively modify the API inputs (e.g. deciding search keywords for Search Engine API).
    识别要调用的正确API:如果不够好,需要迭代修改API输入(例如,为搜索引擎API决定搜索关键词)。
  3. Response based on the API results: the model can choose to refine and call again if results are not satisfied.
    根据API结果做出响应:如果结果不满意,模型可以选择进行细化并再次调用。

This benchmark evaluates the agent’s tool use capabilities at three levels:
这个基准评估了代理工具使用能力的三个级别:

  • Level-1 evaluates the ability to call the API. Given an API’s description, the model needs to determine whether to call a given API, call it correctly, and respond properly to API returns.
    第一级评估了调用API的能力。给定一个API的描述,模型需要确定是否调用给定的API,正确调用它,并对API返回做出适当响应。
  • Level-2 examines the ability to retrieve the API. The model needs to search for possible APIs that may solve the user’s requirement and learn how to use them by reading documentation.
    Level-2 考察检索 API 的能力。模型需要搜索可能解决用户需求的 API,并通过阅读文档学习如何使用它们。
  • Level-3 assesses the ability to plan API beyond retrieve and call. Given unclear user requests (e.g. schedule group meetings, book flight/hotel/restaurant for a trip), the model may have to conduct multiple API calls to solve it.
    Level-3 评估规划 API 超出检索和调用的能力。鉴于用户请求不明确(例如安排团体会议,为旅行预订航班/酒店/餐厅),模型可能需要进行多次 API 调用来解决问题。

Case Studies 案例研究

Scientific Discovery Agent
科学发现代理

ChemCrow (Bran et al. 2023) is a domain-specific example in which LLM is augmented with 13 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. The workflow, implemented in LangChain, reflects what was previously described in the ReAct and MRKLs and combines CoT reasoning with tools relevant to the tasks:
ChemCrow(Bran等,2023年)是一个领域特定的示例,其中LLM通过13个专家设计的工具进行增强,以完成有机合成、药物发现和材料设计任务。在LangChain中实施的工作流程反映了先前在ReAct和MRKLs中描述的内容,并将CoT推理与与任务相关的工具相结合:

  • The LLM is provided with a list of tool names, descriptions of their utility, and details about the expected input/output.
    LLM提供了工具名称列表,它们的实用性描述以及有关预期输入/输出的详细信息。
  • It is then instructed to answer a user-given prompt using the tools provided when necessary. The instruction suggests the model to follow the ReAct format - Thought, Action, Action Input, Observation.
    然后指示使用所提供的工具回答用户给定的提示。指示建议模型遵循ReAct格式 - Thought, Action, Action Input, Observation

One interesting observation is that while the LLM-based evaluation concluded that GPT-4 and ChemCrow perform nearly equivalently, human evaluations with experts oriented towards the completion and chemical correctness of the solutions showed that ChemCrow outperforms GPT-4 by a large margin. This indicates a potential problem with using LLM to evaluate its own performance on domains that requires deep expertise. The lack of expertise may cause LLMs not knowing its flaws and thus cannot well judge the correctness of task results.
一个有趣的观察是,基于LLM的评估得出结论,GPT-4和ChemCrow的表现几乎相当,但专家进行的人工评估侧重于解决方案的完成和化学正确性,结果显示ChemCrow在很大程度上优于GPT-4。这表明在需要深入专业知识的领域中使用LLM来评估自身表现可能存在潜在问题。缺乏专业知识可能导致LLMs不了解自身的缺陷,因此无法很好地判断任务结果的正确性。

Boiko et al. (2023) also looked into LLM-empowered agents for scientific discovery, to handle autonomous design, planning, and performance of complex scientific experiments. This agent can use tools to browse the Internet, read documentation, execute code, call robotics experimentation APIs and leverage other LLMs.
Boiko等人(2023年)还研究了用于科学发现的LLM增强型代理,以处理复杂科学实验的自主设计、规划和执行。该代理可以使用工具浏览互联网,阅读文档,执行代码,调用机器人实验API并利用其他LLMs。

For example, when requested to "develop a novel anticancer drug", the model came up with the following reasoning steps:
例如,当要求 "develop a novel anticancer drug" 时,该模型提出了以下推理步骤:

  1. inquired about current trends in anticancer drug discovery;
    询问了当前抗癌药物发现的趋势;
  2. selected a target; 选择了一个靶点;
  3. requested a scaffold targeting these compounds;
    请求设计一个针对这些化合物的脚手架;
  4. Once the compound was identified, the model attempted its synthesis.
    一旦确定了化合物,模型尝试合成它。

They also discussed the risks, especially with illicit drugs and bioweapons. They developed a test set containing a list of known chemical weapon agents and asked the agent to synthesize them. 4 out of 11 requests (36%) were accepted to obtain a synthesis solution and the agent attempted to consult documentation to execute the procedure. 7 out of 11 were rejected and among these 7 rejected cases, 5 happened after a Web search while 2 were rejected based on prompt only.
他们还讨论了风险,特别是非法药物和生物武器。他们制定了一个测试集,其中包含一系列已知的化学武器制剂,并要求代理人合成它们。11个请求中有4个(36%)被接受以获得合成方案,代理人试图查阅文档以执行程序。11个请求中有7个被拒绝,其中有5个是在进行网络搜索后被拒绝的,而另外2个仅基于提示被拒绝。

Generative Agents Simulation
生成式代理模拟

Generative Agents (Park, et al. 2023) is super fun experiment where 25 virtual characters, each controlled by a LLM-powered agent, are living and interacting in a sandbox environment, inspired by The Sims. Generative agents create believable simulacra of human behavior for interactive applications.
生成代理(Park等,2023年)是一个非常有趣的实验,其中25个虚拟角色,每个由一个LLM驱动的代理控制,在一个受《模拟人生》启发的沙盒环境中生活和互动。生成代理为交互应用程序创造了可信的人类行为模拟。

The design of generative agents combines LLM with memory, planning and reflection mechanisms to enable agents to behave conditioned on past experience, as well as to interact with other agents.
生成代理的设计将LLM与记忆、规划和反思机制相结合,使代理能够根据过去的经验行为,并与其他代理进行互动。

  • Memory stream: is a long-term memory module (external database) that records a comprehensive list of agents’ experience in natural language.
    记忆流:是一个长期记忆模块(外部数据库),记录了代理人在自然语言中的全面经验列表。
    • Each element is an observation, an event directly provided by the agent. - Inter-agent communication can trigger new natural language statements.
      每个元素都是一个观察,是代理人直接提供的事件。- 代理人之间的交流可以触发新的自然语言陈述。
  • Retrieval model: surfaces the context to inform the agent’s behavior, according to relevance, recency and importance.
    检索模型:根据相关性、最新性和重要性呈现上下文以指导代理的行为。
    • Recency: recent events have higher scores
      最新性:最近的事件得分较高
    • Importance: distinguish mundane from core memories. Ask LM directly.
      重要性:区分平凡记忆和核心记忆。直接问LM。
    • Relevance: based on how related it is to the current situation / query.
      相关性:基于其与当前情况/查询的相关程度。
  • Reflection mechanism: synthesizes memories into higher level inferences over time and guides the agent’s future behavior. They are higher-level summaries of past events (<- note that this is a bit different from self-reflection above)
    反思机制:随着时间的推移,将记忆综合成更高层次的推断,并指导代理的未来行为。它们是对过去事件的高层次总结(<-请注意,这与上面的自我反思有些不同)
    • Prompt LM with 100 most recent observations and to generate 3 most salient high-level questions given a set of observations/statements. Then ask LM to answer those questions.
      提示LM使用最近的100个观察结果,并根据一组观察结果/陈述生成3个最显著的高层次问题。然后请LM回答这些问题。
  • Planning & Reacting: translate the reflections and the environment information into actions
    规划与反应:将反思和环境信息转化为行动
    • Planning is essentially in order to optimize believability at the moment vs in time.
      规划基本上是为了在当下与未来时间优化可信度。
    • Prompt template: {Intro of an agent X}. Here is X's plan today in broad strokes: 1)
      提示模板: {Intro of an agent X}. Here is X's plan today in broad strokes: 1)
    • Relationships between agents and observations of one agent by another are all taken into consideration for planning and reacting.
      计划和反应都考虑了代理之间的关系以及一个代理对另一个代理的观察。
    • Environment information is present in a tree structure.
      环境信息以树形结构呈现。
Fig. 13. The generative agent architecture. (Image source: Park et al. 2023)
图13. 生成代理体系结构。(图片来源:Park等人,2023年)

This fun simulation results in emergent social behavior, such as information diffusion, relationship memory (e.g. two agents continuing the conversation topic) and coordination of social events (e.g. host a party and invite many others).
这个有趣的模拟结果产生了新兴的社会行为,比如信息传播、关系记忆(例如两个代理继续谈论的话题)和社交活动的协调(例如举办派对并邀请许多其他人)。

Proof-of-Concept Examples
概念验证示例

AutoGPT has drawn a lot of attention into the possibility of setting up autonomous agents with LLM as the main controller. It has quite a lot of reliability issues given the natural language interface, but nevertheless a cool proof-of-concept demo. A lot of code in AutoGPT is about format parsing.
AutoGPT已经引起了很多关注,因为它可能会建立自主代理,LLM作为主控制器。鉴于自然语言界面,它存在相当多的可靠性问题,但仍然是一个很酷的概念验证演示。AutoGPT中的许多代码都是关于格式解析的。

Here is the system message used by AutoGPT, where {{...}} are user inputs:
这是AutoGPT使用的系统消息,其中 {{...}} 是用户输入:

You are {{ai-name}}, {{user-provided AI bot description}}.
Your decisions must always be made independently without seeking user assistance. Play to your strengths as an LLM and pursue simple strategies with no legal complications.

GOALS:

1. {{user-provided goal 1}}
2. {{user-provided goal 2}}
3. ...
4. ...
5. ...

Constraints:
1. ~4000 word limit for short term memory. Your short term memory is short, so immediately save important information to files.
2. If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.
3. No user assistance
4. Exclusively use the commands listed in double quotes e.g. "command name"
5. Use subprocesses for commands that will not terminate within a few minutes

Commands:
1. Google Search: "google", args: "input": "<search>"
2. Browse Website: "browse_website", args: "url": "<url>", "question": "<what_you_want_to_find_on_website>"
3. Start GPT Agent: "start_agent", args: "name": "<name>", "task": "<short_task_desc>", "prompt": "<prompt>"
4. Message GPT Agent: "message_agent", args: "key": "<key>", "message": "<message>"
5. List GPT Agents: "list_agents", args:
6. Delete GPT Agent: "delete_agent", args: "key": "<key>"
7. Clone Repository: "clone_repository", args: "repository_url": "<url>", "clone_path": "<directory>"
8. Write to file: "write_to_file", args: "file": "<file>", "text": "<text>"
9. Read file: "read_file", args: "file": "<file>"
10. Append to file: "append_to_file", args: "file": "<file>", "text": "<text>"
11. Delete file: "delete_file", args: "file": "<file>"
12. Search Files: "search_files", args: "directory": "<directory>"
13. Analyze Code: "analyze_code", args: "code": "<full_code_string>"
14. Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>", "code": "<full_code_string>"
15. Write Tests: "write_tests", args: "code": "<full_code_string>", "focus": "<list_of_focus_areas>"
16. Execute Python File: "execute_python_file", args: "file": "<file>"
17. Generate Image: "generate_image", args: "prompt": "<prompt>"
18. Send Tweet: "send_tweet", args: "text": "<text>"
19. Do Nothing: "do_nothing", args:
20. Task Complete (Shutdown): "task_complete", args: "reason": "<reason>"

Resources:
1. Internet access for searches and information gathering.
2. Long Term memory management.
3. GPT-3.5 powered Agents for delegation of simple tasks.
4. File output.

Performance Evaluation:
1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.
2. Constructively self-criticize your big-picture behavior constantly.
3. Reflect on past decisions and strategies to refine your approach.
4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.

You should only respond in JSON format as described below
Response Format:
{
    "thoughts": {
        "text": "thought",
        "reasoning": "reasoning",
        "plan": "- short bulleted\n- list that conveys\n- long-term plan",
        "criticism": "constructive self-criticism",
        "speak": "thoughts summary to say to user"
    },
    "command": {
        "name": "command name",
        "args": {
            "arg name": "value"
        }
    }
}
Ensure the response can be parsed by Python json.loads

GPT-Engineer is another project to create a whole repository of code given a task specified in natural language. The GPT-Engineer is instructed to think over a list of smaller components to build and ask for user input to clarify questions as needed.
GPT-Engineer是另一个项目,旨在根据自然语言中指定的任务创建整个代码存储库。GPT-Engineer被指示思考构建一系列较小组件的清单,并根据需要要求用户输入以澄清问题。

Here are a sample conversation for task clarification sent to OpenAI ChatCompletion endpoint used by GPT-Engineer. The user inputs are wrapped in {{user input text}}.
以下是发送给GPT-Engineer使用的OpenAI ChatCompletion端点的任务澄清示例对话。用户输入被包裹在 {{user input text}} 中。

[
  {
    "role": "system",
    "content": "You will read instructions and not carry them out, only seek to clarify them.\nSpecifically you will first summarise a list of super short bullets of areas that need clarification.\nThen you will pick one clarifying question, and wait for an answer from the user.\n"
  },
  {
    "role": "user",
    "content": "We are writing {{a Super Mario game in python. MVC components split in separate files. Keyboard control.}}\n"
  },
  {
    "role": "assistant",
    "content": "Summary of areas that need clarification:\n1. Specifics of the Super Mario game (e.g. level design, characters, gameplay mechanics)\n2. Details about the MVC components (e.g. which components are in each file)\n3. Keyboard control implementation (e.g. which keys to use, how to handle input)\n\nClarifying question:\nCan you provide more details about the Super Mario game, such as level design, characters, and gameplay mechanics?"
  },
  {
    "role": "user",
    "content": "{{There are 10 levels in total. The main character is a plumber named Mario, who can walk and jump. It is a classical platform game just like Super Mario. The main character moves from left to right, trying to get to the destination, where there are many obstacles and attacks from enemies in the process.}}\n\nIs anything else unclear? If yes, only answer in the form:\n{remaining unclear areas} remaining questions.\n{Next question}\nIf everything is sufficiently clear, only answer \"Nothing more to clarify.\"."
  },
  {
    "role": "assistant",
    "content": "Remaining unclear areas: 2 remaining questions.\nCan you provide more information about how the MVC components are split into separate files?"
  },
  {
    "role": "user",
    "content": "{{Make your own assumptions and state them explicitly before starting}}"
  }
]

Then after these clarification, the agent moved into the code writing mode with a different system message. System message:
在这些澄清之后,代理人以不同的系统消息进入了编码模式。系统消息:

You will get instructions for code to write. You will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code. Make sure that every detail of the architecture is, in the end, implemented as code.
您将收到编写代码的指令。您将写一个非常长的答案。确保最终将架构的每个细节都实现为代码。确保最终将架构的每个细节都实现为代码。

Think step by step and reason yourself to the right decisions to make sure we get it right. You will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.
逐步思考并理性地做出正确决策,以确保我们做对。首先列出核心类、函数、方法的名称,以及它们的作用的简要注释。

Then you will output the content of each file including ALL code. Each file must strictly follow a markdown code block format, where the following tokens must be replaced such that FILENAME is the lowercase file name including the file extension, LANG is the markup code block language for the code’s language, and CODE is the code:
然后输出每个文件的内容,包括所有代码。每个文件必须严格遵循 markdown 代码块格式,其中必须替换以下标记,以便 FILENAME 是包括文件扩展名的小写文件名,LANG 是代码语言的标记代码块语言,CODE 是代码。

FILENAME 文件名

CODE

You will start with the “entrypoint” file, then go to the ones that are imported by that file, and so on. Please note that the code should be fully functional. No placeholders.
您将从“入口点”文件开始,然后转到该文件导入的文件,依此类推。请注意,代码应该是完全可用的。不要使用占位符。

Follow a language and framework appropriate best practice file naming convention. Make sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other. Ensure to implement all code, if you are unsure, write a plausible implementation. Include module dependency or package manager dependency definition file. Before you finish, double check that all parts of the architecture is present in the files.
遵循适当的最佳实践文件命名约定。确保文件包含所有导入、类型等。确保不同文件中的代码彼此兼容。确保实现所有代码,如果不确定,编写一个合理的实现。包括模块依赖或包管理器依赖定义文件。在完成之前,仔细检查文件中是否存在架构的所有部分。

Useful to know: You almost always put different classes in different files. For Python, you always create an appropriate requirements.txt file. For NodeJS, you always create an appropriate package.json file. You always add a comment briefly describing the purpose of the function definition. You try to add comments explaining very complex bits of logic. You always follow the best practices for the requested languages in terms of describing the code written as a defined package/project.
需要知道的是:几乎总是将不同的类放在不同的文件中。对于Python,您总是创建一个适当的requirements.txt文件。对于NodeJS,您总是创建一个适当的package.json文件。您总是添加一个简要描述函数定义目的的注释。您尝试添加解释非常复杂逻辑部分的注释。您始终遵循所请求语言的最佳实践,以描述编写的代码作为一个定义的包/项目。

Python toolbelt preferences:
Python工具包偏好:

  • pytest
  • dataclasses 数据类

Conversatin samples: 对话样本:

[
  {
    "role": "system",
    "content": "You will get instructions for code to write.\nYou will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code.\nMake sure that every detail of the architecture is, in the end, implemented as code.\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nYou will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nPlease note that the code should be fully functional. No placeholders.\n\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other.\nEnsure to implement all code, if you are unsure, write a plausible implementation.\nInclude module dependency or package manager dependency definition file.\nBefore you finish, double check that all parts of the architecture is present in the files.\n\nUseful to know:\nYou almost always put different classes in different files.\nFor Python, you always create an appropriate requirements.txt file.\nFor NodeJS, you always create an appropriate package.json file.\nYou always add a comment briefly describing the purpose of the function definition.\nYou try to add comments explaining very complex bits of logic.\nYou always follow the best practices for the requested languages in terms of describing the code written as a defined\npackage/project.\n\n\nPython toolbelt preferences:\n- pytest\n- dataclasses\n"
  },
 #  … same conversation as earlier, ended with "Make your own assumptions and state them explicitly before starting".
  {
    "role": "assistant",
    "content": "Assumptions:\n1. Model: The model will contain the game's data, such as level information, character states, and enemy positions.\n2. View: The view will handle the game's visuals, including rendering the game objects, backgrounds, and updating the display.\n3. Controller: The controller will manage user input, such as keyboard controls, and update the model accordingly.\n\nWith these assumptions in mind, is there any specific aspect of the keyboard control implementation you would like me to clarify?"
  },
  {
    "role": "user",
    "content": "Please now remember the steps:\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nFirst lay out the names of the core classes, functions, methods that will be necessary, As well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nPlease note that the code should be fully functional. No placeholders.\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. The code should be fully functional. Make sure that code in different files are compatible with each other.\nBefore you finish, double check that all parts of the architecture is present in the files.\n"
  }
]

Challenges 挑战

After going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations:
在经历了构建LLM中心代理的关键思想和演示之后,我开始看到一些共同的局限性:

  • Finite context length: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention.
    有限上下文长度:受限的上下文容量限制了历史信息、详细说明、API调用上下文和响应的包含。系统的设计必须在这种有限的通信带宽下运作,而像自我反思这样的机制可以从过去的错误中学习,长或无限的上下文窗口将受益良多。虽然向量存储和检索可以访问更大的知识库,但它们的表征能力不如完全注意力强大。

  • Challenges in long-term planning and task decomposition: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error.
    长期规划和任务分解中的挑战:在漫长的历史上进行规划并有效地探索解决方案空间仍然具有挑战性。 LLMs 在面对意外错误时很难调整计划,使它们比起能够从试错中学习的人类更不稳健。

  • Reliability of natural language interface: Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools. However, the reliability of model outputs is questionable, as LLMs may make formatting errors and occasionally exhibit rebellious behavior (e.g. refuse to follow an instruction). Consequently, much of the agent demo code focuses on parsing model output.
    自然语言接口的可靠性:当前的代理系统依赖自然语言作为LLMs和外部组件(如内存和工具)之间的接口。然而,模型输出的可靠性值得怀疑,因为LLMs可能会出现格式错误,偶尔表现出叛逆行为(例如拒绝遵循指令)。因此,大部分代理演示代码都集中在解析模型输出上。

Citation

Cited as: 引用为:

Weng, Lilian. (Jun 2023). “LLM-powered Autonomous Agents”. Lil’Log. https://lilianweng.github.io/posts/2023-06-23-agent/.
翁,莉莲。 (2023年6月)。“LLM动力自主代理”。Lil’Log。https://lilianweng.github.io/posts/2023-06-23-agent/。

Or

@article{weng2023agent,
  title   = "LLM-powered Autonomous Agents",
  author  = "Weng, Lilian",
  journal = "lilianweng.github.io",
  year    = "2023",
  month   = "Jun",
  url     = "https://lilianweng.github.io/posts/2023-06-23-agent/"
}

References

[1] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022
[1] 魏等人。“思维链触发引发大型语言模型的推理。”NeurIPS 2022

[2] Yao et al. “Tree of Thoughts: Dliberate Problem Solving with Large Language Models.” arXiv preprint arXiv:2305.10601 (2023).
[2] 姚等人。“思维之树:与大型语言模型进行深思熟虑的问题解决。”arXiv预印本arXiv:2305.10601(2023年)。

[3] Liu et al. “Chain of Hindsight Aligns Language Models with Feedback “ arXiv preprint arXiv:2302.02676 (2023).
[3] 刘等人。“回顾链将语言模型与反馈对齐” arXiv预印本 arXiv:2302.02676 (2023)。

[4] Liu et al. “LLM+P: Empowering Large Language Models with Optimal Planning Proficiency” arXiv preprint arXiv:2304.11477 (2023).
[4] 刘等人。“LLM+P: 用最佳规划能力增强大型语言模型” arXiv预印本 arXiv:2304.11477 (2023)。

[5] Yao et al. “ReAct: Synergizing reasoning and acting in language models.” ICLR 2023.
[5] 姚等人。“ReAct: 在语言模型中协同推理和行动。” ICLR 2023。

[6] Google Blog. “Announcing ScaNN: Efficient Vector Similarity Search” July 28, 2020.
[6] 谷歌博客。“宣布ScaNN: 高效的向量相似性搜索” 2020年7月28日。

[7] https://chat.openai.com/share/46ff149e-a4c7-4dd7-a800-fc4a642ea389
[8] Shinn & Labash. “Reflexion: 具有动态记忆和自我反思的自主代理” arXiv预印本 arXiv:2303.11366 (2023).

[8] Shinn & Labash. “Reflexion: an autonomous agent with dynamic memory and self-reflection” arXiv preprint arXiv:2303.11366 (2023).

[9] Laskin et al. “In-context Reinforcement Learning with Algorithm Distillation” ICLR 2023.
[9] Laskin等人。“具有算法蒸馏的上下文强化学习” ICLR 2023。

[10] Karpas et al. “MRKL Systems A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.” arXiv preprint arXiv:2205.00445 (2022).
[10] Karpas等人。“MRKL系统:一种模块化的神经符号架构,结合大型语言模型、外部知识源和离散推理。” arXiv预印本 arXiv:2205.00445(2022)。

[11] Weaviate Blog. Why is Vector Search so fast? Sep 13, 2022.
Weaviate博客。为什么向量搜索如此快?2022年9月13日。

[12] Li et al. “API-Bank: A Benchmark for Tool-Augmented LLMs” arXiv preprint arXiv:2304.08244 (2023).
李等人。“API-Bank:工具增强的基准测试” arXiv预印本 arXiv:2304.08244(2023年)。

[13] Shen et al. “HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace” arXiv preprint arXiv:2303.17580 (2023).
[13] 沈等人。“HuggingGPT: 使用ChatGPT及其在HuggingFace中的伙伴解决AI任务” arXiv预印本 arXiv:2303.17580 (2023)。

[14] Bran et al. “ChemCrow: Augmenting large-language models with chemistry tools.” arXiv preprint arXiv:2304.05376 (2023).
[14] 布兰等人。“ChemCrow: 用化学工具增强大型语言模型。” arXiv预印本 arXiv:2304.05376 (2023)。

[15] Boiko et al. “Emergent autonomous scientific research capabilities of large language models.” arXiv preprint arXiv:2304.05332 (2023).
[15] Boiko等人。“大型语言模型的新兴自主科学研究能力。”arXiv预印本 arXiv:2304.05332 (2023)。

[16] Joon Sung Park, et al. “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv preprint arXiv:2304.03442 (2023).
[16] Joon Sung Park等人。“生成型代理:人类行为的交互模拟。”arXiv预印本 arXiv:2304.03442 (2023)。

[17] AutoGPT. https://github.com/Significant-Gravitas/Auto-GPT

[18] GPT-Engineer. https://github.com/AntonOsika/gpt-engineer