Intelligent agents are considered by many to be the ultimate goal of AI. The classic book by Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach (Prentice Hall, 1995), defines the field of AI research as “the study and design of rational agents.
智能体被许多人认为是人工智能的终极目标。由 Stuart Russell 和 Peter Norvig 合著的经典书籍《Artificial Intelligence: A Modern Approach》(Prentice Hall,1995)将人工智能研究领域定义为“理性智能体的研究和设计”。

The unprecedented capabilities of foundation models have opened the door to agentic applications that were previously unimaginable. These new capabilities make it finally possible to develop autonomous, intelligent agents to act as our assistants, coworkers, and coaches. They can help us create a website, gather data, plan a trip, do market research, manage a customer account, automate data entry, prepare us for interviews, interview our candidates, negotiate a deal, etc. The possibilities seem endless, and the potential economic value of these agents is enormous.
基础模型的前所未有的能力为我们打开了通往以前无法想象的代理应用程序的大门。这些新的能力使我们终于可以开发出自主、智能的代理,充当我们的助手、同事和教练。它们可以帮助我们创建网站、收集数据、规划旅行、进行市场研究、管理客户账户、自动化数据输入、为我们准备面试、面试我们的候选人、谈判交易等。可能性似乎是无穷无尽的,这些代理的潜在经济价值是巨大的。

This section will start with an overview of agents and then continue with two aspects that determine the capabilities of an agent: tools and planning. Agents, with their new modes of operations, have new modes of failure. This section will end with a discussion on how to evaluate agents to catch these failures.
本节将首先概述代理,然后继续讨论决定代理能力的两个方面:工具和规划。代理以其新的操作模式,具有新的失败模式。本节将以如何评估代理以捕捉这些失败的讨论结束。

This post is adapted from the Agents section of AI Engineering (2025) with minor edits to make it a standalone post.
此帖子改编自《AI Engineering (2025)》的“智能体”部分,经过少量编辑后使其成为独立的帖子。

Notes:  笔记:

  1. AI-powered agents are an emerging field with no established theoretical frameworks for defining, developing, and evaluating them. This section is a best-effort attempt to build a framework from the existing literature, but it will evolve as the field does. Compared to the rest of the book, this section is more experimental. I received helpful feedback from early reviewers, and I hope to get feedback from readers of this blog post, too.
    由 AI 驱动的代理是一个新兴领域,目前还没有建立用于定义、开发和评估它们的理论框架。本节是基于现有文献尽力构建一个框架的尝试,但随着领域的发展它也会演变。与书的其他部分相比,本节更具实验性。我从早期审稿人那里得到了有益的反馈,也希望能从这篇博客文章的读者那里获得反馈。
  2. Just before this book came out, Anthropic published a blog post on Building effective agents (Dec 2024). I’m glad to see that Anthropic’s blog post and my agent section are conceptually aligned, though with slightly different terminologies. However, Anthropic’s post focuses on isolated patterns, whereas my post covers why and how things work. I also focus more on planning, tool selection, and failure modes.
    就在本书出版前,Anthropic 发布了一篇关于构建有效代理的博客文章(2024 年 12 月)。我很高兴看到 Anthropic 的博客文章和我的代理部分在概念上是一致的,尽管术语略有不同。然而,Anthropic 的文章侧重于孤立的模式,而我的文章涵盖了为什么以及如何运作。我还更关注规划、工具选择和失败模式。
  3. The post contains a lot of background information. Feel free to skip ahead if it feels a little too in the weeds!
    帖子包含了很多背景信息。如果觉得内容过于详细,可以跳过!

Agent Overview  代理概述

The term agent has been used in many different engineering contexts, including but not limited to a software agent, intelligent agent, user agent, conversational agent, and reinforcement learning agent. So, what exactly is an agent?
术语“agent”已在许多不同的工程背景下使用,包括但不限于软件代理、智能代理、用户代理、对话代理和强化学习代理。那么,代理到底是什么呢?

An agent is anything that can perceive its environment and act upon that environment. Artificial Intelligence: A Modern Approach (1995) defines an agent as anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.
智能体是可以感知其环境并对其环境产生作用的任何事物。《人工智能:现代方法》(1995)将智能体定义为可以通过传感器感知其环境并通过执行器对环境产生作用的任何事物。

This means that an agent is characterized by the environment it operates in and the set of actions it can perform.
这意味着一个代理的特征由其运行的环境及其可以执行的动作集决定。

The environment an agent can operate in is defined by its use case. If an agent is developed to play a game (e.g., Minecraft, Go, Dota), that game is its environment. If you want an agent to scrape documents from the internet, the environment is the internet. A self-driving car agent’s environment is the road system and its adjacent areas.
代理可以操作的环境由其用例定义。如果开发一个代理用于玩游戏(例如,Minecraft、Go、Dota),那么该游戏就是它的环境。如果你想让一个代理从互联网抓取文档,那么环境就是互联网。自动驾驶汽车代理的环境是道路系统及其相邻区域。

The set of actions an AI agent can perform is augmented by the tools it has access to. Many generative AI-powered applications you interact with daily are agents with access to tools, albeit simple ones. ChatGPT is an agent. It can search the web, execute Python code, and generate images. RAG systems are agents—text retrievers, image retrievers, and SQL executors are their tools.
AI 代理能够执行的动作集合由其可访问的工具增强。你每天互动的许多生成式 AI 驱动的应用程序都是具有工具访问权限的代理,尽管这些工具可能很简单。ChatGPT 是一个代理。它可以搜索网页,执行 Python 代码,并生成图像。RAG 系统是代理——文本检索器、图像检索器和 SQL 执行器是它们的工具。

There’s a strong dependency between an agent’s environment and its set of tools. The environment determines what tools an agent can potentially use. For example, if the environment is a chess game, the only possible actions for an agent are the valid chess moves. However, an agent’s tool inventory restricts the environment it can operate in. For example, if a robot’s only action is swimming, it’ll be confined to a water environment.
代理的环境与其工具集之间存在强烈的依赖关系。环境决定了代理可以潜在使用哪些工具。例如,如果环境是国际象棋游戏,代理唯一的可能行动就是有效的国际象棋走法。然而,代理的工具库存限制了它可以操作的环境。例如,如果机器人的唯一动作是游泳,它将被限制在水环境中。

Figure 6-8 shows a visualization of SWE-agent (Yang et al., 2024), an agent built on top of GPT-4. Its environment is the computer with the terminal and the file system. Its set of actions include navigate repo, search files, view files, and edit lines.
图 6-8 显示了 SWE-agent(Yang et al., 2024)的可视化,该代理是在 GPT-4 基础上构建的。它的环境是带有终端和文件系统的计算机。它的动作集包括导航仓库、搜索文件、查看文件和编辑行。

A coding agent
Figure 6-8. SWE-agent is a coding agent whose environment is the computer and whose actions include navigation, search, view files, and editing
图 6-8. SWE-agent 是一种编码代理,其环境是计算机,其动作包括导航、搜索、查看文件和编辑。


An AI agent is meant to accomplish tasks typically provided by the users. In an AI agent, AI is the brain that processes the task, plans a sequence of actions to achieve this task, and determines whether the task has been accomplished.
AI 代理旨在完成通常由用户提供的任务。在 AI 代理中,AI 是处理任务的大脑,规划一系列动作以完成此任务,并确定任务是否已经完成。

Let’s return to the RAG system with tabular data in the Kitty Vogue example above. This is a simple agent with three actions:
让我们回到上面 Kitty Vogue 示例中的 RAG 系统,该系统使用表格数据。这是一个具有三个动作的简单代理:

  • response generation,  响应生成,
  • SQL query generation, and
    SQL 查询生成,以及
  • SQL query execution.  SQL 查询执行。

Given the query "Project the sales revenue for Fruity Fedora over the next three months", the agent might perform the following sequence of actions:
鉴于查询 "Project the sales revenue for Fruity Fedora over the next three months" ,代理可能会执行以下动作序列:

  1. Reason about how to accomplish this task. It might decide that to predict future sales, it first needs the sales numbers from the last five years. An agent’s reasoning can be shown as intermediate responses.
    考虑如何完成此任务。它可能决定要预测未来销售,首先需要过去五年的销售数字。代理的推理可以作为中间响应显示。
  2. Invoke SQL query generation to generate the query to get sales numbers from the last five years.
    调用 SQL 查询生成以生成查询,从过去五年获取销售数字。
  3. Invoke SQL query execution to execute this query.
    调用 SQL 查询执行来执行此查询。
  4. Reason about the tool outputs (outputs from the SQL query execution) and how they help with sales prediction. It might decide that these numbers are insufficient to make a reliable projection, perhaps because of missing values. It then decides that it also needs information about past marketing campaigns.
    分析工具输出(SQL 查询执行的输出)以及它们如何帮助销售预测。它可能会认为这些数字不足以做出可靠的预测,可能是因为存在缺失值。然后它决定还需要有关过去营销活动的信息。
  5. Invoke SQL query generation to generate the queries for past marketing campaigns.
    调用 SQL 查询生成以生成过去营销活动的查询。
  6. Invoke SQL query execution.
    调用 SQL 查询执行。
  7. Reason that this new information is sufficient to help predict future sales. It then generates a projection.
    原因在于这些新信息足以帮助预测未来的销售情况。然后它生成一个预测。
  8. Reason that the task has been successfully completed.
    任务已成功完成的原因。

Compared to non-agent use cases, agents typically require more powerful models for two reasons:
与非代理用例相比,代理通常需要更强大的模型,原因有两个:

  • Compound mistakes: an agent often needs to perform multiple steps to accomplish a task, and the overall accuracy decreases as the number of steps increases. If the model’s accuracy is 95% per step, over 10 steps, the accuracy will drop to 60%, and over 100 steps, the accuracy will be only 0.6%.
    复合错误:一个代理通常需要执行多个步骤来完成任务,随着步骤数量的增加,整体准确性会下降。如果模型每一步的准确率为 95%,那么在 10 个步骤后,准确率将降至 60%,而在 100 个步骤后,准确率将仅为 0.6%。
  • Higher stakes: with access to tools, agents are capable of performing more impactful tasks, but any failure could have more severe consequences.
    更高的赌注:借助工具访问权限,代理能够执行更具影响力的任务,但任何失败都可能带来更严重的后果。

A task that requires many steps can take time and money to run. A common complaint is that agents are only good for burning through your API credits. However, if agents can be autonomous, they can save human time, making their costs worthwhile.
一项需要多个步骤的任务可能需要花费时间和金钱来运行。常见的抱怨是代理只会消耗你的 API 信用额度。然而,如果代理可以自主运行,它们可以节省人类的时间,使其成本值得。

Given an environment, the success of an agent in an environment depends on the tool it has access to and the strength of its AI planner. Let’s start by looking into different kinds of tools a model can use. We’ll analyze AI’s capability for planning next.
给定一个环境,代理在环境中的成功取决于它所拥有的工具及其 AI 规划器的强度。让我们先来看看模型可以使用的不同种类的工具。我们将在接下来分析 AI 的规划能力。

Tools  工具

A system doesn’t need access to external tools to be an agent. However, without external tools, the agent’s capabilities would be limited. By itself, a model can typically perform one action—an LLM can generate text and an image generator can generate images. External tools make an agent vastly more capable.
一个系统不需要访问外部工具就可以成为代理。然而,没有外部工具,代理的能力将受到限制。仅靠自身,模型通常只能执行一个动作——LLM 可以生成文本和图像生成器可以生成图像。外部工具使代理的能力大大提高。

Tools help an agent to both perceive the environment and act upon it. Actions that allow an agent to perceive the environment are read-only actions, whereas actions that allow an agent to act upon the environment are write actions.
工具帮助代理既感知环境又对其采取行动。允许代理感知环境的动作是只读动作,而允许代理对环境采取行动的是写入动作。

The set of tools an agent has access to is its tool inventory. Since an agent’s tool inventory determines what an agent can do, it’s important to think through what and how many tools to give an agent. More tools give an agent more capabilities. However, the more tools there are, the more challenging it is to understand and utilize them well. Experimentation is necessary to find the right set of tools, as discussed later in the “Tool selection” section.
代理有权访问的工具集称为其工具库存。由于代理的工具库存决定了代理能做什么,因此考虑给代理配备哪些工具以及多少工具很重要。更多的工具赋予代理更多的能力。然而,工具越多,理解和充分利用它们就越具挑战性。需要通过实验来找到合适的工具集,如后文“工具选择”部分所讨论的。

Depending on the agent’s environment, there are many possible tools. Here are three categories of tools that you might want to consider: knowledge augmentation (i.e., context construction), capability extension, and tools that let your agent act upon its environment.
根据代理的环境,有许多可能的工具。这里有三类工具你可能需要考虑:知识增强(即,上下文构建),能力扩展,以及让您的代理能够对其环境施加影响的工具。

Knowledge augmentation  知识增强

I hope that this book, so far, has convinced you of the importance of having the relevant context for a model’s response quality. An important category of tools includes those that help augment the knowledge of your agent. Some of them have already been discussed: text retriever, image retriever, and SQL executor. Other potential tools include internal people search, an inventory API that returns the status of different products, Slack retrieval, an email reader, etc.
我希望到目前为止,这本书已经让你相信了为模型的响应质量提供相关上下文的重要性。一个重要的工具类别包括那些帮助增强代理知识的工具。其中一些已经讨论过:文本检索器、图像检索器和 SQL 执行器。其他潜在的工具包括内部人员搜索、返回不同产品状态的库存 API、Slack 检索、电子邮件阅读器等。

Many such tools augment a model with your organization’s private processes and information. However, tools can also give models access to public information, especially from the internet.
许多此类工具通过添加您组织的私有流程和信息来增强模型。然而,工具也可以让模型访问公共信息,特别是来自互联网的信息。

Web browsing was among the earliest and most anticipated capabilities to be incorporated into ChatGPT. Web browsing prevents a model from going stale. A model goes stale when the data it was trained on becomes outdated. If the model’s training data was cut off last week, it won’t be able to answer questions that require information from this week unless this information is provided in the context. Without web browsing, a model won’t be able to tell you about the weather, news, upcoming events, stock prices, flight status, etc.
网页浏览是最早且最受期待的功能之一,被纳入到 ChatGPT 中。网页浏览可以防止模型过时。当模型训练所用的数据变得陈旧时,模型就会过时。如果模型的训练数据截止到上周,那么它将无法回答需要本周信息的问题,除非这些信息在上下文中提供。没有网页浏览功能,模型将无法告诉你关于天气、新闻、即将发生的事件、股票价格、航班状态等信息。

I use web browsing as an umbrella term to cover all tools that access the internet, including web browsers and APIs such as search APIs, news APIs, GitHub APIs, or social media APIs.
我使用网络浏览作为总称,涵盖所有访问互联网的工具,包括网页浏览器和 API,如搜索 API、新闻 API、GitHub API 或社交媒体 API。

While web browsing allows your agent to reference up-to-date information to generate better responses and reduce hallucinations, it can also open up your agent to the cesspools of the internet. Select your Internet APIs with care.
虽然网页浏览可以让您的代理参考最新信息以生成更好的回复并减少幻觉,但它也可能使您的代理接触到互联网的 cesspools。请选择您的 Internet API 时要谨慎。

Capability extension  功能扩展

You might also consider tools that address the inherent limitations of AI models. They are easy ways to give your model a performance boost. For example, AI models are notorious for being bad at math. If you ask a model what is 199,999 divided by 292, the model will likely fail. However, this calculation would be trivial if the model had access to a calculator. Instead of trying to train the model to be good at arithmetic, it’s a lot more resource-efficient to just give the model access to a tool.
你还可以考虑解决 AI 模型固有限制的工具。这些是提升模型性能的简单方法。例如,AI 模型以不擅长数学而闻名。如果你问模型 199,999 除以 292 是多少,模型很可能会失败。然而,如果模型可以使用计算器,这个计算将变得微不足道。与其尝试训练模型擅长算术,不如直接给模型提供工具访问权限,这样更节省资源。

Other simple tools that can significantly boost a model’s capability include a calendar, timezone converter, unit converter (e.g., from lbs to kg), and translator that can translate to and from the languages that the model isn’t good at.
其他可以显著提升模型能力的简单工具包括日历、时区转换器、单位转换器(例如,从磅转换为千克)和翻译器,这些翻译器可以在模型不擅长的语言之间进行互译。

More complex but powerful tools are code interpreters. Instead of training a model to understand code, you can give it access to a code interpreter to execute a piece of code, return the results, or analyze the code’s failures. This capability lets your agents act as coding assistants, data analysts, and even research assistants that can write code to run experiments and report results. However, automated code execution comes with the risk of code injection attacks, as discussed in Chapter 5 in the section “Defensive Prompt Engineering. Proper security measurements are crucial to keep you and your users safe.
更复杂但功能强大的工具是代码解释器。与其训练模型理解代码,不如让它访问代码解释器来执行一段代码,返回结果,或分析代码的失败。这种能力使你的代理可以充当编程助手、数据分析师,甚至是能够编写代码进行实验并报告结果的研究助理。然而,自动代码执行伴随着代码注入攻击的风险,如第 5 章“防御性提示工程”部分所讨论的。适当的安全措施对于保护你和你的用户的安全至关重要。

Tools can turn a text-only or image-only model into a multimodal model. For example, a model that can generate only texts can leverage a text-to-image model as a tool, allowing it to generate both texts and images. Given a text request, the agent’s AI planner decides whether to invoke text generation, image generation, or both. This is how ChatGPT can generate both text and images—it uses DALL-E as its image generator.
工具可以将仅文本或仅图像的模型转换为多模态模型。例如,一个只能生成文本的模型可以利用文本到图像的模型作为工具,使其能够生成文本和图像。给定一个文本请求,代理的 AI 规划器决定调用文本生成、图像生成或两者。这就是 ChatGPT 可以生成文本和图像的方式——它使用 DALL-E 作为其图像生成器。

Agents can also use a code interpreter to generate charts and graphs, a LaTex compiler to render math equations, or a browser to render web pages from HTML code.
代理还可以使用代码解释器生成图表,LaTex 编译器渲染数学公式,或浏览器从 HTML 代码渲染网页。

Similarly, a model that can process only text inputs can use an image captioning tool to process images and a transcription tool to process audio. It can use an OCR (optical character recognition) tool to read PDFs.
同样地,只能处理文本输入的模型可以使用图像字幕工具来处理图像,使用转录工具来处理音频。它可以使用 OCR(光学字符识别)工具来读取 PDF。

Tool use can significantly boost a model’s performance compared to just prompting or even finetuning. Chameleon (Lu et al., 2023) shows that a GPT-4-powered agent, augmented with a set of 13 tools, can outperform GPT-4 alone on several benchmarks. Examples of tools this agent used are knowledge retrieval, a query generator, an image captioner, a text detector, and Bing search.
工具的使用可以显著提升模型的性能,相比于仅使用提示或甚至微调。Chameleon (Lu et al., 2023) 显示,一个配备了 13 种工具增强的 GPT-4 驱动代理在多个基准测试中可以超过单独的 GPT-4。这个代理使用的工具示例包括知识检索、查询生成器、图像字幕生成、文本检测器和 Bing 搜索。

On ScienceQA, a science question answering benchmark, Chameleon improves the best published few-shot result by 11.37%. On TabMWP (Tabular Math Word Problems) (Lu et al., 2022), a benchmark involving tabular math questions, Chameleon improves the accuracy by 17%.
在 ScienceQA(一个科学问答基准)上,Chameleon 将最佳的 few-shot 结果提高了 11.37%。在 TabMWP(表格数学文字题)(Lu et al., 2022)(一个涉及表格数学问题的基准)上,Chameleon 将准确率提高了 17%。

Write actions  编写操作

So far, we’ve discussed read-only actions that allow a model to read from its data sources. But tools can also perform write actions, making changes to the data sources. An SQL executor can retrieve a data table (read) and change or delete the table (write). An email API can read an email but can also respond to it. A banking API can retrieve your current balance, but can also initiate a bank transfer.
到目前为止,我们讨论了只读操作,这些操作允许模型从其数据源读取。但工具也可以执行写入操作,对数据源进行更改。SQL 执行器可以检索数据表(读取)并更改或删除表(写入)。电子邮件 API 可以读取电子邮件,也可以回复它。银行 API 可以检索您当前的余额,也可以发起银行转账。

Write actions enable a system to do more. They can enable you to automate the whole customer outreach workflow: researching potential customers, finding their contacts, drafting emails, sending first emails, reading responses, following up, extracting orders, updating your databases with new orders, etc.
编写操作可以使系统完成更多任务。它们可以让你自动化整个客户外展工作流程:研究潜在客户、查找他们的联系方式、起草电子邮件、发送第一封邮件、阅读回复、跟进、提取订单、用新订单更新数据库等。

However, the prospect of giving AI the ability to automatically alter our lives is frightening. Just as you shouldn’t give an intern the authority to delete your production database, you shouldn’t allow an unreliable AI to initiate bank transfers. Trust in the system’s capabilities and its security measures is crucial. You need to ensure that the system is protected from bad actors who might try to manipulate it into performing harmful actions.
然而,赋予人工智能自动改变我们生活的前景是令人恐惧的。就像你不应该给实习生删除生产数据库的权限一样,你不应该允许不可靠的人工智能发起银行转账。对系统的能力及其安全措施的信任至关重要。你需要确保系统受到保护,防止不良行为者可能试图操纵它进行有害行为。


Sidebar: Agents and security
侧栏:代理和安全

Whenever I talk about autonomous AI agents to a group of people, there is often someone who brings up self-driving cars. “What if someone hacks into the car to kidnap you?” While the self-driving car example seems visceral because of its physicality, an AI system can cause harm without a presence in the physical world. It can manipulate the stock market, steal copyrights, violate privacy, reinforce biases, spread misinformation and propaganda, and more, as discussed in the section “Defensive Prompt Engineering” in Chapter 5.
每当我和一群人讨论自主 AI 代理时,通常会有人提到自动驾驶汽车。“如果有人侵入汽车来绑架你怎么办?”虽然自动驾驶汽车的例子因为其物理性而显得直观,但 AI 系统可以在没有实体存在的情况下造成伤害。它可以操纵股市,窃取版权,侵犯隐私,强化偏见,传播错误信息和宣传等,正如第五章“防御性提示工程”部分所讨论的。

These are all valid concerns, and any organization that wants to leverage AI needs to take safety and security seriously. However, this doesn’t mean that AI systems should never be given the ability to act in the real world. If we can trust a machine to take us into space, I hope that one day, security measures will be sufficient for us to trust autonomous AI systems. Besides, humans can fail, too. Personally, I would trust a self-driving car more than the average stranger to give me a lift.
这些都是有效的担忧,任何希望利用人工智能的组织都需要认真对待安全和保障。然而,这并不意味着人工智能系统永远不应该被赋予在现实世界中行动的能力。如果我们能够信任机器带我们进入太空,我希望有一天,安全措施将足够让我们信任自主的人工智能系统。此外,人类也会犯错。我个人更信任自动驾驶汽车而不是普通陌生人来载我一程。

Just as the right tools can help humans be vastly more productive—can you imagine doing business without Excel or building a skyscraper without cranes?—tools enable models to accomplish many more tasks. Many model providers already support tool use with their models, a feature often called function calling. Going forward, I would expect function calling with a wide set of tools to be common with most models.
正如合适的工具可以帮助人类大大提高生产力——你能想象没有 Excel 做业务或没有起重机建造摩天大楼吗?——工具使模型能够完成更多任务。许多模型提供商已经通过他们的模型支持工具使用,这一功能通常被称为函数调用。展望未来,我预计大多数模型将普遍支持使用广泛工具集的函数调用。


Planning  规划

At the heart of a foundation model agent is the model responsible for solving user-provided tasks. A task is defined by its goal and constraints. For example, one task is to schedule a two-week trip from San Francisco to India with a budget of $5,000. The goal is the two-week trip. The constraint is the budget.
基础模型代理的核心是负责解决用户提供的任务的模型。任务由其目标和约束定义。例如,一个任务是从旧金山到印度安排为期两周的旅行,预算为 5,000 美元。目标是为期两周的旅行。约束是预算。

Complex tasks require planning. The output of the planning process is a plan, which is a roadmap outlining the steps needed to accomplish a task. Effective planning typically requires the model to understand the task, consider different options to achieve this task, and choose the most promising one.
复杂任务需要规划。规划过程的输出是一个计划,这是一个概述完成任务所需步骤的路线图。有效的规划通常要求模型理解任务,考虑实现任务的不同选项,并选择最有前景的一个。

If you’ve ever been in any planning meeting, you know that planning is hard. As an important computational problem, planning is well studied and would require several volumes to cover. I’ll only be able to cover the surface here.
如果你曾经参加过任何规划会议,你就知道规划是困难的。作为一个重要计算问题,规划已经得到了广泛研究,并且需要多卷书才能涵盖。我在这里只能涉及表面。

Planning overview  规划概述

Given a task, there are many possible ways to solve it, but not all of them will lead to a successful outcome. Among the correct solutions, some are more efficient than others. Consider the query, "How many companies without revenue have raised at least $1 billion?", and the two example solutions:
给定一个任务,有许多可能的解决方法,但并非所有方法都会导致成功的结果。在正确的解决方案中,有些比其他更有效。考虑查询, "How many companies without revenue have raised at least $1 billion?" ,以及两个示例解决方案:

  1. Find all companies without revenue, then filter them by the amount raised.
    找到所有没有收入的公司,然后根据筹集的资金金额进行筛选。
  2. Find all companies that have raised at least $1 billion, then filter them by revenue.
    找到所有筹集了至少 10 亿美元的公司,然后按收入筛选它们。

The second option is more efficient. There are vastly more companies without revenue than companies that have raised $1 billion. Given only these two options, an intelligent agent should choose option 2.
第二个选项更有效。没有收入的公司远远多于筹集了 10 亿美元的公司。仅给出这两个选项,智能代理应选择选项 2。

You can couple planning with execution in the same prompt. For example, you give the model a prompt, ask it to think step by step (such as with a chain-of-thought prompt), and then execute those steps all in one prompt. But what if the model comes up with a 1,000-step plan that doesn’t even accomplish the goal? Without oversight, an agent can run those steps for hours, wasting time and money on API calls, before you realize that it’s not going anywhere.
你可以将规划与执行在同一提示中结合。例如,你给模型一个提示,要求它一步一步地思考(如使用链式思考提示),然后在同一个提示中执行这些步骤。但如果模型提出了一个 1,000 步的计划,却甚至无法达成目标怎么办?没有监督,代理可能会运行这些步骤数小时,在 API 调用上浪费时间和金钱,直到你意识到它根本没有任何进展。

To avoid fruitless execution, planning should be decoupled from execution. You ask the agent to first generate a plan, and only after this plan is validated is it executed. The plan can be validated using heuristics. For example, one simple heuristic is to eliminate plans with invalid actions. If the generated plan requires a Google search and the agent doesn’t have access to Google Search, this plan is invalid. Another simple heuristic might be eliminating all plans with more than X steps.
为避免无果的执行,规划应与执行分离。您要求代理首先生成一个计划,并且只有在这个计划经过验证后才执行。可以使用启发式方法验证计划。例如,一个简单的启发式方法是消除包含无效操作的计划。如果生成的计划需要进行 Google 搜索,而代理无法访问 Google 搜索,那么这个计划就是无效的。另一个简单的启发式方法可能是消除所有超过 X 个步骤的计划。

A plan can also be validated using AI judges. You can ask a model to evaluate whether the plan seems reasonable or how to improve it.
也可以使用 AI 评审来验证计划。您可以要求模型评估该计划是否合理或如何改进。

If the generated plan is evaluated to be bad, you can ask the planner to generate another plan. If the generated plan is good, execute it.
如果生成的计划被评估为不好,您可以要求规划者生成另一个计划。如果生成的计划是好的,则执行它。

If the plan consists of external tools, function calling will be invoked. Outputs from executing this plan will then again need to be evaluated. Note that the generated plan doesn’t have to be an end-to-end plan for the whole task. It can be a small plan for a subtask. The whole process looks like Figure 6-9.
如果计划包含外部工具,则会调用函数。执行此计划的输出然后需要再次进行评估。请注意,生成的计划不一定是整个任务的端到端计划。它可以是子任务的小计划。整个过程如图 6-9 所示。

Agent pattern
Figure 6-9. Decoupling planning and execution so that only validated plans are executed
图 6-9. 解耦规划和执行,以便只执行经过验证的计划


Your system now has three components: one to generate plans, one to validate plans, and another to execute plans. If you consider each component an agent, this can be considered a multi-agent system. Because most agentic workflows are sufficiently complex to involve multiple components, most agents are multi-agent.
您的系统现在有三个组件:一个用于生成计划,一个用于验证计划,另一个用于执行计划。如果将每个组件视为一个代理,这可以被认为是一个多代理系统。因为大多数代理工作流程足够复杂,涉及到多个组件,所以大多数代理是多代理的。

To speed up the process, instead of generating plans sequentially, you can generate several plans in parallel and ask the evaluator to pick the most promising one. This is another latency–cost tradeoff, as generating multiple plans simultaneously will incur extra costs.
为了加快过程,您可以并行生成多个计划,而不是依次生成计划,并要求评估者选择最有前途的一个。这是另一种延迟–成本权衡,因为同时生成多个计划会产生额外的成本。

Planning requires understanding the intention behind a task: what’s the user trying to do with this query? An intent classifier is often used to help agents plan. As shown in Chapter 5 in the section “Break complex tasks into simpler subtasks, intent classification can be done using another prompt or a classification model trained for this task. The intent classification mechanism can be considered another agent in your multi-agent system.
规划需要理解任务背后的目的:用户通过这个查询试图做什么?意图分类器通常用于帮助代理进行规划。如第 5 章“将复杂任务分解为更简单的子任务”部分所示,可以使用另一个提示或为此任务训练的分类模型来进行意图分类。可以将意图分类机制视为您多代理系统中的另一个代理。

Knowing the intent can help the agent pick the right tools. For example, for customer support, if the query is about billing, the agent might need access to a tool to retrieve a user’s recent payments. But if the query is about how to reset a password, the agent might need to access documentation retrieval.
了解意图可以帮助代理选择正确的工具。例如,对于客户支持,如果查询是关于账单的,代理可能需要访问工具以检索用户的最近付款。但如果查询是关于如何重置密码的,代理可能需要访问文档检索。

Tip:  提示:

Some queries might be out of the scope of the agent. The intent classifier should be able to classify requests as IRRELEVANT so that the agent can politely reject those instead of wasting FLOPs coming up with impossible solutions.
有些查询可能超出了代理的范围。意图分类器应该能够将请求分类为 IRRELEVANT ,以便代理可以礼貌地拒绝这些请求,而不是浪费 FLOPs 去尝试不可能的解决方案。

So far, we’ve assumed that the agent automates all three stages: generating plans, validating plans, and executing plans. In reality, humans can be involved at any stage to aid with the process and mitigate risks.
到目前为止,我们假设代理自动化了所有三个阶段:生成计划、验证计划和执行计划。实际上,人类可以在任何阶段参与以辅助过程并减轻风险。

  • A human expert can provide a plan, validate a plan, or execute parts of a plan. For example, for complex tasks for which an agent has trouble generating the whole plan, a human expert can provide a high-level plan that the agent can expand upon.
    人类专家可以提供计划、验证计划或执行计划的部分内容。例如,对于代理难以生成完整计划的复杂任务,人类专家可以提供一个高层级的计划,代理可以在此基础上进行扩展。
  • If a plan involves risky operations, such as updating a database or merging a code change, the system can ask for explicit human approval before executing or defer to humans to execute these operations. To make this possible, you need to clearly define the level of automation an agent can have for each action.
    如果计划涉及风险操作,例如更新数据库或合并代码更改,系统可以在执行前要求明确的人工批准,或者交由人工来执行这些操作。为了实现这一点,你需要明确定义代理对每个操作可以拥有的自动化程度。

To summarize, solving a task typically involves the following processes. Note that reflection isn’t mandatory for an agent, but it’ll significantly boost the agent’s performance.
总结来说,解决任务通常涉及以下过程。请注意,反思对于代理不是必需的,但它将显著提高代理的性能。

  1. Plan generation: come up with a plan for accomplishing this task. A plan is a sequence of manageable actions, so this process is also called task decomposition.
    计划生成:为完成此任务制定一个计划。计划是一系列可管理的行动,因此这个过程也被称为任务分解。
  2. Reflection and error correction: evaluate the generated plan. If it’s a bad plan, generate a new one.
    反思和错误纠正:评估生成的计划。如果是坏计划,生成一个新的。
  3. Execution: take actions outlined in the generated plan. This often involves calling specific functions.
    执行:采取生成的计划中概述的行动。这通常涉及调用特定的函数。
  4. Reflection and error correction: upon receiving the action outcomes, evaluate these outcomes and determine whether the goal has been accomplished. Identify and correct mistakes. If the goal is not completed, generate a new plan.
    反射和错误纠正:在接收到行动结果后,评估这些结果并确定目标是否已实现。识别并纠正错误。如果目标未完成,生成新计划。

You’ve already seen some techniques for plan generation and reflection in this book. When you ask a model to “think step by step”, you’re asking it to decompose a task. When you ask a model to “verify if your answer is correct”, you’re asking it to reflect.
你已经在本书中看到了一些计划生成和反思的技巧。当你要求模型“逐步思考”时,你是在要求它分解任务。当你要求模型“验证答案是否正确”时,你是在要求它进行反思。

Foundation models as planners
基础模型作为规划者

An open question is how well foundation models can plan. Many researchers believe that foundation models, at least those built on top of autoregressive language models, cannot. Meta’s Chief AI Scientist Yann LeCun states unequivocably that autoregressive LLMs can’t plan (2023).
一个开放性问题是基础模型的规划能力有多强。许多研究人员认为,基础模型,至少是基于自回归语言模型构建的那些,不能进行规划。Meta 的首席 AI 科学家 Yann LeCun 明确表示,自回归LLMs无法规划(2023)。

While there is a lot of anecdotal evidence that LLMs are poor planners, it’s unclear whether it’s because we don’t know how to use LLMs the right way or because LLMs, fundamentally, can’t plan.
虽然有很多轶事证据表明LLMs是糟糕的规划者,但尚不清楚这是因为我们不知道如何正确使用LLMs,还是因为LLMs从根本上不会规划。

Planning, at its core, is a search problem. You search among different paths towards the goal, predict the outcome (reward) of each path, and pick the path with the most promising outcome. Often, you might determine that no path exists that can take you to the goal.
规划本质上是一个搜索问题。你在不同的路径中进行搜索,预测每条路径的结果(回报),并选择结果最理想的路径。通常,你可能会确定没有可以到达目标的路径。

Search often requires backtracking. For example, imagine you’re at a step where there are two possible actions: A and B. After taking action A, you enter a state that’s not promising, so you need to backtrack to the previous state to take action B.
搜索经常需要回溯。例如,想象你在某一步有两个可能的操作:A 和 B。在采取操作 A 之后,你进入了一个不太有希望的状态,因此你需要回溯到之前的状态来采取操作 B。

Some people argue that an autoregressive model can only generate forward actions. It can’t backtrack to generate alternate actions. Because of this, they conclude that autoregressive models can’t plan. However, this isn’t necessarily true. After executing a path with action A, if the model determines that this path doesn’t make sense, it can revise the path using action B instead, effectively backtracking. The model can also always start over and choose another path.
有些人认为自回归模型只能生成向前的动作。它不能回溯以生成替代动作。因此,他们得出结论认为自回归模型无法规划。然而,这并不一定是正确的。在执行了带有动作 A 的路径后,如果模型确定这条路径没有意义,它可以使用动作 B 来修订路径,从而有效地回溯。模型也可以随时重新开始并选择另一条路径。

It’s also possible that LLMs are poor planners because they aren’t given the toolings needed to plan. To plan, it’s necessary to know not only the available actions but also the potential outcome of each action. As a simple example, let’s say you want to walk up a mountain. Your potential actions are turn right, turn left, turn around, or go straight ahead. However, if turning right will cause you to fall off the cliff, you might not consider this action. In technical terms, an action takes you from one state to another, and it’s necessary to know the outcome state to determine whether to take an action.
也有可能是因为没有提供所需的工具,LLMs 才成为糟糕的规划者。要进行规划,不仅需要知道可用的操作,还要知道每个操作的潜在结果。举个简单的例子,假设你想爬上一座山。你的潜在操作是向右转、向左转、掉头或直走。然而,如果向右转会让你跌落悬崖,你可能不会考虑这个操作。从技术上讲,一个操作会把你从一个状态带到另一个状态,有必要知道结果状态以确定是否采取该操作。

This means that prompting a model to generate only a sequence of actions like what the popular chain-of-thought prompting technique does isn’t sufficient. The paper “Reasoning with Language Model is Planning with World Model” (Hao et al., 2023) argues that an LLM, by containing so much information about the world, is capable of predicting the outcome of each action. This LLM can incorporate this outcome prediction to generate coherent plans.
这意味着仅生成一系列动作的提示,如流行的链式思维提示技术所做的那样,并不充分。论文《使用语言模型进行推理即是使用世界模型进行规划》(Hao 等人,2023)认为,由于包含如此多的世界信息,LLM 能够预测每个动作的结果。这种 LLM 可以将结果预测纳入其中,以生成连贯的计划。

Even if AI can’t plan, it can still be a part of a planner. It might be possible to augment an LLM with a search tool and state tracking system to help it plan.
即使人工智能无法规划,它仍然可以成为规划者的一部分。有可能通过添加搜索工具和状态跟踪系统来增强LLM,以帮助它进行规划。


Sidebar: Foundation model (FM) versus reinforcement learning (RL) planners
侧栏:基础模型(FM)与强化学习(RL)规划器的对比

The agent is a core concept in RL, which is defined in Wikipedia as a field “concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward.
代理是 RL 中的核心概念,在维基百科中被定义为一个领域“关注智能代理应该如何在一个动态环境中采取行动,以最大化累积奖励。”

RL agents and FM agents are similar in many ways. They are both characterized by their environments and possible actions. The main difference is in how their planners work. In an RL agent, the planner is trained by an RL algorithm. Training this RL planner can require a lot of time and resources. In an FM agent, the model is the planner. This model can be prompted or finetuned to improve its planning capabilities, and generally requires less time and fewer resources.
RL 智能体和 FM 智能体在很多方面是相似的。它们都以其环境和可能的动作来定义。主要的区别在于它们的规划器的工作方式。在 RL 智能体中,规划器是由 RL 算法训练的。训练这个 RL 规划器可能需要大量的时间和资源。在 FM 智能体中,模型本身就是规划器。这个模型可以通过提示或微调来提高其规划能力,并且通常需要更少的时间和资源。

However, there’s nothing to prevent an FM agent from incorporating RL algorithms to improve its performance. I suspect that in the long run, FM agents and RL agents will merge.
但是,没有什么可以阻止 FM 代理融入 RL 算法来提高其性能。我怀疑从长远来看,FM 代理和 RL 代理将会合并。


Plan generation  计划生成

The simplest way to turn a model into a plan generator is with prompt engineering. Imagine that you want to create an agent to help customers learn about products at Kitty Vogue. You give this agent access to three external tools: retrieve products by price, retrieve top products, and retrieve product information. Here’s an example of a prompt for plan generation. This prompt is for illustration purposes only. Production prompts are likely more complex.
将模型转变为计划生成器的最简单方法是通过提示工程。想象一下,您想要创建一个代理来帮助客户了解 Kitty Vogue 的产品。您给这个代理访问三个外部工具的权限:按价格检索产品、检索热门产品和检索产品信息。以下是一个用于计划生成的提示示例。此提示仅用于说明目的。生产中的提示可能更为复杂。

SYSTEM PROMPT
Propose a plan to solve the task. You have access to 5 actions:

* get_today_date()
* fetch_top_products(start_date, end_date, num_products)
* fetch_product_info(product_name)
* generate_query(task_history, tool_output)
* generate_response(query)

The plan must be a sequence of valid actions.

Examples
Task: "Tell me about Fruity Fedora"
Plan: [fetch_product_info, generate_query, generate_response]

Task: "What was the best selling product last week?"
Plan: [fetch_top_products, generate_query, generate_response]

Task: {USER INPUT}
Plan:

There are two things to note about this example:
关于这个例子,有两点需要注意:

  • The plan format used here—a list of functions whose parameters are inferred by the agent—is just one of many ways to structure the agent control flow.
    这里使用的计划格式——一个由代理推断其参数的功能列表——只是结构化代理控制流的众多方法之一。
  • The generate_query function takes in the task’s current history and the most recent tool outputs to generate a query to be fed into the response generator. The tool output at each step is added to the task’s history.
    generate_query 函数接收任务的当前历史和最新的工具输出,以生成要输入到响应生成器的查询。每一步的工具输出都会添加到任务的历史中。

Given the user input “What’s the price of the best-selling product last week”, a generated plan might look like this:
鉴于用户输入“上周最畅销产品的价格是多少”,生成的计划可能如下所示:

  1. get_time()
  2. fetch_top_products()
  3. fetch_product_info()
  4. generate_query()
  5. generate_response()

You might wonder, “What about the parameters needed for each function?” The exact parameters are hard to predict in advance since they are often extracted from the previous tool outputs. If the first step, get_time(), outputs “2030-09-13”, the agent can reason that the parameters for the next step should be called with the following parameters:
你可能会想,“每个函数所需的参数怎么办?”确切的参数很难提前预测,因为它们通常是从前一个工具的输出中提取的。如果第一步, get_time() ,输出为“2030-09-13”,代理可以推断出下一步所需的参数应该用以下参数调用:

retrieve_top_products(
    start_date="2030-09-07",
    end_date="2030-09-13",
    num_products=1
)

Often, there’s insufficient information to determine the exact parameter values for a function. For example, if a user asks “What’s the average price of best-selling products?”, the answers to to following questions are unclear:
通常,没有足够的信息来确定函数的确切参数值。例如,如果用户问“最畅销产品的平均价格是多少?”,以下问题的答案不清楚:

  • How many best-selling products the user wants to look at?
    用户想查看多少种畅销产品?
  • Does the user want the best-selling products last week, last month, or of all time?
    用户想要查看上周、上个月还是有史以来的最畅销产品?

This means that models frequently have to make guesses, and guesses can be wrong.
这意味着模型经常不得不做出猜测,而这些猜测可能是错误的。

Because both the action sequence and the associated parameters are generated by AI models, they can be hallucinated. Hallucinations can cause the model to call an invalid function or call a valid function but with wrong parameters. Techniques for improving a model’s performance in general can be used to improve a model’s planning capabilities.
因为动作序列和相关参数都是由 AI 模型生成的,它们可能是虚构的。虚构可能导致模型调用无效函数或调用有效函数但使用错误的参数。用于提高模型整体性能的技术也可以用来提高模型的规划能力。

Tip: Here are a few tips to make an agent better at planning.
提示:这里有一些技巧可以让代理更好地进行规划。

  • Write a better system prompt with more examples.
    编写一个更好的系统提示,并提供更多示例。
  • Give better descriptions of the tools and their parameters so that the model understands them better.
    提供更好的工具及其参数的描述,以便模型能更好地理解它们。
  • Rewrite the functions themselves to make them simpler, such as refactoring a complex function into two simpler functions.
    重写函数本身以使其更简单,例如将一个复杂的函数重构为两个更简单的函数。
  • Use a stronger model. In general, stronger models are better at planning.
    使用更强的模型。一般来说,更强的模型在规划方面表现更好。
  • Finetune a model for plan generation.
    微调模型以生成计划。

Function calling  函数调用

Many model providers offer tool use for their models, effectively turning their models into agents. A tool is a function. Invoking a tool is, therefore, often called function calling. Different model APIs work differently, but in general, function calling works as follows:
许多模型提供商为其模型提供工具使用,从而有效地将其模型转化为代理。工具是一个函数。因此,调用工具通常被称为函数调用。不同的模型 API 工作方式不同,但一般来说,函数调用的工作流程如下:

  1. Create a tool inventory. Declare all the tools that you might want a model to use. Each tool is described by its execution entry point (e.g., its function name), its parameters, and its documentation (e.g., what the function does and what parameters it needs).
    创建工具清单。声明所有你可能希望模型使用的工具。每个工具通过其执行入口点(例如,函数名称)、参数和文档(例如,函数的功能及其所需参数)来描述。
  2. Specify what tools the agent can use for a query.
    指定代理可以使用哪些工具进行查询。

    Because different queries might need different tools, many APIs let you specify a list of declared tools to be used per query. Some let you control tool use further by the following settings:
    因为不同的查询可能需要不同的工具,许多 API 允许你指定一个声明工具列表,用于每个查询。一些 API 通过以下设置让你进一步控制工具的使用:
    • required: the model must use at least one tool.
      required : 模型必须使用至少一个工具。
    • none: the model shouldn’t use any tool.
      none : 模型不应使用任何工具。
    • auto: the model decides which tools to use.
      auto : 模型决定使用哪些工具。

Function calling is illustrated in Figure 6-10. This is written in pseudocode to make it representative of multiple APIs. To use a specific API, please refer to its documentation.
函数调用在图 6-10 中进行了说明。这使用伪代码编写,以使其代表多个 API。要使用特定的 API,请参阅其文档。

An example of function calling
Figure 6-10. An example of a model using two simple tools
图 6-10. 使用两个简单工具的模型示例


Given a query, an agent defined as in Figure 6-10 will automatically generate what tools to use and their parameters. Some function calling APIs will make sure that only valid functions are generated, though they won’t be able to guarantee the correct parameter values.
给定一个查询,如图 6-10 定义的代理将自动生成要使用的工具及其参数。某些函数调用 API 将确保只生成有效的函数,尽管它们无法保证参数值的正确性。

For example, given the user query “How many kilograms are 40 pounds?”, the agent might decide that it needs the tool lbs_to_kg_tool with one parameter value of 40. The agent’s response might look like this.
例如,对于用户查询“40 磅是多少千克?”,代理可能会决定它需要工具 lbs_to_kg_tool ,其中一个参数值为 40。代理的响应可能如下所示。

response = ModelResponse(
    finish_reason='tool_calls',
    message=chat.Message(
        content=None,
        role='assistant',
        tool_calls=[
            ToolCall(
                function=Function(
                    arguments='{"lbs":40}',
                    name='lbs_to_kg'),
                type='function')
        ])
)

From this response, you can evoke the function lbs_to_kg(lbs=40) and use its output to generate a response to the users.
从这个响应中,你可以调用函数 lbs_to_kg(lbs=40) 并使用其输出来生成对用户的响应。

Tip: When working with agents, always ask the system to report what parameter values it uses for each function call. Inspect these values to make sure they are correct.
提示:在与代理合作时,始终要求系统报告其每次函数调用使用的参数值。检查这些值以确保它们正确无误。

Planning granularity  规划粒度

A plan is a roadmap outlining the steps needed to accomplish a task. A roadmap can be of different levels of granularity. To plan for a year, a quarter-by-quarter plan is higher-level than a month-by-month plan, which is, in turn, higher-level than a week-to-week plan.
计划是概述完成任务所需步骤的路线图。路线图可以有不同的颗粒度级别。要规划一年,季度计划比月度计划更高级别,而月度计划又比周计划更高级别。

There’s a planning/execution tradeoff. A detailed plan is harder to generate, but easier to execute. A higher-level plan is easier to generate, but harder to execute. An approach to circumvent this tradeoff is to plan hierarchically. First, use a planner to generate a high-level plan, such as a quarter-to-quarter plan. Then, for each quarter, use the same or a different planner to generate a month-to-month plan.
存在规划与执行的权衡。详细的计划更难以生成,但更容易执行。高层次的计划更容易生成,但更难执行。规避这种权衡的一种方法是分层规划。首先,使用规划工具生成高层次的计划,例如季度到季度的计划。然后,对于每个季度,使用相同或不同的规划工具生成月度到月度的计划。

So far, all examples of generated plans use the exact function names, which is very granular. A problem with this approach is that an agent’s tool inventory can change over time. For example, the function to get the current date get_time() can be renamed to get_current_time(). When a tool changes, you’ll need to update your prompt and all your examples. Using the exact function names also makes it harder to reuse a planner across different use cases with different tool APIs.
到目前为止,所有生成的计划示例都使用了确切的函数名称,这非常细致。这种方法的一个问题是,代理的工具库存可能会随着时间而变化。例如,获取当前日期的函数 get_time() 可以被重命名为 get_current_time() 。当工具发生变化时,您需要更新您的提示和所有示例。使用确切的函数名称也使得在具有不同工具 API 的不同用例之间重用计划变得更加困难。

If you’ve previously finetuned a model to generate plans based on the old tool inventory, you’ll need to finetune the model again on the new tool inventory.
如果您之前已微调模型以根据旧工具库存生成计划,则需要再次根据新工具库存微调模型。

To avoid this problem, plans can also be generated using a more natural language, which is higher-level than domain-specific function names. For example, given the query “What’s the price of the best-selling product last week”, an agent can be instructed to output a plan that looks like this:
为避免此问题,也可以使用更自然的语言来生成计划,这种语言比特定领域的函数名称更高级。例如,对于查询“上周最畅销产品的价格是多少”,可以指示代理输出一个这样的计划:

  1. get current date
  2. retrieve the best-selling product last week
  3. retrieve product information
  4. generate query
  5. generate response

Using more natural language helps your plan generator become robust to changes in tool APIs. If your model was trained mostly on natural language, it’ll likely be better at understanding and generating plans in natural language and less likely to hallucinate.
使用更自然的语言有助于使你的计划生成器对工具 API 的变化更具鲁棒性。如果你的模型主要是用自然语言训练的,它将更善于理解和生成自然语言的计划,并且不太可能产生幻觉。

The downside of this approach is that you need a translator to translate each natural language action into executable commands. Chameleon (Lu et al., 2023) calls this translator a program generator. However, translating is a much simpler task than planning and can be done by weaker models with a lower risk of hallucination.
这种方法的缺点是需要一个翻译器将每种自然语言操作翻译成可执行命令。Chameleon (Lu et al., 2023) 将这个翻译器称为程序生成器。然而,翻译任务比规划简单得多,可以由较弱的模型完成,且幻觉风险较低。

Complex plans  复杂计划

The plan examples so far have been sequential: the next action in the plan is always executed after the previous action is done. The order in which actions can be executed is called a control flow. The sequential form is just one type of control flow. Other types of control flows include the parallel, if statement, and for loop. The list below provides an overview of each control flow, including sequential for comparison:
到目前为止,计划示例都是顺序的:计划中的下一个操作总是在前一个操作完成后执行。可以执行操作的顺序称为控制流。顺序形式只是控制流的一种类型。其他类型的控制流包括并行、if 语句和 for 循环。下面的列表概述了每种控制流,包括顺序进行比较:

  • Sequential  顺序

    Executing task B after task A is complete, possibly because task B depends on task A. For example, the SQL query can only be executed after it’s been translated from the natural language input.
    在任务 A 完成后执行任务 B,可能是因为任务 B 依赖于任务 A。例如,SQL 查询只有在从自然语言输入转换后才能执行。

  • Parallel  并行

    Executing tasks A and B at the same time. For example, given the query “Find me best-selling products under $100”, an agent might first retrieve the top 100 best-selling products and, for each of these products, retrieve its price.
    同时执行任务 A 和 B。例如,对于查询“查找 100 美元以下的畅销产品”,代理可能首先检索最畅销的前 100 种产品,并为这些产品中的每一种检索其价格。

  • If statement  如果语句

    Executing task B or task C depending on the output from the previous step. For example, the agent first checks NVIDIA’s earnings report. Based on this report, it can then decide to sell or buy NVIDIA stocks. Anthropic’s post calls this pattern “routing”.
    根据前一步的输出执行任务 B 或任务 C。例如,代理首先检查 NVIDIA 的收益报告。根据此报告,它可以决定卖出或买入 NVIDIA 股票。Anthropic 的帖子将这种模式称为“路由”。

  • For loop  对于循环

    Repeat executing task A until a specific condition is met. For example, keep on generating random numbers until a prime number.
    重复执行任务 A,直到满足特定条件。例如,持续生成随机数,直到生成一个质数。

These different control flows are visualized in Figure 6-11.
这些不同的控制流在图 6-11 中进行了可视化。

Agent control flow
Figure 6-11. Examples of different orders in which a plan can be executed
图 6-11. 计划可以按不同顺序执行的示例


In traditional software engineering, conditions for control flows are exact. With AI-powered agents, AI models determine control flows. Plans with non-sequential control flows are more difficult to both generate and translate into executable commands.
在传统的软件工程中,控制流的条件是精确的。使用 AI 驱动的代理时,AI 模型决定控制流。具有非顺序控制流的计划更难以生成和转换为可执行命令。

When evaluating an agent framework, check what control flows it supports. For example, if the system needs to browse ten websites, can it do so simultaneously? Parallel execution can significantly reduce the latency perceived by users.
在评估代理框架时,检查它支持哪些控制流。例如,如果系统需要浏览十个网站,它能否同时进行?并行执行可以显著减少用户感知的延迟。

Reflection and error correction
反射和错误纠正

Even the best plans need to be constantly evaluated and adjusted to maximize their chance of success. While reflection isn’t strictly necessary for an agent to operate, it’s necessary for an agent to succeed.
即使是最周密的计划也需要不断评估和调整,以最大化其成功的机会。虽然反思对于代理运行并不是严格必要的,但对于代理的成功是必要的。

There are many places during a task process where reflection can be useful:
在任务过程中有很多地方可以使用反射:

  • After receiving a user query to evaluate if the request is feasible.
    在收到用户查询后,评估该请求是否可行。
  • After the initial plan generation to evaluate whether the plan makes sense.
    在初始计划生成后,评估该计划是否合理。
  • After each execution step to evaluate if it’s on the right track.
    每次执行步骤后评估是否在正确的轨道上。
  • After the whole plan has been executed to determine if the task has been accomplished.
    整个计划执行完毕后,确定任务是否已经完成。

Reflection and error correction are two different mechanisms that go hand in hand. Reflection generates insights that help uncover errors to be corrected.
反射和错误纠正是两种不同的机制,它们相辅相成。反射产生的见解有助于发现需要纠正的错误。

Reflection can be done with the same agent with self-critique prompts. It can also be done with a separate component, such as a specialized scorer: a model that outputs a concrete score for each outcome.
可以通过带有自我批评提示的同一代理进行反思。也可以通过单独的组件进行,例如专门的评分器:为每个结果输出具体分数的模型。

First proposed by ReAct (Yao et al., 2022), interleaving reasoning and action has become a common pattern for agents. Yao et al. used the term “reasoning” to encompass both planning and reflection. At each step, the agent is asked to explain its thinking (planning), take actions, then analyze observations (reflection), until the task is considered finished by the agent. The agent is typically prompted, using examples, to generate outputs in the following format:
最早由 ReAct(Yao 等人,2022)提出,交织推理和行动已成为代理的常见模式。Yao 等人使用“推理”一词来涵盖规划和反思。在每一步中,要求代理解释其思考(规划),采取行动,然后分析观察结果(反思),直到任务被认为由代理完成。通常,代理会根据示例被提示生成以下格式的输出:

Thought 1: …
Act 1: …
Observation 1: …

… [continue until reflection determines that the task is finished] …

Thought N: … 
Act N: Finish [Response to query]

Figure 6-12 shows an example of an agent following the ReAct framework responding to a question from HotpotQA (Yang et al., 2018), a benchmark for multi-hop question answering.
图 6-12 展示了一个遵循 ReAct 框架的代理对 HotpotQA(Yang 等人,2018)问题作出反应的示例,这是一个多跳问答的基准。

ReAct example
Figure 6-12: A ReAct agent in action.
图 6-12:一个 ReAct 代理在运行中。


You can implement reflection in a multi-agent setting: one agent plans and takes actions and another agent evaluates the outcome after each step or after a number of steps.
你可以在多智能体环境中实现反射:一个智能体规划并采取行动,另一个智能体在每一步之后或在若干步骤后评估结果。

If the agent’s response failed to accomplish the task, you can prompt the agent to reflect on why it failed and how to improve. Based on this suggestion, the agent generates a new plan. This allows agents to learn from their mistakes.
如果代理的响应未能完成任务,您可以提示代理反思失败的原因以及如何改进。根据这些建议,代理生成了一个新计划。这使代理能够从错误中学习。

For example, given a coding generation task, an evaluator might evaluate that the generated code fails ⅓ of the test cases. The agent then reflects that it failed because it didn’t take into account arrays where all numbers are negative. The actor then generates new code, taking into account all-negative arrays.
例如,对于一个代码生成任务,评估者可能会评估生成的代码有⅓的测试用例未通过。然后,代理反思失败的原因是它没有考虑到所有数字都是负数的数组。接着,执行者生成新的代码,考虑了所有数字为负数的数组。

This is the approach that Reflexion (Shinn et al., 2023) took. In this framework, reflection is separated into two modules: an evaluator that evaluates the outcome and a self-reflection module that analyzes what went wrong. Figure 6-13 shows examples of Reflexion agents in action. The authors used the term “trajectory” to refer to a plan. At each step, after evaluation and self-reflection, the agent proposes a new trajectory.
这是 Reflexion(Shinn 等人,2023)所采用的方法。在这个框架中,反思被分为两个模块:一个评估者模块评估结果,一个自我反思模块分析哪里出了问题。图 6-13 展示了 Reflexion 代理在行动中的例子。作者使用术语“trajectory”来指代一个计划。在每一步之后,在评估和自我反思之后,代理提出一个新的 trajectory。

Reflexion example
Figure 6-13. Examples of how Reflexion agents work.
图 6-13. 反射代理工作原理的示例。


Compared to plan generation, reflection is relatively easy to implement and can bring surprisingly good performance improvement. The downside of this approach is latency and cost. Thoughts, observations, and sometimes actions can take a lot of tokens to generate, which increases cost and user-perceived latency, especially for tasks with many intermediate steps. To nudge their agents to follow the format, both ReAct and Reflexion authors used plenty of examples in their prompts. This increases the cost of computing input tokens and reduces the context space available for other information.
与计划生成相比,反射相对容易实现,并且可以带来令人惊讶的性能提升。这种方法的缺点是延迟和成本。想法、观察,有时甚至是行动可能需要生成大量的标记,这会增加成本和用户感知的延迟,特别是在具有许多中间步骤的任务中。为了引导他们的代理遵循格式,ReAct 和 Reflexion 的作者在提示中使用了大量示例。这增加了计算输入标记的成本,并减少了可用于其他信息的上下文空间。

Tool selection  工具选择

Because tools often play a crucial role in a task’s success, tool selection requires careful consideration. The tools to give your agent depend on the environment and the task, but also depends on the AI model that powers the agent.
因为工具通常在任务的成功中起着关键作用,工具的选择需要仔细考虑。要给代理的工具取决于环境和任务,但也取决于驱动代理的 AI 模型。

There’s no foolproof guide on how to select the best set of tools. Agent literature consists of a wide range of tool inventories. For example, Toolformer (Schick et al., 2023) finetuned GPT-J to learn five tools. Chameleon (Lu et al., 2023) uses 13 tools. On the other hand, Gorilla (Patil et al., 2023) attempted to prompt agents to select the right API call among 1,645 APIs.
没有万无一失的指南来说明如何选择最佳工具集。代理文献包含广泛种类的工具清单。例如,Toolformer (Schick et al., 2023) 对 GPT-J 进行了微调以学习五种工具。Chameleon (Lu et al., 2023) 使用了 13 种工具。另一方面,Gorilla (Patil et al., 2023) 尝试提示代理在 1,645 个 API 中选择正确的 API 调用。

More tools give the agent more capabilities. However, the more tools there are, the harder it is to efficiently use them. It’s similar to how it’s harder for humans to master a large set of tools. Adding tools also means increasing tool descriptions, which might not fit into a model’s context.
更多工具赋予代理更多功能。然而,工具越多,高效使用它们就越困难。这类似于人类掌握大量工具也更难。添加工具还意味着增加工具描述,这可能无法适应模型的上下文。

Like many other decisions while building AI applications, tool selection requires experimentation and analysis. Here are a few things you can do to help you decide:
像构建 AI 应用程序的许多其他决策一样,工具选择需要实验和分析。这里有一些你可以做的事情来帮助你决定:

  • Compare how an agent performs with different sets of tools.
    比较代理在不同工具集下的表现。
  • Do an ablation study to see how much the agent’s performance drops if a tool is removed from its inventory. If a tool can be removed without a performance drop, remove it.
    进行消融研究,以观察如果从代理的库存中移除一个工具,其性能会下降多少。如果移除某个工具不会导致性能下降,则移除它。
  • Look for tools that the agent frequently makes mistakes on. If a tool proves too hard for the agent to use—for example, extensive prompting and even finetuning can’t get the model to learn to use it—change the tool.
    寻找代理经常出错的工具。如果某个工具对代理来说太难使用——例如,广泛的提示甚至微调都无法使模型学会使用它——则更换工具。
  • Plot the distribution of tool calls to see what tools are most used and what tools are least used. Figure 6-14 shows the differences in tool use patterns of GPT-4 and ChatGPT in Chameleon (Lu et al., 2023).
    绘制工具调用的分布图,以查看哪些工具使用最多,哪些工具使用最少。图 6-14 显示了 GPT-4 和 ChatGPT 在 Chameleon(Lu et al., 2023)中工具使用模式的差异。
Different models have different tool preferences
Figure 6-14. Different models and tasks express different tool use patterns. Image from Chameleon (Lu et al., 2023).
图 6-14. 不同的模型和任务表现出不同的工具使用模式。图片来自 Chameleon (Lu et al., 2023)。


Experiments by Chameleon (Lu et al., 2023) also demonstrate two points:
变色龙(Lu 等人,2023)的实验也证明了两点:

  1. Different tasks require different tools. ScienceQA, the science question answering task, relies much more on knowledge retrieval tools than TabMWP, a tabular math problem-solving task.
    不同的任务需要不同的工具。ScienceQA,科学问答任务,比 TabMWP(表格数学问题解决任务)更依赖于知识检索工具。
  2. Different models have different tool preferences. For example, GPT-4 seems to select a wider set of tools than ChatGPT. ChatGPT seems to favor image captioning, while GPT-4 seems to favor knowledge retrieval.
    不同的模型有不同的工具偏好。例如,GPT-4 似乎选择比 ChatGPT 更广泛的一套工具。ChatGPT 似乎更喜欢图像字幕,而 GPT-4 似乎更喜欢知识检索。

Tip:  提示:

When evaluating an agent framework, evaluate what planners and tools it supports. Different frameworks might focus on different categories of tools. For example, AutoGPT focuses on social media APIs (Reddit, X, and Wikipedia), whereas Composio focuses on enterprise APIs (Google Apps, GitHub, and Slack).
在评估代理框架时,评估它支持的规划器和工具。不同的框架可能专注于不同类别的工具。例如,AutoGPT 专注于社交媒体 API(Reddit、X 和 Wikipedia),而 Composio 专注于企业 API(Google Apps、GitHub 和 Slack)。

As your needs will likely change over time, evaluate how easy it is to extend your agent to incorporate new tools.
随着您的需求可能会随着时间而变化,评估扩展您的代理以纳入新工具的难易程度。

As humans, we become more productive not just by using the tools we’re given, but also by creating progressively more powerful tools from simpler ones. Can AI create new tools from its initial tools?
作为人类,我们不仅通过使用所给予的工具来提高生产力,还通过从简单的工具中创造越来越强大的工具。AI 能从其初始工具创造新工具吗?

Chameleon (Lu et al., 2023) proposes the study of tool transition: after tool X, how likely is the agent to call tool Y? Figure 6-15 shows an example of tool transition. If two tools are frequently used together, they can be combined into a bigger tool. If an agent is aware of this information, the agent itself can combine initial tools to continually build more complex tools.
变色龙 (Lu et al., 2023) 提出了工具转换的研究:在使用工具 X 之后,代理调用工具 Y 的可能性有多大?图 6-15 显示了一个工具转换的示例。如果两个工具经常一起使用,它们可以被组合成一个更大的工具。如果代理掌握这些信息,它可以将初始工具组合起来,持续构建更复杂的工具。

Tool transition
Figure 6-15. A tool transition tree by Chameleon (Lu et al., 2023).
图 6-15. Chameleon(Lu 等人,2023)的工具转换树。


Vogager (Wang et al., 2023) proposes a skill manager to keep track of new skills (tools) that an agent acquires for later reuse. Each skill is a coding program. When the skill manager determines a newly created skill is to be useful (e.g., because it’s successfully helped an agent accomplish a task), it adds this skill to the skill library (conceptually similar to the tool inventory). This skill can be retrieved later to use for other tasks.
Vogager (Wang et al., 2023) 提出了一种技能管理器,用于跟踪代理获得的新技能(工具),以便日后重用。每个技能都是一个编程程序。当技能管理器确定新创建的技能有用(例如,因为它成功帮助代理完成任务),它会将此技能添加到技能库(概念上类似于工具库存)。这个技能可以在以后检索出来用于其他任务。

Earlier in this section, we mentioned that the success of an agent in an environment depends on its tool inventory and its planning capabilities. Failures in either aspect can cause the agent to fail. The next section will discuss different failure modes of an agent and how to evaluate them.
在本节前面,我们提到代理在环境中的成功取决于其工具库存及其规划能力。任一方面的失败都可能导致代理失败。下一节将讨论代理的不同失败模式以及如何评估它们。

Agent Failure Modes and Evaluation
代理失败模式和评估

Evaluation is about detecting failures. The more complex a task an agent performs, the more possible failure points there are. Other than the failure modes common to all AI applications discussed in Chapters 3 and 4, agents also have unique failures caused by planning, tool execution, and efficiency. Some of the failures are easier to catch than others.
评估是关于检测失败。代理执行的任务越复杂,可能的失败点就越多。除了第 3 章和第 4 章讨论的所有 AI 应用常见的失败模式外,代理还会有由规划、工具执行和效率引起的独特失败。有些失败比其他失败更容易被发现。

To evaluate an agent, identify its failure modes and measure how often each of these failure modes happens.
要评估一个代理,识别其故障模式并测量每种故障模式发生的频率。

Planning failures  规划失败

Planning is hard and can fail in many ways. The most common mode of planning failure is tool use failure. The agent might generate a plan with one or more of these errors.
计划很难并且可能在很多方面失败。最常见的计划失败模式是工具使用失败。代理可能会生成一个包含以下一个或多个错误的计划。

  • Invalid tool  无效工具

    For example, it generates a plan that contains bing_search, which isn’t in the tool inventory.
    例如,它生成一个包含 bing_search 的计划,而这并不在工具库存中。

  • Valid tool, invalid parameters.
    有效的工具,无效的参数。

    For example, it calls lbs_to_kg with two parameters, but this function requires only one parameter, lbs.
    例如,它用两个参数调用 lbs_to_kg ,但此函数只需要一个参数, lbs

  • Valid tool, incorrect parameter values
    有效的工具,参数值错误

    For example, it calls lbs_to_kg with one parameter, lbs, but uses the value 100 for lbs when it should be 120.
    例如,它将 lbs_to_kg with 称为一个参数, lbs ,但在应该使用 120 的地方却使用了 100 作为 lbs 的值。

Another mode of planning failure is goal failure: the agent fails to achieve the goal. This can be because the plan doesn’t solve a task, or it solves the task without following the constraints. To illustrate this, imagine you ask the model to plan a two-week trip from San Francisco to India with a budget of $5,000. The agent might plan a trip from San Francisco to Vietnam, or plan you a two-week trip from San Francisco to India that will cost you way over the budget.
另一种计划失败的方式是目标失败:代理未能实现目标。这可能是因为计划没有解决任务,或者在没有遵循约束的情况下解决了任务。为了说明这一点,想象一下你要求模型规划一个从旧金山到印度为期两周的旅行,预算为 5,000 美元。代理可能会规划一个从旧金山到越南的旅行,或者为你规划一个从旧金山到印度为期两周的旅行,但费用远超预算。

A common constraint that is often overlooked by agent evaluation is time. In many cases, the time an agent takes matters less because you can assign a task to an agent and only need to check in when it’s done. However, in many cases, the agent becomes less useful with time. For example, if you ask an agent to prepare a grant proposal and the agent finishes it after the grant deadline, the agent isn’t very helpful.
一个常见的约束条件,常常被代理评估忽视的是时间。在很多情况下,代理花费的时间较少重要,因为你可以将任务分配给代理,并且只需要在任务完成时进行检查。然而,在许多情况下,随着时间的推移,代理变得不那么有用。例如,如果你要求代理准备一份资助提案,而代理在资助截止日期后才完成,那么这个代理的帮助就不大了。

An interesting mode of planning failure is caused by errors in reflection. The agent is convinced that it’s accomplished a task when it hasn’t. For example, you ask the agent to assign 50 people to 30 hotel rooms. The agent might assign only 40 people and insist that the task has been accomplished.
一种有趣的规划失败模式是由反思中的错误引起的。代理确信它已经完成了一项任务,而实际上并没有。例如,你要求代理将 50 人分配到 30 个酒店房间。代理可能只分配了 40 人,并坚持认为任务已经完成。

To evaluate an agent for planning failures, one option is to create a planning dataset where each example is a tuple (task, tool inventory). For each task, use the agent to generate a K number of plans. Compute the following metrics:
要评估一个代理的规划失败情况,一个选项是创建一个规划数据集,其中每个示例是一个元组 (task, tool inventory) 。对于每个任务,使用代理生成 K 个计划。计算以下指标:

  1. Out of all generated plans, how many are valid?
    在所有生成的计划中,有多少是有效的?
  2. For a given task, how many plans does the agent have to generate, on average, to get a valid plan?
    对于给定的任务,代理平均需要生成多少个计划才能得到一个有效的计划?
  3. Out of all tool calls, how many are valid?
    在所有工具调用中,有多少是有效的?
  4. How often are invalid tools called?
    无效工具被调用的频率是多少?
  5. How often are valid tools called with invalid parameters?
    有效的工具多久会被用无效的参数调用一次?
  6. How often are valid tools called with incorrect parameter values?
    有效的工具多久会被用错误的参数值调用一次?

Analyze the agent’s outputs for patterns. What types of tasks does the agent fail more on? Do you have a hypothesis why? What tools does the model frequently make mistakes with? Some tools might be harder for an agent to use. You can improve an agent’s ability to use a challenging tool by better prompting, more examples, or finetuning. If all fail, you might consider swapping out this tool for something easier to use.
分析代理的输出以寻找模式。代理在哪些类型的任务上更容易失败?你有假设原因吗?模型经常在哪种工具上出错?有些工具可能对代理来说更难使用。你可以通过更好的提示、更多的例子或微调来提高代理使用具有挑战性工具的能力。如果所有方法都失败了,你可能需要考虑更换这个工具,换成更容易使用的。

Tool failures  工具故障

Tool failures happen when the correct tool is used, but the tool output is wrong. One failure mode is when a tool just gives the wrong outputs. For example, an image captioner returns a wrong description, or an SQL query generator returns a wrong SQL query.
工具故障发生在使用正确的工具时,但工具的输出结果错误。一种故障模式是工具直接给出错误的输出。例如,图像字幕生成器返回了错误的描述,或者 SQL 查询生成器返回了错误的 SQL 查询。

If the agent generates only high-level plans and a translation module is involved in translating from each planned action to executable commands, failures can happen because of translation errors.
如果代理只生成高层次计划,并且翻译模块参与将每个计划的操作翻译成可执行命令,则可能会因为翻译错误而发生故障。

Tool failures are tool-dependent. Each tool needs to be tested independently. Always print out each tool call and its output so that you can inspect and evaluate them. If you have a translator, create benchmarks to evaluate it.
工具故障取决于工具本身。每个工具都需要独立测试。始终打印出每个工具调用及其输出,以便您可以检查和评估它们。如果您有翻译器,创建基准来评估它。

Detecting missing tool failures requires an understanding of what tools should be used. If your agent frequently fails on a specific domain, this might be because it lacks tools for this domain. Work with human domain experts and observe what tools they would use.
检测工具缺失故障需要了解应使用哪些工具。如果您的代理在特定领域频繁失败,这可能是因为它缺乏该领域的工具。与人类领域专家合作,观察他们会使用哪些工具。

Efficiency  效率

An agent might generate a valid plan using the right tools to accomplish a task, but it might be inefficient. Here are a few things you might want to track to evaluate an agent’s efficiency:
一个代理可能使用正确的工具生成有效的计划来完成任务,但可能会低效。这里有一些你可能想要跟踪以评估代理效率的内容:

  • How many steps does the agent need, on average, to complete a task?
    代理平均需要多少步才能完成任务?
  • How much does the agent cost, on average, to complete a task?
    代理完成任务平均需要多少费用?
  • How long does each action typically take? Are there any actions that are especially time-consuming or expensive?
    每个动作通常需要多长时间?是否有特别耗时或昂贵的动作?

You can compare these metrics with your baseline, which can be another agent or a human operator. When comparing AI agents to human agents, keep in mind that humans and AI have very different modes of operation, so what’s considered efficient for humans might be inefficient for AI and vice versa. For example, visiting 100 web pages might be inefficient for a human agent who can only visit one page at a time but trivial for an AI agent that can visit all the web pages at once.
您可以将这些指标与您的基线进行比较,基线可以是另一个代理或人类操作员。在比较 AI 代理和人类代理时,请记住,人类和 AI 的操作模式非常不同,因此对人类而言高效的可能对 AI 来说是低效的,反之亦然。例如,访问 100 个网页对于一次只能访问一个页面的人类代理来说可能是低效的,但对于可以同时访问所有网页的 AI 代理来说则是微不足道的。

Conclusion  结论

At its core, the concept of an agent is fairly simple. An agent is defined by the environment it operates in and the set of tools it has access to. In an AI-powered agent, the AI model is the brain that leverages its tools and feedback from the environment to plan how best to accomplish a task. Access to tools makes a model vastly more capable, so the agentic pattern is inevitable.
其核心,代理的概念相当简单。代理由其运行的环境和它可访问的工具集定义。在人工智能驱动的代理中,AI 模型是利用其工具和来自环境的反馈来规划如何最好地完成任务的大脑。对工具的访问使模型的能力大大增强,因此代理模式是不可避免的。

While the concept of “agents” sounds novel, they are built upon many concepts that have been used since the early days of LLMs, including self-critique, chain-of-thought, and structured outputs.
虽然“代理”这个概念听起来很新颖,但它们是建立在许多自LLMs早期就开始使用的概念之上的,包括自我批评、思维链和结构化输出。

This post covered conceptually how agents work and different components of an agent. In a future post, I’ll discuss how to evaluate agent frameworks.
这篇文章概念性地介绍了代理的工作原理及其不同组成部分。在以后的文章中,我将讨论如何评估代理框架。

The agentic pattern often deals with information that exceeds a model’s context limit. A memory system that supplements the model’s context in handling information can significantly enhance an agent’s capabilities. Since this post is already long, I’ll explore how a memory system works in a future blog post.
代理模式经常处理超出模型上下文限制的信息。一个补充模型上下文以处理信息的内存系统可以显著增强代理的能力。由于这篇文章已经很长,我将在未来的博客文章中探讨内存系统的工作原理。