代理

Intelligent agents are considered by many to be the ultimate goal of AI. The classic book by Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach (Prentice Hall, 1995), defines the field of AI research as “the study and design of rational agents.”
智能代理被许多人视为人工智能的最终目标。《斯图尔特·罗素和彼得·诺维格的经典著作《人工智能：一种现代方法》（普伦蒂斯·霍尔，1995 年）将人工智能研究领域定义为“对理性代理的研究与设计。”

The unprecedented capabilities of foundation models have opened the door to agentic applications that were previously unimaginable. These new capabilities make it finally possible to develop autonomous, intelligent agents to act as our assistants, coworkers, and coaches. They can help us create a website, gather data, plan a trip, do market research, manage a customer account, automate data entry, prepare us for interviews, interview our candidates, negotiate a deal, etc. The possibilities seem endless, and the potential economic value of these agents is enormous.
前所未有的基础模型能力为之前难以想象的代理应用打开了大门。这些新能力终于使得开发自主、智能的代理成为可能，它们可以充当我们的助手、同事和教练。它们可以帮助我们创建网站、收集数据、规划旅行、进行市场研究、管理客户账户、自动化数据录入、为我们准备面试、面试候选人、谈判交易等。可能性似乎无穷无尽，这些代理的经济价值潜力巨大。

This section will start with an overview of agents and then continue with two aspects that determine the capabilities of an agent: tools and planning. Agents, with their new modes of operations, have new modes of failure. This section will end with a discussion on how to evaluate agents to catch these failures.
本节将从对代理的概述开始，然后继续探讨决定代理能力的两个方面：工具和规划。代理由于其新的操作模式，也出现了新的故障模式。本节将以讨论如何评估代理以捕捉这些故障结束。

This post is adapted from the Agents section of AI Engineering (2025) with minor edits to make it a standalone post.
此帖改编自《人工智能工程》（2025 年）的“代理”部分，并进行了少量编辑以使其成为一篇独立的帖子。

Notes: 注：

AI-powered agents are an emerging field with no established theoretical frameworks for defining, developing, and evaluating them. This section is a best-effort attempt to build a framework from the existing literature, but it will evolve as the field does. Compared to the rest of the book, this section is more experimental. I received helpful feedback from early reviewers, and I hope to get feedback from readers of this blog post, too.
人工智能驱动的代理是一个新兴领域，尚未形成定义、开发和评估它们的既定理论框架。本节是对现有文献进行框架构建的最佳尝试，但这一框架将随着该领域的发展而演变。与本书的其他部分相比，本节更具实验性。我收到了早期审稿人的有益反馈，并希望也能从本博客文章的读者那里获得反馈。
Just before this book came out, Anthropic published a blog post on Building effective agents (Dec 2024). I’m glad to see that Anthropic’s blog post and my agent section are conceptually aligned, though with slightly different terminologies. However, Anthropic’s post focuses on isolated patterns, whereas my post covers why and how things work. I also focus more on planning, tool selection, and failure modes.
在本书出版前夕，Anthropic 发布了一篇关于构建有效代理的博客文章（2024 年 12 月）。我很高兴看到 Anthropic 的博客文章和我关于代理的部分在概念上是一致的，尽管使用了略有不同的术语。然而，Anthropic 的文章侧重于孤立模式，而我的文章则涵盖了为什么以及事物是如何运作的。我还更关注规划、工具选择和故障模式。
The post contains a lot of background information. Feel free to skip ahead if it feels a little too in the weeds!
该帖子包含大量背景信息。如果感觉有点过于复杂，请随意跳过！

Agent Overview 代理概述

The term agent has been used in many different engineering contexts, including but not limited to a software agent, intelligent agent, user agent, conversational agent, and reinforcement learning agent. So, what exactly is an agent?
术语“代理”在许多不同的工程环境中被使用，包括但不限于软件代理、智能代理、用户代理、对话代理和强化学习代理。那么，究竟什么是代理呢？

An agent is anything that can perceive its environment and act upon that environment. Artificial Intelligence: A Modern Approach (1995) defines an agent as anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.
一个智能体是指任何能够感知其环境并对该环境产生作用的实体。《人工智能：现代方法》（1995 年）将智能体定义为任何可以通过传感器感知其环境并通过执行器对该环境产生作用的实体。

This means that an agent is characterized by the environment it operates in and the set of actions it can perform.
这意味着一个智能体由其操作的环境和它能执行的动作集合所表征。

The environment an agent can operate in is defined by its use case. If an agent is developed to play a game (e.g., Minecraft, Go, Dota), that game is its environment. If you want an agent to scrape documents from the internet, the environment is the internet. A self-driving car agent’s environment is the road system and its adjacent areas.
环境定义了代理可以操作的范围。如果一个代理被开发来玩游戏（例如，Minecraft、Go、Dota），那么这个游戏就是其环境。如果你想让代理从互联网上抓取文档，那么环境就是互联网。自动驾驶汽车代理的环境是道路系统及其相邻区域。

The set of actions an AI agent can perform is augmented by the tools it has access to. Many generative AI-powered applications you interact with daily are agents with access to tools, albeit simple ones. ChatGPT is an agent. It can search the web, execute Python code, and generate images. RAG systems are agents—text retrievers, image retrievers, and SQL executors are their tools.
一套人工智能代理可以执行的操作可以通过它所能访问的工具得到增强。你每天与之互动的许多基于生成式 AI 的应用程序都是拥有工具的代理，尽管这些工具很简单。ChatGPT 是一个代理。它可以搜索网络、执行 Python 代码和生成图像。RAG 系统是代理——文本检索器、图像检索器和 SQL 执行器是它们的工具。

There’s a strong dependency between an agent’s environment and its set of tools. The environment determines what tools an agent can potentially use. For example, if the environment is a chess game, the only possible actions for an agent are the valid chess moves. However, an agent’s tool inventory restricts the environment it can operate in. For example, if a robot’s only action is swimming, it’ll be confined to a water environment.
环境与代理的工具集之间存在强烈的依赖关系。环境决定了代理可能使用的工具。例如，如果环境是棋局，代理唯一可能采取的行动就是有效的棋步。然而，代理的工具库存限制了其可以操作的环境。例如，如果一个机器人的唯一行动是游泳，它将被限制在水中环境。

Figure 6-8 shows a visualization of SWE-agent (Yang et al., 2024), an agent built on top of GPT-4. Its environment is the computer with the terminal and the file system. Its set of actions include navigate repo, search files, view files, and edit lines.
图 6-8 展示了 SWE-agent（杨等，2024）的可视化，该代理建立在 GPT-4 之上。其环境是带有终端和文件系统的计算机。其动作集包括导航代码库、搜索文件、查看文件和编辑行。

Figure 6-8. SWE-agent is a coding agent whose environment is the computer and whose actions include navigation, search, view files, and editing
图 6-8. SWE-agent 是一种编码代理，其环境是计算机，其行为包括导航、搜索、查看文件和编辑

An AI agent is meant to accomplish tasks typically provided by the users. In an AI agent, AI is the brain that processes the task, plans a sequence of actions to achieve this task, and determines whether the task has been accomplished.
人工智能代理旨在完成用户通常提供的任务。在人工智能代理中，人工智能是处理任务、规划实现该任务的行动序列并确定任务是否完成的“大脑”。

Let’s return to the RAG system with tabular data in the Kitty Vogue example above. This is a simple agent with three actions:
让我们回到上述 Kitty Vogue 示例中的 RAG 系统，该系统具有表格数据。这是一个具有三个动作的简单代理：

response generation, 响应生成
SQL query generation, and
SQL 查询生成，以及
SQL query execution. SQL 查询执行。

Given the query "Project the sales revenue for Fruity Fedora over the next three months", the agent might perform the following sequence of actions:
考虑到查询 "Project the sales revenue for Fruity Fedora over the next three months" ，智能体可能会执行以下动作序列：

Reason about how to accomplish this task. It might decide that to predict future sales, it first needs the sales numbers from the last five years. An agent’s reasoning can be shown as intermediate responses.
关于如何完成这项任务的推理。它可能决定，为了预测未来的销售额，它首先需要过去五年的销售数据。代理的推理可以表现为中间响应。
Invoke SQL query generation to generate the query to get sales numbers from the last five years.
调用 SQL 查询生成器以生成获取过去五年销售数据的查询。
Invoke SQL query execution to execute this query.
调用 SQL 查询执行以执行此查询。
Reason about the tool outputs (outputs from the SQL query execution) and how they help with sales prediction. It might decide that these numbers are insufficient to make a reliable projection, perhaps because of missing values. It then decides that it also needs information about past marketing campaigns.
分析工具输出（SQL 查询执行输出）及其如何帮助销售预测。它可能认为这些数字不足以进行可靠的预测，或许是因为缺少值。然后它决定还需要关于过去营销活动的信息。
Invoke SQL query generation to generate the queries for past marketing campaigns.
调用 SQL 查询生成器以生成过去营销活动的查询。
Invoke SQL query execution.
调用 SQL 查询执行。
Reason that this new information is sufficient to help predict future sales. It then generates a projection.
此新信息足以帮助预测未来销售的原因。然后生成预测。
Reason that the task has been successfully completed.
任务成功完成的理由。

Compared to non-agent use cases, agents typically require more powerful models for two reasons:
与无代理用例相比，代理通常需要更强大的模型，原因有两个：

Compound mistakes: an agent often needs to perform multiple steps to accomplish a task, and the overall accuracy decreases as the number of steps increases. If the model’s accuracy is 95% per step, over 10 steps, the accuracy will drop to 60%, and over 100 steps, the accuracy will be only 0.6%.
复合错误：一个代理通常需要执行多个步骤来完成一项任务，随着步骤数量的增加，整体准确性会降低。如果模型的每步准确率为 95%，超过 10 步，准确率将降至 60%，超过 100 步，准确率将仅为 0.6%。
Higher stakes: with access to tools, agents are capable of performing more impactful tasks, but any failure could have more severe consequences.
更高的赌注：有了工具，代理人能够执行更具影响力的任务，但任何失败都可能带来更严重的后果。

A task that requires many steps can take time and money to run. A common complaint is that agents are only good for burning through your API credits. However, if agents can be autonomous, they can save human time, making their costs worthwhile.
一项需要许多步骤的任务可能需要时间和金钱来运行。常见的抱怨是代理只能用来消耗你的 API 积分。然而，如果代理能够实现自主，它们可以节省人力时间，使它们的成本变得值得。

Given an environment, the success of an agent in an environment depends on the tool it has access to and the strength of its AI planner. Let’s start by looking into different kinds of tools a model can use. We’ll analyze AI’s capability for planning next.
给定一个环境，代理在该环境中的成功取决于它可访问的工具以及其 AI 规划器的强度。让我们首先探讨模型可以使用的不同类型的工具。接下来，我们将分析 AI 的规划能力。

Tools 工具

A system doesn’t need access to external tools to be an agent. However, without external tools, the agent’s capabilities would be limited. By itself, a model can typically perform one action—an LLM can generate text and an image generator can generate images. External tools make an agent vastly more capable.
一个系统不需要访问外部工具就能成为代理。然而，没有外部工具，代理的能力将受到限制。模型本身通常只能执行一个动作——一个LLM可以生成文本，一个图像生成器可以生成图像。外部工具使代理的能力大大增强。

Tools help an agent to both perceive the environment and act upon it. Actions that allow an agent to perceive the environment are read-only actions, whereas actions that allow an agent to act upon the environment are write actions.
工具帮助智能体感知环境并对其做出反应。允许智能体感知环境的动作是只读动作，而允许智能体对环境做出反应的动作是写入动作。

The set of tools an agent has access to is its tool inventory. Since an agent’s tool inventory determines what an agent can do, it’s important to think through what and how many tools to give an agent. More tools give an agent more capabilities. However, the more tools there are, the more challenging it is to understand and utilize them well. Experimentation is necessary to find the right set of tools, as discussed later in the “Tool selection” section.
工具集是代理能够访问的，称为工具库。由于代理的工具库决定了代理能够做什么，因此思考给予代理哪些工具以及多少工具是很重要的。更多的工具赋予代理更多的能力。然而，工具越多，理解和有效利用它们的挑战就越大。正如在“工具选择”部分后面所讨论的，实验是必要的，以找到合适的工具集。

Depending on the agent’s environment, there are many possible tools. Here are three categories of tools that you might want to consider: knowledge augmentation (i.e., context construction), capability extension, and tools that let your agent act upon its environment.
根据代理的环境，有许多可能的工具。以下是你可能想要考虑的三种工具类别：知识增强（即，构建上下文）、能力扩展以及允许你的代理对其环境采取行动的工具。

Knowledge augmentation 知识增强

I hope that this book, so far, has convinced you of the importance of having the relevant context for a model’s response quality. An important category of tools includes those that help augment the knowledge of your agent. Some of them have already been discussed: text retriever, image retriever, and SQL executor. Other potential tools include internal people search, an inventory API that returns the status of different products, Slack retrieval, an email reader, etc.
我希望到目前为止，这本书已经使你相信拥有相关背景对于模型响应质量的重要性。一类重要的工具包括那些帮助增强你的代理知识。其中一些已经被讨论过：文本检索器、图像检索器和 SQL 执行器。其他潜在的工具包括内部人员搜索、返回不同产品状态的库存 API、Slack 检索、电子邮件阅读器等。

Many such tools augment a model with your organization’s private processes and information. However, tools can also give models access to public information, especially from the internet.
许多此类工具通过贵组织的私有流程和信息增强模型。然而，工具也可以使模型获得公共信息访问权限，尤其是来自互联网的信息。

Web browsing was among the earliest and most anticipated capabilities to be incorporated into ChatGPT. Web browsing prevents a model from going stale. A model goes stale when the data it was trained on becomes outdated. If the model’s training data was cut off last week, it won’t be able to answer questions that require information from this week unless this information is provided in the context. Without web browsing, a model won’t be able to tell you about the weather, news, upcoming events, stock prices, flight status, etc.
网页浏览是 ChatGPT 最早且最被期待的功能之一。网页浏览可以防止模型过时。当模型训练的数据变得过时时，模型就会过时。如果模型的训练数据上周就停止更新，除非在上下文中提供这些信息，否则它将无法回答需要本周信息的问题。没有网页浏览，模型将无法告诉你关于天气、新闻、即将发生的事件、股价、航班状态等信息。

I use web browsing as an umbrella term to cover all tools that access the internet, including web browsers and APIs such as search APIs, news APIs, GitHub APIs, or social media APIs.
我使用网络浏览作为一个总称，涵盖所有访问互联网的工具，包括网页浏览器和 API，如搜索 API、新闻 API、GitHub API 或社交媒体 API。

While web browsing allows your agent to reference up-to-date information to generate better responses and reduce hallucinations, it can also open up your agent to the cesspools of the internet. Select your Internet APIs with care.
尽管网络浏览允许您的代理引用最新信息以生成更好的回应并减少幻觉，但它也可能使您的代理暴露于互联网的垃圾堆中。请谨慎选择您的互联网 API。

Capability extension 能力扩展

You might also consider tools that address the inherent limitations of AI models. They are easy ways to give your model a performance boost. For example, AI models are notorious for being bad at math. If you ask a model what is 199,999 divided by 292, the model will likely fail. However, this calculation would be trivial if the model had access to a calculator. Instead of trying to train the model to be good at arithmetic, it’s a lot more resource-efficient to just give the model access to a tool.
您还可以考虑解决 AI 模型固有局限性的工具。这些方法是提升模型性能的便捷途径。例如，AI 模型因数学能力差而闻名。如果您询问一个模型 199,999 除以 292 等于多少，模型很可能会失败。然而，如果模型能够访问计算器，这个计算就变得微不足道。与其试图训练模型擅长算术，不如直接让模型访问工具，这样更节省资源。

Other simple tools that can significantly boost a model’s capability include a calendar, timezone converter, unit converter (e.g., from lbs to kg), and translator that can translate to and from the languages that the model isn’t good at.
其他可以显著提升模型能力的简单工具包括日历、时区转换器、单位转换器（例如，从磅转换为千克），以及能够翻译模型不擅长的语言之间的翻译器。

More complex but powerful tools are code interpreters. Instead of training a model to understand code, you can give it access to a code interpreter to execute a piece of code, return the results, or analyze the code’s failures. This capability lets your agents act as coding assistants, data analysts, and even research assistants that can write code to run experiments and report results. However, automated code execution comes with the risk of code injection attacks, as discussed in Chapter 5 in the section “Defensive Prompt Engineering“. Proper security measurements are crucial to keep you and your users safe.
更复杂但功能强大的工具是代码解释器。你不必训练一个模型来理解代码，而是可以给它访问代码解释器的权限来执行一段代码，返回结果或分析代码的失败。这种能力使你的代理能够充当编码助手、数据分析师，甚至研究助手，可以编写代码来运行实验并报告结果。然而，自动代码执行伴随着代码注入攻击的风险，如第 5 章“防御性提示工程”部分所述。采取适当的安全措施对于保护你和你的用户至关重要。

Tools can turn a text-only or image-only model into a multimodal model. For example, a model that can generate only texts can leverage a text-to-image model as a tool, allowing it to generate both texts and images. Given a text request, the agent’s AI planner decides whether to invoke text generation, image generation, or both. This is how ChatGPT can generate both text and images—it uses DALL-E as its image generator.
工具可以将纯文本或纯图像模型转化为多模态模型。例如，只能生成文本的模型可以利用文本到图像模型作为工具，使其能够生成文本和图像。给定一个文本请求，智能体的人工智能规划器决定是否调用文本生成、图像生成或两者都调用。这就是 ChatGPT 能够生成文本和图像的原因——它使用 DALL-E 作为其图像生成器。

Agents can also use a code interpreter to generate charts and graphs, a LaTex compiler to render math equations, or a browser to render web pages from HTML code.
代理还可以使用代码解释器生成图表和图形，LaTeX 编译器来渲染数学公式，或者浏览器从 HTML 代码中渲染网页。

Similarly, a model that can process only text inputs can use an image captioning tool to process images and a transcription tool to process audio. It can use an OCR (optical character recognition) tool to read PDFs.
同样，只能处理文本输入的模型可以使用图像字幕工具处理图像，使用转录工具处理音频。它可以使用 OCR（光学字符识别）工具读取 PDF 文件。

Tool use can significantly boost a model’s performance compared to just prompting or even finetuning. Chameleon (Lu et al., 2023) shows that a GPT-4-powered agent, augmented with a set of 13 tools, can outperform GPT-4 alone on several benchmarks. Examples of tools this agent used are knowledge retrieval, a query generator, an image captioner, a text detector, and Bing search.
工具使用可以显著提升模型性能，相较于仅提示或甚至微调。Chameleon（Lu 等，2023）表明，一个由 GPT-4 驱动的代理，增加了一套 13 个工具，可以在多个基准测试中超越单独的 GPT-4。该代理使用的工具示例包括知识检索、查询生成器、图像标题生成器、文本检测器和 Bing 搜索。

On ScienceQA, a science question answering benchmark, Chameleon improves the best published few-shot result by 11.37%. On TabMWP (Tabular Math Word Problems) (Lu et al., 2022), a benchmark involving tabular math questions, Chameleon improves the accuracy by 17%.
在 ScienceQA 科学问答基准测试中，Chameleon 将已发表的最好少样本结果提升了 11.37%。在 TabMWP（表格数学单词问题）（Lu 等人，2022 年）基准测试中，涉及表格数学问题的基准测试，Chameleon 将准确率提升了 17%。

Write actions 撰写行动

So far, we’ve discussed read-only actions that allow a model to read from its data sources. But tools can also perform write actions, making changes to the data sources. An SQL executor can retrieve a data table (read) and change or delete the table (write). An email API can read an email but can also respond to it. A banking API can retrieve your current balance, but can also initiate a bank transfer.
迄今为止，我们讨论了只读操作，允许模型从其数据源中读取。但工具也可以执行写操作，对数据源进行更改。一个 SQL 执行器可以检索数据表（读取）并更改或删除表（写入）。一个电子邮件 API 可以读取电子邮件，但也可以回复它。一个银行 API 可以检索您的当前余额，但也可以发起银行转账。

Write actions enable a system to do more. They can enable you to automate the whole customer outreach workflow: researching potential customers, finding their contacts, drafting emails, sending first emails, reading responses, following up, extracting orders, updating your databases with new orders, etc.
编写动作使系统能够做更多。它们可以让你自动化整个客户接触工作流程：研究潜在客户、寻找他们的联系方式、撰写电子邮件、发送第一封电子邮件、阅读回复、跟进、提取订单、将新订单更新到你的数据库中等。

However, the prospect of giving AI the ability to automatically alter our lives is frightening. Just as you shouldn’t give an intern the authority to delete your production database, you shouldn’t allow an unreliable AI to initiate bank transfers. Trust in the system’s capabilities and its security measures is crucial. You need to ensure that the system is protected from bad actors who might try to manipulate it into performing harmful actions.
然而，赋予人工智能自动改变我们生活的能力的前景令人恐惧。正如你不应该赋予实习生删除生产数据库的权限一样，你也不应该让不可靠的人工智能发起银行转账。对系统的能力和其安全措施有信心至关重要。你需要确保系统免受试图操纵其执行有害行为的恶意行为者的侵害。

Sidebar: Agents and security
侧边栏：代理和安全

Whenever I talk about autonomous AI agents to a group of people, there is often someone who brings up self-driving cars. “What if someone hacks into the car to kidnap you?” While the self-driving car example seems visceral because of its physicality, an AI system can cause harm without a presence in the physical world. It can manipulate the stock market, steal copyrights, violate privacy, reinforce biases, spread misinformation and propaganda, and more, as discussed in the section “Defensive Prompt Engineering” in Chapter 5.
每当我和一群人谈论自主人工智能代理时，总有人会提到自动驾驶汽车。“如果有人黑入汽车来绑架你怎么办？”虽然自动驾驶汽车这个例子因为其物理性而显得直观，但人工智能系统可以在没有物理世界存在的情况下造成伤害。它可以操纵股市，盗取版权，侵犯隐私，强化偏见，传播虚假信息和宣传，等等，正如第 5 章“防御性提示工程”部分所讨论的。

These are all valid concerns, and any organization that wants to leverage AI needs to take safety and security seriously. However, this doesn’t mean that AI systems should never be given the ability to act in the real world. If we can trust a machine to take us into space, I hope that one day, security measures will be sufficient for us to trust autonomous AI systems. Besides, humans can fail, too. Personally, I would trust a self-driving car more than the average stranger to give me a lift.
这些都是合理的担忧，任何想要利用人工智能的组织都需要认真对待安全和安保问题。然而，这并不意味着人工智能系统永远不应该被赋予在现实世界中行动的能力。如果我们能够信任机器带我们进入太空，我希望有一天，安全措施将足够让我们信任自主人工智能系统。此外，人类也会失败。就我个人而言，我更信任自动驾驶汽车，而不是一个普通的陌生人来搭载我。

Just as the right tools can help humans be vastly more productive—can you imagine doing business without Excel or building a skyscraper without cranes?—tools enable models to accomplish many more tasks. Many model providers already support tool use with their models, a feature often called function calling. Going forward, I would expect function calling with a wide set of tools to be common with most models.
正如合适的工具能极大地提高人类的生产力——你能想象没有 Excel 做生意或没有起重机建造摩天大楼吗？——工具使模型能够完成更多任务。许多模型提供商已经在其模型中支持工具使用，这一功能通常被称为函数调用。展望未来，我预计大多数模型都将普遍使用具有广泛工具集的函数调用。

Planning 规划

At the heart of a foundation model agent is the model responsible for solving user-provided tasks. A task is defined by its goal and constraints. For example, one task is to schedule a two-week trip from San Francisco to India with a budget of $5,000. The goal is the two-week trip. The constraint is the budget.
模型负责解决用户提供的任务，这是基础模型代理的核心。任务由其目标和约束定义。例如，一个任务是安排从旧金山到印度的两周旅行，预算为 5000 美元。目标是两周旅行，约束是预算。

Complex tasks require planning. The output of the planning process is a plan, which is a roadmap outlining the steps needed to accomplish a task. Effective planning typically requires the model to understand the task, consider different options to achieve this task, and choose the most promising one.
复杂任务需要规划。规划过程的输出是一个计划，它是一份概述完成任务所需步骤的路线图。有效的规划通常要求模型理解任务，考虑实现此任务的不同选项，并选择最有希望的方案。

If you’ve ever been in any planning meeting, you know that planning is hard. As an important computational problem, planning is well studied and would require several volumes to cover. I’ll only be able to cover the surface here.
如果您曾经参加过任何规划会议，您就会知道规划是困难的。作为一个重要的计算问题，规划已经被广泛研究，需要多卷书籍来涵盖。在这里我只能触及表面。

Planning overview 规划概述

Given a task, there are many possible ways to solve it, but not all of them will lead to a successful outcome. Among the correct solutions, some are more efficient than others. Consider the query, "How many companies without revenue have raised at least $1 billion?", and the two example solutions:
给定一个任务，有许多可能的解决方法，但并非所有方法都能导致成功的成果。在正确的解决方案中，有些比其他方法更有效率。考虑查询 "How many companies without revenue have raised at least $1 billion?" ，以及两个示例解决方案：

Find all companies without revenue, then filter them by the amount raised.
查找所有无收入的公司，然后按筹集的金额进行筛选。
Find all companies that have raised at least $1 billion, then filter them by revenue.
查找所有至少融资 10 亿美元的公司，然后按收入进行筛选。

The second option is more efficient. There are vastly more companies without revenue than companies that have raised $1 billion. Given only these two options, an intelligent agent should choose option 2.
第二种选择更有效率。没有收入的公司的数量远远多于已经融资 10 亿美元的公司。考虑到只有这两种选择，智能代理应该选择第二种选项。

You can couple planning with execution in the same prompt. For example, you give the model a prompt, ask it to think step by step (such as with a chain-of-thought prompt), and then execute those steps all in one prompt. But what if the model comes up with a 1,000-step plan that doesn’t even accomplish the goal? Without oversight, an agent can run those steps for hours, wasting time and money on API calls, before you realize that it’s not going anywhere.
您可以在同一个提示中结合规划和执行。例如，您给模型一个提示，要求它逐步思考（例如使用思维链提示），然后在一个提示中执行这些步骤。但如果模型提出一个 1000 步的计划，而这个计划甚至无法实现目标呢？如果没有监督，代理可以运行这些步骤数小时，浪费时间和金钱在 API 调用上，直到您意识到它没有任何进展。

To avoid fruitless execution, planning should be decoupled from execution. You ask the agent to first generate a plan, and only after this plan is validated is it executed. The plan can be validated using heuristics. For example, one simple heuristic is to eliminate plans with invalid actions. If the generated plan requires a Google search and the agent doesn’t have access to Google Search, this plan is invalid. Another simple heuristic might be eliminating all plans with more than X steps.
为避免徒劳执行，计划应与执行分离。您要求代理首先生成一个计划，只有在此计划经过验证后，才执行它。计划可以通过启发式方法进行验证。例如，一个简单的启发式方法是消除包含无效动作的计划。如果生成的计划需要谷歌搜索，而代理无法访问谷歌搜索，则此计划无效。另一个简单的启发式方法可能是消除所有包含超过 X 步的计划。

A plan can also be validated using AI judges. You can ask a model to evaluate whether the plan seems reasonable or how to improve it.
计划也可以通过人工智能评委进行验证。你可以要求模型评估该计划是否合理或如何改进。

If the generated plan is evaluated to be bad, you can ask the planner to generate another plan. If the generated plan is good, execute it.
如果生成的计划评估为不好，您可以要求规划器生成另一个计划。如果生成的计划是好的，则执行它。

If the plan consists of external tools, function calling will be invoked. Outputs from executing this plan will then again need to be evaluated. Note that the generated plan doesn’t have to be an end-to-end plan for the whole task. It can be a small plan for a subtask. The whole process looks like Figure 6-9.
如果该计划包含外部工具，则将调用函数。执行此计划后的输出还需要再次评估。请注意，生成的计划不一定是整个任务的端到端计划。它可以是一个子任务的较小计划。整个过程如图 6-9 所示。

Figure 6-9. Decoupling planning and execution so that only validated plans are executed
图 6-9. 解耦规划和执行，以确保仅执行经过验证的计划

Your system now has three components: one to generate plans, one to validate plans, and another to execute plans. If you consider each component an agent, this can be considered a multi-agent system. Because most agentic workflows are sufficiently complex to involve multiple components, most agents are multi-agent.
您的系统现在有三个组成部分：一个用于生成计划，一个用于验证计划，另一个用于执行计划。如果您将每个组成部分视为一个代理，这可以被视为一个多代理系统。因为大多数代理工作流程都足够复杂，涉及多个组成部分，所以大多数代理都是多代理。

To speed up the process, instead of generating plans sequentially, you can generate several plans in parallel and ask the evaluator to pick the most promising one. This is another latency–cost tradeoff, as generating multiple plans simultaneously will incur extra costs.
为了加快进程，您可以选择并行生成多个计划，并让评估者挑选最有希望的方案。这又是一种延迟与成本之间的权衡，因为同时生成多个计划将产生额外的成本。

Planning requires understanding the intention behind a task: what’s the user trying to do with this query? An intent classifier is often used to help agents plan. As shown in Chapter 5 in the section “Break complex tasks into simpler subtasks“, intent classification can be done using another prompt or a classification model trained for this task. The intent classification mechanism can be considered another agent in your multi-agent system.
规划需要理解任务背后的意图：用户试图通过这个查询做什么？意图分类器常被用来帮助代理进行规划。正如第五章“将复杂任务分解为更简单的子任务”一节所示，意图分类可以使用另一个提示或为此任务训练的分类模型来完成。意图分类机制可以被视为您多智能体系统中的另一个代理。

Knowing the intent can help the agent pick the right tools. For example, for customer support, if the query is about billing, the agent might need access to a tool to retrieve a user’s recent payments. But if the query is about how to reset a password, the agent might need to access documentation retrieval.
了解意图可以帮助代理选择合适的工具。例如，对于客户支持，如果查询是关于账单的，代理可能需要访问一个工具来检索用户的最近付款。但如果查询是关于如何重置密码，代理可能需要访问文档检索。

Tip: 提示：

Some queries might be out of the scope of the agent. The intent classifier should be able to classify requests as IRRELEVANT so that the agent can politely reject those instead of wasting FLOPs coming up with impossible solutions.
某些查询可能超出代理的范畴。意图分类器应该能够将请求分类为 IRRELEVANT ，以便代理可以礼貌地拒绝这些请求，而不是浪费浮点运算能力来提出不可能的解决方案。

So far, we’ve assumed that the agent automates all three stages: generating plans, validating plans, and executing plans. In reality, humans can be involved at any stage to aid with the process and mitigate risks.
截至目前，我们假设智能体自动化了所有三个阶段：生成计划、验证计划和执行计划。实际上，人类可以在任何阶段介入以协助过程并降低风险。

A human expert can provide a plan, validate a plan, or execute parts of a plan. For example, for complex tasks for which an agent has trouble generating the whole plan, a human expert can provide a high-level plan that the agent can expand upon.
人类专家可以提供计划、验证计划或执行计划的部分。例如，对于代理难以生成整个计划的复杂任务，人类专家可以提供一个高级计划，代理可以在此基础上进行扩展。
If a plan involves risky operations, such as updating a database or merging a code change, the system can ask for explicit human approval before executing or defer to humans to execute these operations. To make this possible, you need to clearly define the level of automation an agent can have for each action.
如果计划涉及风险操作，例如更新数据库或合并代码变更，系统可以在执行前要求明确的人类批准，或将执行这些操作的任务推迟给人类。为了实现这一点，您需要明确定义代理对每个动作可以拥有的自动化程度。

To summarize, solving a task typically involves the following processes. Note that reflection isn’t mandatory for an agent, but it’ll significantly boost the agent’s performance.
总结来说，解决一个任务通常涉及以下过程。请注意，反思对于智能体并非强制性的，但它将显著提高智能体的性能。

Plan generation: come up with a plan for accomplishing this task. A plan is a sequence of manageable actions, so this process is also called task decomposition.
计划生成：为完成这项任务制定一个计划。计划是一系列可管理的行动序列，因此这个过程也被称为任务分解。
Reflection and error correction: evaluate the generated plan. If it’s a bad plan, generate a new one.
反思与错误纠正：评估生成的计划。如果是一个糟糕的计划，则生成一个新的。
Execution: take actions outlined in the generated plan. This often involves calling specific functions.
执行：执行生成计划中概述的行动。这通常涉及调用特定函数。
Reflection and error correction: upon receiving the action outcomes, evaluate these outcomes and determine whether the goal has been accomplished. Identify and correct mistakes. If the goal is not completed, generate a new plan.
反思与错误纠正：在收到行动结果后，评估这些结果并确定目标是否实现。识别并纠正错误。如果目标未完成，生成新的计划。

You’ve already seen some techniques for plan generation and reflection in this book. When you ask a model to “think step by step”, you’re asking it to decompose a task. When you ask a model to “verify if your answer is correct”, you’re asking it to reflect.
您已经在这本书中看到了一些计划生成和反思的技术。当您要求模型“逐步思考”时，您是在要求它分解任务。当您要求模型“验证您的答案是否正确”时，您是在要求它进行反思。

Foundation models as planners
基础模型作为规划者

An open question is how well foundation models can plan. Many researchers believe that foundation models, at least those built on top of autoregressive language models, cannot. Meta’s Chief AI Scientist Yann LeCun states unequivocably that autoregressive LLMs can’t plan (2023).
一个开放的问题是基础模型能有多好的规划能力。许多研究人员认为，基础模型，至少是建立在自回归语言模型之上的那些，不能。Meta 的首席人工智能科学家 Yann LeCun 明确表示，自回归LLMs无法规划（2023）。

Auto-Regressive LLMs can't plan
自回归LLMs无法规划
(and can't really reason).
（实际上无法进行推理）。

"While our own limited experiments didn't show any significant improvement [in planning abilities] through fine tuning, it is possible that with even more fine tuning data and effort, the empirical performance may well… https://t.co/rHA5QHi90C
尽管我们自己的有限实验没有显示出通过微调在规划能力方面有任何显著改进，但可能通过更多的微调数据和努力，经验性能可能会…… https://t.co/rHA5QHi90C
— Yann LeCun (@ylecun) September 13, 2023
— Yann LeCun (@ylecun) 2023 年 9 月 13 日

While there is a lot of anecdotal evidence that LLMs are poor planners, it’s unclear whether it’s because we don’t know how to use LLMs the right way or because LLMs, fundamentally, can’t plan.
尽管有很多关于LLMs是糟糕规划者的轶事证据，但尚不清楚这是否因为我们不知道如何正确使用LLMs，还是因为LLMs在本质上无法规划。

Planning, at its core, is a search problem. You search among different paths towards the goal, predict the outcome (reward) of each path, and pick the path with the most promising outcome. Often, you might determine that no path exists that can take you to the goal.
规划本质上是一个搜索问题。你在通往目标的不同路径中进行搜索，预测每条路径的（奖励）结果，并选择最有希望的路径。通常，你可能会确定没有一条路径能够带你到达目标。

Search often requires backtracking. For example, imagine you’re at a step where there are two possible actions: A and B. After taking action A, you enter a state that’s not promising, so you need to backtrack to the previous state to take action B.
搜索通常需要回溯。例如，想象你处于一个有两个可能行动的步骤：A 和 B。执行行动 A 后，你进入一个没有希望的状态，因此你需要回溯到前一个状态来执行行动 B。

Some people argue that an autoregressive model can only generate forward actions. It can’t backtrack to generate alternate actions. Because of this, they conclude that autoregressive models can’t plan. However, this isn’t necessarily true. After executing a path with action A, if the model determines that this path doesn’t make sense, it can revise the path using action B instead, effectively backtracking. The model can also always start over and choose another path.
一些人认为自回归模型只能生成向前动作，无法回溯生成替代动作。因此，他们认为自回归模型无法进行规划。然而，这并不一定正确。在执行了动作 A 的路径后，如果模型判断这条路径不合理，它可以使用动作 B 来修改路径，从而实现回溯。模型也可以始终重新开始并选择另一条路径。

It’s also possible that LLMs are poor planners because they aren’t given the toolings needed to plan. To plan, it’s necessary to know not only the available actions but also the potential outcome of each action. As a simple example, let’s say you want to walk up a mountain. Your potential actions are turn right, turn left, turn around, or go straight ahead. However, if turning right will cause you to fall off the cliff, you might not consider this action. In technical terms, an action takes you from one state to another, and it’s necessary to know the outcome state to determine whether to take an action.
也可能是LLMs是糟糕的规划者，因为他们没有得到进行规划所需的工具。要规划，不仅要知道可用的行动，还要知道每个行动的可能结果。以一个简单的例子来说，假设你想爬上一座山。你的潜在行动是向右转、向左转、转身或直行。然而，如果你向右转会导致你掉下悬崖，你可能不会考虑这个行动。从技术角度来说，一个行动将你从一个状态带到另一个状态，知道结果状态是确定是否采取行动的必要条件。

This means that prompting a model to generate only a sequence of actions like what the popular chain-of-thought prompting technique does isn’t sufficient. The paper “Reasoning with Language Model is Planning with World Model” (Hao et al., 2023) argues that an LLM, by containing so much information about the world, is capable of predicting the outcome of each action. This LLM can incorporate this outcome prediction to generate coherent plans.
这意味着仅仅通过提示模型生成一系列动作，就像流行的思维链提示技术所做的那样，是不够的。论文《用语言模型进行推理即用世界模型进行规划》（Hao 等，2023）认为，一个包含大量关于世界信息的LLM，能够预测每个动作的结果。这个LLM可以将这种结果预测纳入，以生成连贯的计划。

Even if AI can’t plan, it can still be a part of a planner. It might be possible to augment an LLM with a search tool and state tracking system to help it plan.
即使人工智能无法进行规划，它仍然可以是规划者的一部分。可能可以通过增加搜索工具和状态跟踪系统来增强LLM，帮助它进行规划。

Sidebar: Foundation model (FM) versus reinforcement learning (RL) planners
侧边栏：基础模型（FM）与强化学习（RL）规划者

The agent is a core concept in RL, which is defined in Wikipedia as a field “concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward.”
智能体是强化学习中的核心概念，维基百科将其定义为“关注智能体如何在动态环境中采取行动以最大化累积奖励的领域。”

RL agents and FM agents are similar in many ways. They are both characterized by their environments and possible actions. The main difference is in how their planners work.
RL 智能体和 FM 智能体在许多方面相似。它们都以其环境和可能采取的行动为特征。主要区别在于它们的规划器如何工作。

In an RL agent, the planner is trained by an RL algorithm. Training this RL planner can require a lot of time and resources.
在强化学习代理中，规划器通过强化学习算法进行训练。训练这个强化学习规划器可能需要大量的时间和资源。
In an FM agent, the model is the planner. This model can be prompted or finetuned to improve its planning capabilities, and generally requires less time and fewer resources.
在 FM 智能体中，模型是规划器。此模型可以被提示或微调以提高其规划能力，通常需要的时间和资源较少。

However, there’s nothing to prevent an FM agent from incorporating RL algorithms to improve its performance. I suspect that in the long run, FM agents and RL agents will merge.
然而，没有任何东西可以阻止 FM 代理结合 RL 算法来提高其性能。我怀疑从长远来看，FM 代理和 RL 代理将会合并。

Plan generation 计划生成

The simplest way to turn a model into a plan generator is with prompt engineering. Imagine that you want to create an agent to help customers learn about products at Kitty Vogue. You give this agent access to three external tools: retrieve products by price, retrieve top products, and retrieve product information. Here’s an example of a prompt for plan generation. This prompt is for illustration purposes only. Production prompts are likely more complex.
将模型转化为计划生成器的最简单方法是使用提示工程。想象一下，你想要创建一个代理来帮助客户了解 Kitty Vogue 的产品。你给这个代理提供访问三个外部工具的权限：按价格检索产品、检索热门产品和检索产品信息。以下是一个用于计划生成的提示示例。这个提示仅用于说明目的。生产提示可能更复杂。

SYSTEM PROMPT: 系统提示：

Propose a plan to solve the task. You have access to 5 actions:

* get_today_date()
* fetch_top_products(start_date, end_date, num_products)
* fetch_product_info(product_name)
* generate_query(task_history, tool_output)
* generate_response(query)

The plan must be a sequence of valid actions.

Examples
Task: "Tell me about Fruity Fedora"
Plan: [fetch_product_info, generate_query, generate_response]

Task: "What was the best selling product last week?"
Plan: [fetch_top_products, generate_query, generate_response]

Task: {USER INPUT}
Plan:

There are two things to note about this example:
关于这个例子，有两点需要注意：

The plan format used here—a list of functions whose parameters are inferred by the agent—is just one of many ways to structure the agent control flow.
此处使用的计划格式——一个由代理推断参数的功能列表——只是构建代理控制流程的多种方式之一。
The generate_query function takes in the task’s current history and the most recent tool outputs to generate a query to be fed into the response generator. The tool output at each step is added to the task’s history.
generate_query 函数接收任务当前历史记录和最近一次工具输出，以生成要输入到响应生成器的查询。每个步骤的工具输出都被添加到任务的历史记录中。

Given the user input “What’s the price of the best-selling product last week”, a generated plan might look like this:
考虑到用户输入“上周最畅销产品的价格是多少”，生成的计划可能如下所示：

get_time()
fetch_top_products()
fetch_product_info()
generate_query()
generate_response()

You might wonder, “What about the parameters needed for each function?” The exact parameters are hard to predict in advance since they are often extracted from the previous tool outputs. If the first step, get_time(), outputs “2030-09-13”, the agent can reason that the parameters for the next step should be called with the following parameters:
您可能会想，“每个函数所需的参数是什么？”由于这些参数通常是从先前工具的输出中提取的，因此很难提前准确预测。如果第一步 get_time() 输出“2030-09-13”，代理可以推理出下一步的参数应该使用以下参数：

fetch_top_products(
    start_date="2030-09-07",
    end_date="2030-09-13",
    num_products=1
)

Often, there’s insufficient information to determine the exact parameter values for a function. For example, if a user asks “What’s the average price of best-selling products?”, the answers to the following questions are unclear:
通常，确定函数确切参数值的信息不足。例如，如果用户询问“畅销产品的平均价格是多少？”，以下问题的答案并不明确：

How many best-selling products the user wants to look at?
用户想查看多少个畅销产品？
Does the user want the best-selling products last week, last month, or of all time?
用户想要上周、上个月还是史上最畅销的产品？

This means that models frequently have to make guesses, and guesses can be wrong.
这意味着模型经常需要做出猜测，而猜测可能会出错。

Because both the action sequence and the associated parameters are generated by AI models, they can be hallucinated. Hallucinations can cause the model to call an invalid function or call a valid function but with wrong parameters. Techniques for improving a model’s performance in general can be used to improve a model’s planning capabilities.
由于动作序列和相关参数都是由 AI 模型生成的，因此它们可能会出现幻觉。幻觉可能导致模型调用无效函数，或者调用有效函数但参数错误。可以采用提高模型整体性能的技术来提升模型的规划能力。

Tips for making an agent better at planning.
建议提高智能体规划能力的技巧。

Write a better system prompt with more examples.
撰写一个包含更多示例的更优系统提示。

Give better descriptions of the tools and their parameters so that the model understands them better.
更详细地描述工具及其参数，以便模型能更好地理解它们。

Rewrite the functions themselves to make them simpler, such as refactoring a complex function into two simpler functions.
重写函数本身以使其更简单，例如将一个复杂的函数重构为两个更简单的函数。

Use a stronger model. In general, stronger models are better at planning.
使用更强的模型。一般来说，更强的模型在规划方面表现更好。

Finetune a model for plan generation.
微调用于计划生成的模型。

Function calling 函数调用

Many model providers offer tool use for their models, effectively turning their models into agents. A tool is a function. Invoking a tool is, therefore, often called function calling. Different model APIs work differently, but in general, function calling works as follows:
许多模型提供商为他们的模型提供工具使用，有效地将模型转变为代理。工具是一个函数。因此，调用工具通常被称为函数调用。不同的模型 API 工作方式不同，但一般来说，函数调用的工作方式如下：

Create a tool inventory. Declare all the tools that you might want a model to use. Each tool is described by its execution entry point (e.g., its function name), its parameters, and its documentation (e.g., what the function does and what parameters it needs).
创建工具清单。声明所有可能希望模型使用的工具。每个工具都通过其执行入口点（例如，其函数名）、其参数和其文档（例如，函数的功能和所需的参数）进行描述。
Specify what tools the agent can use for a query.
明确代理可以用于查询的工具。
Because different queries might need different tools, many APIs let you specify a list of declared tools to be used per query. Some let you control tool use further by the following settings:
由于不同的查询可能需要不同的工具，许多 API 允许您为每个查询指定一个声明工具的列表。一些 API 还允许您通过以下设置进一步控制工具的使用：
- required: the model must use at least one tool.
  required ：该模型必须使用至少一个工具。
- none: the model shouldn’t use any tool.
  none ：该模型不应使用任何工具。
- auto: the model decides which tools to use.
  模型决定使用哪些工具。

Function calling is illustrated in Figure 6-10. This is written in pseudocode to make it representative of multiple APIs. To use a specific API, please refer to its documentation.
函数调用如图 6-10 所示。这里使用伪代码编写，以使其代表多个 API。要使用特定 API，请参阅其文档。

Figure 6-10. An example of a model using two simple tools
图 6-10. 使用两种简单工具的模型示例

Given a query, an agent defined as in Figure 6-10 will automatically generate what tools to use and their parameters. Some function calling APIs will make sure that only valid functions are generated, though they won’t be able to guarantee the correct parameter values.
给定一个查询，如图 6-10 定义的代理将自动生成使用哪些工具及其参数。一些函数调用 API 将确保只生成有效的函数，尽管它们无法保证正确的参数值。

For example, given the user query “How many kilograms are 40 pounds?”, the agent might decide that it needs the tool lbs_to_kg_tool with one parameter value of 40. The agent’s response might look like this.
例如，给定用户查询“40 磅等于多少千克？”，代理可能会决定需要使用工具 lbs_to_kg_tool ，并设置一个参数值为 40。代理的回复可能如下所示。

response = ModelResponse(
    finish_reason='tool_calls',
    message=chat.Message(
        content=None,
        role='assistant',
        tool_calls=[
            ToolCall(
                function=Function(
                    arguments='{"lbs":40}',
                    name='lbs_to_kg'),
                type='function')
        ])
)

From this response, you can evoke the function lbs_to_kg(lbs=40) and use its output to generate a response to the users.
从这一响应中，您可以调用 lbs_to_kg(lbs=40) 功能并使用其输出生成对用户的响应。

Tip: When working with agents, always ask the system to report what parameter values it uses for each function call. Inspect these values to make sure they are correct.
提示：在与代理一起工作时，始终要求系统报告每个函数调用所使用的参数值。检查这些值以确保它们正确。

Planning granularity 计划粒度

A plan is a roadmap outlining the steps needed to accomplish a task. A roadmap can be of different levels of granularity. To plan for a year, a quarter-by-quarter plan is higher-level than a month-by-month plan, which is, in turn, higher-level than a week-to-week plan.
计划是一份概述完成任务所需步骤的路线图。路线图可以有不同的粒度级别。为了规划一年，按季度制定的计划比按月制定的计划更高级，而按月制定的计划又比按周制定的计划更高级。

There’s a planning/execution tradeoff. A detailed plan is harder to generate, but easier to execute. A higher-level plan is easier to generate, but harder to execute. An approach to circumvent this tradeoff is to plan hierarchically. First, use a planner to generate a high-level plan, such as a quarter-to-quarter plan. Then, for each quarter, use the same or a different planner to generate a month-to-month plan.
存在规划/执行权衡。详细的计划更难生成，但更容易执行。高级别的计划更容易生成，但更难执行。规避这种权衡的方法是进行分层规划。首先，使用规划器生成高级别的计划，例如季度到季度的计划。然后，对于每个季度，使用相同的或不同的规划器生成月度到月度的计划。

So far, all examples of generated plans use the exact function names, which is very granular. A problem with this approach is that an agent’s tool inventory can change over time. For example, the function to get the current date get_time() can be renamed to get_current_time(). When a tool changes, you’ll need to update your prompt and all your examples. Using the exact function names also makes it harder to reuse a planner across different use cases with different tool APIs.
截至目前，所有生成的计划示例都使用确切的函数名称，这非常具体。这种方法的缺点是代理的工具库存可能会随时间变化。例如，获取当前日期的函数 get_time() 可以重命名为 get_current_time() 。当工具发生变化时，您需要更新您的提示和所有示例。使用确切的函数名称也使得在不同用例和不同工具 API 下重用规划器变得更加困难。

If you’ve previously finetuned a model to generate plans based on the old tool inventory, you’ll need to finetune the model again on the new tool inventory.
如果您之前已经针对旧工具清单对模型进行了微调以生成计划，您需要在新工具清单上再次对模型进行微调。

To avoid this problem, plans can also be generated using a more natural language, which is higher-level than domain-specific function names. For example, given the query “What’s the price of the best-selling product last week”, an agent can be instructed to output a plan that looks like this:
为避免这个问题，也可以使用比特定领域函数名更高层次的更自然语言来生成计划。例如，给定查询“上周最畅销产品的价格是多少”，可以指示代理输出如下所示的计划：

get current date
retrieve the best-selling product last week
retrieve product information
generate query
generate response

Using more natural language helps your plan generator become robust to changes in tool APIs. If your model was trained mostly on natural language, it’ll likely be better at understanding and generating plans in natural language and less likely to hallucinate.
使用更自然的语言有助于使您的计划生成器对工具 API 的变化具有鲁棒性。如果您的模型主要在自然语言上进行训练，那么它可能更擅长理解和生成自然语言计划，并且不太可能产生幻觉。

The downside of this approach is that you need a translator to translate each natural language action into executable commands. Chameleon (Lu et al., 2023) calls this translator a program generator. However, translating is a much simpler task than planning and can be done by weaker models with a lower risk of hallucination.
这种方法的一个缺点是，你需要一个翻译器将每种自然语言动作翻译成可执行命令。Chameleon（Lu 等人，2023）称这个翻译器为程序生成器。然而，翻译比规划简单得多，可以用更弱小的模型完成，且幻觉风险较低。

Complex plans 复杂计划

The plan examples so far have been sequential: the next action in the plan is always executed after the previous action is done. The order in which actions can be executed is called a control flow. The sequential form is just one type of control flow. Other types of control flows include the parallel, if statement, and for loop. The list below provides an overview of each control flow, including sequential for comparison:
计划示例到目前为止都是顺序的：计划中的下一步行动总是在前一步行动完成后执行。可以执行动作的顺序称为控制流。顺序形式只是控制流的一种类型。其他类型的控制流包括并行、if 语句和 for 循环。下表提供了每种控制流的概述，包括顺序以供比较：

Sequential 连续的

Executing task B after task A is complete, possibly because task B depends on task A. For example, the SQL query can only be executed after it’s been translated from the natural language input.
执行任务 A 完成后执行任务 B，可能是因为任务 B 依赖于任务 A。例如，SQL 查询只能在从自然语言输入翻译过来之后才能执行。
Parallel 并行

Executing tasks A and B at the same time. For example, given the query “Find me best-selling products under $100”, an agent might first retrieve the top 100 best-selling products and, for each of these products, retrieve its price.
同时执行任务 A 和 B。例如，给定查询“为我找到售价低于 100 美元的热销产品”，一个智能体可能会首先检索前 100 个热销产品，然后为这些产品中的每一个检索其价格。
If statement 如果语句

Executing task B or task C depending on the output from the previous step. For example, the agent first checks NVIDIA’s earnings report. Based on this report, it can then decide to sell or buy NVIDIA stocks. Anthropic’s post calls this pattern “routing”.
根据上一步的输出执行任务 B 或任务 C。例如，代理首先检查 NVIDIA 的收益报告。基于这份报告，它可以决定卖出或买入 NVIDIA 的股票。Anthropic 在其后续文章中将这种模式称为“路由”。
For loop 循环

Repeat executing task A until a specific condition is met. For example, keep on generating random numbers until a prime number.
重复执行任务 A，直到满足特定条件。例如，持续生成随机数，直到出现一个素数。

These different control flows are visualized in Figure 6-11.
这些不同的控制流程在图 6-11 中进行了可视化。

Figure 6-11. Examples of different orders in which a plan can be executed
图 6-11. 计划可以执行的不同顺序示例

In traditional software engineering, conditions for control flows are exact. With AI-powered agents, AI models determine control flows. Plans with non-sequential control flows are more difficult to both generate and translate into executable commands.
在传统的软件工程中，控制流的条件是精确的。在人工智能驱动的代理中，人工智能模型决定控制流。具有非顺序控制流的计划在生成和将其转换为可执行命令方面都更为困难。

Tip: 提示：

When evaluating an agent framework, check what control flows it supports. For example, if the system needs to browse ten websites, can it do so simultaneously? Parallel execution can significantly reduce the latency perceived by users.
在评估一个代理框架时，检查它支持哪些控制流程。例如，如果系统需要浏览十个网站，它能否同时进行？并行执行可以显著减少用户感知到的延迟。

Reflection and error correction
反思与错误纠正

Even the best plans need to be constantly evaluated and adjusted to maximize their chance of success. While reflection isn’t strictly necessary for an agent to operate, it’s necessary for an agent to succeed.
即使最好的计划也需要不断评估和调整，以最大化其成功的可能性。虽然反思对于代理的运作并非绝对必要，但对于代理的成功却是必要的。

There are many places during a task process where reflection can be useful:
在任务过程中有许多地方反思可能是有用的：

After receiving a user query to evaluate if the request is feasible.
在收到用户查询以评估请求是否可行之后。
After the initial plan generation to evaluate whether the plan makes sense.
在生成初步计划以评估该计划是否合理之后。
After each execution step to evaluate if it’s on the right track.
在每一步执行后评估是否在正确的轨道上。
After the whole plan has been executed to determine if the task has been accomplished.
在执行整个计划以确定任务是否完成之后。

Reflection and error correction are two different mechanisms that go hand in hand. Reflection generates insights that help uncover errors to be corrected.
反思和错误纠正是两种相辅相成的机制。反思产生洞见，有助于发现需要纠正的错误。

Reflection can be done with the same agent with self-critique prompts. It can also be done with a separate component, such as a specialized scorer: a model that outputs a concrete score for each outcome.
反思可以使用带有自我批评提示的同一位代理进行。也可以使用一个单独的组件，例如专门的评分器：一个为每个结果输出具体分数的模型。

First proposed by ReAct (Yao et al., 2022), interleaving reasoning and action has become a common pattern for agents. Yao et al. used the term “reasoning” to encompass both planning and reflection. At each step, the agent is asked to explain its thinking (planning), take actions, then analyze observations (reflection), until the task is considered finished by the agent. The agent is typically prompted, using examples, to generate outputs in the following format:
首先由 ReAct（姚等，2022 年）提出，交错推理和行动已成为代理的常见模式。姚等使用“推理”一词涵盖规划和反思。在每一步，代理被要求解释其思考（规划），采取行动，然后分析观察结果（反思），直到代理认为任务完成。代理通常通过举例来提示，以生成以下格式的输出：

Thought 1: …
Act 1: …
Observation 1: …

… [continue until reflection determines that the task is finished] …

Thought N: … 
Act N: Finish [Response to query]

Figure 6-12 shows an example of an agent following the ReAct framework responding to a question from HotpotQA (Yang et al., 2018), a benchmark for multi-hop question answering.
图 6-12 展示了一个遵循 ReAct 框架的智能体对 HotpotQA（杨等，2018）中的一个问题进行响应的示例，HotpotQA 是多跳问答的一个基准。

Figure 6-12: A ReAct agent in action.
图 6-12：ReAct 代理在行动中。

You can implement reflection in a multi-agent setting: one agent plans and takes actions and another agent evaluates the outcome after each step or after a number of steps.
您可以在多智能体环境中实现反思：一个智能体进行规划和采取行动，另一个智能体在每个步骤或若干步骤之后评估结果。

If the agent’s response failed to accomplish the task, you can prompt the agent to reflect on why it failed and how to improve. Based on this suggestion, the agent generates a new plan. This allows agents to learn from their mistakes.
如果代理人的响应未能完成任务，您可以提示代理人反思失败的原因以及如何改进。基于此建议，代理人生成新的计划。这使代理人能够从错误中学习。

For example, given a coding generation task, an evaluator might evaluate that the generated code fails ⅓ of the test cases. The agent then reflects that it failed because it didn’t take into account arrays where all numbers are negative. The actor then generates new code, taking into account all-negative arrays.
例如，对于一个代码生成任务，评估者可能会评估生成的代码有⅓的测试用例失败。然后智能体反思失败的原因是因为它没有考虑到所有数字都是负数的数组。随后，该演员生成新的代码，考虑到了所有负数数组。

This is the approach that Reflexion (Shinn et al., 2023) took. In this framework, reflection is separated into two modules: an evaluator that evaluates the outcome and a self-reflection module that analyzes what went wrong. Figure 6-13 shows examples of Reflexion agents in action. The authors used the term “trajectory” to refer to a plan. At each step, after evaluation and self-reflection, the agent proposes a new trajectory.
这是 Reflexion（Shinn 等人，2023 年）所采用的方法。在这个框架中，反思被分为两个模块：一个评估结果的评价器和分析出错原因的自我反思模块。图 6-13 展示了 Reflexion 代理的行动示例。作者使用“轨迹”一词来指代计划。在每一步评估和自我反思之后，代理提出一个新的轨迹。

Figure 6-13. Examples of how Reflexion agents work.
图 6-13. 反射代理工作示例

Compared to plan generation, reflection is relatively easy to implement and can bring surprisingly good performance improvement. The downside of this approach is latency and cost. Thoughts, observations, and sometimes actions can take a lot of tokens to generate, which increases cost and user-perceived latency, especially for tasks with many intermediate steps. To nudge their agents to follow the format, both ReAct and Reflexion authors used plenty of examples in their prompts. This increases the cost of computing input tokens and reduces the context space available for other information.
与计划生成相比，反思相对容易实现，并且可以带来令人惊讶的性能提升。这种方法的缺点是延迟和成本。思考、观察有时甚至行动可能需要大量标记来生成，这增加了成本和用户感知的延迟，尤其是在有许多中间步骤的任务中。为了引导他们的代理遵循格式，ReAct 和 Reflexion 的作者在提示中使用了大量示例。这增加了计算输入标记的成本，并减少了可用于其他信息的上下文空间。

Tool selection 工具选择

Because tools often play a crucial role in a task’s success, tool selection requires careful consideration. The tools to give your agent depend on the environment and the task, but also depends on the AI model that powers the agent.
因为工具在任务成功中常常扮演关键角色，因此工具选择需要仔细考虑。为您的智能体提供的工具取决于环境和任务，同时也取决于驱动智能体的 AI 模型。

There’s no foolproof guide on how to select the best set of tools. Agent literature consists of a wide range of tool inventories. For example:
没有一劳永逸的指南来选择最佳工具集。代理文献包括广泛的工具清单。例如：

Toolformer (Schick et al., 2023) finetuned GPT-J to learn 5 tools.
工具形成器（Schick 等人，2023）微调了 GPT-J 以学习 5 个工具。
Chameleon (Lu et al., 2023) uses 13 tools.
刺猬（Lu 等人，2023 年）使用了 13 个工具。
Gorilla (Patil et al., 2023) attempted to prompt agents to select the right API call among 1,645 APIs.
大猩猩（Patil 等人，2023 年）试图提示代理在 1,645 个 API 中选择正确的 API 调用。

More tools give the agent more capabilities. However, the more tools there are, the harder it is to efficiently use them. It’s similar to how it’s harder for humans to master a large set of tools. Adding tools also means increasing tool descriptions, which might not fit into a model’s context.
更多工具赋予代理更多能力。然而，工具越多，高效使用它们的难度就越大。这类似于人类掌握大量工具变得更加困难。添加工具也意味着增加工具描述，这些描述可能不适合模型的环境。

Like many other decisions while building AI applications, tool selection requires experimentation and analysis. Here are a few things you can do to help you decide:
与构建人工智能应用时的许多其他决策一样，工具选择需要实验和分析。以下是一些可以帮助您做出决定的建议：

Compare how an agent performs with different sets of tools.
比较一个代理使用不同工具集的表现。
Do an ablation study to see how much the agent’s performance drops if a tool is removed from its inventory. If a tool can be removed without a performance drop, remove it.
进行消融研究，以观察移除工具后代理性能下降的程度。如果移除工具不会导致性能下降，则移除它。
Look for tools that the agent frequently makes mistakes on. If a tool proves too hard for the agent to use—for example, extensive prompting and even finetuning can’t get the model to learn to use it—change the tool.
寻找代理经常出错的工具。如果一个工具对代理来说太难使用——例如，即使经过广泛的提示甚至微调也无法让模型学会使用它——就更换工具。
Plot the distribution of tool calls to see what tools are most used and what tools are least used. Figure 6-14 shows the differences in tool use patterns of GPT-4 and ChatGPT in Chameleon (Lu et al., 2023).
绘制工具调用的分布图，以查看哪些工具使用最频繁以及哪些工具使用最少。图 6-14 展示了 GPT-4 和 ChatGPT 在 Chameleon（Lu 等，2023）中工具使用模式的差异。

Different models have different tool preferences

Figure 6-14. Different models and tasks express different tool use patterns.
图 6-14. 不同的模型和任务表现出不同的工具使用模式。

Experiments by Chameleon (Lu et al., 2023) also demonstrate two points:
实验由变色龙（Lu 等人，2023 年）也证明了两个观点：

Different tasks require different tools. ScienceQA, the science question answering task, relies much more on knowledge retrieval tools than TabMWP, a tabular math problem-solving task.
不同的任务需要不同的工具。科学问答任务 ScienceQA 比表格数学问题解决任务 TabMWP 更依赖于知识检索工具。
Different models have different tool preferences. For example, GPT-4 seems to select a wider set of tools than ChatGPT. ChatGPT seems to favor image captioning, while GPT-4 seems to favor knowledge retrieval.
不同模型有不同的工具偏好。例如，GPT-4 似乎选择的工具集比 ChatGPT 更广泛。ChatGPT 似乎更倾向于图像标题生成，而 GPT-4 似乎更倾向于知识检索。

Tip: 提示：

When evaluating an agent framework, evaluate what planners and tools it supports. Different frameworks might focus on different categories of tools. For example, AutoGPT focuses on social media APIs (Reddit, X, and Wikipedia), whereas Composio focuses on enterprise APIs (Google Apps, GitHub, and Slack).
当评估一个代理框架时，评估它支持哪些规划器和工具。不同的框架可能专注于不同类别的工具。例如，AutoGPT 专注于社交媒体 API（Reddit、X 和维基百科），而 Composio 则专注于企业 API（Google Apps、GitHub 和 Slack）。

As your needs will likely change over time, evaluate how easy it is to extend your agent to incorporate new tools.
随着您的需求可能会随时间而变化，评估扩展您的代理以纳入新工具的难易程度。

As humans, we become more productive not just by using the tools we’re given, but also by creating progressively more powerful tools from simpler ones. Can AI create new tools from its initial tools?
人类不仅通过使用我们拥有的工具变得更有生产力，而且通过从更简单的工具中创造出越来越强大的工具。人工智能能否从其初始工具中创造出新的工具？

Chameleon (Lu et al., 2023) proposes the study of tool transition: after tool X, how likely is the agent to call tool Y? Figure 6-15 shows an example of tool transition. If two tools are frequently used together, they can be combined into a bigger tool. If an agent is aware of this information, the agent itself can combine initial tools to continually build more complex tools.
蜥蜴（Lu 等人，2023）提出了工具转换的研究：在工具 X 之后，代理调用工具 Y 的可能性有多大？图 6-15 展示了工具转换的一个例子。如果两种工具经常一起使用，它们可以被组合成更大的工具。如果一个代理知道这个信息，代理本身可以将初始工具组合起来，不断构建更复杂的工具。

Figure 6-15. A tool transition tree by Chameleon (Lu et al., 2023).
图 6-15. Chameleon（Lu 等，2023）的工具转换树。

Vogager (Wang et al., 2023) proposes a skill manager to keep track of new skills (tools) that an agent acquires for later reuse. Each skill is a coding program. When the skill manager determines a newly created skill is to be useful (e.g., because it’s successfully helped an agent accomplish a task), it adds this skill to the skill library (conceptually similar to the tool inventory). This skill can be retrieved later to use for other tasks.
王等（2023）提出了一种技能管理器，用于跟踪代理获取的新技能（工具），以便以后重用。每个技能都是一个编码程序。当技能管理器确定一个新创建的技能是有用的（例如，因为它成功地帮助代理完成了一项任务）时，它会将此技能添加到技能库（概念上类似于工具库存）。这个技能以后可以检索出来用于其他任务。

Earlier in this section, we mentioned that the success of an agent in an environment depends on its tool inventory and its planning capabilities. Failures in either aspect can cause the agent to fail. The next section will discuss different failure modes of an agent and how to evaluate them.
在本文本的前一部分，我们提到一个代理在环境中的成功取决于其工具库存和其规划能力。任一方面失败都可能导致代理失败。下一节将讨论代理的不同失败模式及其评估方法。

Agent Failure Modes and Evaluation
代理故障模式与评估

Evaluation is about detecting failures. The more complex a task an agent performs, the more possible failure points there are. Other than the failure modes common to all AI applications discussed in Chapters 3 and 4, agents also have unique failures caused by planning, tool execution, and efficiency. Some of the failures are easier to catch than others.
评估是关于检测失败。一个智能体执行的任务越复杂，可能的故障点就越多。除了第 3 章和第 4 章中讨论的所有 AI 应用共有的故障模式外，智能体还有由规划、工具执行和效率引起的独特故障。一些故障比其他故障更容易被发现。

To evaluate an agent, identify its failure modes and measure how often each of these failure modes happens.
为了评估一个智能体，需要确定其故障模式并测量这些故障模式发生的频率。

Planning failures 规划失败

Planning is hard and can fail in many ways. The most common mode of planning failure is tool use failure. The agent might generate a plan with one or more of these errors.
规划困难，可能会以多种方式失败。规划失败最常见的形式是工具使用失败。代理可能会生成包含一个或多个这些错误的计划。

Invalid tool 无效工具

For example, it generates a plan that contains bing_search, which isn’t in the tool inventory.
例如，它生成一个包含 bing_search 的计划，而这个编号不在工具库存中。
Valid tool, invalid parameters.
有效的工具，无效的参数。

For example, it calls lbs_to_kg with two parameters, but this function requires only one parameter, lbs.
例如，它调用 lbs_to_kg 时带有两个参数，但此函数只需要一个参数， lbs 。
Valid tool, incorrect parameter values
有效的工具，错误的参数值

For example, it calls lbs_to_kg with one parameter, lbs, but uses the value 100 for lbs when it should be 120.
例如，它将 lbs_to_kg with 称为一个参数， lbs ，但使用 100 磅的值，而应该是 120 磅。

Another mode of planning failure is goal failure: the agent fails to achieve the goal. This can be because the plan doesn’t solve a task, or it solves the task without following the constraints. To illustrate this, imagine you ask the model to plan a two-week trip from San Francisco to India with a budget of $5,000. The agent might plan a trip from San Francisco to Vietnam, or plan you a two-week trip from San Francisco to India that will cost you way over the budget.
另一种规划失败的模式是目标失败：代理无法实现目标。这可能是因为计划没有解决任务，或者没有遵循约束条件来解决任务。为了说明这一点，想象你要求模型规划一个从旧金山到印度的两周旅行，预算为 5000 美元。代理可能会规划一个从旧金山到越南的旅行，或者为你规划一个从旧金山到印度的两周旅行，费用将远远超出预算。

A common constraint that is often overlooked by agent evaluation is time. In many cases, the time an agent takes matters less because you can assign a task to an agent and only need to check in when it’s done. However, in many cases, the agent becomes less useful with time. For example, if you ask an agent to prepare a grant proposal and the agent finishes it after the grant deadline, the agent isn’t very helpful.
一个常被代理评估忽视的常见约束是时间。在许多情况下，代理所花费的时间并不重要，因为你可以分配一个任务给代理，只需在任务完成后进行检查即可。然而，在许多情况下，随着时间的推移，代理变得不那么有用。例如，如果你要求代理准备一份拨款提案，而代理在拨款截止日期之后才完成，那么这个代理就不是很有帮助了。

An interesting mode of planning failure is caused by errors in reflection. The agent is convinced that it’s accomplished a task when it hasn’t. For example, you ask the agent to assign 50 people to 30 hotel rooms. The agent might assign only 40 people and insist that the task has been accomplished.
一种有趣的规划失败模式是由反思错误引起的。代理确信它已经完成了任务，而实际上并没有。例如，你要求代理将 50 人分配到 30 个酒店房间。代理可能只分配了 40 人，并坚持认为任务已经完成。

To evaluate an agent for planning failures, one option is to create a planning dataset where each example is a tuple (task, tool inventory). For each task, use the agent to generate a K number of plans. Compute the following metrics:
为了评估规划失败的行为者，一个选项是创建一个规划数据集，其中每个示例都是一个元组 (task, tool inventory) 。对于每个任务，使用行为者生成 K 个规划。计算以下指标：

Out of all generated plans, how many are valid?
在所有生成的计划中，有多少是有效的？
For a given task, how many plans does the agent have to generate to get a valid plan?
对于给定的任务，代理需要生成多少个计划才能得到一个有效的计划？
Out of all tool calls, how many are valid?
在所有工具调用中，有多少是有效的？
How often are invalid tools called?
无效工具被调用有多频繁？
How often are valid tools called with invalid parameters?
如何频繁地使用无效参数调用有效工具？
How often are valid tools called with incorrect parameter values?
多频繁地使用无效的工具参数值调用有效工具？

Analyze the agent’s outputs for patterns. What types of tasks does the agent fail more on? Do you have a hypothesis why? What tools does the model frequently make mistakes with? Some tools might be harder for an agent to use. You can improve an agent’s ability to use a challenging tool by better prompting, more examples, or finetuning. If all fail, you might consider swapping out this tool for something easier to use.
分析代理的输出以寻找模式。代理在哪些类型的任务上失败得更多？你有什么假设吗？模型经常在哪些工具上犯错误？有些工具可能对代理来说更难使用。你可以通过更好的提示、更多示例或微调来提高代理使用具有挑战性的工具的能力。如果所有方法都失败了，你可能需要考虑用更容易使用的工具替换这个工具。

Tool failures 工具故障

Tool failures happen when the correct tool is used, but the tool output is wrong. One failure mode is when a tool just gives the wrong outputs. For example, an image captioner returns a wrong description, or an SQL query generator returns a wrong SQL query.
工具在使用正确工具时发生故障，但工具输出错误。一种故障模式是工具仅提供错误的输出。例如，图像标题生成器返回错误的描述，或者 SQL 查询生成器返回错误的 SQL 查询。

If the agent generates only high-level plans and a translation module is involved in translating from each planned action to executable commands, failures can happen because of translation errors.
如果代理仅生成高级计划且涉及将每个计划动作翻译成可执行命令的翻译模块，可能会因为翻译错误而发生故障。

Tool failures are tool-dependent. Each tool needs to be tested independently. Always print out each tool call and its output so that you can inspect and evaluate them. If you have a translator, create benchmarks to evaluate it.
工具故障具有工具依赖性。每种工具都需要独立测试。始终打印出每个工具调用及其输出，以便您可以检查和评估它们。如果您有翻译器，创建基准来评估它。

Detecting missing tool failures requires an understanding of what tools should be used. If your agent frequently fails on a specific domain, this might be because it lacks tools for this domain. Work with human domain experts and observe what tools they would use.
检测缺失的工具故障需要了解应该使用哪些工具。如果你的代理在特定领域频繁失败，这可能是因为它缺少该领域的工具。与人类领域专家合作，观察他们会使用哪些工具。

Efficiency 效率

An agent might generate a valid plan using the right tools to accomplish a task, but it might be inefficient. Here are a few things you might want to track to evaluate an agent’s efficiency:
一个智能体可能使用正确的工具生成一个有效的计划来完成一项任务，但它可能效率不高。以下是一些你可能想要跟踪以评估智能体效率的事项：

How many steps does the agent need, on average, to complete a task?
平均来说，代理需要多少步才能完成任务？
How much does the agent cost, on average, to complete a task?
平均来说，代理完成一个任务的成本是多少？
How long does each action typically take? Are there any actions that are especially time-consuming or expensive?
每个动作通常需要多长时间？有没有特别耗时或昂贵的动作？

You can compare these metrics with your baseline, which can be another agent or a human operator. When comparing AI agents to human agents, keep in mind that humans and AI have very different modes of operation, so what’s considered efficient for humans might be inefficient for AI and vice versa. For example, visiting 100 web pages might be inefficient for a human agent who can only visit one page at a time but trivial for an AI agent that can visit all the web pages at once.
您可以将这些指标与您的基线进行比较，基线可以是另一个代理或人类操作员。当比较人工智能代理与人类代理时，请记住人类和人工智能具有非常不同的操作模式，因此对人类来说认为是高效的，可能对人工智能来说是不高效的，反之亦然。例如，访问 100 个网页可能对只能一次访问一个网页的人类代理来说是不高效的，但对于可以一次性访问所有网页的人工智能代理来说则是微不足道的。

Conclusion 结论

At its core, the concept of an agent is fairly simple. An agent is defined by the environment it operates in and the set of tools it has access to. In an AI-powered agent, the AI model is the brain that leverages its tools and feedback from the environment to plan how best to accomplish a task. Access to tools makes a model vastly more capable, so the agentic pattern is inevitable.
在本质上，代理的概念相当简单。代理由其运作的环境和可访问的工具集定义。在人工智能驱动的代理中，AI 模型是大脑，它利用其工具和环境反馈来规划如何最好地完成任务。工具的访问使模型的能力大大增强，因此代理模式是不可避免的。

While the concept of “agents” sounds novel, they are built upon many concepts that have been used since the early days of LLMs, including self-critique, chain-of-thought, and structured outputs.
虽然“代理”这一概念听起来新颖，但它们建立在自LLMs早期以来就被使用的许多概念之上，包括自我批评、思维链和结构化输出。

This post covered conceptually how agents work and different components of an agent. In a future post, I’ll discuss how to evaluate agent frameworks.
此篇帖子从概念上介绍了智能体的工作原理和智能体的不同组成部分。在未来的帖子中，我将讨论如何评估智能体框架。

The agentic pattern often deals with information that exceeds a model’s context limit. A memory system that supplements the model’s context in handling information can significantly enhance an agent’s capabilities. Since this post is already long, I’ll explore how a memory system works in a future blog post.
代理模式通常处理超出模型上下文限制的信息。一个补充模型上下文以处理信息的记忆系统可以显著增强代理的能力。由于这篇文章已经很长，我将在未来的博客文章中探讨记忆系统的工作原理。