Building effective agents
建立有效的代理人
Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.
在過去的一年中,我們與數十個團隊合作,為各行各業構建大型語言模型(LLM)代理。始終如一,最成功的實施並沒有使用複雜的框架或專門的庫。相反,他們是使用簡單、可組合的模式進行構建的。
In this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.
在這篇文章中,我們分享了從與客戶合作和自己建立代理的過程中所學到的經驗,並為開發人員提供有關建立有效代理的實用建議。
What are agents? 什麼是代理人?
"Agent" can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
「代理」可以用幾種方式來定義。一些客戶將代理定義為在長時間內獨立運作的完全自主系統,使用各種工具來完成複雜任務。其他人則用這個術語來描述遵循預定工作流程的更具指導性的實現。在 Anthropic,我們將所有這些變體歸類為代理系統,但在工作流程和代理之間劃分了一個重要的架構區別:
- Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
工作流程是通過預定義的代碼路徑協調LLMs和工具的系統。 - Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
代理則是系統,其中LLMs動態地指導自己的過程和工具使用,保持對完成任務方式的控制。
Below, we will explore both types of agentic systems in detail. In Appendix 1 (“Agents in Practice”), we describe two domains where customers have found particular value in using these kinds of systems.
在下文中,我們將詳細探討這兩種類型的代理系統。在附錄 1(“實踐中的代理”)中,我們描述了客戶在使用這些系統時發現特別價值的兩個領域。
When (and when not) to use agents
何時(以及何時不)使用代理人
When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.
在使用 {{1001 }} 建立應用程式時,我們建議尋找最簡單的解決方案,並僅在需要時增加複雜性。這可能意味著根本不建立代理系統。代理系統通常以延遲和成本來換取更好的任務表現,因此您應該考慮這種權衡何時是合理的。
When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.
當需要更多複雜性時,工作流程為明確定義的任務提供可預測性和一致性,而當需要靈活性和模型驅動的決策制定時,代理則是更好的選擇。然而,對於許多應用來說,優化單個 LLM 調用與檢索和上下文示例通常已經足夠。
When and how to use frameworks
何時以及如何使用框架
There are many frameworks that make agentic systems easier to implement, including:
有許多框架使得代理系統更容易實現,包括:
- LangGraph from LangChain;
- Amazon Bedrock's AI Agent framework;
Amazon Bedrock 的 AI 代理框架; - Rivet, a drag and drop GUI LLM workflow builder; and
Rivet,一個拖放式圖形用戶界面LLM工作流程構建器;以及 - Vellum, another GUI tool for building and testing complex workflows.
Vellum,另一個用於構建和測試複雜工作流程的圖形用戶界面工具。
These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.
We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what's under the hood are a common source of customer error.
我們建議開發人員直接使用 LLM API:許多模式可以在幾行代碼中實現。如果您使用框架,請確保您了解底層代碼。對於底層運作的錯誤假設是客戶錯誤的常見來源。
See our cookbook for some sample implementations.
請參閱我們的食譜以獲取一些示範實現。
Building blocks, workflows, and agents
構建模塊、工作流程和代理人
In this section, we’ll explore the common patterns for agentic systems we’ve seen in production. We'll start with our foundational building block—the augmented LLM—and progressively increase complexity, from simple compositional workflows to autonomous agents.
在本節中,我們將探討我們在生產中看到的代理系統的常見模式。我們將從我們的基礎構建塊——增強型 LLM 開始,並逐步增加複雜性,從簡單的組合工作流程到自主代理。
Building block: The augmented LLM
構建模塊:增強的 LLM
The basic building block of agentic systems is an LLM enhanced with augmentations such as retrieval, tools, and memory. Our current models can actively use these capabilities—generating their own search queries, selecting appropriate tools, and determining what information to retain.
代理系統的基本構建單元是增強的 LLM,並配備檢索、工具和記憶等增強功能。我們目前的模型可以主動使用這些能力——生成自己的搜索查詢、選擇適當的工具,以及決定保留哪些信息。
We recommend focusing on two key aspects of the implementation: tailoring these capabilities to your specific use case and ensuring they provide an easy, well-documented interface for your LLM. While there are many ways to implement these augmentations, one approach is through our recently released Model Context Protocol, which allows developers to integrate with a growing ecosystem of third-party tools with a simple client implementation.
我們建議專注於實施的兩個關鍵方面:將這些功能量身定制以符合您的特定用例,並確保它們為您的 LLM 提供一個簡單且文檔完善的介面。雖然有許多方法可以實施這些增強功能,但一種方法是通過我們最近發布的模型上下文協議,該協議允許開發人員通過簡單的客戶端實現與不斷增長的第三方工具生態系統集成。
For the remainder of this post, we'll assume each LLM call has access to these augmented capabilities.
在本帖的其餘部分,我們將假設每個 LLM 調用都可以訪問這些增強的功能。
Workflow: Prompt chaining
工作流程:提示鏈接
Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see "gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.
提示鏈接將任務分解為一系列步驟,其中每個 LLM 調用處理前一個的輸出。您可以在任何中間步驟上添加程序檢查(見下圖中的“門”),以確保過程仍然在正確的軌道上。
When to use this workflow: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.
何時使用此工作流程:此工作流程非常適合於可以輕鬆且乾淨地分解為固定子任務的情況。主要目標是通過使每個 LLM 呼叫成為一個更簡單的任務來權衡延遲與更高準確性之間的關係。
Examples where prompt chaining is useful:
提示鏈接有用的例子:
- Generating Marketing copy, then translating it into a different language.
生成行銷文案,然後將其翻譯成不同的語言。 - Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
撰寫文件大綱,檢查大綱是否符合某些標準,然後根據大綱撰寫文件。
Workflow: Routing 工作流程:路由
Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.
路由將輸入分類並將其指向專門的後續任務。這種工作流程允許關注點的分離,並構建更專門的提示。沒有這種工作流程,針對一種輸入的優化可能會損害對其他輸入的性能。
When to use this workflow: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.
何時使用此工作流程:路由適用於複雜任務,其中有明確的類別,這些類別更適合分開處理,並且分類可以通過 LLM 或更傳統的分類模型/算法準確處理。
Examples where routing is useful:
路由有用的例子:
- Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
將不同類型的客戶服務查詢(一般問題、退款請求、技術支持)導入不同的下游流程、提示和工具。 - Routing easy/common questions to smaller models like Claude 3.5 Haiku and hard/unusual questions to more capable models like Claude 3.5 Sonnet to optimize cost and speed.
將簡單/常見問題路由到較小的模型,如 Claude 3.5 Haiku,將困難/不尋常問題路由到更強大的模型,如 Claude 3.5 Sonnet,以優化成本和速度。
Workflow: Parallelization
工作流程:平行化
LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:
LLMs 有時可以同時在一個任務上工作,並且其輸出可以以程式化的方式進行聚合。這種工作流程,即平行化,體現在兩個主要變體中:
- Sectioning: Breaking a task into independent subtasks run in parallel.
分段:將任務拆分為獨立的子任務並行執行。 - Voting: Running the same task multiple times to get diverse outputs.
投票:多次運行相同任務以獲得多樣化的輸出。
When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.
何時使用此工作流程:當劃分的子任務可以進行平行處理以提高速度,或當需要多個視角或嘗試以獲得更高的信心結果時,平行化是有效的。對於具有多個考量的複雜任務,LLMs 通常在每個考量由單獨的 LLM 呼叫處理時表現更佳,這樣可以專注於每個特定方面。
Examples where parallelization is useful:
平行化有用的例子:
- Sectioning: 分段:
- Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
實施防護措施,其中一個模型實例處理用戶查詢,而另一個則篩選不當內容或請求。這通常比讓同一個LLM調用同時處理防護措施和核心回應表現更好。 - Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
自動化評估以評估 LLM 的性能,其中每次 LLM 調用評估模型在給定提示上的不同性能方面。
- Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
- Voting: 投票:
- Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.
檢查一段代碼的漏洞,其中幾個不同的提示會檢查並標記代碼,如果它們發現問題。 - Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
評估給定內容是否不當,使用多個提示評估不同方面或要求不同的投票閾值,以平衡假陽性和假陰性。
- Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.
Workflow: Orchestrator-workers
工作流程:協調者-工作者
In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.
在協調者-工作者工作流程中,一個中央LLM動態地將任務拆分,將其委派給工作者LLMs,並綜合他們的結果。
When to use this workflow: This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility—subtasks aren't pre-defined, but determined by the orchestrator based on the specific input.
何時使用此工作流程:此工作流程非常適合於複雜任務,當您無法預測所需的子任務時(例如,在編碼中,需要更改的文件數量以及每個文件中更改的性質可能取決於任務)。雖然在拓撲上相似,但與並行化的主要區別在於其靈活性——子任務不是預先定義的,而是由協調者根據具體輸入來確定。
Example where orchestrator-workers is useful:
協調者-工作者有用的例子:
- Coding products that make complex changes to multiple files each time.
編碼產品每次對多個文件進行複雜更改。 - Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
搜尋涉及從多個來源收集和分析信息以尋找可能相關信息的任務。
Workflow: Evaluator-optimizer
工作流程:評估者-優化者
In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.
在評估者-優化器工作流程中,一個 LLM 呼叫生成回應,而另一個則在循環中提供評估和反饋。
When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.
何時使用此工作流程:當我們有明確的評估標準,並且迭代改進能提供可衡量的價值時,此工作流程特別有效。良好適配的兩個標誌是,首先,當人類表達他們的反饋時,LLM 的回應可以明顯改善;其次,LLM 可以提供這樣的反饋。這類似於人類作家在撰寫精練文件時可能經歷的迭代寫作過程。
Examples where evaluator-optimizer is useful:
評估者-優化者有用的例子:
- Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
文學翻譯中,翻譯者LLM可能最初無法捕捉到的細微差別,但評估者LLM可以提供有用的批評。 - Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.
需要多輪搜索和分析以收集全面信息的複雜搜索任務,評估者決定是否需要進一步搜索。
Agents 代理人
Agents are emerging in production as LLMs mature in key capabilities—understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement. During execution, it's crucial for the agents to gain “ground truth” from the environment at each step (such as tool call results or code execution) to assess its progress. Agents can then pause for human feedback at checkpoints or when encountering blockers. The task often terminates upon completion, but it’s also common to include stopping conditions (such as a maximum number of iterations) to maintain control.
代理正在生產中出現,因為LLMs在關鍵能力上成熟——理解複雜輸入、進行推理和規劃、可靠地使用工具以及從錯誤中恢復。代理的工作始於來自人類用戶的命令或互動討論。一旦任務明確,代理便獨立計劃和操作,可能會返回人類以獲取進一步的信息或判斷。在執行過程中,代理在每一步獲取環境的“真實情況”(例如工具調用結果或代碼執行)以評估其進展是至關重要的。然後,代理可以在檢查點或遇到障礙時暫停以獲取人類的反饋。任務通常在完成時終止,但也常見包括停止條件(例如最大迭代次數)以保持控制。
Agents can handle sophisticated tasks, but their implementation is often straightforward. They are typically just LLMs using tools based on environmental feedback in a loop. It is therefore crucial to design toolsets and their documentation clearly and thoughtfully. We expand on best practices for tool development in Appendix 2 ("Prompt Engineering your Tools").
代理可以處理複雜的任務,但它們的實施通常是直接的。它們通常只是 LLMs 使用基於環境反饋的工具進行循環。因此,清晰而周到地設計工具集及其文檔至關重要。我們在附錄 2(“為您的工具進行提示工程”)中擴展了工具開發的最佳實踐。
When to use agents: Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.
何時使用代理:代理可以用於開放式問題,這些問題難以或不可能預測所需的步驟數量,並且無法硬編碼固定路徑。LLM 可能會運行多個回合,您必須對其決策過程有一定程度的信任。代理的自主性使其非常適合在受信環境中擴展任務。
The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.
代理的自主性意味著更高的成本,以及累積錯誤的潛力。我們建議在沙盒環境中進行廣泛測試,並設置適當的防護措施。
Examples where agents are useful:
代理人有用的例子:
The following examples are from our own implementations:
以下範例來自我們自己的實作:
- A coding Agent to resolve SWE-bench tasks, which involve edits to many files based on a task description;
一個編碼代理來解決 SWE-bench 任務,這些任務涉及根據任務描述對許多文件進行編輯; - Our “computer use” reference implementation, where Claude uses a computer to accomplish tasks.
我們的「電腦使用」參考實現,其中克勞德使用電腦來完成任務。
Combining and customizing these patterns
結合和自訂這些模式
These building blocks aren't prescriptive. They're common patterns that developers can shape and combine to fit different use cases. The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexity only when it demonstrably improves outcomes.
這些構建模塊並不是規定性的。它們是開發人員可以塑造和組合以適應不同用例的常見模式。成功的關鍵,與任何LLM功能一樣,是衡量性能並對實施進行迭代。重申一次:您應該考慮僅在其明顯改善結果時才添加複雜性。
Summary 摘要
Success in the LLM space isn't about building the most sophisticated system. It's about building the right system for your needs. Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.
在LLM領域的成功並不在於建立最複雜的系統,而在於為您的需求建立合適的系統。從簡單的提示開始,通過全面的評估來優化它們,只有在更簡單的解決方案無法滿足需求時,才添加多步驟的代理系統。
When implementing agents, we try to follow three core principles:
在實施代理時,我們嘗試遵循三個核心原則:
- Maintain simplicity in your agent's design.
保持代理設計的簡單性。 - Prioritize transparency by explicitly showing the agent’s planning steps.
通過明確顯示代理的計劃步驟來優先考慮透明度。 - Carefully craft your agent-computer interface (ACI) through thorough tool documentation and testing.
仔細通過徹底的工具文檔和測試來精心設計您的代理-計算機界面(ACI)。
Frameworks can help you get started quickly, but don't hesitate to reduce abstraction layers and build with basic components as you move to production. By following these principles, you can create agents that are not only powerful but also reliable, maintainable, and trusted by their users.
框架可以幫助你快速入門,但在進入生產階段時,不要猶豫減少抽象層次,並使用基本組件進行構建。遵循這些原則,你可以創建不僅強大而且可靠、可維護且受到用戶信任的代理。
Acknowledgements 致謝
Written by Erik Schluntz and Barry Zhang. This work draws upon our experiences building agents at Anthropic and the valuable insights shared by our customers, for which we're deeply grateful.
由 Erik Schluntz 和 Barry Zhang 撰寫。本作品基於我們在 Anthropic 建立代理的經驗以及客戶分享的寶貴見解,我們深表感謝。
Appendix 1: Agents in practice
附錄 1:實踐中的代理人
Our work with customers has revealed two particularly promising applications for AI agents that demonstrate the practical value of the patterns discussed above. Both applications illustrate how agents add the most value for tasks that require both conversation and action, have clear success criteria, enable feedback loops, and integrate meaningful human oversight.
我們與客戶的合作揭示了兩個特別有前景的人工智慧代理應用,這些應用展示了上述模式的實際價值。這兩個應用說明了代理在需要對話和行動的任務中如何增加最大的價值,擁有明確的成功標準,能夠啟用反饋循環,並整合有意義的人類監督。
A. Customer support A. 客戶支持
Customer support combines familiar chatbot interfaces with enhanced capabilities through tool integration. This is a natural fit for more open-ended agents because:
客戶支持結合了熟悉的聊天機器人界面和通過工具整合增強的功能。這對於更開放式的代理來說是自然的適合,因為:
- Support interactions naturally follow a conversation flow while requiring access to external information and actions;
支持互動自然地遵循對話流程,同時需要訪問外部信息和行動; - Tools can be integrated to pull customer data, order history, and knowledge base articles;
工具可以整合以提取客戶數據、訂單歷史和知識庫文章; - Actions such as issuing refunds or updating tickets can be handled programmatically; and
像發放退款或更新票務這樣的行動可以通過程式化方式處理;和 - Success can be clearly measured through user-defined resolutions.
成功可以通過用戶定義的解決方案明確衡量。
Several companies have demonstrated the viability of this approach through usage-based pricing models that charge only for successful resolutions, showing confidence in their agents' effectiveness.
幾家公司通過基於使用的定價模型展示了這種方法的可行性,該模型僅對成功解決的情況收費,顯示出對其代理人有效性的信心。
B. Coding agents B. 編碼代理
The software development space has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:
軟體開發領域顯示出對於 LLM 功能的卓越潛力,其能力從代碼補全演變為自主解決問題。代理特別有效,因為:
- Code solutions are verifiable through automated tests;
代碼解決方案可以通過自動化測試進行驗證; - Agents can iterate on solutions using test results as feedback;
代理可以利用測試結果作為反饋來迭代解決方案; - The problem space is well-defined and structured; and
問題空間是明確且有結構的;而 - Output quality can be measured objectively.
輸出質量可以客觀地衡量。
In our own implementation, agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on the pull request description alone. However, whereas automated testing helps verify functionality, human review remains crucial for ensuring solutions align with broader system requirements.
在我們自己的實現中,代理現在可以僅根據拉取請求描述解決 SWE-bench Verified 基準中的真實 GitHub 問題。然而,儘管自動化測試有助於驗證功能,但人工審查仍然對確保解決方案符合更廣泛的系統要求至關重要。
Appendix 2: Prompt engineering your tools
附錄 2:為您的工具進行提示工程
No matter which agentic system you're building, tools will likely be an important part of your agent. Tools enable Claude to interact with external services and APIs by specifying their exact structure and definition in our API. When Claude responds, it will include a tool use block in the API response if it plans to invoke a tool. Tool definitions and specifications should be given just as much prompt engineering attention as your overall prompts. In this brief appendix, we describe how to prompt engineer your tools.
無論你正在建立哪種代理系統,工具都可能是你代理的重要組成部分。工具使 Claude 能夠通過在我們的 API 中指定其確切結構和定義來與外部服務和 API 互動。當 Claude 回應時,如果它計劃調用工具,則會在 API 回應中包含一個工具使用區塊。工具的定義和規範應該與你的整體提示一樣受到提示工程的重視。在這個簡短的附錄中,我們描述了如何對你的工具進行提示工程。
There are often several ways to specify the same action. For instance, you can specify a file edit by writing a diff, or by rewriting the entire file. For structured output, you can return code inside markdown or inside JSON. In software engineering, differences like these are cosmetic and can be converted losslessly from one to the other. However, some formats are much more difficult for an LLM to write than others. Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written. Writing code inside JSON (compared to markdown) requires extra escaping of newlines and quotes.
通常有幾種方法可以指定相同的操作。例如,您可以通過編寫差異來指定文件編輯,或通過重寫整個文件來指定。對於結構化輸出,您可以在 markdown 或 JSON 中返回代碼。在軟體工程中,這些差異是表面上的,並且可以無損地從一種格式轉換為另一種格式。然而,某些格式對於 {{1001 }} 來說比其他格式更難編寫。編寫差異需要知道在新代碼寫入之前,塊標頭中有多少行正在更改。將代碼寫入 JSON(與 markdown 相比)需要額外轉義換行符和引號。
Our suggestions for deciding on tool formats are the following:
我們對於決定工具格式的建議如下:
- Give the model enough tokens to "think" before it writes itself into a corner.
給模型足夠的標記以便在寫入死胡同之前能夠「思考」。 - Keep the format close to what the model has seen naturally occurring in text on the internet.
保持格式接近模型在互聯網上自然出現的文本。 - Make sure there's no formatting "overhead" such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.
確保沒有格式化的「開銷」,例如必須準確計算數千行代碼,或對其編寫的任何代碼進行字符串轉義。
One rule of thumb is to think about how much effort goes into human-computer interfaces (HCI), and plan to invest just as much effort in creating good agent-computer interfaces (ACI). Here are some thoughts on how to do so:
一個經驗法則是考慮人機介面(HCI)所需的努力程度,並計劃在創建良好的代理-計算機介面(ACI)上投入同樣多的努力。以下是一些實現這一目標的想法:
- Put yourself in the model's shoes. Is it obvious how to use this tool, based on the description and parameters, or would you need to think carefully about it? If so, then it’s probably also true for the model. A good tool definition often includes example usage, edge cases, input format requirements, and clear boundaries from other tools.
將自己置於模型的立場。根據描述和參數,使用這個工具是否顯而易見,還是需要仔細思考?如果是這樣,那麼對於模型來說也可能是如此。一個好的工具定義通常包括示例用法、邊界情況、輸入格式要求,以及與其他工具的明確界限。 - How can you change parameter names or descriptions to make things more obvious? Think of this as writing a great docstring for a junior developer on your team. This is especially important when using many similar tools.
如何更改參數名稱或描述以使其更明顯?將這視為為您團隊中的初級開發人員撰寫一個出色的文檔字符串。當使用許多相似的工具時,這尤其重要。 - Test how the model uses your tools: Run many example inputs in our workbench to see what mistakes the model makes, and iterate.
測試模型如何使用您的工具:在我們的工作台上運行許多示例輸入,以查看模型犯了什麼錯誤,並進行迭代。 - Poka-yoke your tools. Change the arguments so that it is harder to make mistakes.
對你的工具進行防錯設計。改變參數,使犯錯變得更困難。
While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt. For example, we found that the model would make mistakes with tools using relative filepaths after the agent had moved out of the root directory. To fix this, we changed the tool to always require absolute filepaths—and we found that the model used this method flawlessly.
在為 SWE-bench 構建我們的代理時,我們實際上花了更多時間優化我們的工具,而不是整體提示。例如,我們發現當代理移出根目錄後,模型在使用相對文件路徑的工具時會出錯。為了解決這個問題,我們將工具更改為始終要求絕對文件路徑——我們發現模型以這種方法運行得非常完美。