Agents are only as effective as the tools we give them. We share how to write high-quality tools and evaluations, and how you can boost performance by using Claude to optimize its tools for itself.
代理的有效性取决于我们为其提供的工具。我们将分享如何编写高质量的工具和评估方法,以及如何利用 Claude 自我优化工具来提升性能。

The Model Context Protocol (MCP) can empower LLM agents with potentially hundreds of tools to solve real-world tasks. But how do we make those tools maximally effective?
模型上下文协议(MCP)能够赋予 LLM 代理数百种工具来解决现实世界中的任务。但我们如何使这些工具发挥最大效能?

In this post, we describe our most effective techniques for improving performance in a variety of agentic AI systems1.
在这篇文章中,我们介绍了提升各类智能代理系统性能的最有效技术。

We begin by covering how you can:
我们首先介绍如何:

  • Build and test prototypes of your tools
    构建并测试你的工具原型
  • Create and run comprehensive evaluations of your tools with agents
    使用代理创建并运行全面的工具评估
  • Collaborate with agents like Claude Code to automatically increase the performance of your tools
    与 Claude Code 等智能体协作,自动提升您的工具性能

We conclude with key principles for writing high-quality tools we’ve identified along the way:
我们总结出在编写高质量工具过程中所发现的关键原则:

  • Choosing the right tools to implement (and not to implement)
    选择正确的工具来实现(以及不实现)
  • Namespacing tools to define clear boundaries in functionality
    命名空间工具以明确功能边界
  • Returning meaningful context from tools back to agents
    将工具生成的有意义上下文返回给智能体
  • Optimizing tool responses for token efficiency
    优化工具响应以提高令牌效率
  • Prompt-engineering tool descriptions and specs
    提示工程工具描述与规格

This is an image depicting how an engineer might use Claude Code to evaluate the efficacy of agentic tools.Building an evaluation allows you to systematically measure the performance of your tools. You can use Claude Code to automatically optimize your tools against this evaluation.
构建评估体系能够让你系统地衡量工具的性能。你可以利用 Claude Code 根据这一评估自动优化你的工具。

What is a tool? 什么是工具?

In computing, deterministic systems produce the same output every time given identical inputs, while non-deterministic systems—like agents—can generate varied responses even with the same starting conditions.
在计算领域,确定性系统在给定相同输入时每次都会产生相同的输出,而非确定性系统——如智能体——即使在相同的初始条件下也可能生成不同的响应。

When we traditionally write software, we’re establishing a contract between deterministic systems. For instance, a function call like getWeather(“NYC”) will always fetch the weather in New York City in the exact same manner every time it is called.
在传统软件开发中,我们建立的是确定性系统之间的契约。例如,像 getWeather(“NYC”) 这样的函数调用,每次执行时都会以完全相同的方式获取纽约市的天气信息。

Tools are a new kind of software which reflects a contract between deterministic systems and non-deterministic agents. When a user asks "Should I bring an umbrella today?,” an agent might call the weather tool, answer from general knowledge, or even ask a clarifying question about location first. Occasionally, an agent might hallucinate or even fail to grasp how to use a tool.
工具是一种新型软件,它体现了确定性系统与非确定性代理之间的契约。当用户询问“今天我应该带伞吗?”时,代理可能会调用天气工具,根据常识回答,甚至先提出一个关于位置的澄清问题。偶尔,代理可能会出现幻觉,甚至无法理解如何使用工具。

This means fundamentally rethinking our approach when writing software for agents: instead of writing tools and MCP servers the way we’d write functions and APIs for other developers or systems, we need to design them for agents.
这意味着在编写面向智能体的软件时,我们需要从根本上重新思考我们的方法:与其像为其他开发者或系统编写函数和 API 那样编写工具和 MCP 服务器,不如专门为智能体设计它们。

Our goal is to increase the surface area over which agents can be effective in solving a wide range of tasks by using tools to pursue a variety of successful strategies. Fortunately, in our experience, the tools that are most “ergonomic” for agents also end up being surprisingly intuitive to grasp as humans.
我们的目标是通过使用工具来追求多种成功策略,从而扩大代理在解决广泛任务中的有效范围。幸运的是,根据我们的经验,对代理来说最“符合人体工程学”的工具,最终对人类来说也出奇地直观易懂。

How to write tools 如何编写工具

In this section, we describe how you can collaborate with agents both to write and to improve the tools you give them. Start by standing up a quick prototype of your tools and testing them locally. Next, run a comprehensive evaluation to measure subsequent changes. Working alongside agents, you can repeat the process of evaluating and improving your tools until your agents achieve strong performance on real-world tasks.
在本节中,我们将探讨如何与智能体协作,共同编写并优化提供给它们的工具。首先,快速搭建工具的原型并在本地进行测试。接着,执行全面的评估以衡量后续的改进效果。通过与智能体并肩工作,您可以反复进行工具的评估与优化,直至智能体在现实任务中展现出卓越的性能。

Building a prototype 构建原型

It can be difficult to anticipate which tools agents will find ergonomic and which tools they won’t without getting hands-on yourself. Start by standing up a quick prototype of your tools. If you’re using Claude Code to write your tools (potentially in one-shot), it helps to give Claude documentation for any software libraries, APIs, or SDKs (including potentially the MCP SDK) your tools will rely on. LLM-friendly documentation can commonly be found in flat llms.txt files on official documentation sites (here’s our API’s).
预测哪些工具对 AI 代理来说易于使用,哪些则不然,若不亲自上手可能难以判断。首先,快速搭建起你的工具原型。如果你使用 Claude Code 来编写工具(可能是一次性完成),为 Claude 提供工具所依赖的软件库、API 或 SDK(可能包括 MCP SDK)的文档会大有帮助。LLM 友好的文档通常可以在官方文档站点的扁平 llms.txt 文件中找到(这里是我们的 API 文档)。

Wrapping your tools in a local MCP server or Desktop extension (DXT) will allow you to connect and test your tools in Claude Code or the Claude Desktop app.
将你的工具封装在本地 MCP 服务器或桌面扩展(DXT)中,将允许你在 Claude Code 或 Claude 桌面应用程序中连接并测试你的工具。

To connect your local MCP server to Claude Code, run claude mcp add <name> <command> [args...].
要将您的本地 MCP 服务器连接到 Claude Code,请运行 claude mcp add <name> <command> [args...]

To connect your local MCP server or DXT to the Claude Desktop app, navigate to Settings > Developer or Settings > Extensions, respectively.
要将您的本地 MCP 服务器或 DXT 连接到 Claude 桌面应用程序,请分别导航至 Settings > DeveloperSettings > Extensions

Tools can also be passed directly into Anthropic API calls for programmatic testing.
工具也可以直接传入 Anthropic API 调用中以进行程序化测试。

Test the tools yourself to identify any rough edges. Collect feedback from your users to build an intuition around the use-cases and prompts you expect your tools to enable.
亲自测试工具,以发现任何不足之处。收集用户反馈,围绕你期望工具支持的用例和提示建立直觉。

Running an evaluation 运行评估

Next, you need to measure how well Claude uses your tools by running an evaluation. Start by generating lots of evaluation tasks, grounded in real world uses. We recommend collaborating with an agent to help analyze your results and determine how to improve your tools. See this process end-to-end in our tool evaluation cookbook.
接下来,您需要通过运行评估来衡量 Claude 使用工具的效果。首先,生成大量基于实际应用场景的评估任务。我们建议与一个代理合作,帮助分析结果并确定如何改进工具。在我们的工具评估指南中,您可以详细了解这一端到端的过程。

This graph measures the test set accuracy of human-written vs. Claude-optimized Slack MCP servers.Held-out test set performance of our internal Slack tools
我们内部 Slack 工具的留出测试集表现

**Generating evaluation tasks
生成评估任务**

With your early prototype, Claude Code can quickly explore your tools and create dozens of prompt and response pairs. Prompts should be inspired by real-world uses and be based on realistic data sources and services (for example, internal knowledge bases and microservices). We recommend you avoid overly simplistic or superficial “sandbox” environments that don’t stress-test your tools with sufficient complexity. Strong evaluation tasks might require multiple tool calls—potentially dozens.
借助早期原型,Claude Code 能够迅速探索您的工具,并生成数十组提示与响应对。提示应源自现实世界的应用场景,并基于真实的数据源和服务(例如,内部知识库和微服务)。我们建议避免使用过于简单或表面的“沙盒”环境,这些环境无法以足够的复杂性对您的工具进行压力测试。强有力的评估任务可能需要多次工具调用——甚至可能达到数十次。

Here are some examples of strong tasks:
以下是一些强有力任务的示例:

  • Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Attach the notes from our last project planning meeting and reserve a conference room.
    下周与简安排一次会议,讨论我们最新的 Acme Corp 项目。附上我们上次项目规划会议的笔记,并预订一间会议室。
  • Customer ID 9182 reported that they were charged three times for a single purchase attempt. Find all relevant log entries and determine if any other customers were affected by the same issue.
    客户 ID 9182 报告称,他们在一次购买尝试中被收取了三次费用。查找所有相关的日志条目,并确定是否有其他客户受到相同问题的影响。
  • Customer Sarah Chen just submitted a cancellation request. Prepare a retention offer. Determine: (1) why they're leaving, (2) what retention offer would be most compelling, and (3) any risk factors we should be aware of before making an offer.
    客户 Sarah Chen 刚刚提交了取消服务的请求。准备一份挽留方案。需确定:(1) 他们为何离开,(2) 哪种挽留方案最具吸引力,以及(3) 在提出方案前我们应注意哪些风险因素。

And here are some weaker tasks:
以下是一些较弱的任务:

  • Schedule a meeting with jane@acme.corp next week.
    下周与 jane@acme.corp 安排一次会议。
  • Search the payment logs for purchase_complete and customer_id=9182.
    在支付日志中搜索 purchase_completecustomer_id=9182
  • Find the cancellation request by Customer ID 45892.
    查找客户 ID 为 45892 的取消请求。

Each evaluation prompt should be paired with a verifiable response or outcome. Your verifier can be as simple as an exact string comparison between ground truth and sampled responses, or as advanced as enlisting Claude to judge the response. Avoid overly strict verifiers that reject correct responses due to spurious differences like formatting, punctuation, or valid alternative phrasings.
每个评估提示都应配有一个可验证的响应或结果。验证器可以简单到对真实答案与采样响应进行精确字符串比较,也可以复杂到让 Claude 来评判响应。避免使用过于严格的验证器,以免因格式、标点或有效的替代措辞等无关差异而拒绝正确的响应。

For each prompt-response pair, you can optionally also specify the tools you expect an agent to call in solving the task, to measure whether or not agents are successful in grasping each tool’s purpose during evaluation. However, because there might be multiple valid paths to solving tasks correctly, try to avoid overspecifying or overfitting to strategies.
对于每个提示-响应对,您还可以选择性地指定您期望代理在解决任务时调用的工具,以评估代理是否成功理解每个工具的用途。然而,由于可能存在多种正确解决任务的有效路径,请尽量避免过度指定或过度适应特定策略。

Running the evaluation 运行评估

We recommend running your evaluation programmatically with direct LLM API calls. Use simple agentic loops (while-loops wrapping alternating LLM API and tool calls): one loop for each evaluation task. Each evaluation agent should be given a single task prompt and your tools.
我们建议通过直接调用 LLM API 以编程方式运行您的评估程序。使用简单的代理循环( while -循环包裹交替的 LLM API 和工具调用):每个评估任务对应一个循环。每个评估代理应被赋予单一任务提示和您的工具。

In your evaluation agents’ system prompts, we recommend instructing agents to output not just structured response blocks (for verification), but also reasoning and feedback blocks. Instructing agents to output these before tool call and response blocks may increase LLMs’ effective intelligence by triggering chain-of-thought (CoT) behaviors.
在评估智能体的系统提示中,我们建议指导智能体不仅输出结构化响应块(用于验证),还应输出推理和反馈块。指导智能体在工具调用和响应块之前输出这些内容,可能通过触发思维链(CoT)行为来提升 LLMs 的有效智能。

If you’re running your evaluation with Claude, you can turn on interleaved thinking for similar functionality “off-the-shelf”. This will help you probe why agents do or don’t call certain tools and highlight specific areas of improvement in tool descriptions and specs.
如果您正在使用 Claude 进行评估,可以启用交错思维功能以获得类似“开箱即用”的效果。这将帮助您探究代理为何调用或不调用某些工具,并突出显示工具描述和规范中需要改进的具体领域。

As well as top-level accuracy, we recommend collecting other metrics like the total runtime of individual tool calls and tasks, the total number of tool calls, the total token consumption, and tool errors. Tracking tool calls can help reveal common workflows that agents pursue and offer some opportunities for tools to consolidate.
除了顶级准确性外,我们还建议收集其他指标,如单个工具调用和任务的总运行时间、工具调用的总次数、总令牌消耗量以及工具错误。跟踪工具调用有助于揭示代理遵循的常见工作流程,并为工具整合提供一些机会。

This graph measures the test set accuracy of human-written vs. Claude-optimized Asana MCP servers.Held-out test set performance of our internal Asana tools
我们内部 Asana 工具的保留测试集表现

Analyzing results 分析结果
Agents are your helpful partners in spotting issues and providing feedback on everything from contradictory tool descriptions to inefficient tool implementations and confusing tool schemas. However, keep in mind that what agents omit in their feedback and responses can often be more important than what they include. LLMs don’t always say what they mean.
代理是您在发现问题并提供反馈时的得力助手,无论是针对相互矛盾的工具描述、低效的工具实现,还是令人困惑的工具模式。然而,请记住,代理在反馈和响应中省略的内容往往比包含的内容更为重要。LLMs 并不总是言尽其意。

Observe where your agents get stumped or confused. Read through your evaluation agents’ reasoning and feedback (or CoT) to identify rough edges. Review the raw transcripts (including tool calls and tool responses) to catch any behavior not explicitly described in the agent’s CoT. Read between the lines; remember that your evaluation agents don’t necessarily know the correct answers and strategies.
观察你的代理在何处遇到困难或感到困惑。通过阅读评估代理的推理和反馈(或思维链)来识别粗糙之处。审查原始记录(包括工具调用和工具响应),以捕捉代理思维链中未明确描述的任何行为。深入字里行间;记住,你的评估代理不一定知道正确答案和策略。

Analyze your tool calling metrics. Lots of redundant tool calls might suggest some rightsizing of pagination or token limit parameters is warranted; lots of tool errors for invalid parameters might suggest tools could use clearer descriptions or better examples. When we launched Claude’s web search tool, we identified that Claude was needlessly appending 2025 to the tool’s query parameter, biasing search results and degrading performance (we steered Claude in the right direction by improving the tool description).
分析你的工具调用指标。大量冗余的工具调用可能表明需要对分页或令牌限制参数进行适当调整;大量因无效参数导致的工具错误可能意味着工具需要更清晰的描述或更好的示例。当我们推出 Claude 的网页搜索工具时,我们发现 Claude 不必要地在工具的 query 参数后附加了 2025 ,这影响了搜索结果并降低了性能(我们通过改进工具描述,引导 Claude 走向了正确的方向)。

Collaborating with agents 与智能体协作

You can even let agents analyze your results and improve your tools for you. Simply concatenate the transcripts from your evaluation agents and paste them into Claude Code. Claude is an expert at analyzing transcripts and refactoring lots of tools all at once—for example, to ensure tool implementations and descriptions remain self-consistent when new changes are made.
你甚至可以让代理分析你的结果并为你改进工具。只需将评估代理的转录内容串联起来,粘贴到 Claude Code 中。Claude 擅长分析转录内容并一次性重构大量工具——例如,确保在做出新更改时,工具的实现和描述保持自洽。

In fact, most of the advice in this post came from repeatedly optimizing our internal tool implementations with Claude Code. Our evaluations were created on top of our internal workspace, mirroring the complexity of our internal workflows, including real projects, documents, and messages.
事实上,本文中的大部分建议都来自于我们使用 Claude Code 反复优化内部工具实现的过程。我们的评估建立在我们内部工作空间之上,反映了内部工作流程的复杂性,包括实际项目、文档和消息。

We relied on held-out test sets to ensure we did not overfit to our “training” evaluations. These test sets revealed that we could extract additional performance improvements even beyond what we achieved with "expert" tool implementations—whether those tools were manually written by our researchers or generated by Claude itself.
我们依靠预留的测试集来确保不会对“训练”评估产生过拟合。这些测试集显示,我们能够提取出额外的性能提升,甚至超越了“专家”工具实现所达到的水平——无论这些工具是由我们的研究人员手动编写,还是由 Claude 自身生成的。

In the next section, we’ll share some of what we learned from this process.
在下一节中,我们将分享从这一过程中获得的一些经验。

Principles for writing effective tools 编写高效工具的原则

In this section, we distill our learnings into a few guiding principles for writing effective tools.
在本节中,我们将所学内容提炼为几条编写高效工具的指导原则。

Choosing the right tools for agents 为 AI 代理选择合适的工具

More tools don’t always lead to better outcomes. A common error we’ve observed is tools that merely wrap existing software functionality or API endpoints—whether or not the tools are appropriate for agents. This is because agents have distinct “affordances” to traditional software—that is, they have different ways of perceiving the potential actions they can take with those tools
工具越多并不总能带来更好的结果。我们观察到一个常见的错误是,工具仅仅是对现有软件功能或 API 端点的封装——无论这些工具是否适合智能体。这是因为智能体与传统软件有着不同的“可供性”——也就是说,它们感知并利用这些工具执行潜在操作的方式是不同的。

LLM agents have limited "context" (that is, there are limits to how much information they can process at once), whereas computer memory is cheap and abundant. Consider the task of searching for a contact in an address book. Traditional software programs can efficiently store and process a list of contacts one at a time, checking each one before moving on.
LLM 代理的“上下文”有限(即它们一次能处理的信息量有限),而计算机内存则既便宜又充足。以在通讯录中搜索联系人为例,传统软件程序能够高效地逐个存储和处理联系人列表,逐一检查后再继续下一个。

However, if an LLM agent uses a tool that returns ALL contacts and then has to read through each one token-by-token, it's wasting its limited context space on irrelevant information (imagine searching for a contact in your address book by reading each page from top-to-bottom—that is, via brute-force search). The better and more natural approach (for agents and humans alike) is to skip to the relevant page first (perhaps finding it alphabetically).
然而,如果一个 LLM 代理使用了一个返回所有联系人的工具,然后不得不逐字逐句地阅读每一个联系人,它就是在将有限的上下文空间浪费在无关信息上(想象一下通过从上到下逐页阅读来在通讯录中搜索联系人——即通过暴力搜索)。更好且更自然的方法(对代理和人类都是如此)是首先跳到相关页面(或许按字母顺序找到它)。

We recommend building a few thoughtful tools targeting specific high-impact workflows, which match your evaluation tasks and scaling up from there. In the address book case, you might choose to implement a search_contacts or message_contact tool instead of a list_contacts tool.
我们建议构建一些针对特定高影响力工作流程的精心设计的工具,这些工具应与您的评估任务相匹配,并在此基础上进行扩展。在通讯录案例中,您可以选择实现 search_contactsmessage_contact 工具,而非 list_contacts 工具。

Tools can consolidate functionality, handling potentially multiple discrete operations (or API calls) under the hood. For example, tools can enrich tool responses with related metadata or handle frequently chained, multi-step tasks in a single tool call.
工具能够整合功能,在内部处理可能涉及多个独立操作(或 API 调用)的任务。例如,工具可以通过添加相关元数据来丰富工具响应,或者在一次工具调用中处理频繁串联的多步骤任务。

Here are some examples: 以下是一些示例:

  • Instead of implementing a list_users, list_events, and create_event tools, consider implementing a schedule_event tool which finds availability and schedules an event.
    与其分别实现 list_userslist_eventscreate_event 工具,不如考虑实现一个 schedule_event 工具,它能查找可用时间并安排事件。
  • Instead of implementing a read_logs tool, consider implementing a search_logs tool which only returns relevant log lines and some surrounding context.
    与其实现一个 read_logs 工具,不如考虑实现一个 search_logs 工具,它仅返回相关的日志行及部分上下文信息。
  • Instead of implementing get_customer_by_id, list_transactions, and list_notes tools, implement a get_customer_context tool which compiles all of a customer’s recent & relevant information all at once.
    与其分别实现 get_customer_by_idlist_transactionslist_notes 工具,不如实现一个 get_customer_context 工具,一次性编译客户所有最新且相关的信息。

Make sure each tool you build has a clear, distinct purpose. Tools should enable agents to subdivide and solve tasks in much the same way that a human would, given access to the same underlying resources, and simultaneously reduce the context that would have otherwise been consumed by intermediate outputs.
确保你构建的每个工具都具有明确且独特的目的。工具应使代理能够以与人类相同的方式细分和解决任务,前提是它们能够访问相同的基础资源,同时减少原本会被中间输出消耗的上下文。

Too many tools or overlapping tools can also distract agents from pursuing efficient strategies. Careful, selective planning of the tools you build (or don’t build) can really pay off.
过多的工具或功能重叠的工具也可能分散代理追求高效策略的注意力。精心、有选择性地规划你构建(或不构建)的工具,确实能带来显著成效。

Namespacing your tools 为你的工具进行命名空间划分

Your AI agents will potentially gain access to dozens of MCP servers and hundreds of different tools–including those by other developers. When tools overlap in function or have a vague purpose, agents can get confused about which ones to use.
你的 AI 代理将有可能访问数十个 MCP 服务器和数百种不同的工具——包括其他开发者提供的工具。当工具功能重叠或用途模糊时,代理可能会对使用哪个工具感到困惑。

Namespacing (grouping related tools under common prefixes) can help delineate boundaries between lots of tools; MCP clients sometimes do this by default. For example, namespacing tools by service (e.g., asana_search, jira_search) and by resource (e.g., asana_projects_search, asana_users_search), can help agents select the right tools at the right time.
命名空间(将相关工具分组在共同前缀下)有助于在众多工具之间划定界限;MCP 客户端有时默认会这样做。例如,按服务(如 asana_searchjira_search )和按资源(如 asana_projects_searchasana_users_search )对工具进行命名空间划分,可以帮助代理在正确的时间选择正确的工具。

We have found selecting between prefix- and suffix-based namespacing to have non-trivial effects on our tool-use evaluations. Effects vary by LLM and we encourage you to choose a naming scheme according to your own evaluations.
我们发现,在工具使用评估中,选择基于前缀和后缀的命名空间方式会产生显著影响。这种影响因 LLM 而异,我们建议您根据自身的评估结果来选择合适的命名方案。

Agents might call the wrong tools, call the right tools with the wrong parameters, call too few tools, or process tool responses incorrectly. By selectively implementing tools whose names reflect natural subdivisions of tasks, you simultaneously reduce the number of tools and tool descriptions loaded into the agent’s context and offload agentic computation from the agent’s context back into the tool calls themselves. This reduces an agent’s overall risk of making mistakes.
代理可能会调用错误的工具、以错误的参数调用正确的工具、调用过少的工具,或者错误地处理工具响应。通过有选择地实现那些名称反映任务自然细分的工具,你同时减少了加载到代理上下文中的工具数量和工具描述,并将代理计算从代理上下文转移回工具调用本身。这降低了代理整体犯错的风险。

Returning meaningful context from your tools 从你的工具中返回有意义的上下文

In the same vein, tool implementations should take care to return only high signal information back to agents. They should prioritize contextual relevance over flexibility, and eschew low-level technical identifiers (for example: uuid, 256px_image_url, mime_type). Fields like name, image_url, and file_type are much more likely to directly inform agents’ downstream actions and responses.
同样地,工具实现应确保仅向代理返回高价值信息。它们应优先考虑上下文相关性而非灵活性,并避免使用低级技术标识符(例如: uuid256px_image_urlmime_type )。像 nameimage_urlfile_type 这样的字段更有可能直接影响代理的下游行动和响应。

Agents also tend to grapple with natural language names, terms, or identifiers significantly more successfully than they do with cryptic identifiers. We’ve found that merely resolving arbitrary alphanumeric UUIDs to more semantically meaningful and interpretable language (or even a 0-indexed ID scheme) significantly improves Claude’s precision in retrieval tasks by reducing hallucinations.
代理在处理自然语言名称、术语或标识符时,通常比处理晦涩难懂的标识符要成功得多。我们发现,仅仅将任意的字母数字 UUID 解析为更具语义意义和可解释性的语言(甚至是一个从 0 开始的 ID 方案),就能通过减少幻觉显著提高 Claude 在检索任务中的精确度。

In some instances, agents may require the flexibility to interact with both natural language and technical identifiers outputs, if only to trigger downstream tool calls (for example, search_user(name=’jane’)send_message(id=12345)). You can enable both by exposing a simple response_format enum parameter in your tool, allowing your agent to control whether tools return “concise” or “detailed” responses (images below).
在某些情况下,智能体可能需要灵活地与自然语言和技术标识符输出进行交互,即便只是为了触发下游工具调用(例如, search_user(name=’jane’)send_message(id=12345) )。您可以通过在工具中公开一个简单的 response_format 枚举参数来实现这一点,使您的智能体能够控制工具返回 “concise” 还是 “detailed” 响应(见下图)。

You can add more formats for even greater flexibility, similar to GraphQL where you can choose exactly which pieces of information you want to receive. Here is an example ResponseFormat enum to control tool response verbosity:
你可以添加更多格式以实现更大的灵活性,类似于 GraphQL,你可以精确选择想要接收的信息片段。以下是一个用于控制工具响应详细程度的 ResponseFormat 枚举示例:

enum ResponseFormat {
   DETAILED = "detailed",
   CONCISE = "concise"
}

Copy 复制

Here’s an example of a detailed tool response (206 tokens):
以下是一个详细工具响应的示例(206 个词):

This code snippet depicts an example of a detailed tool response.

Here’s an example of a concise tool response (72 tokens):
以下是一个简洁工具响应的示例(72 个词):

This code snippet depicts a concise tool response.Slack threads and thread replies are identified by unique thread_ts which are required to fetch thread replies. thread_ts and other IDs (channel_id, user_id) can be retrieved from a “detailed” tool response to enable further tool calls that require these. “concise” tool responses return only thread content and exclude IDs. In this example, we use ~⅓ of the tokens with “concise” tool responses.
Slack 线程和线程回复通过唯一的 thread_ts 标识,这些标识是获取线程回复所必需的。 thread_ts 和其他 ID( channel_iduser_id )可以从 “detailed” 工具响应中检索,以便进行需要这些 ID 的进一步工具调用。 “concise” 工具响应仅返回线程内容,不包含 ID。在此示例中,我们使用了约三分之一的 token 来处理 “concise” 工具响应。

Even your tool response structure—for example XML, JSON, or Markdown—can have an impact on evaluation performance: there is no one-size-fits-all solution. This is because LLMs are trained on next-token prediction and tend to perform better with formats that match their training data. The optimal response structure will vary widely by task and agent. We encourage you to select the best response structure based on your own evaluation.
即使是你的工具响应结构——例如 XML、JSON 或 Markdown——也会对评估性能产生影响:没有一种放之四海而皆准的解决方案。这是因为 LLMs 是基于下一个词预测进行训练的,往往在与训练数据匹配的格式上表现更好。最佳响应结构会因任务和代理的不同而有很大差异。我们鼓励你根据自己的评估选择最佳的响应结构。

Optimizing tool responses for token efficiency 优化工具响应以提高令牌效率

Optimizing the quality of context is important. But so is optimizing the quantity of context returned back to agents in tool responses.
优化上下文的质量固然重要,但优化工具响应中返回给代理的上下文数量同样关键。

We suggest implementing some combination of pagination, range selection, filtering, and/or truncation with sensible default parameter values for any tool responses that could use up lots of context. For Claude Code, we restrict tool responses to 25,000 tokens by default. We expect the effective context length of agents to grow over time, but the need for context-efficient tools to remain.
我们建议对任何可能消耗大量上下文的工具响应实施分页、范围选择、过滤和/或截断的某种组合,并设置合理的默认参数值。对于 Claude Code,我们默认将工具响应限制在 25,000 个标记以内。我们预计代理的有效上下文长度会随着时间的推移而增长,但对上下文高效工具的需求仍将存在。

If you choose to truncate responses, be sure to steer agents with helpful instructions. You can directly encourage agents to pursue more token-efficient strategies, like making many small and targeted searches instead of a single, broad search for a knowledge retrieval task. Similarly, if a tool call raises an error (for example, during input validation), you can prompt-engineer your error responses to clearly communicate specific and actionable improvements, rather than opaque error codes or tracebacks.
若选择截断响应,务必以有益的指令引导代理。你可以直接鼓励代理采取更节省令牌的策略,例如在知识检索任务中,进行多次小而精准的搜索,而非一次宽泛的搜索。同样地,若工具调用引发错误(如输入验证时),你可以通过提示工程来设计错误响应,明确传达具体且可操作的改进建议,而非晦涩的错误代码或回溯信息。

Here’s an example of a truncated tool response:
以下是一个截断工具响应的示例:

This image depicts an example of a truncated tool response.

Here’s an example of an unhelpful error response:
以下是一个无用的错误响应示例:

This image depicts an example of an unhelpful tool response.

Here’s an example of a helpful error response:
以下是一个有用的错误响应示例:

This image depicts an example of a helpful error response.Tool truncation and error responses can steer agents towards more token-efficient tool-use behaviors (using filters or pagination) or give examples of correctly formatted tool inputs.
工具截断和错误响应可以引导代理采取更节省令牌的工具使用行为(如使用过滤器或分页),或提供正确格式化工具输入的示例。

Prompt-engineering your tool descriptions 优化你的工具描述提示词

We now come to one of the most effective methods for improving tools: prompt-engineering your tool descriptions and specs. Because these are loaded into your agents’ context, they can collectively steer agents toward effective tool-calling behaviors.
现在,我们来到提升工具效能最为有效的方法之一:对工具描述和规格进行提示工程优化。由于这些内容会被载入智能体的上下文环境中,它们能够共同引导智能体采取高效的工具调用行为。

When writing tool descriptions and specs, think of how you would describe your tool to a new hire on your team. Consider the context that you might implicitly bring—specialized query formats, definitions of niche terminology, relationships between underlying resources—and make it explicit. Avoid ambiguity by clearly describing (and enforcing with strict data models) expected inputs and outputs. In particular, input parameters should be unambiguously named: instead of a parameter named user, try a parameter named user_id.
在编写工具描述和规范时,设想一下你会如何向团队中的新成员介绍你的工具。考虑你可能隐含带来的上下文——专门的查询格式、小众术语的定义、底层资源之间的关系——并将其明确化。通过清晰地描述(并通过严格的数据模型强制执行)预期的输入和输出来避免歧义。特别是,输入参数的命名应明确无误:与其使用名为 user 的参数,不如尝试使用名为 user_id 的参数。

With your evaluation you can measure the impact of your prompt engineering with greater confidence. Even small refinements to tool descriptions can yield dramatic improvements. Claude Sonnet 3.5 achieved state-of-the-art performance on the SWE-bench Verified evaluation after we made precise refinements to tool descriptions, dramatically reducing error rates and improving task completion.
通过评估,您可以更有信心地衡量提示工程的影响。即使是对工具描述的微小改进,也能带来显著的提升。在我们对工具描述进行精确优化后,Claude Sonnet 3.5 在 SWE-bench Verified 评估中达到了最先进的性能,显著降低了错误率并提高了任务完成度。

You can find other best practices for tool definitions in our Developer Guide. If you’re building tools for Claude, we also recommend reading about how tools are dynamically loaded into Claude’s system prompt. Lastly, if you’re writing tools for an MCP server, tool annotations help disclose which tools require open-world access or make destructive changes.
您可以在我们的开发者指南中找到其他关于工具定义的最佳实践。如果您正在为 Claude 构建工具,我们还建议您阅读有关工具如何动态加载到 Claude 系统提示中的内容。最后,如果您正在为 MCP 服务器编写工具,工具注释有助于披露哪些工具需要开放世界访问权限或进行破坏性更改。

Looking ahead 展望未来

To build effective tools for agents, we need to re-orient our software development practices from predictable, deterministic patterns to non-deterministic ones.
为了构建有效的智能体工具,我们需要将软件开发实践从可预测的确定性模式重新定位为非确定性模式。

Through the iterative, evaluation-driven process we’ve described in this post, we've identified consistent patterns in what makes tools successful: Effective tools are intentionally and clearly defined, use agent context judiciously, can be combined together in diverse workflows, and enable agents to intuitively solve real-world tasks.
通过本文所述的迭代式、评估驱动过程,我们识别出工具成功的一致模式:有效的工具需经过深思熟虑且定义明确,合理利用代理上下文,能够在多样化的工作流程中灵活组合,并使得代理能够直观地解决现实世界中的任务。

In the future, we expect the specific mechanisms through which agents interact with the world to evolve—from updates to the MCP protocol to upgrades to the underlying LLMs themselves. With a systematic, evaluation-driven approach to improving tools for agents, we can ensure that as agents become more capable, the tools they use will evolve alongside them.
未来,我们预计代理与世界互动的具体机制将不断发展——从 MCP 协议的更新到底层 LLMs 自身的升级。通过系统化、以评估为导向的方法来改进代理工具,我们可以确保随着代理能力的增强,它们所使用的工具也将随之进化。

Acknowledgements 致谢

Written by Ken Aizawa with valuable contributions from colleagues across Research (Barry Zhang, Zachary Witten, Daniel Jiang, Sami Al-Sheikh, Matt Bell, Maggie Vo), MCP (Theodora Chu, John Welsh, David Soria Parra, Adam Jones), Product Engineering (Santiago Seira), Marketing (Molly Vorwerck), Design (Drew Roper), and Applied AI (Christian Ryan, Alexander Bricken).
由 Ken Aizawa 撰写,并得到了来自研究部门(Barry Zhang、Zachary Witten、Daniel Jiang、Sami Al-Sheikh、Matt Bell、Maggie Vo)、MCP(Theodora Chu、John Welsh、David Soria Parra、Adam Jones)、产品工程(Santiago Seira)、市场营销(Molly Vorwerck)、设计(Drew Roper)以及应用人工智能(Christian Ryan、Alexander Bricken)同事的宝贵贡献。

1Beyond training the underlying LLMs themselves.
1 除了训练底层 LLMs 本身之外。