Documentation Index
Fetch the complete documentation index at: https://arizeai-433a7140-mikeldking-12899-providers-and-secrets.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Tool Response Handling evaluator determines whether an AI agent correctly processed a tool’s result to produce an appropriate output. This evaluator focuses on the what happens after the tool calling — validating that the agent used the tool result accurately — rather than whether the right tool was selected or invoked correctly.When to Use
Use the Tool Response Handling evaluator when you need to:- Detect hallucinated data — Identify when the agent invents information not present in the tool result
- Validate data extraction — Ensure dates, numbers, and structured fields are correctly parsed and transformed
- Check error handling — Verify the agent retries transient errors and corrects argument errors appropriately
- Audit for information disclosure — Check that credentials, internal URLs, or PII from tool results are not leaked to users
- Evaluate multi-tool handling — Validate that the agent correctly incorporates results from multiple tool calls
This evaluator validates how the agent handled the tool result, not whether the right tool was chosen or invoked correctly. Use the Tool Selection evaluator to evaluate tool choice, and the Tool Invocation evaluator to validate argument correctness. Together, all three evaluators provide complete coverage of the tool-calling pipeline.
Supported Levels
The level of an evaluator determines the scope of the evaluation in OpenTelemetry terms. Some evaluations are applicable to individual spans, some to full traces or sessions, and some are applicable at multiple levels.| Level | Supported | Notes |
|---|---|---|
| Span | Yes | For LLM spans that include a tool result and the agent’s subsequent output. |
Input Requirements
The Tool Response Handling evaluator requires four inputs:| Field | Type | Description |
|---|---|---|
input | string | The user query or conversation context |
tool_call | string | The tool invocation(s) made by the agent, including arguments |
tool_result | string | The tool’s response (data, errors, or partial results) |
output | string | The agent’s handling after receiving the tool result (may include retries, follow-ups, or final response) |
In TypeScript, the fields use camelCase:
toolCall and toolResult.Formatting Tips
While you can pass full JSON representations for each field, human-readable formats typically produce more accurate evaluations.input (user query or conversation context):
tool_call (the tool invocation with arguments):
tool_result (the tool’s response):
output (the agent’s response after receiving the tool result):
- Include the full output sequence — If the agent retried or made follow-up calls after an error, include the entire handling sequence, not just the final message
- Multi-tool calls are supported — If the agent called multiple tools, include all tool calls and results; the evaluator checks that the agent handled all results correctly
Output Interpretation
The evaluator returns aScore object with the following properties:
| Property | Value | Description |
|---|---|---|
label | "correct" or "incorrect" | Classification result |
score | 1.0 or 0.0 | Numeric score (1.0 = correct, 0.0 = incorrect) |
explanation | string | LLM-generated reasoning for the classification |
direction | "maximize" | Higher scores are better |
metadata | object | Additional information such as the model name. When tracing is enabled, includes the trace_id for the evaluation. |
- Data is extracted accurately from the tool result with no hallucinated details
- Dates, numbers, and structured fields are properly transformed and formatted
- Transient errors (rate limits, timeouts) are retried; invalid argument errors are corrected
- No sensitive information (credentials, internal URLs, PII) is disclosed
- The agent’s response actually uses the tool result rather than ignoring it
- The output includes information not present in the tool result (hallucination)
- The meaning of the tool result is misrepresented or reversed
- Dates, numbers, or structured data are incorrectly converted
- The agent failed to retry retryable errors or correct fixable argument errors
- The agent made repeated identical calls that continued to fail
- Sensitive information from the tool result was leaked to the user
- The agent’s response ignored the tool result entirely
Usage Examples
- Python
- TypeScript
Using Input Mapping
When your data has different field names, use input mapping.- Python
- TypeScript
Configuration
For LLM client configuration options, see Configuring the LLM.Viewing and Modifying the Prompt
You can view the latest versions of our prompt templates on GitHub. The evaluators are designed to work well in a variety of contexts, but we highly recommend modifying the prompt to be more specific to your use case. Feel free to adapt them.- Python
- TypeScript
Using with Phoenix
Evaluating Traces
Run evaluations on traces collected in Phoenix and log results as annotations:Running Experiments
Use the Tool Response Handling evaluator in Phoenix experiments:API Reference
- Python: ToolResponseHandlingEvaluator
- TypeScript: createToolResponseHandlingEvaluator
Related
- Tool Selection Evaluator - For evaluating whether the right tool was chosen
- Tool Invocation Evaluator - For evaluating whether tool arguments are correct
- Correctness Evaluator - For evaluating factual accuracy of LLM responses

