Evaluation

Evaluations measure the quality of your AI application’s outputs — whether responses are accurate, grounded, safe, or relevant to the user’s intent. Unlike traditional software, LLM outputs can’t be tested with simple assertions. Evaluations give you a systematic way to catch regressions, compare model or prompt changes, and build confidence before shipping. Phoenix supports both deterministic code-based evaluators (exact match, regex, custom heuristics) and LLM-as-a-judge evaluators, where a second model scores the output against a rubric. You can run evaluations on traces from production, on experiment results, or on any dataset. Phoenix provides two approaches:

Two Ways to Evaluate

Client-Side Evals (SDK)

Python and TypeScript SDKs for running evaluations against Phoenix traces, datasets, or any data source. Full control over evaluation logic, judge models, and pipelines.

Server-Side Evals (UI)

Configure evaluators in the Phoenix UI and attach them to your datasets. Phoenix scores experiment results automatically — no code required.

Features

Model Agnostic via adapters (for OpenAI, LiteLLM, LangChain, AI SDK, and more) — so you can easily switch judge models, or stick to your preferred provider.
Powerful input mapping system for working with complex data structures — easily map nested data and complex inputs to evaluator requirements.
Pre-built metrics for common evaluation tasks and use cases like RAG and tool-calling agents.
Evaluators are natively instrumented via OpenTelemetry tracing for observability and dataset curation.
Blazing fast performance — achieve up to 20x speedup with built-in concurrency and batching.
Built-in Explanations — all Phoenix LLM evaluations return explanations by default for better results and richer signals.

Structured Output via Tool Calling

LLM evaluators use function calling (tool use) to extract structured judgments rather than parsing freeform text. Phoenix generates a tool from the evaluator’s output config — for example:

{
  "name": "correctness",
  "parameters": {
    "type": "object",
    "properties": {
      "label": {
        "type": "string",
        "enum": ["correct", "incorrect"]
      },
      "explanation": {
        "type": "string"
      }
    },
    "required": ["label", "explanation"]
  }
}

The LLM is required to call this tool rather than respond with freeform text. Phoenix parses the tool call and maps the returned label to its numeric score. This applies to both the client-side SDK and server-side evaluators.

Executors

Under the hood, the Phoenix uses executors to run evaluations faster and more reliably. Phoenix automatically handles the infrastructure complexity:

Rate limit handling: automatically retries when LLM providers throttle requests
Error management: distinguishes between temporary failures and permanent errors so retries don’t waste API budget
Dynamic concurrency: adjusts parallelism based on provider performance to maximize throughput without triggering rate limits

This means you can run thousands of evaluations without writing any retry or concurrency logic yourself.

Evaluator Tracing

All evaluator runs are automatically traced via OpenTelemetry and sent to a dedicated Phoenix project. This gives you complete transparency into how your evaluators make decisions — essential for validating prompt engineering and achieving human alignment. Every evaluation execution captures the input data, the exact prompts sent to the judge LLM, the model’s full reasoning, the final scores, and execution timing. Use Phoenix’s trace viewer to explore evaluation traces, identify systematic biases, and continuously improve evaluator performance.

For continuous monitoring of application performance — evals on production traffic with alerting and threshold-based triggers — see Arize AX Online Evals.

Get Started

Tracing

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Evaluation

Two Ways to Evaluate

Client-Side Evals (SDK)

Server-Side Evals (UI)

Features

Structured Output via Tool Calling

Executors

Evaluator Tracing

Get Started

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Documentation Index

​Two Ways to Evaluate

Client-Side Evals (SDK)

Server-Side Evals (UI)

​Features

​Structured Output via Tool Calling

​Executors

​Evaluator Tracing

Two Ways to Evaluate

Features

Structured Output via Tool Calling

Executors

Evaluator Tracing