Follow with Complete Python Notebook

Follow with Complete TypeScript Tutorial

In this section, you’ll run a repeatable experiment that uses an LLM-as-a-Judge to score agent outputs on specific and subjective criteria. These evaluations are well suited for cases where ground truth is unavailable or where quality expectations can be clearly defined in a prompt.

LLM as a Judge Evaluators

LLM as a Judge evaluators use an LLM to assess output quality. These are particularly useful when correctness is hard to encode with rules, such as evaluating relevance, helpfulness, reasoning quality, or actionability. These evaluators use criteria you define, making them suitable for datasets with or without reference outputs.

LLM as a Judge Evaluator for Overall Agent Performance

This experiment evaluates the overall performance of the support agent using an LLM as a Judge evaluator. This allows us to assess subjective qualities like actionability and helpfulness that are difficult to measure with code-based evaluators.

Define the Task Function

The task function is what Phoenix calls for each example in your dataset. It receives the input from the dataset (in our case, the query field) and returns an output that will be evaluated. In this example, our task function extracts the query from the dataset input, runs the full support agent (which includes tool calls and reasoning), and returns the agent’s response:

def my_support_agent_task(input):
    """
    Task function that will be run on each row of the dataset.
    """
    query = input.get("query")

    # Call the agent with the query
    response = support_agent.run(query)
    return response.content

Define the LLM as a Judge Evaluator

We create an LLM as a Judge evaluator that assesses whether the agent’s response is actionable and helpful. The evaluator uses a prompt template that defines the criteria for a good response:

# Define LLM Judge Evaluator checking for Actionable Responses
from phoenix.evals import ClassificationEvaluator, LLM
from phoenix.client.resources.experiments.types import EvaluationResult

# Define Prompt Template
support_response_actionability_judge = """
You are evaluating a customer support agent's response.

Determine whether the response is ACTIONABLE and helps resolve the user's issue.

Mark the response as CORRECT if it:
- Directly addresses the user's specific question
- Provides concrete steps, guidance, or information
- Clearly routes the user toward a solution

Mark the response as INCORRECT if it:
- Is generic, vague, or non-specific
- Avoids answering the question
- Provides no clear next steps
- Deflects with phrases like "contact support" without guidance

User Query:
{input.query}

Agent Response:
{output}

Return only one label: "correct" or "incorrect".
"""

# Create Evaluator
actionability_judge = ClassificationEvaluator(
    name="actionability-judge",
    prompt_template=support_response_actionability_judge,
    llm=LLM(model="gpt-5", provider="openai"),
    choices={"correct": 1.0, "incorrect": 0.0},
)

def call_actionability_judge(input, output):
    """
    Wrapper function for the actionability judge evaluator.
    This is needed because run_experiment expects a function, not an evaluator object.
    """
    results = actionability_judge.evaluate({
        "input": input,
        "output": output
    })
    result = results[0]
    return EvaluationResult(
        score=result.score,
        label=result.label,
        explanation=result.explanation
    )

Run the Experiment

Run the experiment on your dataset.

from phoenix.client.experiments import run_experiment

experiment = run_experiment(
    dataset=dataset,
    task=my_support_agent_task,
    evaluators=[call_actionability_judge],
    experiment_name="support agent",
    experiment_description="Initial support agent evaluation using actionability judge to measure how actionable and helpful the agent's responses are",
)

In the Phoenix UI, you can click into the experiment to inspect the results:

Complete agent traces let you drill into any run to see the exact inputs, agent reasoning, tool calls, and response. This is useful for understanding agent behavior and debugging when an example scores poorly.
Scores and labels per example show which inputs the LLM Judge rated highly or poorly, so you can spot patterns and prioritize where to improve.
Evaluator explanation tells you why the judge gave each score so you can fix specific failure modes.
Aggregate metrics across the run let you compare experiments over time and track whether quality is improving.

Next Steps

Now that you know how to run experiments with LLM as a Judge evaluators, you can also use code-based evaluators when you have ground truth available.

Get Started

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Run Experiments with LLM as a Judge

Follow with Complete Python Notebook

Follow with Complete TypeScript Tutorial

LLM as a Judge Evaluators

LLM as a Judge Evaluator for Overall Agent Performance

Define the Task Function

Define the LLM as a Judge Evaluator

Run the Experiment

Next Steps

Run Experiments with Code Evals

Iterating with Experiments in Your Workflow

Get Started

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Documentation Index

Follow with Complete Python Notebook

Follow with Complete TypeScript Tutorial

​LLM as a Judge Evaluators

​LLM as a Judge Evaluator for Overall Agent Performance

​Define the Task Function

​Define the LLM as a Judge Evaluator

​Run the Experiment

​Next Steps

Run Experiments with Code Evals

Iterating with Experiments in Your Workflow

LLM as a Judge Evaluators

LLM as a Judge Evaluator for Overall Agent Performance

Define the Task Function

Define the LLM as a Judge Evaluator

Run the Experiment

Next Steps