Learn how to define and run experiments to systematically evaluate your AI application using LLM as a Judge evaluators for subjective quality assessments.
Use this file to discover all available pages before exploring further.
Follow with Complete Python Notebook
Follow with Complete TypeScript Tutorial
In this section, you’ll run a repeatable experiment that uses an LLM-as-a-Judge to score agent outputs on specific and subjective criteria. These evaluations are well suited for cases where ground truth is unavailable or where quality expectations can be clearly defined in a prompt.
LLM as a Judge evaluators use an LLM to assess output quality. These are particularly useful when correctness is hard to encode with rules, such as evaluating relevance, helpfulness, reasoning quality, or actionability.These evaluators use criteria you define, making them suitable for datasets with or without reference outputs.
LLM as a Judge Evaluator for Overall Agent Performance
This experiment evaluates the overall performance of the support agent using an LLM as a Judge evaluator. This allows us to assess subjective qualities like actionability and helpfulness that are difficult to measure with code-based evaluators.
The task function is what Phoenix calls for each example in your dataset. It receives the input from the dataset (in our case, the query field) and returns an output that will be evaluated.In this example, our task function extracts the query from the dataset input, runs the full support agent (which includes tool calls and reasoning), and returns the agent’s response:
def my_support_agent_task(input): """ Task function that will be run on each row of the dataset. """ query = input.get("query") # Call the agent with the query response = support_agent.run(query) return response.content
We create an LLM as a Judge evaluator that assesses whether the agent’s response is actionable and helpful. The evaluator uses a prompt template that defines the criteria for a good response:
# Define LLM Judge Evaluator checking for Actionable Responsesfrom phoenix.evals import ClassificationEvaluator, LLMfrom phoenix.client.resources.experiments.types import EvaluationResult# Define Prompt Templatesupport_response_actionability_judge = """You are evaluating a customer support agent's response.Determine whether the response is ACTIONABLE and helps resolve the user's issue.Mark the response as CORRECT if it:- Directly addresses the user's specific question- Provides concrete steps, guidance, or information- Clearly routes the user toward a solutionMark the response as INCORRECT if it:- Is generic, vague, or non-specific- Avoids answering the question- Provides no clear next steps- Deflects with phrases like "contact support" without guidanceUser Query:{input.query}Agent Response:{output}Return only one label: "correct" or "incorrect"."""# Create Evaluatoractionability_judge = ClassificationEvaluator( name="actionability-judge", prompt_template=support_response_actionability_judge, llm=LLM(model="gpt-5", provider="openai"), choices={"correct": 1.0, "incorrect": 0.0},)def call_actionability_judge(input, output): """ Wrapper function for the actionability judge evaluator. This is needed because run_experiment expects a function, not an evaluator object. """ results = actionability_judge.evaluate({ "input": input, "output": output }) result = results[0] return EvaluationResult( score=result.score, label=result.label, explanation=result.explanation )
from phoenix.client.experiments import run_experimentexperiment = run_experiment( dataset=dataset, task=my_support_agent_task, evaluators=[call_actionability_judge], experiment_name="support agent", experiment_description="Initial support agent evaluation using actionability judge to measure how actionable and helpful the agent's responses are",)
In the Phoenix UI, you can click into the experiment to inspect the results:
Complete agent traces let you drill into any run to see the exact inputs, agent reasoning, tool calls, and response. This is useful for understanding agent behavior and debugging when an example scores poorly.
Scores and labels per example show which inputs the LLM Judge rated highly or poorly, so you can spot patterns and prioritize where to improve.
Evaluator explanation tells you why the judge gave each score so you can fix specific failure modes.
Aggregate metrics across the run let you compare experiments over time and track whether quality is improving.