Our support agent is running, and traces are flowing into Phoenix. We can see every LLM call, tool execution, and retrieval. Users are still complaining. Some responses are helpful, others are completely wrong. We need a way to measure quality - not just observe activity. In this chapter, you’ll learn to

Annotate traces with human feedback. This let’s you label your traces, figuring out where you need to improve.
Capture user reactions from your application. When user’s complain, attach that feedback to your data and use it to improve.
Run automated LLM-as-Judge evaluations to find patterns in what’s failing. Scale your analysis over thousands of traces using an LLM, so that you can build confident and data-driven analysis of what improvements need to be made.

Follow along with code Throughout the tutorial we will include key code snippets, but to see the full implementation, check out the companion projects below.

TypeScript Tutorial

Companion TypeScript project with runnable examples

Python Tutorial

Companion Python project with runnable examples

2.1 Human Annotations in the UI

Before automating anything, we need to know what “good” actually looks like. Is a one-sentence answer better than a detailed paragraph? Should the agent apologize when it can’t help? These depend on our users, our brand, and our use case. Human annotation is how we build that understanding. By manually reviewing traces and marking them as good, bad, or somewhere in between, we create ground truth - the gold standard that everything else gets measured against. We’ll also start noticing patterns: maybe the agent struggles with multi-part questions, or gets confused when users reference previous messages.

Create Annotation Config

Navigate to Settings → Annotations in Phoenix to create annotation types. We’ll create a simple config for us to label our support agent helpfulness.

Here’s a breakdown of the different annotation configurations.

Type	Example	Use Case
Categorical	`correct` / `incorrect`	Yes/no or multi-class labels
Continuous	1-5 scale, 0-100%	Numeric scores
Freeform	Any text	Open-ended notes

Annotate in the UI

Open a trace → click Annotate → fill out the form. Once we’ve annotated traces, we can filter by annotation values, export to datasets, and compare across annotators. Even 50 well-annotated traces teach you more about failure modes than weeks of guessing.

2.2 Programmatic Annotations (User Feedback)

Manual annotation gives you ground truth, but it doesn’t scale. We can review maybe 50 traces a day, but your agent is handling thousands of conversations. Sometimes, our users are already telling you what’s working. Every thumbs up, thumbs down, “this wasn’t helpful” click, or escalation to a human agent is feedback. Let’s store that feedback in Phoenix, so that we can attach it to our traces! Let’s simulate a thumbs up/thumbs down feature, and then store that as annotations to our traces in Phoenix. This will give us metrics on how satified our users are.

Get the Span ID from Running Code

To attach feedback to a trace, you need the span ID. Here’s how to capture it:

TypeScript
Python

import { trace } from "@opentelemetry/api";

async function handleSupportQuery(userQuery: string) {
  return tracer.startActiveSpan("support-agent", async (span) => {
    // Capture the span ID for later feedback
    const spanId = span.spanContext().spanId;
    
    // ... process query ...
    
    return {
      response: "Your order has shipped!",
      spanId, // Return this to your frontend
    };
  });
}

from opentelemetry import trace

def handle_support_query(user_query: str):
    tracer = trace.get_tracer("support-agent")
    
    with tracer.start_as_current_span("support-agent") as agent_span:
        # Capture the span ID for later feedback
        span_id = format_span_id(agent_span.get_span_context().span_id)
        
        # ... process query ...
        
        return {
            "query": user_query,
            "response": response,
            "spanId": span_id,
        }

In a web application, you’d return the spanId to your frontend along with the response, then send it back when the user clicks thumbs up/down.

Log Feedback via Phoenix Client

Install the Phoenix client:

TypeScript
Python

npm install @arizeai/phoenix-client

pip install arize-phoenix-client

Then log annotations:

TypeScript
Python

import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";

// When user clicks thumbs up
await logSpanAnnotations({
  spanAnnotations: [{
    spanId: "abc123...",  // The span ID from your response
    name: "user_feedback",
    label: "thumbs-up",
    score: 1,
    annotatorKind: "HUMAN",
    metadata: {
      source: "web_app",
      userId: "user_456",
    },
  }],
  sync: true,
});

from phoenix.client.resources.spans import SpanAnnotationData
from phoenix.client import Client

phoenix_client = Client()

# When user clicks thumbs up, store the annotation
annotations = []
if answer in ["y", "1", "yes"]:
    annotations.append(
        SpanAnnotationData(
            name="user_feedback",
            span_id=resp["spanId"],
            annotator_kind="HUMAN",
            result={"label": "thumbs-up", "score": 1.0},
            metadata={"category": resp["category"], "source": "interactive_tutorial"},
        )
    )

phoenix_client.spans.log_span_annotations(
    span_annotations=annotations,
    sync=False,
)

Run the support agent, where we let you give feedback on the traces and push annotations to Phoenix: After the agent generates responses, you’ll be prompted to rate each one:

Enter y for thumbs-up (good response)
Enter n for thumbs-down (bad response)
Enter s to skip

Your feedback is sent to Phoenix as annotations. Check the Annotations tab on each trace to see your ratings.

2.3 LLM-as-Judge Evaluations

We’ve collected user feedback and identified which responses were unhelpful. Now we need to understand why they failed. Was the tool call returning errors? Was the retrieval pulling irrelevant context? Instead of manually clicking through each unhelpful trace, you can automate this analysis. We’ll create two evaluators - one for our lookupOrderStatus tool, and the other for FAQ retrieval relevance. These evaluators annotate the child spans, so when you click into an unhelpful trace, you can immediately see what went wrong.

Install the Phoenix Evals Package

TypeScript
Python

npm install @arizeai/phoenix-evals

pip install arize-phoenix-evals

Tool Result Evaluator

Did the tool call succeed or return an error? This is a simple code-based check:

TypeScript
Python

// Filter for tool spans
const toolSpans = spans.filter((span) => span.name === "ai.toolCall");

for (const span of toolSpans) {
    const spanId = span.context.span_id;
    const output = JSON.stringify(span.attributes["output.value"] || "");

    // Simple check: does the output contain "error" or "not found"?
    const hasError = output.toLowerCase().includes("error") || 
                     output.toLowerCase().includes("not found");
    
    const status = hasError ? "❌ ERROR" : "✅ SUCCESS";
    console.log(`   Tool span ${spanId.substring(0, 8)}... ${status}`);

    annotations.push({
      spanId,
      name: "tool_result",
      label: hasError ? "error" : "success",
      score: hasError ? 0 : 1,
      explanation: hasError ? "Tool returned an error or 'not found' response" : "Tool executed successfully",
      annotatorKind: "LLM" as const,  // Using "LLM" for consistency, though this is code-based
      metadata: {
        evaluator: "tool_result",
        type: "code",
      },
    });
  }

import json
from phoenix.client.resources.spans import SpanAnnotationData

# Filter for tool spans (lookupOrderStatus)
tool_spans = [span for span in spans if span.get("name") == "lookupOrderStatus"]

# Tool Result Evaluator - code-based check
tool_annotations = []

for span in tool_spans:
    # Access span_id from context
    context = span.get("context", {})
    span_id = context.get("span_id", "") if isinstance(context, dict) else ""

    # Access attributes (may be a dict or JSON string)
    attributes = span.get("attributes", {})
    if isinstance(attributes, str):
        attributes = json.loads(attributes)

    output_value = attributes.get("output.value", "")

    # Simple check: does the output contain "error" or "not found"?
    output_str = json.dumps(output_value) if not isinstance(output_value, str) else output_value
    has_error = "error" in output_str.lower() or "not found" in output_str.lower()

    tool_annotations.append(
        SpanAnnotationData(
            name="tool_result",
            span_id=span_id,
            annotator_kind="CODE",
            result={"label": "error" if has_error else "success", "score": 0.0 if has_error else 1.0},
            metadata={
                "evaluator": "tool_result",
                "type": "code",
            },
        )
    )

Retrieval Relevance Evaluator

Was the retrieved context actually relevant to the question?

TypeScript
Python

import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

// Filter for the LLM calls that use retrieved context
const llmSpans = spans.filter((span) => 
span.name === "ai.generateText" && 
String(span.attributes["gen_ai.system"] || "").includes("Answer the user's question using ONLY the information provided in the context below. Be friendly and concise.")
);

// Create an LLM-as-Judge evaluator that determines if retrieved context was relevant
const retrievalRelevanceEvaluator = createClassificationEvaluator({
name: "retrieval_relevance",
model: openai("gpt-4o-mini"),
choices: {
relevant: 1,
irrelevant: 0,
},
promptTemplate: `You are evaluating whether the retrieved context is relevant to answering the user's prompt.

Classify the retrieval as:
- RELEVANT: The context contains information that directly helps answer the question
- IRRELEVANT: The context does NOT contain useful information for the question

You are comparing the "Context" object and the "prompt" object.

[Context and Prompt]: {{input}}
`,
});

// Evaluate each RAG span
for (const span of llmSpans) {
const spanId = span.context.span_id;

// Extract the system prompt (which contains the retrieved context)
const input = span.attributes["input.value"] as string || "";

const result = await retrievalRelevanceEvaluator.evaluate({
  input: input,
});
const status = result.label === "relevant" ? "✅ RELEVANT" : "❌ IRRELEVANT";
console.log(`   RAG span ${spanId.substring(0, 8)}... ${status}`);

// Add annotation to be logged to Phoenix
annotations.push({
spanId,
name: "retrieval_relevance",
label: result.label,
score: result.score,
explanation: result.explanation,
annotatorKind: "LLM",
metadata: {
  model: "gpt-4o-mini",
  evaluator: "retrieval_relevance",
},
});

import json
from phoenix.evals import LLM, ClassificationEvaluator
from phoenix.client.resources.spans import SpanAnnotationData

# Filter for retrieval spans (RETRIEVER kind) - FAQ retrieval
retrieval_spans = [
    span for span in spans
    if span.get("span_kind") == "RETRIEVER" or span.get("name") == "faq-retrieval"
]

# Create an LLM-as-Judge evaluator that determines if retrieved context was relevant
llm = LLM(provider="openai", model="gpt-4o-mini")

retrieval_relevance_evaluator = ClassificationEvaluator(
    name="retrieval_relevance",
    prompt_template="""You are evaluating whether the retrieved context is relevant to answering the user's prompt.

Classify the retrieval as:
- RELEVANT: The context contains information that directly helps answer the question
- IRRELEVANT: The context does NOT contain useful information for the question

You are comparing the "Context" object and the "prompt" object.

[Context and Prompt]: {input}""",
    llm=llm,
    choices={"relevant": 1, "irrelevant": 0},
)

# Evaluate each retrieval span
rag_annotations = []

for span in retrieval_spans:
    # Access span_id from context
    context = span.get("context", {})
    span_id = context.get("span_id", "") if isinstance(context, dict) else ""

    # Access attributes (may be a dict or JSON string)
    attributes = span.get("attributes", {})
    if isinstance(attributes, str):
        attributes = json.loads(attributes)

    # Logic to extract the query and retrieved documents here

    # Build input for evaluator: query + retrieved context
    context_text = "\n\n".join(documents)
    evaluation_input = f"Query: {query}\n\nRetrieved Context:\n{context_text}"

    result = retrieval_relevance_evaluator.evaluate({"input": evaluation_input})
    score_result = result[0] if isinstance(result, list) else result

    rag_annotations.append(
        SpanAnnotationData(
            name="retrieval_relevance",
            span_id=span_id,
            annotator_kind="LLM",
            result={
                "label": score_result.label,
                "score": score_result.score if hasattr(score_result, "score") else (1.0 if score_result.label == "relevant" else 0.0),
            },
            metadata={
                "model": "gpt-4o-mini",
                "evaluator": "retrieval_relevance",
            },
        )
    )

Log Evaluations to Phoenix

The full evaluation script in the tutorial handles both evaluators.

TypeScript
Python

await logSpanAnnotations({
  spanAnnotations: annotations,
  sync: false,  // async mode - Phoenix processes in background
});

all_eval_annotations = tool_annotations + rag_annotations

phoenix_client.spans.log_span_annotations(
    span_annotations=all_eval_annotations,
    sync=False,
)

Run the Evaluation Script

Complete TypeScript Tutorial

Run evaluation script using pnpm evaluate

Complete Python Tutorial

This will:

Fetch tool and RAG spans from Phoenix
Evaluate each:
- Tool spans: success vs. error (code-based check)
- Retrieval spans: relevant vs. irrelevant (LLM-based)
Log results back as annotations on the child spans

The Debugging Workflow

Now you have a complete debugging workflow:

Run the agent (pnpm start) and provide feedback (thumbs up/down)
Run evaluations (pnpm evaluate) to annotate child spans
Click into unhelpful traces in Phoenix
Check the child span annotations:
- tool_result = error → The order wasn’t found
- retrieval_relevance = irrelevant → The FAQ wasn’t in the knowledge base

This tells you exactly why a trace failed, not just that it failed.

For this example, we see that the agent gives an unhelpful answer to the user regarding their order number. We can quickly check the tool span to see that the order number ORD-99999 simply isn’t in the order database! Automated evals make it really fast to pinpoint root cause errors for our annotations, because they can dive deep into the trace and span data much faster than humans can!

Summary

Congratulations! You now have a complete quality feedback loop:

Step	What You Do	What You Learn
1. User Feedback	Rate responses as helpful/unhelpful	Which traces failed
2. Child Span Evals	Run `pnpm evaluate`	Why they failed (tool error? bad retrieval?)
3. Analysis	Click into unhelpful traces	Root cause (missing order, FAQ not in KB)
4. Fix	Update prompts, knowledge base, or tools	Improve the agent

This is the debugging workflow that actually scales! Instead of manually reviewing every trace, you:

Use feedback to identify failures
Use automated evaluation to diagnose them
Use trace details to understand the root cause

Next Steps

Your traces are now annotated with both human feedback and automated evaluations. You can identify which responses failed and diagnose why. But there’s still a missing piece: real customer support isn’t just single queries, but full conversations between SupportBot and the customer. “What’s my order status?” followed by “When will it arrive?” followed by “Can I change the address?” In the next chapter, you’ll learn to track multi-turn conversations as sessions, giving you visibility into the full customer journey, not just isolated queries. Continue to Chapter 3: Sessions →

Get Started

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Annotations and Evaluation

TypeScript Tutorial

Python Tutorial

2.1 Human Annotations in the UI

Create Annotation Config

Annotate in the UI

2.2 Programmatic Annotations (User Feedback)

Get the Span ID from Running Code

Log Feedback via Phoenix Client

2.3 LLM-as-Judge Evaluations

Install the Phoenix Evals Package

Tool Result Evaluator

Retrieval Relevance Evaluator

Log Evaluations to Phoenix

Run the Evaluation Script

Complete TypeScript Tutorial

Complete Python Tutorial

The Debugging Workflow

Summary

Next Steps

Get Started

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Documentation Index

TypeScript Tutorial

Python Tutorial

​2.1 Human Annotations in the UI

​Create Annotation Config

​Annotate in the UI

​2.2 Programmatic Annotations (User Feedback)

​Get the Span ID from Running Code

​Log Feedback via Phoenix Client

​2.3 LLM-as-Judge Evaluations

​Install the Phoenix Evals Package

​Tool Result Evaluator

​Retrieval Relevance Evaluator

​Log Evaluations to Phoenix

​Run the Evaluation Script

Complete TypeScript Tutorial

Complete Python Tutorial

​The Debugging Workflow

​Summary

​Next Steps

2.1 Human Annotations in the UI

Create Annotation Config

Annotate in the UI

2.2 Programmatic Annotations (User Feedback)

Get the Span ID from Running Code

Log Feedback via Phoenix Client

2.3 LLM-as-Judge Evaluations

Install the Phoenix Evals Package

Tool Result Evaluator

Retrieval Relevance Evaluator

Log Evaluations to Phoenix

Run the Evaluation Script

The Debugging Workflow

Summary

Next Steps