Use this file to discover all available pages before exploring further.
Our support agent is running, and traces are flowing into Phoenix. We can see every LLM call, tool execution, and retrieval.Users are still complaining. Some responses are helpful, others are completely wrong. We need a way to measure quality - not just observe activity.In this chapter, you’ll learn to
Annotate traces with human feedback. This let’s you label your traces, figuring out where you need to improve.
Capture user reactions from your application. When user’s complain, attach that feedback to your data and use it to improve.
Run automated LLM-as-Judge evaluations to find patterns in what’s failing. Scale your analysis over thousands of traces using an LLM, so that you can build confident and data-driven analysis of what improvements need to be made.
Follow along with codeThroughout the tutorial we will include key code snippets, but to see the full implementation, check out the companion projects below.
TypeScript Tutorial
Companion TypeScript project with runnable examples
Before automating anything, we need to know what “good” actually looks like. Is a one-sentence answer better than a detailed paragraph? Should the agent apologize when it can’t help? These depend on our users, our brand, and our use case.Human annotation is how we build that understanding. By manually reviewing traces and marking them as good, bad, or somewhere in between, we create ground truth - the gold standard that everything else gets measured against. We’ll also start noticing patterns: maybe the agent struggles with multi-part questions, or gets confused when users reference previous messages.
Navigate to Settings → Annotations in Phoenix to create annotation types. We’ll create a simple config for us to label our support agent helpfulness.Here’s a breakdown of the different annotation configurations.
Open a trace → click Annotate → fill out the form.Once we’ve annotated traces, we can filter by annotation values, export to datasets, and compare across annotators. Even 50 well-annotated traces teach you more about failure modes than weeks of guessing.
Manual annotation gives you ground truth, but it doesn’t scale. We can review maybe 50 traces a day, but your agent is handling thousands of conversations.Sometimes, our users are already telling you what’s working. Every thumbs up, thumbs down, “this wasn’t helpful” click, or escalation to a human agent is feedback. Let’s store that feedback in Phoenix, so that we can attach it to our traces!Let’s simulate a thumbs up/thumbs down feature, and then store that as annotations to our traces in Phoenix. This will give us metrics on how satified our users are.
To attach feedback to a trace, you need the span ID. Here’s how to capture it:
TypeScript
Python
import { trace } from "@opentelemetry/api";async function handleSupportQuery(userQuery: string) { return tracer.startActiveSpan("support-agent", async (span) => { // Capture the span ID for later feedback const spanId = span.spanContext().spanId; // ... process query ... return { response: "Your order has shipped!", spanId, // Return this to your frontend }; });}
from opentelemetry import tracedef handle_support_query(user_query: str): tracer = trace.get_tracer("support-agent") with tracer.start_as_current_span("support-agent") as agent_span: # Capture the span ID for later feedback span_id = format_span_id(agent_span.get_span_context().span_id) # ... process query ... return { "query": user_query, "response": response, "spanId": span_id, }
In a web application, you’d return the spanId to your frontend along with the response, then send it back when the user clicks thumbs up/down.
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";// When user clicks thumbs upawait logSpanAnnotations({ spanAnnotations: [{ spanId: "abc123...", // The span ID from your response name: "user_feedback", label: "thumbs-up", score: 1, annotatorKind: "HUMAN", metadata: { source: "web_app", userId: "user_456", }, }], sync: true,});
from phoenix.client.resources.spans import SpanAnnotationDatafrom phoenix.client import Clientphoenix_client = Client()# When user clicks thumbs up, store the annotationannotations = []if answer in ["y", "1", "yes"]: annotations.append( SpanAnnotationData( name="user_feedback", span_id=resp["spanId"], annotator_kind="HUMAN", result={"label": "thumbs-up", "score": 1.0}, metadata={"category": resp["category"], "source": "interactive_tutorial"}, ) )phoenix_client.spans.log_span_annotations( span_annotations=annotations, sync=False,)
Run the support agent, where we let you give feedback on the traces and push annotations to Phoenix:After the agent generates responses, you’ll be prompted to rate each one:
Enter y for thumbs-up (good response)
Enter n for thumbs-down (bad response)
Enter s to skip
Your feedback is sent to Phoenix as annotations. Check the Annotations tab on each trace to see your ratings.
We’ve collected user feedback and identified which responses were unhelpful. Now we need to understand why they failed. Was the tool call returning errors? Was the retrieval pulling irrelevant context?Instead of manually clicking through each unhelpful trace, you can automate this analysis. We’ll create two evaluators - one for our lookupOrderStatus tool, and the other for FAQ retrieval relevance. These evaluators annotate the child spans, so when you click into an unhelpful trace, you can immediately see what went wrong.
Was the retrieved context actually relevant to the question?
TypeScript
Python
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";import { openai } from "@ai-sdk/openai";// Filter for the LLM calls that use retrieved contextconst llmSpans = spans.filter((span) =>span.name === "ai.generateText" &&String(span.attributes["gen_ai.system"] || "").includes("Answer the user's question using ONLY the information provided in the context below. Be friendly and concise."));// Create an LLM-as-Judge evaluator that determines if retrieved context was relevantconst retrievalRelevanceEvaluator = createClassificationEvaluator({name: "retrieval_relevance",model: openai("gpt-4o-mini"),choices: {relevant: 1,irrelevant: 0,},promptTemplate: `You are evaluating whether the retrieved context is relevant to answering the user's prompt.Classify the retrieval as:- RELEVANT: The context contains information that directly helps answer the question- IRRELEVANT: The context does NOT contain useful information for the questionYou are comparing the "Context" object and the "prompt" object.[Context and Prompt]: {{input}}`,});// Evaluate each RAG spanfor (const span of llmSpans) {const spanId = span.context.span_id;// Extract the system prompt (which contains the retrieved context)const input = span.attributes["input.value"] as string || "";const result = await retrievalRelevanceEvaluator.evaluate({ input: input,});const status = result.label === "relevant" ? "✅ RELEVANT" : "❌ IRRELEVANT";console.log(` RAG span ${spanId.substring(0, 8)}... ${status}`);// Add annotation to be logged to Phoenixannotations.push({spanId,name: "retrieval_relevance",label: result.label,score: result.score,explanation: result.explanation,annotatorKind: "LLM",metadata: { model: "gpt-4o-mini", evaluator: "retrieval_relevance",},});
import jsonfrom phoenix.evals import LLM, ClassificationEvaluatorfrom phoenix.client.resources.spans import SpanAnnotationData# Filter for retrieval spans (RETRIEVER kind) - FAQ retrievalretrieval_spans = [ span for span in spans if span.get("span_kind") == "RETRIEVER" or span.get("name") == "faq-retrieval"]# Create an LLM-as-Judge evaluator that determines if retrieved context was relevantllm = LLM(provider="openai", model="gpt-4o-mini")retrieval_relevance_evaluator = ClassificationEvaluator( name="retrieval_relevance", prompt_template="""You are evaluating whether the retrieved context is relevant to answering the user's prompt.Classify the retrieval as:- RELEVANT: The context contains information that directly helps answer the question- IRRELEVANT: The context does NOT contain useful information for the questionYou are comparing the "Context" object and the "prompt" object.[Context and Prompt]: {input}""", llm=llm, choices={"relevant": 1, "irrelevant": 0},)# Evaluate each retrieval spanrag_annotations = []for span in retrieval_spans: # Access span_id from context context = span.get("context", {}) span_id = context.get("span_id", "") if isinstance(context, dict) else "" # Access attributes (may be a dict or JSON string) attributes = span.get("attributes", {}) if isinstance(attributes, str): attributes = json.loads(attributes) # Logic to extract the query and retrieved documents here # Build input for evaluator: query + retrieved context context_text = "\n\n".join(documents) evaluation_input = f"Query: {query}\n\nRetrieved Context:\n{context_text}" result = retrieval_relevance_evaluator.evaluate({"input": evaluation_input}) score_result = result[0] if isinstance(result, list) else result rag_annotations.append( SpanAnnotationData( name="retrieval_relevance", span_id=span_id, annotator_kind="LLM", result={ "label": score_result.label, "score": score_result.score if hasattr(score_result, "score") else (1.0 if score_result.label == "relevant" else 0.0), }, metadata={ "model": "gpt-4o-mini", "evaluator": "retrieval_relevance", }, ) )
Run the agent (pnpm start) and provide feedback (thumbs up/down)
Run evaluations (pnpm evaluate) to annotate child spans
Click into unhelpful traces in Phoenix
Check the child span annotations:
tool_result = error → The order wasn’t found
retrieval_relevance = irrelevant → The FAQ wasn’t in the knowledge base
This tells you exactly why a trace failed, not just that it failed.For this example, we see that the agent gives an unhelpful answer to the user regarding their order number. We can quickly check the tool span to see that the order number ORD-99999 simply isn’t in the order database! Automated evals make it really fast to pinpoint root cause errors for our annotations, because they can dive deep into the trace and span data much faster than humans can!
Your traces are now annotated with both human feedback and automated evaluations. You can identify which responses failed and diagnose why.But there’s still a missing piece: real customer support isn’t just single queries, but full conversations between SupportBot and the customer. “What’s my order status?” followed by “When will it arrive?” followed by “Can I change the address?”In the next chapter, you’ll learn to track multi-turn conversations as sessions, giving you visibility into the full customer journey, not just isolated queries.Continue to Chapter 3: Sessions →