Use this file to discover all available pages before exploring further.
Your support agent handles single queries well. Classification works. Tool calls execute. RAG retrieves relevant documents. But real customer support isn’t just single queries, it’s full conversations.“What’s my order status?” → “When will it arrive?” → “Can I change the address?”Each of these is a separate trace. Without sessions, they’re disconnected points in your data. You can’t see that the customer asked about the same order three times, or that the agent forgot the order ID between turns and asked for it again.Sessions change that. By grouping traces with a shared session ID, you transform isolated data points into conversation threads. In Phoenix, you can see the full back-and-forth, track metrics across the conversation (total tokens, turns to resolution), and debug issues like “the bot forgot what I said.”In this chapter, you’ll add session tracking to your support agent, run multi-turn conversations, and evaluate conversations as complete units - not just individual turns.Follow along with codeThroughout the tutorial we will include key code snippets, but to see the full implementation, check out the companion projects below.
TypeScript Tutorial
Companion TypeScript project with runnable examples
Adding session tracking to your agent is surprisingly simple. You need two things:
A session ID: A unique identifier for each conversation (usually a UUID)
Context propagation: Making sure child spans inherit the session ID
The key insight is that session IDs are just span attributes. Set them on your parent span, and Phoenix automatically groups all related traces together.
You’ll need Phoenix OTEL to register tracing and set session context:
TypeScript
Python
npm install @arizeai/phoenix-otel
pip install "arize-phoenix-otel>=0.16.0"
arize-phoenix-otel0.16.0+ is required to import using_session and SpanAttributes from phoenix.otel. On older versions, install openinference-instrumentation and openinference-semantic-conventions and import from those packages instead.
Now let’s see sessions in action. Here’s a conversation scenario that tests the agent’s ability to maintain context:
TypeScript
Python
const sessionId = crypto.randomUUID();const conversationHistory: Message[] = [];const sessionContext: SessionContext = { turnCount: 0 };// Turn 1: Ask about an orderconst turn1 = await handleSupportQuery( "What's the status of order ORD-12345?", sessionId, conversationHistory, sessionContext);// Update historyconversationHistory.push( { role: "user", content: "What's the status of order ORD-12345?" }, { role: "assistant", content: turn1.response });sessionContext.lastMentionedOrderId = "ORD-12345";sessionContext.turnCount++;// Turn 2: Follow-up question (no order ID)const turn2 = await handleSupportQuery( "When will it arrive?", sessionId, conversationHistory, sessionContext);// The agent should remember ORD-12345 from the previous turn
import uuidsession_id = str(uuid.uuid4())conversation_history: List[Message] = []session_context: SessionContext = {"lastMentionedOrderId": None, "turnCount": 0}# Turn 1: Ask about an orderturn1 = handle_support_query( "What's the status of order ORD-12345?", session_id, conversation_history, session_context)# Update historyconversation_history.append({"role": "user", "content": "What's the status of order ORD-12345?"})conversation_history.append({"role": "assistant", "content": turn1["response"]})session_context["lastMentionedOrderId"] = "ORD-12345"session_context["turnCount"] += 1# Turn 2: Follow-up question (no order ID)turn2 = handle_support_query( "When will it arrive?", session_id, conversation_history, session_context)# The agent should remember ORD-12345 from the previous turn
TypeScript Tutorial
Run the sessions demo: pnpm sessions
Python Tutorial
This runs three conversation scenarios:
Order Inquiry: Customer asks about order, then asks follow-up questions
FAQ Conversation: Multiple FAQ questions in one session
Mixed Conversation: Switching between order and FAQ topics
You can now see full conversations in Phoenix, but manually reviewing every session doesn’t scale. With hundreds of conversations happening daily, you need automated insights.This is where LLM-as-Judge evaluation shines. Instead of clicking through sessions one by one, you can automatically evaluate entire conversations and answer questions like:
Is memory being preserved? Does the agent remember order IDs, customer preferences, and context from earlier in the conversation?
Are issues getting resolved? Do conversations end with the customer’s problem solved, or do they trail off unresolved?
Where do conversations break down? Which sessions show signs of confusion, repetition, or context loss?
By running evaluators across all your sessions, you get aggregate metrics (“85% of conversations maintain coherence”) and can quickly filter to the problematic ones.The evaluator also generates explanations, so you understand why a session was marked as incoherent or unresolved.
This evaluator checks if the agent maintained context throughout the conversation:
TypeScript
Python
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";const conversationCoherenceEvaluator = createClassificationEvaluator({ name: "conversation_coherence", model: openai("gpt-5"), choices: { coherent: 1, incoherent: 0, }, // Explanations are automatically generated by the evaluator promptTemplate: `You are evaluating whether a customer support agent maintained context throughout a multi-turn conversation.A conversation is COHERENT if:- The agent remembers information from earlier turns- The agent doesn't ask for information already provided- Responses build on previous context appropriatelyA conversation is INCOHERENT if:- The agent "forgets" things the customer said earlier- The agent asks for the same information multiple times- Responses seem disconnected from previous turns[Full Conversation]:{{input}}Did the agent maintain context throughout this conversation?`,});
from phoenix.evals import LLM, ClassificationEvaluatorllm = LLM(provider="openai", model="gpt-5")conversation_coherence_evaluator = ClassificationEvaluator( name="conversation_coherence", prompt_template="""You are evaluating whether a customer support agent maintained context throughout a multi-turn conversation.A conversation is COHERENT if:- The agent remembers information from earlier turns- The agent doesn't ask for information already provided- Responses build on previous context appropriately- The conversation flows naturallyA conversation is INCOHERENT if:- The agent "forgets" things the customer said earlier- The agent asks for the same information multiple times- Responses seem disconnected from previous turns- The customer has to repeat themselves[Full Conversation]:{input}Did the agent maintain context throughout this conversation?""", llm=llm, choices={"coherent": 1, "incoherent": 0},)
This evaluator determines if the customer’s issue was actually resolved:
TypeScript
Python
const resolutionEvaluator = createClassificationEvaluator({ name: "resolution_status", model: openai("gpt-5"), choices: { resolved: 1, unresolved: 0, }, // Explanations are automatically generated by the evaluator promptTemplate: `You are evaluating whether a customer's issue was resolved in a support conversation.The issue is RESOLVED if:- The customer got the information they needed- Their question was answered- The conversation ended with the customer's needs metThe issue is UNRESOLVED if:- The customer didn't get what they needed- Questions went unanswered- The agent couldn't help with the request[Full Conversation]:{{input}}Was the customer's issue resolved?`,});
from phoenix.evals import ClassificationEvaluatorresolution_evaluator = ClassificationEvaluator( name="resolution_status", prompt_template="""You are evaluating whether a customer's issue was resolved in a support conversation.The issue is RESOLVED if:- The customer got the information they needed- Their question was answered- The conversation ended with the customer's needs metThe issue is UNRESOLVED if:- The customer didn't get what they needed- Questions went unanswered- The agent couldn't help with the request[Full Conversation]:{input}Was the customer's issue resolved?""", llm=llm, choices={"resolved": 1, "unresolved": 0},)
Here’s the full evaluation flow. First, fetch spans from Phoenix and group them by session ID:
TypeScript
Python
import { getSpans } from "@arizeai/phoenix-client/spans";import { logSessionAnnotations } from "@arizeai/phoenix-client/sessions";// Fetch all agent spansconst { spans } = await getSpans({ project: { projectName: "support-bot" }, limit: 200,});// Filter to agent spans and group by session IDconst agentSpans = spans.filter((span) => span.name === "support-agent");const sessionGroups = new Map<string, typeof agentSpans>();for (const span of agentSpans) { const sessionId = span.attributes["session.id"] as string; if (sessionId) { if (!sessionGroups.has(sessionId)) { sessionGroups.set(sessionId, []); } sessionGroups.get(sessionId)!.push(span); }}console.log(`Found ${sessionGroups.size} sessions`);
import jsonfrom typing import Dict, List, Anyfrom phoenix.client.resources.spans import SpanAnnotationDatafrom phoenix.otel import SpanAttributes# Fetch all agent spansspans = phoenix_client.spans.get_spans( project_identifier="support-bot", limit=200,)# Filter to agent spans and group by session IDagent_spans = [span for span in spans if span.get("name") == "support-agent"]session_groups: Dict[str, List[Any]] = {}for span in agent_spans: # Access attributes (may be a dict or JSON string) attributes = span.get("attributes", {}) if isinstance(attributes, str): attributes = json.loads(attributes) session_id = attributes.get("session.id") or attributes.get(SpanAttributes.SESSION_ID) if session_id: if session_id not in session_groups: session_groups[session_id] = [] session_groups[session_id].append(span)print(f"Found {len(session_groups)} sessions")
For each session, build a transcript and run the evaluators:
Now that we’ve ran our session level evaluators, let’s see how our support bot performs across user sessions.Turn 1: The user asks about order ORD-67890. The agent correctly looks up the order and reports it’s processing with a December 15 ETA.Turn 2: The user switches topics entirely - “How do I cancel my subscription?” This is a FAQ question, not an order question. The agent handles it via RAG, providing the correct cancellation instructions.Turn 3: Here’s the real test. The user says “Back to my order - what’s the carrier?” They don’t repeat the order ID. They just say “my order.”Did the agent remember? Yes. It correctly referenced ORD-67890 and provided the carrier status (pending) without asking the user to repeat themselves.The session-level annotations confirm what we see:
conversation_coherence: coherent (score: 1.0) - The explanation notes that “the agent correctly referenced the order ID and consistent details across turns… and also handled the separate subscription question without losing track.”
resolution_status: resolved (score: 1.0) - The explanation confirms “the agent answered the user’s questions: provided order status and ETA, explained cancellation steps, and clarified that the carrier is currently pending.”
This is exactly what session evaluation gives you. Instead of manually reviewing each turn, you can scan the coherence and resolution scores across all sessions. When you find one marked “incoherent” or “unresolved,” click in to see the explanation and understand what went wrong.
You’ve used sessions transform your tracing data from isolated queries into conversation threads. Here are the benefits you’ve realized by using sessions:
Without Sessions
With Sessions
Individual traces, disconnected
Full conversation history
Can’t see context loss
”Bot forgot what I said” is visible
Per-turn metrics only
Total tokens, turns to resolution
Evaluate single responses
Evaluate entire conversations
The workflow:
Add session IDs to your agent (one-time setup)
Track conversation history between turns
View sessions in the Phoenix Sessions tab
Evaluate conversations with coherence and resolution evaluators
Debug patterns by clicking into problematic sessions
The patterns you’ve learned - tracing, annotation, evaluation, and sessions - apply to any LLM application. The specific evaluators and metrics will change, but the approach stays the same: observe everything, measure what matters, and use the data to improve.