Use this file to discover all available pages before exploring further.
When building agents and LLM applications, it’s hard to see what’s actually going on under the hood. Even if you set up a comprehensive agent architecture with multiple prompts, descriptive tools, and data retrievals, you’re left answering questions like:
Why did the agent choose that tool instead of this one?
What context was actually passed to the LLM when it generated that response?
Where is all the latency coming from - is it the model, the retrieval, or something else?
The user got a wrong answer, but which step in the pipeline failed?
In this tutorial, with just a few additional lines of code, you’ll be able to monitor every LLM call, tool execution, and retrieval operation that powers your agents. You’ll learn how to debug, monitor, and analyze your agents more effectively and efficiently, transforming them from personal projects to production-ready applications.Follow along with codeThroughout the tutorial we will include key code snippets, but to see the full implementation, check out the companion projects below.
TypeScript Tutorial
Companion TypeScript project with runnable examples
Classifies incoming queries (order status vs. FAQ)
Routes to the appropriate handler:
Order Status: Use a tool to look up order information, then summarize for the customer
FAQ: Search a knowledge base with embeddings, then generate an answer using RAG
The issue is that our users are complaining. Responses are slow, answers are wrong, but we have no idea why. Our agent is a black box - we can see what the user asked, and see how the agent replied, but we don’t have visibility into the individual components of our agent that actually ran.Let’s set up tracing to gain visibility.
arize-phoenix-otel0.16.0+ is required to import SpanAttributes and the OpenInference context managers directly from phoenix.otel. On older versions, import them from openinference.instrumentation and openinference.semconv.trace instead.
In order to send traces to Phoenix, you must sign up for a free space and account. Follow these instructions to configure Phoenix Cloud, if you haven’t already.Once you have Phoenix Cloud configured, set your keys:
TypeScript
Python
Create a .env file in your project root:
PHOENIX_API_KEY=<ENTER YOUR PHOENIX API KEY>PHOENIX_COLLECTOR_ENDPOINT=<ENTER YOUR PHOENIX ENDPOINT>OPENAI_API_KEY=<ENTER YOUR OPENAI API KEY>
import osos.environ["PHOENIX_API_KEY"] = "<ENTER YOUR PHOENIX API KEY>"os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "<ENTER YOUR PHOENIX ENDPOINT>"os.environ["OPENAI_API_KEY"] = "<ENTER YOUR OPENAI API KEY>"
import { register } from "@arizeai/phoenix-otel";// Register with Phoenix - this handles all the OpenTelemetry boilerplateexport const provider = register({ projectName: "support-bot",});
Import this file at the top of your application to enable tracing.
Setting auto_instrument=True in the register function looks at the installed OpenInference packages and automatically instruments them, so you don’t need to do manual configuration for each library.
from phoenix.otel import registertracer_provider = register( project_name="support-bot", auto_instrument=True)
Every LLM call is a decision point. What prompt did the model receive? What did it output? How long did it take, and how many tokens did it use?Without tracing, you’re forced to build your own logging or debugging, and therefore miss out on key data that would block you from full observability. With tracing, you get a complete record of every LLM interaction, including
input messages (system, user, assistant prompt)
LLM output
model name, model provider
invocation parameters
token counts
latency
For SupportBot, tracing LLM calls will give us observability into the classification stage. We’ll see exactly what class all our support queries are, and what lead to that classification.It will also give us observability into the final generation stage. How was the final output delivered to the user, and what context went into that final LLM call that generated the final output?
TypeScript
Python
The key to tracing AI SDK calls is one parameter: experimental_telemetry: { isEnabled: true }. Add this to any generateText or embed call:
import { generateText } from "ai";import { openai } from "@ai-sdk/openai";const result = await generateText({ model: openai.chat("gpt-4o-mini"), system: "Classify the query as 'order_status' or 'faq'", prompt: userQuery, experimental_telemetry: { isEnabled: true }, // This enables tracing});
Calls will be automatically traced when instrumentation is enabled:
user_query = "Where is my order?"result = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": "Classify the query as 'order_status' or 'faq'"}, {"role": "user", "content": user_query}, ],)
Tools allow your agent to interact with databases, APIs, external systems. In order to gain insight into how your tools are performing, you need to answer questions like
Did the LLM decide to call the right tool?
Did it extract the parameters correctly?
Did the tool return what you expected?
With tracing, you see the complete chain, including the LLM’s decision, the exact parameters passed, and the tool’s response, without having to guess which step broke.When your LLM calls tools, those executions are automatically traced as child spans. With Phoenix, you can see the complete chain, including the LLM’s decision, the exact parameters passed, and the tool’s response.
TypeScript
Python
With the AI SDK, you can simply define your tools using the AI SDK configuration, as the tracing happens automatically when experimental_telemetry is enabled:
const result = await generateText({ model: openai.chat("gpt-4o-mini"), prompt: userQuery, tools: { lookupOrderStatus: tool({ description: "Look up order status by order ID", inputSchema: z.object({ orderId: z.string(), }), execute: async ({ orderId }) => { // Your tool logic here return orderDatabase[orderId]; }, }), }, maxSteps: 2, experimental_telemetry: { isEnabled: true }, // Tools are traced automatically});
For the OpenAI Python SDK, you need a helper function to execute tool calls and manually create spans to trace tool execution.
tools = [ # tools defined here]# Helper function to execute toolsdef execute_tool_call(tool_call, database): """Execute a tool call and return the result.""" function_name = tool_call.function.name function_args = json.loads(tool_call.function.arguments) with tracer.start_as_current_span( function_name, attributes={ SpanAttributes.OPENINFERENCE_SPAN_KIND: "TOOL", SpanAttributes.TOOL_NAME: function_name, SpanAttributes.TOOL_PARAMETERS: json.dumps(function_args), SpanAttributes.INPUT_VALUE: json.dumps(function_args), }, ) as tool_span: # Helper function logic heremessages = [ { "role": "system", "content": "You are a helpful customer support agent....", }, {"role": "user", "content": user_query},]result = client.chat.completions.create( model="gpt-4o-mini", messages=messages, tools=tools, tool_choice="auto",)message = result.choices[0].messagemessages.append(message)# Execute tool if called, then get final responseif message.tool_calls: for tool_call in message.tool_calls: tool_result = execute_tool_call(tool_call, ORDER_DATABASE) messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_result), }) # Final LLM call with tool result final_result = client.chat.completions.create( model="gpt-4o-mini", messages=messages, ) final_response = final_result.choices[0].message.contentelse: final_response = message.content
In Phoenix, you’ll see:
LLM Span: Model decides to call lookupOrderStatus
Tool Span: Shows the tool name, input (orderId), and output
RAG pipelines can fail in many places. The embedding might not capture the query’s intent, the retrieval might return irrelevant documents, or the LLM might misuse good context. When a user gets a bad answer, which step failed? With tracing, you can see the full pipeline, including which documents were retrieved, what context was injected into the prompt, and how the LLM used it. You can pinpoint exactly where things went wrong.For RAG, trace both the embedding calls and the generation call. Each embed call becomes its own span:
TypeScript
Python
// Embed the user's query - traced automaticallyconst { embedding } = await embed({ model: openai.embedding("text-embedding-ada-002"), value: userQuery, experimental_telemetry: { isEnabled: true },});// ... semantic search logic ...// Generate with retrieved context - traced automaticallyconst { text } = await generateText({ model: openai.chat("gpt-4o-mini"), system: `Answer using ONLY this context:\n\n${retrievedContext}`, prompt: userQuery, experimental_telemetry: { isEnabled: true },});
# Embed the query - traced automaticallyembedding_response = client.embeddings.create( model="text-embedding-ada-002", input=user_query)query_embedding = embedding_response.data[0].embedding# Generate with retrieved context - traced automaticallyrag_result = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": f"You are a helpful customer support agent. Answer the user's question using ONLY the information provided in the context below. Be friendly and concise.\n\nContext:\n{rag_context}", }, {"role": "user", "content": user_query}, ],)
In Phoenix, you’ll see:The generation span shows the retrieved context in the system prompt, so you can immediately see if retrieval found the right documents.
A single user request might trigger multiple LLM calls, tool executions, and retrievals. Let’s allocate all of these under one parent span, so all operations for one request are nested together. Click on the parent span and see the entire execution tree: classification, tool calls, retrieval, generation, all in one view, with timing relationships visible at a glance.
To see all operations for a single request as one trace, wrap them in a parent span using the OpenTelemetry API:
TypeScript
Python
import { trace, SpanStatusCode } from "@opentelemetry/api";const tracer = trace.getTracer("support-agent");async function handleSupportQuery(userQuery: string) { return tracer.startActiveSpan( "support-agent", { attributes: { "openinference.span.kind": "AGENT", "input.value": userQuery } }, async (span) => { try { // All LLM calls, tool executions, and embeddings inside here // will appear as children of this span const result = await processQuery(userQuery); span.setAttribute("output.value", result); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR }); throw error; } finally { span.end(); } } );}
from opentelemetry import tracefrom phoenix.otel import SpanAttributesdef handle_support_query(user_query: str) -> AgentResponse: tracer = trace.get_tracer("support-agent") with tracer.start_as_current_span( "support-agent", attributes={ SpanAttributes.OPENINFERENCE_SPAN_KIND: "AGENT", SpanAttributes.INPUT_VALUE: user_query, }, ) as agent_span: try: # All LLM calls, tool executions, and embeddings inside here # will appear as children of this span agent_span.set_attribute(SpanAttributes.OUTPUT_VALUE, result) agent_span.set_status(trace.Status(trace.StatusCode.OK)) return result except Exception as error: agent_span.set_status(trace.Status(trace.StatusCode.ERROR)) raise
The final SupportBot agent combines the classifier, the order status tool, and the FAQ retrieval into a single agent.The tutorial code runs 7 test queries against the agent:
const queries = [ "What's the status of order ORD-12345?", // Order found "How can I get a refund?", // FAQ in knowledge base "Where is my order ORD-67890?", // Order found "I forgot my password", // FAQ in knowledge base "What's the status of order ORD-99999?", // Order NOT found "How do I upgrade to premium?", // FAQ NOT in knowledge base "Can you help me with something?", // Vague request];
Complete TypeScript Tutorial
Complete Python Tutorial
The code may prompt for feedback on each response - you can skip this for now (press s) and focus on the traces.
Open your Phoenix Cloud space. You’ll see 7 support-agent traces - one for each query.Click into any trace to see the full execution tree. Let’s look at two interesting cases:
Our support query classifier gave the following classification:
{ "category": "faq", "confidence": "low", "reasoning": "The query is vague and doesn't specify a relevant topic, but it suggests a need for assistance, placing it within a general support context."}
Confidence: low is a huge red flag! This tells us that our support query classifier was unable to confidently classify the user’s support query, indicating the query may be out of scope for our agent.The last span shows the most relevant context retrieved, which is
Context:Q: How do I reset my password?A: Go to Settings > Security > Reset Password. You'll receive an email with a reset link that expires in 24 hours.Q: How do I update my profile information?A: Go to Account Settings > Profile. You can update your name, email, phone number, and address there.
This context is not relevant to the user’s question at all.Therefore, our traces have given us proper insight into why the final answer was:
I’d be happy to help, but I can only assist with questions related to account settings, passwords, and profile information. Let me know if you need help with those!
Our support query classifier gave us the following classification:
{ "category": "order_status", "confidence": "high", "reasoning": "The query directly asks about the status of a specific order, indicating it is related to order tracking."}
Hmm. Seems like our classifier thinks this question accurately falls within the scope of our agent. Let’s keep going.Our support agent LLM span chose the following tool call:
lookupOrderStatus("{\"orderId\":\"ORD-99999\"}")
Seems good…The lookOrderStatus tool call gave us:
{"error":"Order ORD-99999 not found in our system"}
Aha! Seems like ORD-99999 is an invalid order number!That’s why the final output was:
Hi there! I checked on your order with the ID ORD-99999, but it seems that I couldn't find any details at the moment. If you could provide me with more information or check back later, I'd be happy to assist you further!
You can see inside your application now - every LLM call, tool execution, and retrieval is visible. We spent some time manually analyzing traces. But how can we automate this analysis, over thousands of traces? How can we store this analysis in Phoenix, so that we can build metrics that measure our application?In the next chapter, you’ll learn to:
Annotate traces to mark quality issues
Capture user feedback (thumbs up/down) and attach it to traces
Run automated LLM-as-Judge evaluations to find patterns in what’s failing