When building agents and LLM applications, it’s hard to see what’s actually going on under the hood. Even if you set up a comprehensive agent architecture with multiple prompts, descriptive tools, and data retrievals, you’re left answering questions like:

Why did the agent choose that tool instead of this one?
What context was actually passed to the LLM when it generated that response?
Where is all the latency coming from - is it the model, the retrieval, or something else?
The user got a wrong answer, but which step in the pipeline failed?

In this tutorial, with just a few additional lines of code, you’ll be able to monitor every LLM call, tool execution, and retrieval operation that powers your agents. You’ll learn how to debug, monitor, and analyze your agents more effectively and efficiently, transforming them from personal projects to production-ready applications. Follow along with code Throughout the tutorial we will include key code snippets, but to see the full implementation, check out the companion projects below.

TypeScript Tutorial

Companion TypeScript project with runnable examples

Python Tutorial

Companion Python project with runnable examples

SupportBot

Our sample support agent for this tutorial:

Classifies incoming queries (order status vs. FAQ)
Routes to the appropriate handler:
- Order Status: Use a tool to look up order information, then summarize for the customer
- FAQ: Search a knowledge base with embeddings, then generate an answer using RAG

The issue is that our users are complaining. Responses are slow, answers are wrong, but we have no idea why. Our agent is a black box - we can see what the user asked, and see how the agent replied, but we don’t have visibility into the individual components of our agent that actually ran. Let’s set up tracing to gain visibility.

Setting Up Tracing

First, install the dependencies and configure OpenTelemetry to send traces to Phoenix.

Install Dependencies

TypeScript
Python

npm install ai @ai-sdk/openai @arizeai/openinference-vercel \
  @arizeai/openinference-semantic-conventions @opentelemetry/api \
  @opentelemetry/sdk-trace-node @opentelemetry/exporter-trace-otlp-proto \
  @opentelemetry/resources @opentelemetry/semantic-conventions zod

pip install "arize-phoenix-otel>=0.16.0" arize-phoenix-client openai openinference-instrumentation-openai numpy

arize-phoenix-otel 0.16.0+ is required to import SpanAttributes and the OpenInference context managers directly from phoenix.otel. On older versions, import them from openinference.instrumentation and openinference.semconv.trace instead.

Set up Phoenix Cloud

In order to send traces to Phoenix, you must sign up for a free space and account. Follow these instructions to configure Phoenix Cloud, if you haven’t already. Once you have Phoenix Cloud configured, set your keys:

TypeScript
Python

Create a .env file in your project root:

PHOENIX_API_KEY=<ENTER YOUR PHOENIX API KEY>
PHOENIX_COLLECTOR_ENDPOINT=<ENTER YOUR PHOENIX ENDPOINT>
OPENAI_API_KEY=<ENTER YOUR OPENAI API KEY>

import os

os.environ["PHOENIX_API_KEY"] = "<ENTER YOUR PHOENIX API KEY>"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "<ENTER YOUR PHOENIX ENDPOINT>"
os.environ["OPENAI_API_KEY"] = "<ENTER YOUR OPENAI API KEY>"

Configure Tracing

TypeScript
Python

Create an instrumentation.ts file:

import { register } from "@arizeai/phoenix-otel";

// Register with Phoenix - this handles all the OpenTelemetry boilerplate
export const provider = register({
  projectName: "support-bot",
});

Import this file at the top of your application to enable tracing.

Setting auto_instrument=True in the register function looks at the installed OpenInference packages and automatically instruments them, so you don’t need to do manual configuration for each library.

from phoenix.otel import register

tracer_provider = register(
    project_name="support-bot",
    auto_instrument=True
)

Tracing LLM Calls

Every LLM call is a decision point. What prompt did the model receive? What did it output? How long did it take, and how many tokens did it use? Without tracing, you’re forced to build your own logging or debugging, and therefore miss out on key data that would block you from full observability. With tracing, you get a complete record of every LLM interaction, including

input messages (system, user, assistant prompt)
LLM output
model name, model provider
invocation parameters
token counts
latency

For SupportBot, tracing LLM calls will give us observability into the classification stage. We’ll see exactly what class all our support queries are, and what lead to that classification. It will also give us observability into the final generation stage. How was the final output delivered to the user, and what context went into that final LLM call that generated the final output?

TypeScript
Python

The key to tracing AI SDK calls is one parameter: experimental_telemetry: { isEnabled: true }. Add this to any generateText or embed call:

import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

const result = await generateText({
  model: openai.chat("gpt-4o-mini"),
  system: "Classify the query as 'order_status' or 'faq'",
  prompt: userQuery,
  experimental_telemetry: { isEnabled: true },  // This enables tracing
});

Calls will be automatically traced when instrumentation is enabled:

user_query = "Where is my order?"

result = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Classify the query as 'order_status' or 'faq'"},
        {"role": "user", "content": user_query},
    ],
)

In Phoenix, you’ll see:

Tracing Tool Calls

Tools allow your agent to interact with databases, APIs, external systems. In order to gain insight into how your tools are performing, you need to answer questions like

Did the LLM decide to call the right tool?
Did it extract the parameters correctly?
Did the tool return what you expected?

With tracing, you see the complete chain, including the LLM’s decision, the exact parameters passed, and the tool’s response, without having to guess which step broke. When your LLM calls tools, those executions are automatically traced as child spans. With Phoenix, you can see the complete chain, including the LLM’s decision, the exact parameters passed, and the tool’s response.

TypeScript
Python

With the AI SDK, you can simply define your tools using the AI SDK configuration, as the tracing happens automatically when experimental_telemetry is enabled:

const result = await generateText({
  model: openai.chat("gpt-4o-mini"),
  prompt: userQuery,
  tools: {
    lookupOrderStatus: tool({
      description: "Look up order status by order ID",
      inputSchema: z.object({
        orderId: z.string(),
      }),
      execute: async ({ orderId }) => {
        // Your tool logic here
        return orderDatabase[orderId];
      },
    }),
  },
  maxSteps: 2,
  experimental_telemetry: { isEnabled: true },  // Tools are traced automatically
});

For the OpenAI Python SDK, you need a helper function to execute tool calls and manually create spans to trace tool execution.

tools = [
  # tools defined here
]

# Helper function to execute tools
def execute_tool_call(tool_call, database):
    """Execute a tool call and return the result."""
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)

    with tracer.start_as_current_span(
        function_name,
        attributes={
            SpanAttributes.OPENINFERENCE_SPAN_KIND: "TOOL",
            SpanAttributes.TOOL_NAME: function_name,
            SpanAttributes.TOOL_PARAMETERS: json.dumps(function_args),
            SpanAttributes.INPUT_VALUE: json.dumps(function_args),
        },
    ) as tool_span:
      # Helper function logic here
  

messages = [
    {
        "role": "system",
        "content": "You are a helpful customer support agent....",
    },
    {"role": "user", "content": user_query},
]

result = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)

message = result.choices[0].message
messages.append(message)

# Execute tool if called, then get final response
if message.tool_calls:
    for tool_call in message.tool_calls:
        tool_result = execute_tool_call(tool_call, ORDER_DATABASE)
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(tool_result),
        })

    # Final LLM call with tool result
    final_result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    final_response = final_result.choices[0].message.content
else:
    final_response = message.content

In Phoenix, you’ll see:

LLM Span: Model decides to call lookupOrderStatus
Tool Span: Shows the tool name, input (orderId), and output
LLM Span: Model summarizes the result

Tracing RAG Pipelines

RAG pipelines can fail in many places. The embedding might not capture the query’s intent, the retrieval might return irrelevant documents, or the LLM might misuse good context. When a user gets a bad answer, which step failed? With tracing, you can see the full pipeline, including which documents were retrieved, what context was injected into the prompt, and how the LLM used it. You can pinpoint exactly where things went wrong. For RAG, trace both the embedding calls and the generation call. Each embed call becomes its own span:

TypeScript
Python

// Embed the user's query - traced automatically
const { embedding } = await embed({
  model: openai.embedding("text-embedding-ada-002"),
  value: userQuery,
  experimental_telemetry: { isEnabled: true },
});

// ... semantic search logic ...

// Generate with retrieved context - traced automatically
const { text } = await generateText({
  model: openai.chat("gpt-4o-mini"),
  system: `Answer using ONLY this context:\n\n${retrievedContext}`,
  prompt: userQuery,
  experimental_telemetry: { isEnabled: true },
});

# Embed the query - traced automatically
embedding_response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=user_query
)
query_embedding = embedding_response.data[0].embedding

# Generate with retrieved context - traced automatically
rag_result = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": f"You are a helpful customer support agent. Answer the user's question using ONLY the information provided in the context below. Be friendly and concise.\n\nContext:\n{rag_context}",
        },
        {"role": "user", "content": user_query},
    ],
)

In Phoenix, you’ll see:

The generation span shows the retrieved context in the system prompt, so you can immediately see if retrieval found the right documents.

Grouping Operations with Parent Spans

A single user request might trigger multiple LLM calls, tool executions, and retrievals. Let’s allocate all of these under one parent span, so all operations for one request are nested together. Click on the parent span and see the entire execution tree: classification, tool calls, retrieval, generation, all in one view, with timing relationships visible at a glance.

See the entire agent with grouped tracing here.

To see all operations for a single request as one trace, wrap them in a parent span using the OpenTelemetry API:

TypeScript
Python

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("support-agent");

async function handleSupportQuery(userQuery: string) {
  return tracer.startActiveSpan(
    "support-agent",
    { attributes: { "openinference.span.kind": "AGENT", "input.value": userQuery } },
    async (span) => {
      try {
        // All LLM calls, tool executions, and embeddings inside here
        // will appear as children of this span
        const result = await processQuery(userQuery);
        
        span.setAttribute("output.value", result);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({ code: SpanStatusCode.ERROR });
        throw error;
      } finally {
        span.end();
      }
    }
  );
}

from opentelemetry import trace
from phoenix.otel import SpanAttributes

def handle_support_query(user_query: str) -> AgentResponse:
    tracer = trace.get_tracer("support-agent")

    with tracer.start_as_current_span(
        "support-agent",
        attributes={
            SpanAttributes.OPENINFERENCE_SPAN_KIND: "AGENT",
            SpanAttributes.INPUT_VALUE: user_query,
        },
    ) as agent_span:
        try:
            # All LLM calls, tool executions, and embeddings inside here
            # will appear as children of this span
            
            agent_span.set_attribute(SpanAttributes.OUTPUT_VALUE, result)
            agent_span.set_status(trace.Status(trace.StatusCode.OK))
            return result
        except Exception as error:
            agent_span.set_status(trace.Status(trace.StatusCode.ERROR))
            raise

Running the Demo

The final SupportBot agent combines the classifier, the order status tool, and the FAQ retrieval into a single agent. The tutorial code runs 7 test queries against the agent:

const queries = [
  "What's the status of order ORD-12345?",  // Order found
  "How can I get a refund?",                 // FAQ in knowledge base
  "Where is my order ORD-67890?",            // Order found
  "I forgot my password",                    // FAQ in knowledge base
  "What's the status of order ORD-99999?",   // Order NOT found
  "How do I upgrade to premium?",            // FAQ NOT in knowledge base
  "Can you help me with something?",         // Vague request
];

Complete TypeScript Tutorial

Complete Python Tutorial

The code may prompt for feedback on each response - you can skip this for now (press s) and focus on the traces.

Viewing Your Traces

Open your Phoenix Cloud space. You’ll see 7 support-agent traces - one for each query.

Click into any trace to see the full execution tree. Let’s look at two interesting cases:

Trace 1: “Can you help me with something random?”

Our support query classifier gave the following classification:

{
  "category": "faq",
  "confidence": "low",
  "reasoning": "The query is vague and doesn't specify a relevant topic, but it suggests a need for assistance, placing it within a general support context."
}

Confidence: low is a huge red flag! This tells us that our support query classifier was unable to confidently classify the user’s support query, indicating the query may be out of scope for our agent. The last span shows the most relevant context retrieved, which is

Context:
Q: How do I reset my password?
A: Go to Settings > Security > Reset Password. You'll receive an email with a reset link that expires in 24 hours.

Q: How do I update my profile information?
A: Go to Account Settings > Profile. You can update your name, email, phone number, and address there.

This context is not relevant to the user’s question at all. Therefore, our traces have given us proper insight into why the final answer was:

I’d be happy to help, but I can only assist with questions related to account settings, passwords, and profile information. Let me know if you need help with those!

Trace 2: “What’s the status of order ORD-99999?”

Our support query classifier gave us the following classification:

{
  "category": "order_status",
  "confidence": "high",
  "reasoning": "The query directly asks about the status of a specific order, indicating it is related to order tracking."
}

Hmm. Seems like our classifier thinks this question accurately falls within the scope of our agent. Let’s keep going. Our support agent LLM span chose the following tool call:

lookupOrderStatus("{\"orderId\":\"ORD-99999\"}")

Seems good… The lookOrderStatus tool call gave us:

{"error":"Order ORD-99999 not found in our system"}

Aha! Seems like ORD-99999 is an invalid order number! That’s why the final output was:

Hi there! I checked on your order with the ID ORD-99999, but it seems that I couldn't find any details at the moment. If you could provide me with more information or check back later, I'd be happy to assist you further!

Summary

Congratulations! In this tutorial, you learned how to:

Trace LLM calls - Capture inputs, outputs, tokens, and latency with experimental_telemetry: { isEnabled: true }
Trace tool calls - See tool decisions, parameters, and responses as child spans
Trace RAG pipelines - Monitor embeddings and see retrieved context in generation prompts
Group with parent spans - Nest all operations for a request into one trace
View and analyze traces - Debug agent behavior by exploring execution trees in Phoenix

Next Steps

You can see inside your application now - every LLM call, tool execution, and retrieval is visible. We spent some time manually analyzing traces. But how can we automate this analysis, over thousands of traces? How can we store this analysis in Phoenix, so that we can build metrics that measure our application? In the next chapter, you’ll learn to:

Annotate traces to mark quality issues
Capture user feedback (thumbs up/down) and attach it to traces
Run automated LLM-as-Judge evaluations to find patterns in what’s failing

Get Started

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Your First Traces

TypeScript Tutorial

Python Tutorial

SupportBot

Setting Up Tracing

Install Dependencies

Set up Phoenix Cloud

Configure Tracing

Tracing LLM Calls

Tracing Tool Calls

Tracing RAG Pipelines

Grouping Operations with Parent Spans

Running the Demo

Complete TypeScript Tutorial

Complete Python Tutorial

Viewing Your Traces

Trace 1: “Can you help me with something random?”

Trace 2: “What’s the status of order ORD-99999?”

Summary

Next Steps

Get Started

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Documentation Index

TypeScript Tutorial

Python Tutorial

​SupportBot

​Setting Up Tracing

​Install Dependencies

​Set up Phoenix Cloud

​Configure Tracing

​Tracing LLM Calls

​Tracing Tool Calls

​Tracing RAG Pipelines

​Grouping Operations with Parent Spans

​Running the Demo

Complete TypeScript Tutorial

Complete Python Tutorial

​Viewing Your Traces

​Trace 1: “Can you help me with something random?”

​Trace 2: “What’s the status of order ORD-99999?”

​Summary

​Next Steps

SupportBot

Setting Up Tracing

Install Dependencies

Set up Phoenix Cloud

Configure Tracing

Tracing LLM Calls

Tracing Tool Calls

Tracing RAG Pipelines

Grouping Operations with Parent Spans

Running the Demo

Viewing Your Traces

Trace 1: “Can you help me with something random?”

Trace 2: “What’s the status of order ORD-99999?”

Summary

Next Steps