AI applications are fundamentally different from traditional software. A REST API returns the same response for the same input. An LLM-powered agent reasons, retrieves, calls tools, and generates - with each step influenced by probabilities, context, and the interactions between components. When something goes wrong, the failure could be anywhere in that chain. Observability is the practice of instrumenting your application so you can understand its internal state from its external outputs. For AI applications, this means capturing every LLM call, tool execution, retrieval operation, and generation - along with their inputs, outputs, latency, and token usage. With proper observability, you don’t guess why something failed. You look at the data and see exactly what happened. Phoenix provides the infrastructure for AI observability: tracing to capture execution flow, annotations to measure quality, and sessions to track conversations. In this tutorial, you’ll learn to use all three by building a real application.Documentation Index
Fetch the complete documentation index at: https://arizeai-433a7140-mikeldking-12899-providers-and-secrets.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
What You’ll Build
A typescript customer support agent that handles two types of queries:- Order status questions → Calls a tool to look up order information
- FAQ questions → Searches a knowledge base using RAG
Follow along with Complete Code Walkthroughs
TypeScript Tutorial
Companion TypeScript project with runnable examples
Python Tutorial
Companion Python project with runnable examples
Chapter 1: Your First Traces
The problem: Your agent is a black box. When something goes wrong, you addconsole.log statements, re-run, and hope you logged the right thing.
What you’ll learn:
- Instrument your agent with OpenTelemetry in 5 minutes
- Trace LLM calls, tool executions, and RAG retrievals automatically
- Group related operations under parent spans for complete request context
- Navigate the Phoenix UI to explore traces
Chapter 2: Annotations and Evaluation
The problem: You can see what’s happening, but you can’t tell if responses are actually good. A trace showing “200 OK” doesn’t mean the answer was right. What you’ll learn:- Annotate traces with human feedback directly in the Phoenix UI
- Capture user reactions (thumbs up/down) from your application and attach them to traces
- Build LLM-as-Judge evaluators that automatically assess quality
- Find patterns in what’s failing across hundreds of traces
Chapter 3: Sessions
The problem: Your agent handles single queries fine, but real users have conversations. “What’s my order status?” → “When will it arrive?” → “Can I change the address?” Without sessions, each query is isolated - you can’t see if the agent remembered the order ID from the first turn. What you’ll learn:- Add session tracking to group conversation turns together
- View conversations as chat-like threads in Phoenix
- Evaluate entire conversations for coherence and resolution
- Debug “the bot forgot what I said” issues by seeing exactly where context was lost
Prerequisites
- Access to Phoenix Cloud or Phoenix running locally (
pip install arize-phoenix && phoenix serve) - OpenAI API key for LLM calls

