Documentation Index
Fetch the complete documentation index at: https://arizeai-433a7140-mikeldking-12899-providers-and-secrets.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Phoenix evaluators support multiple prompt formats, all compatible with supported models and providers.
Supported Formats
1. String Prompts
Simple string templates with variable placeholders.
evaluator = ClassificationEvaluator(
name="sentiment",
llm=llm,
prompt_template="Classify the sentiment: {text}",
choices={"positive": 1.0, "negative": 0.0, "neutral": 0.5}
)
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
const model = openai("gpt-4o-mini");
const evaluator = createClassificationEvaluator({
name: "sentiment",
model,
promptTemplate: "Classify the sentiment: {{text}}",
choices: { positive: 1, negative: 0, neutral: 0.5 },
});
2. Message Lists
Arrays of message objects with role and content fields.
evaluator = ClassificationEvaluator(
name="helpfulness",
llm=llm,
prompt_template=[
{"role": "system", "content": "Evaluate the answer helpfulness."},
{"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
],
choices={"helpful": 1.0, "somewhat_helpful": 0.5, "not_helpful": 0.0}
)
Supported roles:
"system" - Instructions for the model.
"user" - User messages and input context.
"assistant" - Assistant/model responses (for multi-turn conversations or few-shot examples)
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
const model = openai("gpt-4o-mini");
const evaluator = createClassificationEvaluator({
name: "helpfulness",
model,
promptTemplate: [
{ role: "system", content: "Evaluate the answer helpfulness." },
{ role: "user", content: "Question: {{question}}\nAnswer: {{answer}}" },
],
choices: { helpful: 1, somewhat_helpful: 0.5, not_helpful: 0 },
});
Supported roles:
"system" - Instructions for the model.
"user" - User messages and input context.
"assistant" - Assistant/model responses (for multi-turn conversations or few-shot examples)
3. Structured Content Parts (Python only)
Messages with multiple content parts, useful for separating different pieces of context.
Only text content is supported at this time.evaluator = ClassificationEvaluator(
name="relevance",
llm=llm,
prompt_template=[
{
"role": "user",
"content": [
{"type": "text", "text": "Question: {question}"},
{"type": "text", "text": "Answer: {answer}"}
]
}
],
choices={"relevant": 1.0, "not_relevant": 0.0}
)
Structured content parts are not currently supported in the TypeScript library. Use message lists or string templates instead.
Template Variables
All formats support variable substitution. Python supports both f-string ({variable}) and mustache ({{variable}}) syntax, while TypeScript supports mustache syntax only.
# Variables are provided when calling .evaluate()
result = evaluator.evaluate({
"question": "What is Python?",
"answer": "A programming language"
})
// Variables are provided when calling .evaluate()
const result = await evaluator.evaluate({
question: "What is Python?",
answer: "A programming language",
});
console.log(result.label); // e.g., "relevant"
Using Phoenix Prompt Versions as Eval Templates (Python)
If your prompt is already stored in Phoenix Prompt Management, you can convert it directly into an evals PromptTemplate with phoenix_prompt_to_prompt_template.
from phoenix.client import Client
from phoenix.evals import (
ClassificationEvaluator,
LLM,
phoenix_prompt_to_prompt_template,
)
client = Client(base_url="http://localhost:6006")
prompt_version = client.prompts.get(prompt_identifier="test-prompt")
prompt_template = phoenix_prompt_to_prompt_template(prompt_version)
evaluator = ClassificationEvaluator(
name="recipe_quality",
llm=LLM(provider="openai", model="gpt-4o-mini"),
prompt_template=prompt_template,
choices={"good": 1.0, "bad": 0.0},
)
Notes:
- This utility accepts either a Phoenix
PromptVersion object or a PromptVersionData-like dictionary.
- Role normalization supports Phoenix role aliases (
ai/model -> assistant, developer -> system), including mixed-case role names.
- For structured content parts, only text parts are currently supported (
{"type": "text", "text": ...}).
Client-Specific Behavior
All clients accept the same message format as input. Adapters handle client-specific transformations internally as needed:OpenAI
- System role is converted to developer role for reasoning models.
- Otherwise, messages are passed as-is.
Anthropic
- System messages are extracted and passed via
system parameter
- User/assistant messages sent in messages array
Google GenAI
- System messages are extracted and passed via
system_instruction in config
- Assistant role converted to
model role
- Messages sent in contents array
LiteLLM
- Messages passed directly to LiteLLM in OpenAI format
- LiteLLM handles provider-specific conversions internally
LangChain
- OpenAI format messages are converted to LangChain message objects (
HumanMessage, AIMessage, SystemMessage)
The TypeScript library uses the AI SDK which handles provider-specific message formatting automatically. The AI SDK normalizes the interface across providers, so you can use the same prompt templates regardless of which model provider you choose.For provider-specific details, refer to the AI SDK documentation.
Full Example
A complete example showing evaluator setup and usage:
from phoenix.evals import ClassificationEvaluator, LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = ClassificationEvaluator(
name="helpfulness",
llm=llm,
prompt_template=[
{"role": "system", "content": "You evaluate response helpfulness."},
{"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
],
choices={"helpful": 1.0, "somewhat_helpful": 0.5, "not_helpful": 0.0}
)
result = evaluator.evaluate({
"question": "How do I learn Python?",
"answer": "Start with online tutorials and practice daily."
})
print(result[0].label) # e.g., "helpful"
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
const model = openai("gpt-4o-mini");
const evaluator = createClassificationEvaluator({
name: "helpfulness",
model,
promptTemplate: [
{ role: "system", content: "You evaluate response helpfulness." },
{ role: "user", content: "Question: {{question}}\nAnswer: {{answer}}" },
],
choices: { helpful: 1, somewhat_helpful: 0.5, not_helpful: 0 },
});
const result = await evaluator.evaluate({
question: "How do I learn Python?",
answer: "Start with online tutorials and practice daily.",
});
console.log(result.label); // e.g., "helpful"