Documentation Index
Fetch the complete documentation index at: https://arizeai-433a7140-mikeldking-12899-providers-and-secrets.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
This tutorial shows how to classify documents as relevant or irrelevant to queries using benchmark datasets with ground-truth labels.
Key Points:
- Download and prepare benchmark datasets for relevance classification
- Compare different LLM models (GPT-4, GPT-3.5, GPT-4 Turbo) for classification accuracy
- Analyze results with confusion matrices and detailed reports
- Get explanations for LLM classifications to understand decision-making
- Measure retrieval quality using ranking metrics like precision@k
Notebook Walkthrough
We will go through key code snippets on this page. To follow the full tutorial, check out the full notebook.
Google Colab
colab.research.google.com
Download Benchmark Dataset
df = download_benchmark_dataset(
task="binary-relevance-classification",
dataset_name="wiki_qa-train"
)
N_EVAL_SAMPLE_SIZE = 100
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(columns={
"query_text": "input",
"document_text": "reference",
})
Run Relevance Classification
from phoenix.evals import LLM, async_evaluate_dataframe
from phoenix.evals.metrics import DocumentRelevanceEvaluator
llm = LLM(provider="openai", model="gpt-4")
relevance_evaluator = DocumentRelevanceEvaluator(llm=llm)
evals_df = await async_evaluate_dataframe(dataframe=df_sample, evaluators=[relevance_evaluator], concurrency=10)
relevance_classifications = evals_df["document_relevance_score"].str["label"].tolist()
choices = relevance_evaluator.CHOICES
Evaluate Results
true_labels = df_sample["relevant"].map({True: "relevant", False: "unrelated"}).tolist()
print(classification_report(true_labels, relevance_classifications, labels=choices))
confusion_matrix = ConfusionMatrix(
actual_vector=true_labels, predict_vector=relevance_classifications, classes=choices
)
confusion_matrix.plot(
cmap=plt.colormaps["Blues"],
number_label=True,
normalized=True,
)
Get Explanations
relevance_classifications_df = await async_evaluate_dataframe(
dataframe=df_sample.sample(n=5),
evaluators=[relevance_evaluator],
concurrency=10,
)
relevance_classifications_df["label"] = relevance_classifications_df["document_relevance_score"].str[
"label"
]
relevance_classifications_df["explanation"] = relevance_classifications_df[
"document_relevance_score"
].str["explanation"]
Compare Models
Run the same evaluation with different models:
# GPT-3.5
llm_gpt35 = LLM(provider="openai", model="gpt-3.5-turbo")
# GPT-4 Turbo
llm_gpt4turbo = LLM(provider="openai", model="gpt-4-turbo-preview")