Relevance Classification Evaluation

Notebook Walkthrough
Download Benchmark Dataset
Configure Evaluation
Run Relevance Classification
Evaluate Results
Get Explanations
Compare Models

This tutorial shows how to classify documents as relevant or irrelevant to queries using benchmark datasets with ground-truth labels. Key Points:

Download and prepare benchmark datasets for relevance classification
Compare different LLM models (GPT-4, GPT-3.5, GPT-4 Turbo) for classification accuracy
Analyze results with confusion matrices and detailed reports
Get explanations for LLM classifications to understand decision-making
Measure retrieval quality using ranking metrics like precision@k

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the full notebook.

https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/gc.ico

Google Colab

colab.research.google.com

Download Benchmark Dataset

df = download_benchmark_dataset(
    task="binary-relevance-classification",
    dataset_name="wiki_qa-train"
)

Configure Evaluation

N_EVAL_SAMPLE_SIZE = 100
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(columns={
    "query_text": "input",
    "document_text": "reference",
})

Run Relevance Classification

from phoenix.evals import LLM, async_evaluate_dataframe
from phoenix.evals.metrics import DocumentRelevanceEvaluator

llm = LLM(provider="openai", model="gpt-4")
relevance_evaluator = DocumentRelevanceEvaluator(llm=llm)

evals_df = await async_evaluate_dataframe(dataframe=df_sample, evaluators=[relevance_evaluator], concurrency=10)
relevance_classifications = evals_df["document_relevance_score"].str["label"].tolist()
choices = relevance_evaluator.CHOICES

Evaluate Results

true_labels = df_sample["relevant"].map({True: "relevant", False: "unrelated"}).tolist()

print(classification_report(true_labels, relevance_classifications, labels=choices))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=choices
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

Get Explanations

relevance_classifications_df = await async_evaluate_dataframe(
    dataframe=df_sample.sample(n=5),
    evaluators=[relevance_evaluator],
    concurrency=10,
)
relevance_classifications_df["label"] = relevance_classifications_df["document_relevance_score"].str[
    "label"
]
relevance_classifications_df["explanation"] = relevance_classifications_df[
    "document_relevance_score"
].str["explanation"]

Compare Models

Run the same evaluation with different models:

# GPT-3.5
llm_gpt35 = LLM(provider="openai", model="gpt-3.5-turbo")

# GPT-4 Turbo
llm_gpt4turbo = LLM(provider="openai", model="gpt-4-turbo-preview")

Code Readability Evaluation Using Ragas to Evaluate a Math Problem-Solving Agent

⌘I

AI Engineering Workflows

Tracing

Human-in-the-Loop Workflows (Annotations)

Prompts

Evaluation

Datasets & Experiments

Relevance Classification Evaluation

Notebook Walkthrough

Google Colab

Download Benchmark Dataset

Configure Evaluation

Run Relevance Classification

Evaluate Results

Get Explanations

Compare Models

AI Engineering Workflows

Tracing

Human-in-the-Loop Workflows (Annotations)

Prompts

Evaluation

Datasets & Experiments

Documentation Index

​Notebook Walkthrough

Google Colab

​Download Benchmark Dataset

​Configure Evaluation

​Run Relevance Classification

​Evaluate Results

​Get Explanations

​Compare Models

Notebook Walkthrough

Download Benchmark Dataset

Configure Evaluation

Run Relevance Classification

Evaluate Results

Get Explanations

Compare Models