Documentation Index
Fetch the complete documentation index at: https://arizeai-433a7140-mikeldking-12899-providers-and-secrets.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The PrecisionRecallFScore evaluator computes precision, recall, and F-beta scores for comparing predicted labels against expected labels. It supports both binary and multi-class classification with various averaging strategies.When to Use
Use the PrecisionRecallFScore evaluator when you need to:- Evaluate classification performance - Measure how well your model predicts correct labels
- Compare label sequences - Assess predicted vs expected labels for multi-item outputs
- Binary classification metrics - Compute metrics for spam/ham, positive/negative, etc.
- Multi-class evaluation - Evaluate across multiple categories with different averaging strategies
This is a code-based evaluator that computes standard classification metrics. Both
expected and output should be sequences of labels (strings or integers).Supported Levels
This evaluator is not tied to specific tracing levels. It operates on lists of predicted and expected labels, making it useful for:- Comparing model predictions against ground truth labels
- Evaluating classification outputs at any level where you have paired label sequences
- Batch evaluation of classification tasks in experiments
Input Requirements
The PrecisionRecallFScore evaluator requires two inputs:| Field | Type | Description |
|---|---|---|
expected | List[str | int] | List of expected/true labels |
output | List[str | int] | List of predicted labels |
For numeric labels
{0, 1}, the evaluator automatically treats 1 as the positive class.Constructor Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
beta | float | 1.0 | Weight of recall relative to precision (F1 by default) |
average | str | "macro" | Averaging strategy: "macro", "micro", or "weighted" |
positive_label | str | int | None | For binary classification, specify the positive class |
zero_division | float | 0.0 | Value to use when a metric is undefined (0/0) |
Output Interpretation
The evaluator returns threeScore objects:
| Score Name | Description |
|---|---|
precision | Ratio of true positives to predicted positives |
recall | Ratio of true positives to actual positives |
f1 (or f{beta}) | Harmonic mean of precision and recall |
direction="maximize"(higher is better)kind="code"(code-based evaluator)
Averaging Strategies
For multi-class classification.| Strategy | Description |
|---|---|
macro | Calculate metrics for each class, then average (treats all classes equally) |
micro | Calculate metrics globally by counting total TP, FP, FN |
weighted | Average weighted by class support (number of true instances) |
Usage Examples
- Python
- TypeScript
Binary Classification
For binary classification, specify the positive label:Using with Phoenix
Evaluating Traces
Run evaluations on traces collected in Phoenix and log results as annotations:Running Experiments
Use the PrecisionRecallFScore evaluator in Phoenix experiments:API Reference
- Python: PrecisionRecallFScore
Related
- Exact Match Evaluator - For exact string comparison
- Correctness Evaluator - For semantic correctness evaluation

