Use this file to discover all available pages before exploring further.
Follow with Complete Python Notebook
Follow with Complete TypeScript Tutorial
Once experiments are defined, they can be integrated into your development workflow as a systematic way to validate changes to your application. In practice, this means updating the underlying code that your experiment task calls—such as prompt changes, model swaps, retrieval logic, or system configuration—and then rerunning the experiment to observe how those changes affect evaluation metrics.Because experiments in Phoenix are tied to a fixed dataset and evaluation setup, you can clearly see how metrics evolve as your system changes. This allows you to compare results across runs and identify whether a change led to an improvement, a regression, or a tradeoff across different quality dimensions.Over time, this creates a measurable history of how your application has evolved and helps teams make decisions based on data rather than intuition.
Let’s demonstrate this workflow by creating an improved version of our support agent with enhanced instructions to improve actionability, then running an experiment to compare it against the initial experiment.
We’ll create a new version of the agent with enhanced instructions that emphasize specific, actionable responses. The key change is in the instructions parameter in the agent’s prompt.For the complete implementation including the task function, see the reference notebook.
Run an experiment with the improved agent using the same dataset and evaluator to compare performance:
# Run experiment with improved agent to compare actionability scoresfrom phoenix.client.experiments import run_experiment# Get the datasetimproved_experiment = run_experiment( dataset=dataset, task=improved_support_agent_task, evaluators=[call_actionability_judge], experiment_name="improved support agent", experiment_description="Agent with enhanced instructions to improve actionability - emphasizes specific, concrete responses with clear next steps")
With the improved prompt, the evaluator scores should be higher compared to the initial experiment, indicating better actionability and helpfulness.