Phoenix

Automate and integrate scalable web application workflows using Phoenix framework

Phoenix is a community skill for monitoring and debugging LLM applications using the Arize Phoenix observability platform, covering trace collection, span analysis, evaluation metrics, prompt tracking, and experiment management for AI application observability.

What Is This?

Overview

Phoenix provides tools for observing and evaluating LLM application behavior in development and production environments. It covers trace collection that captures detailed request and response data from LLM calls including token usage and latency measurements, span analysis that breaks down complex LLM pipelines into individual steps for performance inspection, evaluation metrics that score LLM outputs for relevance, faithfulness, and toxicity using automated evaluators, prompt tracking that monitors template versions and their impact on output quality across deployments, and experiment management that compares different model configurations and prompt strategies systematically. The skill enables teams to understand and improve LLM application behavior through structured observability.

Who Should Use This

This skill serves ML engineers monitoring LLM applications in production, development teams debugging retrieval-augmented generation pipelines, and organizations evaluating LLM output quality across different model configurations and prompt versions.

Why Use It?

Problems It Solves

LLM applications produce non-deterministic outputs that are difficult to debug without detailed trace data showing each processing step. Retrieval pipelines fail silently when retrieved context is irrelevant or incomplete with no visibility into the retrieval quality. Prompt changes can degrade output quality in unexpected ways without systematic comparison tools for evaluating the effects. Production LLM costs accumulate without clear attribution to specific features or request patterns in the application.

Core Highlights

Trace collector captures LLM request and response data with full pipeline visibility. Span inspector breaks down multi-step AI workflows into analyzable components. Quality evaluator scores outputs for relevance and faithfulness automatically. Experiment tracker compares model and prompt configurations with structured results.

How to Use It?

Basic Usage

import phoenix as px
from phoenix.otel import (
  register)
from openinference\
  .instrumentation\
  .openai import (
    OpenAIInstrumentor)

session = px.launch_app()

tracer_provider = register(
  project_name=
    'my-llm-app')

OpenAIInstrumentor()\
  .instrument(
    tracer_provider=
      tracer_provider)

from openai import OpenAI
client = OpenAI()

response = client.chat\
  .completions.create(
    model='gpt-4',
    messages=[{
      'role': 'user',
      'content':
        'Explain quantum'
        ' computing'}])

print(
  f'View traces: '
  f'{session.url}')

Real-World Examples

import phoenix as px
from phoenix.evals import (
  OpenAIModel,
  run_evals,
  HallucinationEvaluator,
  QAEvaluator,
  RelevanceEvaluator)
import pandas as pd

class RAGEvaluator:
  def __init__(self):
    self.model = (
      OpenAIModel(
        model='gpt-4'))
    self.evals = {
      'hallucination':
        HallucinationEvaluator(
          self.model),
      'qa': QAEvaluator(
        self.model),
      'relevance':
        RelevanceEvaluator(
          self.model)}

  def evaluate(
    self,
    queries: list[str],
    contexts: list[str],
    responses: list[str]
  ) -> pd.DataFrame:
    df = pd.DataFrame({
      'input': queries,
      'reference':
        contexts,
      'output':
        responses})
    results = {}
    for name, ev in (
      self.evals.items()
    ):
      results[name] = (
        run_evals(
          dataframe=df,
          evaluators=[ev]
        )[0])
    return pd.concat(
      results, axis=1)

Advanced Tips

Use custom span attributes to tag traces with metadata like user segments and feature flags for filtered analysis across different application contexts. Set up evaluation pipelines that run automatically on sampled production traces to detect quality degradation before users report issues. Export trace data to dataframes for custom analysis beyond the built-in Phoenix dashboard visualizations.

When to Use It?

Use Cases

Monitor a production RAG application by tracing retrieval and generation steps to identify latency bottlenecks and quality issues. Evaluate LLM outputs for hallucination and relevance using automated evaluators on collected trace data. Compare prompt template versions by running experiments that measure quality metrics across configurations.

Related Topics

LLM observability, Arize Phoenix, tracing, evaluation metrics, prompt engineering, RAG debugging, and AI application monitoring.

Important Notes

Requirements

Phoenix Python package with OpenTelemetry dependencies for trace collection. OpenInference instrumentation packages for the specific LLM providers being monitored. OpenAI API key or equivalent credentials when using LLM-based evaluation functions.

Usage Recommendations

Do: instrument all LLM calls and retrieval steps to get complete pipeline visibility from query to response. Use automated evaluators on sampled traces to continuously monitor output quality in production. Organize traces into projects to separate different applications and environments clearly.

Don't: log full request and response payloads in high-traffic production systems without sampling since storage costs grow rapidly with volume. Run expensive LLM-based evaluations on every single trace in production since this doubles API costs. Ignore trace latency data when debugging quality issues since slow responses often correlate with retrieval problems.

Limitations

LLM-based evaluators require additional API calls that add cost and latency to the evaluation pipeline. Trace storage grows with application traffic volume requiring retention policies for long-running production deployments. Custom evaluator development requires labeled datasets for calibration against human judgment on domain-specific quality criteria.