Phoenix Evals Migration Guide

⚠️ DEPRECATED INTERFACES

The following interfaces are DEPRECATED and should no longer be used:

phoenix.evals.models module (all model classes)
phoenix.evals.llm_classify function
phoenix.evals.llm_generate function
phoenix.evals.run_evals function
phoenix.evals.templates.PromptTemplate class
All legacy evaluator classes in phoenix.evals root module

Legacy documentation: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/legacy.html

Migration Overview

The new Phoenix Evals API (v2.0+) provides:

Unified LLM interface via phoenix.evals.llm.LLM
Composable evaluators with create_classifier and create_evaluator
Efficient batch processing with evaluate_dataframe
Better error handling and async support
Structured outputs with automatic scoring

Complete Migration Mapping

1. Model Interfaces

DEPRECATED	NEW INTERFACE
`from phoenix.evals.models import OpenAIModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import AnthropicModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import GeminiModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import VertexAIModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import BedrockModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import LiteLLMModel`	`from phoenix.evals.llm import LLM`

2. Core Functions

DEPRECATED	NEW INTERFACE
`phoenix.evals.llm_classify`	`phoenix.evals.create_classifier` + `phoenix.evals.evaluate_dataframe`
`phoenix.evals.llm_generate`	`phoenix.evals.llm.LLM.generate_text` or custom evaluator
`phoenix.evals.run_evals`	`phoenix.evals.evaluate_dataframe`

3. Templates

DEPRECATED	NEW INTERFACE
`phoenix.evals.templates.PromptTemplate`	Raw strings or `phoenix.evals.templating.Template`
`phoenix.evals.templates.ClassificationTemplate`	`phoenix.evals.create_classifier` with `choices` parameter

4. Evaluators

DEPRECATED	NEW INTERFACE
`phoenix.evals.LLMEvaluator`	`phoenix.evals.LLMEvaluator` (new implementation)
`phoenix.evals.HallucinationEvaluator`	`phoenix.evals.metrics.HallucinationEvaluator`
`phoenix.evals.RelevanceEvaluator`	Create with `phoenix.evals.create_classifier`
`phoenix.evals.ToxicityEvaluator`	Create with `phoenix.evals.create_classifier`
`phoenix.evals.QAEvaluator`	Create with `phoenix.evals.create_classifier`
`phoenix.evals.SummarizationEvaluator`	Create with `phoenix.evals.create_classifier`
`phoenix.evals.SQLEvaluator`	Create with `phoenix.evals.create_classifier`

Step-by-Step Migration Examples

Example 1: Basic Classification (llm_classify → create_classifier)

DEPRECATED:

from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

# Old way
model = OpenAIModel(model="gpt-4o")
template = PromptTemplate(
    template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}. Respond either as 'helpful' or 'not_helpful'"
)

evals_df = llm_classify(
    data=spans_df,
    model=model,
    rails=["helpful", "not_helpful"],
    template=template,
    exit_on_error=False,
    provide_explanation=True,
)

# Manual score assignment
evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "helpful" else 0)

NEW:

import pandas as pd
from phoenix.evals import create_classifier, evaluate_dataframe
from phoenix.evals.llm import LLM

# New way
llm = LLM(provider="openai", model="gpt-4o")

helpfulness_evaluator = create_classifier(
    name="helpfulness",
    prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"helpful": 1.0, "not_helpful": 0.0},  # Automatic scoring
)

results_df = evaluate_dataframe(
    dataframe=spans_df,
    evaluators=[helpfulness_evaluator],
)

Example 2: Multiple Evaluators

DEPRECATED:

from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel

model = OpenAIModel(model="gpt-4o")

# Multiple separate calls
relevance_df = llm_classify(data=df, model=model, rails=["relevant", "irrelevant"], ...)
helpfulness_df = llm_classify(data=df, model=model, rails=["helpful", "not_helpful"], ...)
toxicity_df = llm_classify(data=df, model=model, rails=["toxic", "non_toxic"], ...)

# Manual merging required

NEW:

from phoenix.evals import create_classifier, evaluate_dataframe
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o")

# Create multiple evaluators
relevance_evaluator = create_classifier(
    name="relevance",
    prompt_template="Is the response relevant?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

helpfulness_evaluator = create_classifier(
    name="helpfulness",
    prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"helpful": 1.0, "not_helpful": 0.0},
)

toxicity_evaluator = create_classifier(
    name="toxicity",
    prompt_template="Is the response toxic?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"toxic": 0.0, "non_toxic": 1.0},
)

# Single call evaluates all metrics
results_df = evaluate_dataframe(
    dataframe=df,
    evaluators=[relevance_evaluator, helpfulness_evaluator, toxicity_evaluator],
)

Example 3: Text Generation (llm_generate → LLM.generate_text)

DEPRECATED:

from phoenix.evals import llm_generate
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

model = OpenAIModel(model="gpt-4o")
template = PromptTemplate(template="Generate a response to: {query}")

generated_df = llm_generate(
    dataframe=df,
    template=template,
    model=model,
)

NEW:

from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o")

# For single generations
response = llm.generate_text(prompt="Generate a response to: How do I reset my password?")

# For batch processing with dataframes
def generate_responses(row):
    prompt = f"Generate a response to: {row['query']}"
    return llm.generate_text(prompt=prompt)

df['generated_response'] = df.apply(generate_responses, axis=1)

Example 4: Custom Evaluators

DEPRECATED:

from phoenix.evals import LLMEvaluator
from phoenix.evals.models import OpenAIModel

class CustomEvaluator(LLMEvaluator):
    def evaluate(self, input_text, output_text):
        # Custom logic
        pass

evaluator = CustomEvaluator(model=OpenAIModel(model="gpt-4o"))

NEW:

from phoenix.evals import create_evaluator, LLMEvaluator
from phoenix.evals.llm import LLM

# Option 1: Function-based evaluator
@create_evaluator(name="custom_metric", direction="maximize")
def custom_evaluator(input: str, output: str) -> float:
    # Custom heuristic logic
    return len(output) / len(input)  # Example metric

# Option 2: LLM-based evaluator
llm = LLM(provider="openai", model="gpt-4o")

class CustomLLMEvaluator(LLMEvaluator):
    def __init__(self):
        super().__init__(
            name="custom_llm_eval",
            llm=llm,
            prompt_template="Evaluate this response: {input} -> {output}",
        )

    def _evaluate(self, eval_input):
        # Custom LLM evaluation logic
        pass

Example 5: Different LLM Providers

DEPRECATED:

from phoenix.evals.models import OpenAIModel, AnthropicModel, GeminiModel

openai_model = OpenAIModel(model="gpt-4o")
anthropic_model = AnthropicModel(model="claude-3-sonnet-20240229")

NEW:

from phoenix.evals.llm import LLM

# All providers use the same interface
openai_llm = LLM(provider="openai", model="gpt-4o")
litellm_llm = LLM(provider="litellm", model="claude-3-sonnet-20240229")

Migration Checklist

When migrating your code:

✅ Update imports
- Replace phoenix.evals.models.* with phoenix.evals.llm.LLM
- Replace phoenix.evals.llm_classify with phoenix.evals.create_classifier
- Replace phoenix.evals.llm_generate with direct LLM calls
✅ Update model instantiation
- Use unified LLM(provider="...", model="...") interface
- Remove provider-specific model classes
✅ Replace function calls
- Convert llm_classify to create_classifier + evaluate_dataframe
- Convert llm_generate to LLM.generate_text
- Convert run_evals to evaluate_dataframe
✅ Update templates
- Use raw strings instead of PromptTemplate objects
- Replace rails parameter with choices dictionary
✅ Update evaluators
- Use create_classifier for classification tasks
- Use create_evaluator decorator for custom metrics
- Import built-in evaluators from phoenix.evals.metrics
✅ Test the migration
- Verify outputs match expected format
- Check that scores are properly assigned
- Ensure error handling works as expected

Getting Help

New API Documentation: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/evals.html
Legacy API Reference: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/legacy.html

Phoenix Evals Migration Guide

⚠️ DEPRECATED INTERFACES

The following interfaces are DEPRECATED and should no longer be used:

phoenix.evals.models module (all model classes)
phoenix.evals.llm_classify function
phoenix.evals.llm_generate function
phoenix.evals.run_evals function
phoenix.evals.templates.PromptTemplate class
All legacy evaluator classes in phoenix.evals root module

Legacy documentation: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/legacy.html

Migration Overview

The new Phoenix Evals API (v2.0+) provides:

Unified LLM interface via phoenix.evals.llm.LLM
Composable evaluators with create_classifier and create_evaluator
Efficient batch processing with evaluate_dataframe
Better error handling and async support
Structured outputs with automatic scoring

Complete Migration Mapping

1. Model Interfaces

DEPRECATED	NEW INTERFACE
`from phoenix.evals.models import OpenAIModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import AnthropicModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import GeminiModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import VertexAIModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import BedrockModel`	`from phoenix.evals.llm import LLM`
`from phoenix.evals.models import LiteLLMModel`	`from phoenix.evals.llm import LLM`

2. Core Functions

DEPRECATED	NEW INTERFACE
`phoenix.evals.llm_classify`	`phoenix.evals.create_classifier` + `phoenix.evals.evaluate_dataframe`
`phoenix.evals.llm_generate`	`phoenix.evals.llm.LLM.generate_text` or custom evaluator
`phoenix.evals.run_evals`	`phoenix.evals.evaluate_dataframe`

3. Templates

DEPRECATED	NEW INTERFACE
`phoenix.evals.templates.PromptTemplate`	Raw strings or `phoenix.evals.templating.Template`
`phoenix.evals.templates.ClassificationTemplate`	`phoenix.evals.create_classifier` with `choices` parameter

4. Evaluators

DEPRECATED	NEW INTERFACE
`phoenix.evals.LLMEvaluator`	`phoenix.evals.LLMEvaluator` (new implementation)
`phoenix.evals.HallucinationEvaluator`	`phoenix.evals.metrics.HallucinationEvaluator`
`phoenix.evals.RelevanceEvaluator`	Create with `phoenix.evals.create_classifier`
`phoenix.evals.ToxicityEvaluator`	Create with `phoenix.evals.create_classifier`
`phoenix.evals.QAEvaluator`	Create with `phoenix.evals.create_classifier`
`phoenix.evals.SummarizationEvaluator`	Create with `phoenix.evals.create_classifier`
`phoenix.evals.SQLEvaluator`	Create with `phoenix.evals.create_classifier`

Step-by-Step Migration Examples

Example 1: Basic Classification (llm_classify → create_classifier)

DEPRECATED:

from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

# Old way
model = OpenAIModel(model="gpt-4o")
template = PromptTemplate(
    template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}. Respond either as 'helpful' or 'not_helpful'"
)

evals_df = llm_classify(
    data=spans_df,
    model=model,
    rails=["helpful", "not_helpful"],
    template=template,
    exit_on_error=False,
    provide_explanation=True,
)

# Manual score assignment
evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "helpful" else 0)

NEW:

import pandas as pd
from phoenix.evals import create_classifier, evaluate_dataframe
from phoenix.evals.llm import LLM

# New way
llm = LLM(provider="openai", model="gpt-4o")

helpfulness_evaluator = create_classifier(
    name="helpfulness",
    prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"helpful": 1.0, "not_helpful": 0.0},  # Automatic scoring
)

results_df = evaluate_dataframe(
    dataframe=spans_df,
    evaluators=[helpfulness_evaluator],
)

Example 2: Multiple Evaluators

DEPRECATED:

from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel

model = OpenAIModel(model="gpt-4o")

# Multiple separate calls
relevance_df = llm_classify(data=df, model=model, rails=["relevant", "irrelevant"], ...)
helpfulness_df = llm_classify(data=df, model=model, rails=["helpful", "not_helpful"], ...)
toxicity_df = llm_classify(data=df, model=model, rails=["toxic", "non_toxic"], ...)

# Manual merging required

NEW:

from phoenix.evals import create_classifier, evaluate_dataframe
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o")

# Create multiple evaluators
relevance_evaluator = create_classifier(
    name="relevance",
    prompt_template="Is the response relevant?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

helpfulness_evaluator = create_classifier(
    name="helpfulness",
    prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"helpful": 1.0, "not_helpful": 0.0},
)

toxicity_evaluator = create_classifier(
    name="toxicity",
    prompt_template="Is the response toxic?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"toxic": 0.0, "non_toxic": 1.0},
)

# Single call evaluates all metrics
results_df = evaluate_dataframe(
    dataframe=df,
    evaluators=[relevance_evaluator, helpfulness_evaluator, toxicity_evaluator],
)

Example 3: Text Generation (llm_generate → LLM.generate_text)

DEPRECATED:

from phoenix.evals import llm_generate
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

model = OpenAIModel(model="gpt-4o")
template = PromptTemplate(template="Generate a response to: {query}")

generated_df = llm_generate(
    dataframe=df,
    template=template,
    model=model,
)

NEW:

from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o")

# For single generations
response = llm.generate_text(prompt="Generate a response to: How do I reset my password?")

# For batch processing with dataframes
def generate_responses(row):
    prompt = f"Generate a response to: {row['query']}"
    return llm.generate_text(prompt=prompt)

df['generated_response'] = df.apply(generate_responses, axis=1)

Example 4: Custom Evaluators

DEPRECATED:

from phoenix.evals import LLMEvaluator
from phoenix.evals.models import OpenAIModel

class CustomEvaluator(LLMEvaluator):
    def evaluate(self, input_text, output_text):
        # Custom logic
        pass

evaluator = CustomEvaluator(model=OpenAIModel(model="gpt-4o"))

NEW:

from phoenix.evals import create_evaluator, LLMEvaluator
from phoenix.evals.llm import LLM

# Option 1: Function-based evaluator
@create_evaluator(name="custom_metric", direction="maximize")
def custom_evaluator(input: str, output: str) -> float:
    # Custom heuristic logic
    return len(output) / len(input)  # Example metric

# Option 2: LLM-based evaluator
llm = LLM(provider="openai", model="gpt-4o")

class CustomLLMEvaluator(LLMEvaluator):
    def __init__(self):
        super().__init__(
            name="custom_llm_eval",
            llm=llm,
            prompt_template="Evaluate this response: {input} -> {output}",
        )

    def _evaluate(self, eval_input):
        # Custom LLM evaluation logic
        pass

Example 5: Different LLM Providers

DEPRECATED:

from phoenix.evals.models import OpenAIModel, AnthropicModel, GeminiModel

openai_model = OpenAIModel(model="gpt-4o")
anthropic_model = AnthropicModel(model="claude-3-sonnet-20240229")

NEW:

from phoenix.evals.llm import LLM

# All providers use the same interface
openai_llm = LLM(provider="openai", model="gpt-4o")
litellm_llm = LLM(provider="litellm", model="claude-3-sonnet-20240229")

Migration Checklist

When migrating your code:

✅ Update imports
- Replace phoenix.evals.models.* with phoenix.evals.llm.LLM
- Replace phoenix.evals.llm_classify with phoenix.evals.create_classifier
- Replace phoenix.evals.llm_generate with direct LLM calls
✅ Update model instantiation
- Use unified LLM(provider="...", model="...") interface
- Remove provider-specific model classes
✅ Replace function calls
- Convert llm_classify to create_classifier + evaluate_dataframe
- Convert llm_generate to LLM.generate_text
- Convert run_evals to evaluate_dataframe
✅ Update templates
- Use raw strings instead of PromptTemplate objects
- Replace rails parameter with choices dictionary
✅ Update evaluators
- Use create_classifier for classification tasks
- Use create_evaluator decorator for custom metrics
- Import built-in evaluators from phoenix.evals.metrics
✅ Test the migration
- Verify outputs match expected format
- Check that scores are properly assigned
- Ensure error handling works as expected

Getting Help

New API Documentation: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/evals.html
Legacy API Reference: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/legacy.html

Evals Migration.mdc

Phoenix Evals Migration Guide

⚠️ DEPRECATED INTERFACES

Migration Overview

Complete Migration Mapping

1. Model Interfaces

2. Core Functions

3. Templates

4. Evaluators

Step-by-Step Migration Examples

Example 1: Basic Classification (llm_classify → create_classifier)

Example 2: Multiple Evaluators

Example 3: Text Generation (llm_generate → LLM.generate_text)

Example 4: Custom Evaluators

Example 5: Different LLM Providers

Migration Checklist

Getting Help

Related Skills

<h1 align="center">

Frontend Typescript Linting.mdc

2. Apply Deepthink Protocol (reason about dependencies

Phoenix Evals Migration Guide

⚠️ DEPRECATED INTERFACES

Migration Overview

Complete Migration Mapping

1. Model Interfaces

2. Core Functions

3. Templates

4. Evaluators

Step-by-Step Migration Examples

Example 1: Basic Classification (llm_classify → create_classifier)

Example 2: Multiple Evaluators

Example 3: Text Generation (llm_generate → LLM.generate_text)

Example 4: Custom Evaluators

Example 5: Different LLM Providers

Migration Checklist

Getting Help