Evals Migration.mdc

how to migrate to the new evals interfaces

Views0
PublishedFeb 7, 2026

Loading actions...

5 minBeginnerpromptSingle file

Skill content

Main instructions and any bundled files for this skill.

markdown

Phoenix Evals Migration Guide

⚠️ DEPRECATED INTERFACES

The following interfaces are DEPRECATED and should no longer be used:

  • phoenix.evals.models module (all model classes)
  • phoenix.evals.llm_classify function
  • phoenix.evals.llm_generate function
  • phoenix.evals.run_evals function
  • phoenix.evals.templates.PromptTemplate class
  • All legacy evaluator classes in phoenix.evals root module

Legacy documentation: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/legacy.html

Migration Overview

The new Phoenix Evals API (v2.0+) provides:

  • Unified LLM interface via phoenix.evals.llm.LLM
  • Composable evaluators with create_classifier and create_evaluator
  • Efficient batch processing with evaluate_dataframe
  • Better error handling and async support
  • Structured outputs with automatic scoring

Complete Migration Mapping

1. Model Interfaces

DEPRECATEDNEW INTERFACE
from phoenix.evals.models import OpenAIModelfrom phoenix.evals.llm import LLM
from phoenix.evals.models import AnthropicModelfrom phoenix.evals.llm import LLM
from phoenix.evals.models import GeminiModelfrom phoenix.evals.llm import LLM
from phoenix.evals.models import VertexAIModelfrom phoenix.evals.llm import LLM
from phoenix.evals.models import BedrockModelfrom phoenix.evals.llm import LLM
from phoenix.evals.models import LiteLLMModelfrom phoenix.evals.llm import LLM

2. Core Functions

DEPRECATEDNEW INTERFACE
phoenix.evals.llm_classifyphoenix.evals.create_classifier + phoenix.evals.evaluate_dataframe
phoenix.evals.llm_generatephoenix.evals.llm.LLM.generate_text or custom evaluator
phoenix.evals.run_evalsphoenix.evals.evaluate_dataframe

3. Templates

DEPRECATEDNEW INTERFACE
phoenix.evals.templates.PromptTemplateRaw strings or phoenix.evals.templating.Template
phoenix.evals.templates.ClassificationTemplatephoenix.evals.create_classifier with choices parameter

4. Evaluators

DEPRECATEDNEW INTERFACE
phoenix.evals.LLMEvaluatorphoenix.evals.LLMEvaluator (new implementation)
phoenix.evals.HallucinationEvaluatorphoenix.evals.metrics.HallucinationEvaluator
phoenix.evals.RelevanceEvaluatorCreate with phoenix.evals.create_classifier
phoenix.evals.ToxicityEvaluatorCreate with phoenix.evals.create_classifier
phoenix.evals.QAEvaluatorCreate with phoenix.evals.create_classifier
phoenix.evals.SummarizationEvaluatorCreate with phoenix.evals.create_classifier
phoenix.evals.SQLEvaluatorCreate with phoenix.evals.create_classifier

Step-by-Step Migration Examples

Example 1: Basic Classification (llm_classify → create_classifier)

DEPRECATED:

from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

# Old way
model = OpenAIModel(model="gpt-4o")
template = PromptTemplate(
    template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}. Respond either as 'helpful' or 'not_helpful'"
)

evals_df = llm_classify(
    data=spans_df,
    model=model,
    rails=["helpful", "not_helpful"],
    template=template,
    exit_on_error=False,
    provide_explanation=True,
)

# Manual score assignment
evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "helpful" else 0)

NEW:

import pandas as pd
from phoenix.evals import create_classifier, evaluate_dataframe
from phoenix.evals.llm import LLM

# New way
llm = LLM(provider="openai", model="gpt-4o")

helpfulness_evaluator = create_classifier(
    name="helpfulness",
    prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"helpful": 1.0, "not_helpful": 0.0},  # Automatic scoring
)

results_df = evaluate_dataframe(
    dataframe=spans_df,
    evaluators=[helpfulness_evaluator],
)

Example 2: Multiple Evaluators

DEPRECATED:

from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel

model = OpenAIModel(model="gpt-4o")

# Multiple separate calls
relevance_df = llm_classify(data=df, model=model, rails=["relevant", "irrelevant"], ...)
helpfulness_df = llm_classify(data=df, model=model, rails=["helpful", "not_helpful"], ...)
toxicity_df = llm_classify(data=df, model=model, rails=["toxic", "non_toxic"], ...)

# Manual merging required

NEW:

from phoenix.evals import create_classifier, evaluate_dataframe
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o")

# Create multiple evaluators
relevance_evaluator = create_classifier(
    name="relevance",
    prompt_template="Is the response relevant?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

helpfulness_evaluator = create_classifier(
    name="helpfulness",
    prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"helpful": 1.0, "not_helpful": 0.0},
)

toxicity_evaluator = create_classifier(
    name="toxicity",
    prompt_template="Is the response toxic?\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"toxic": 0.0, "non_toxic": 1.0},
)

# Single call evaluates all metrics
results_df = evaluate_dataframe(
    dataframe=df,
    evaluators=[relevance_evaluator, helpfulness_evaluator, toxicity_evaluator],
)

Example 3: Text Generation (llm_generate → LLM.generate_text)

DEPRECATED:

from phoenix.evals import llm_generate
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

model = OpenAIModel(model="gpt-4o")
template = PromptTemplate(template="Generate a response to: {query}")

generated_df = llm_generate(
    dataframe=df,
    template=template,
    model=model,
)

NEW:

from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o")

# For single generations
response = llm.generate_text(prompt="Generate a response to: How do I reset my password?")

# For batch processing with dataframes
def generate_responses(row):
    prompt = f"Generate a response to: {row['query']}"
    return llm.generate_text(prompt=prompt)

df['generated_response'] = df.apply(generate_responses, axis=1)

Example 4: Custom Evaluators

DEPRECATED:

from phoenix.evals import LLMEvaluator
from phoenix.evals.models import OpenAIModel

class CustomEvaluator(LLMEvaluator):
    def evaluate(self, input_text, output_text):
        # Custom logic
        pass

evaluator = CustomEvaluator(model=OpenAIModel(model="gpt-4o"))

NEW:

from phoenix.evals import create_evaluator, LLMEvaluator
from phoenix.evals.llm import LLM

# Option 1: Function-based evaluator
@create_evaluator(name="custom_metric", direction="maximize")
def custom_evaluator(input: str, output: str) -> float:
    # Custom heuristic logic
    return len(output) / len(input)  # Example metric

# Option 2: LLM-based evaluator
llm = LLM(provider="openai", model="gpt-4o")

class CustomLLMEvaluator(LLMEvaluator):
    def __init__(self):
        super().__init__(
            name="custom_llm_eval",
            llm=llm,
            prompt_template="Evaluate this response: {input} -> {output}",
        )

    def _evaluate(self, eval_input):
        # Custom LLM evaluation logic
        pass

Example 5: Different LLM Providers

DEPRECATED:

from phoenix.evals.models import OpenAIModel, AnthropicModel, GeminiModel

openai_model = OpenAIModel(model="gpt-4o")
anthropic_model = AnthropicModel(model="claude-3-sonnet-20240229")

NEW:

from phoenix.evals.llm import LLM

# All providers use the same interface
openai_llm = LLM(provider="openai", model="gpt-4o")
litellm_llm = LLM(provider="litellm", model="claude-3-sonnet-20240229")

Migration Checklist

When migrating your code:

  1. ✅ Update imports

    • Replace phoenix.evals.models.* with phoenix.evals.llm.LLM
    • Replace phoenix.evals.llm_classify with phoenix.evals.create_classifier
    • Replace phoenix.evals.llm_generate with direct LLM calls
  2. ✅ Update model instantiation

    • Use unified LLM(provider="...", model="...") interface
    • Remove provider-specific model classes
  3. ✅ Replace function calls

    • Convert llm_classify to create_classifier + evaluate_dataframe
    • Convert llm_generate to LLM.generate_text
    • Convert run_evals to evaluate_dataframe
  4. ✅ Update templates

    • Use raw strings instead of PromptTemplate objects
    • Replace rails parameter with choices dictionary
  5. ✅ Update evaluators

    • Use create_classifier for classification tasks
    • Use create_evaluator decorator for custom metrics
    • Import built-in evaluators from phoenix.evals.metrics
  6. ✅ Test the migration

    • Verify outputs match expected format
    • Check that scores are properly assigned
    • Ensure error handling works as expected

Getting Help

Share: