Evals Migration.mdc
how to migrate to the new evals interfaces
Loading actions...
Skill content
Main instructions and any bundled files for this skill.
Phoenix Evals Migration Guide
⚠️ DEPRECATED INTERFACES
The following interfaces are DEPRECATED and should no longer be used:
phoenix.evals.modelsmodule (all model classes)phoenix.evals.llm_classifyfunctionphoenix.evals.llm_generatefunctionphoenix.evals.run_evalsfunctionphoenix.evals.templates.PromptTemplateclass- All legacy evaluator classes in
phoenix.evalsroot module
Legacy documentation: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/legacy.html
Migration Overview
The new Phoenix Evals API (v2.0+) provides:
- Unified LLM interface via
phoenix.evals.llm.LLM - Composable evaluators with
create_classifierandcreate_evaluator - Efficient batch processing with
evaluate_dataframe - Better error handling and async support
- Structured outputs with automatic scoring
Complete Migration Mapping
1. Model Interfaces
| DEPRECATED | NEW INTERFACE |
|---|---|
from phoenix.evals.models import OpenAIModel | from phoenix.evals.llm import LLM |
from phoenix.evals.models import AnthropicModel | from phoenix.evals.llm import LLM |
from phoenix.evals.models import GeminiModel | from phoenix.evals.llm import LLM |
from phoenix.evals.models import VertexAIModel | from phoenix.evals.llm import LLM |
from phoenix.evals.models import BedrockModel | from phoenix.evals.llm import LLM |
from phoenix.evals.models import LiteLLMModel | from phoenix.evals.llm import LLM |
2. Core Functions
| DEPRECATED | NEW INTERFACE |
|---|---|
phoenix.evals.llm_classify | phoenix.evals.create_classifier + phoenix.evals.evaluate_dataframe |
phoenix.evals.llm_generate | phoenix.evals.llm.LLM.generate_text or custom evaluator |
phoenix.evals.run_evals | phoenix.evals.evaluate_dataframe |
3. Templates
| DEPRECATED | NEW INTERFACE |
|---|---|
phoenix.evals.templates.PromptTemplate | Raw strings or phoenix.evals.templating.Template |
phoenix.evals.templates.ClassificationTemplate | phoenix.evals.create_classifier with choices parameter |
4. Evaluators
| DEPRECATED | NEW INTERFACE |
|---|---|
phoenix.evals.LLMEvaluator | phoenix.evals.LLMEvaluator (new implementation) |
phoenix.evals.HallucinationEvaluator | phoenix.evals.metrics.HallucinationEvaluator |
phoenix.evals.RelevanceEvaluator | Create with phoenix.evals.create_classifier |
phoenix.evals.ToxicityEvaluator | Create with phoenix.evals.create_classifier |
phoenix.evals.QAEvaluator | Create with phoenix.evals.create_classifier |
phoenix.evals.SummarizationEvaluator | Create with phoenix.evals.create_classifier |
phoenix.evals.SQLEvaluator | Create with phoenix.evals.create_classifier |
Step-by-Step Migration Examples
Example 1: Basic Classification (llm_classify → create_classifier)
DEPRECATED:
from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate
# Old way
model = OpenAIModel(model="gpt-4o")
template = PromptTemplate(
template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}. Respond either as 'helpful' or 'not_helpful'"
)
evals_df = llm_classify(
data=spans_df,
model=model,
rails=["helpful", "not_helpful"],
template=template,
exit_on_error=False,
provide_explanation=True,
)
# Manual score assignment
evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "helpful" else 0)
NEW:
import pandas as pd
from phoenix.evals import create_classifier, evaluate_dataframe
from phoenix.evals.llm import LLM
# New way
llm = LLM(provider="openai", model="gpt-4o")
helpfulness_evaluator = create_classifier(
name="helpfulness",
prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}",
llm=llm,
choices={"helpful": 1.0, "not_helpful": 0.0}, # Automatic scoring
)
results_df = evaluate_dataframe(
dataframe=spans_df,
evaluators=[helpfulness_evaluator],
)
Example 2: Multiple Evaluators
DEPRECATED:
from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
model = OpenAIModel(model="gpt-4o")
# Multiple separate calls
relevance_df = llm_classify(data=df, model=model, rails=["relevant", "irrelevant"], ...)
helpfulness_df = llm_classify(data=df, model=model, rails=["helpful", "not_helpful"], ...)
toxicity_df = llm_classify(data=df, model=model, rails=["toxic", "non_toxic"], ...)
# Manual merging required
NEW:
from phoenix.evals import create_classifier, evaluate_dataframe
from phoenix.evals.llm import LLM
llm = LLM(provider="openai", model="gpt-4o")
# Create multiple evaluators
relevance_evaluator = create_classifier(
name="relevance",
prompt_template="Is the response relevant?\n\nQuery: {input}\nResponse: {output}",
llm=llm,
choices={"relevant": 1.0, "irrelevant": 0.0},
)
helpfulness_evaluator = create_classifier(
name="helpfulness",
prompt_template="Is the response helpful?\n\nQuery: {input}\nResponse: {output}",
llm=llm,
choices={"helpful": 1.0, "not_helpful": 0.0},
)
toxicity_evaluator = create_classifier(
name="toxicity",
prompt_template="Is the response toxic?\n\nQuery: {input}\nResponse: {output}",
llm=llm,
choices={"toxic": 0.0, "non_toxic": 1.0},
)
# Single call evaluates all metrics
results_df = evaluate_dataframe(
dataframe=df,
evaluators=[relevance_evaluator, helpfulness_evaluator, toxicity_evaluator],
)
Example 3: Text Generation (llm_generate → LLM.generate_text)
DEPRECATED:
from phoenix.evals import llm_generate
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate
model = OpenAIModel(model="gpt-4o")
template = PromptTemplate(template="Generate a response to: {query}")
generated_df = llm_generate(
dataframe=df,
template=template,
model=model,
)
NEW:
from phoenix.evals.llm import LLM
llm = LLM(provider="openai", model="gpt-4o")
# For single generations
response = llm.generate_text(prompt="Generate a response to: How do I reset my password?")
# For batch processing with dataframes
def generate_responses(row):
prompt = f"Generate a response to: {row['query']}"
return llm.generate_text(prompt=prompt)
df['generated_response'] = df.apply(generate_responses, axis=1)
Example 4: Custom Evaluators
DEPRECATED:
from phoenix.evals import LLMEvaluator
from phoenix.evals.models import OpenAIModel
class CustomEvaluator(LLMEvaluator):
def evaluate(self, input_text, output_text):
# Custom logic
pass
evaluator = CustomEvaluator(model=OpenAIModel(model="gpt-4o"))
NEW:
from phoenix.evals import create_evaluator, LLMEvaluator
from phoenix.evals.llm import LLM
# Option 1: Function-based evaluator
@create_evaluator(name="custom_metric", direction="maximize")
def custom_evaluator(input: str, output: str) -> float:
# Custom heuristic logic
return len(output) / len(input) # Example metric
# Option 2: LLM-based evaluator
llm = LLM(provider="openai", model="gpt-4o")
class CustomLLMEvaluator(LLMEvaluator):
def __init__(self):
super().__init__(
name="custom_llm_eval",
llm=llm,
prompt_template="Evaluate this response: {input} -> {output}",
)
def _evaluate(self, eval_input):
# Custom LLM evaluation logic
pass
Example 5: Different LLM Providers
DEPRECATED:
from phoenix.evals.models import OpenAIModel, AnthropicModel, GeminiModel
openai_model = OpenAIModel(model="gpt-4o")
anthropic_model = AnthropicModel(model="claude-3-sonnet-20240229")
NEW:
from phoenix.evals.llm import LLM
# All providers use the same interface
openai_llm = LLM(provider="openai", model="gpt-4o")
litellm_llm = LLM(provider="litellm", model="claude-3-sonnet-20240229")
Migration Checklist
When migrating your code:
-
✅ Update imports
- Replace
phoenix.evals.models.*withphoenix.evals.llm.LLM - Replace
phoenix.evals.llm_classifywithphoenix.evals.create_classifier - Replace
phoenix.evals.llm_generatewith direct LLM calls
- Replace
-
✅ Update model instantiation
- Use unified
LLM(provider="...", model="...")interface - Remove provider-specific model classes
- Use unified
-
✅ Replace function calls
- Convert
llm_classifytocreate_classifier+evaluate_dataframe - Convert
llm_generatetoLLM.generate_text - Convert
run_evalstoevaluate_dataframe
- Convert
-
✅ Update templates
- Use raw strings instead of
PromptTemplateobjects - Replace
railsparameter withchoicesdictionary
- Use raw strings instead of
-
✅ Update evaluators
- Use
create_classifierfor classification tasks - Use
create_evaluatordecorator for custom metrics - Import built-in evaluators from
phoenix.evals.metrics
- Use
-
✅ Test the migration
- Verify outputs match expected format
- Check that scores are properly assigned
- Ensure error handling works as expected
Getting Help
- New API Documentation: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/evals.html
- Legacy API Reference: https://arize-phoenix.readthedocs.io/projects/evals/en/latest/api/legacy.html
Related Skills
Frontend Typescript Linting.mdc
TypeScript and ESLint rules that MUST be followed when creating, modifying, or reviewing any file under apps/frontend/, including .ts, .tsx, .js, and .jsx files. Also apply when discussing frontend li...
2. Apply Deepthink Protocol (reason about dependencies
risks