Creating a New Built-in Classification Metric

This guide describes how to create a new built-in classification evaluator metric for Phoenix evals. Follow these steps in order.

Views0
PublishedFeb 7, 2026

Loading actions...

5 minBeginnerpromptSingle file

Skill content

Main instructions and any bundled files for this skill.

markdown

Creating a New Built-in Classification Metric

This guide describes how to create a new built-in classification evaluator metric for Phoenix evals. Follow these steps in order.

Overview

Built-in metrics consist of:

  1. YAML Config - Prompt template with criteria
  2. Generated Types - Auto-generated Python and TypeScript code
  3. Python Evaluator - Python class wrapping the config
  4. TypeScript Evaluator - TypeScript factory function
  5. Benchmark Suite - Synthetic test examples

Step 1: Create the YAML Config

Create a new file in prompts/classification_evaluator_configs/ named {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml.

Required fields:

name: metric_name # lowercase, snake_case
description: Brief description of what this metric evaluates
optimization_direction: maximize # or minimize or neutral
messages:
  - role: user
    content: >-
      # Your prompt template here
      # Use mustache {{placeholder}} for template variables
choices:
  correct: 1.0 # Map label to score
  incorrect: 0.0 # Adjust labels as needed

Template placeholders: Use {{variable_name}} syntax (Mustache format). IMPORTANT: If the user does not specify what input data is provided, ask follow-up questions so you know exactly what placeholders are needed in the prompt template and what they should be called.

Common placeholders:

  • {{input}} - User query or conversation context
  • {{output}} - LLM response to evaluate
  • {{reference}} - Ground truth or expected output

Reference existing configs:

  • TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml - Tool selection evaluation
  • TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.yaml - Tool invocation evaluation
  • CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml - Response correctness
  • HALLUCINATION_CLASSIFICATION_EVALUATOR_CONFIG.yaml - Hallucination detection

Step 2: Compile Prompts

Run to generate Python and TypeScript types:

tox -e compile_prompts

This generates:

  • packages/phoenix-evals/src/phoenix/evals/__generated__/classification_evaluator_configs/
  • js/packages/phoenix-evals/src/__generated__/default_templates/

Step 3: Create Python Evaluator

Create packages/phoenix-evals/src/phoenix/evals/metrics/{metric_name}.py:

from pydantic import BaseModel, Field

from ..__generated__.classification_evaluator_configs import (
    {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG,
)
from ..evaluators import ClassificationEvaluator
from ..llm import LLM
from ..llm.prompts import PromptTemplate


class {MetricName}Evaluator(ClassificationEvaluator):
    """
    Docstring describing the evaluator.

    Args:
        llm (LLM): The LLM instance to use for evaluation.

    Notes:
        - What this metric evaluates
        - What it returns
        - Requirements

    Examples::

        from phoenix.evals.metrics.{metric_name} import {MetricName}Evaluator
        from phoenix.evals import LLM

        llm = LLM(provider="openai", model="gpt-4o-mini")
        evaluator = {MetricName}Evaluator(llm=llm)
        scores = evaluator.evaluate({
            "input": "...",
            "output": "...",
        })
    """

    NAME = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.name
    PROMPT = PromptTemplate(
        template=[
            msg.model_dump() for msg in {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.messages
        ],
    )
    CHOICES = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.choices
    DIRECTION = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.optimization_direction

    class {MetricName}InputSchema(BaseModel):
        # Define input fields matching template placeholders
        input: str = Field(description="Description of this field")
        output: str = Field(description="Description of this field")

    def __init__(self, llm: LLM):
        super().__init__(
            name=self.NAME,
            llm=llm,
            prompt_template=self.PROMPT.template,
            choices=self.CHOICES,
            direction=self.DIRECTION,
            input_schema=self.{MetricName}InputSchema,
        )

Step 4: Create TypeScript Evaluator

Create js/packages/phoenix-evals/src/llm/create{MetricName}Evaluator.ts:

import { {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG } from "../__generated__/default_templates";
import { CreateClassificationEvaluatorArgs } from "../types/evals";
import { ClassificationEvaluator } from "./ClassificationEvaluator";
import { createClassificationEvaluator } from "./createClassificationEvaluator";

export interface {MetricName}EvaluatorArgs<
  RecordType extends Record<string, unknown> = {MetricName}EvaluationRecord,
> extends Omit<
    CreateClassificationEvaluatorArgs<RecordType>,
    "promptTemplate" | "choices" | "optimizationDirection" | "name"
  > {
  optimizationDirection?: CreateClassificationEvaluatorArgs<RecordType>["optimizationDirection"];
  name?: CreateClassificationEvaluatorArgs<RecordType>["name"];
  choices?: CreateClassificationEvaluatorArgs<RecordType>["choices"];
  promptTemplate?: CreateClassificationEvaluatorArgs<RecordType>["promptTemplate"];
}

export type {MetricName}EvaluationRecord = {
  input: string;
  output: string;
  // Add fields matching template placeholders
};

/**
 * Creates a {metric_name} evaluator function.
 *
 * @example
 * ```ts
 * const evaluator = create{MetricName}Evaluator({ model: openai("gpt-4o-mini") });
 * const result = await evaluator.evaluate({
 *   input: "...",
 *   output: "...",
 * });
 * ```
 */
export function create{MetricName}Evaluator<
  RecordType extends Record<string, unknown> = {MetricName}EvaluationRecord,
>(args: {MetricName}EvaluatorArgs<RecordType>): ClassificationEvaluator<RecordType> {
  const {
    choices = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.choices,
    promptTemplate = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.template,
    optimizationDirection = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.optimizationDirection,
    name = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.name,
    ...rest
  } = args;
  return createClassificationEvaluator<RecordType>({
    ...rest,
    promptTemplate,
    choices,
    optimizationDirection,
    name,
  });
}

Add export to js/packages/phoenix-evals/src/llm/index.ts:

export * from "./create{MetricName}Evaluator";

Step 5: Build JS Packages

cd js && pnpm build

Step 6: Create Benchmark Suite

Create js/benchmarks/evals-benchmarks/src/{metric_name}_benchmark.ts.

Reference existing benchmarks in the same directory for patterns:

  • conciseness_benchmark.ts - Good example with failed examples printout
  • correctness_benchmark.ts - Good example of category-based organization
  • tool_invocation_benchmark.ts - Multi-tool and context handling
  • document_relevance_benchmark.ts - Large synthetic dataset

Target dataset size: Aim for 30-50 synthetic examples covering:

  • 2-4 examples per failure mode (incorrect cases)
  • 2-4 examples per success scenario (correct cases)
  • At least 2 edge case categories

Synthetic dataset creation: For complex metrics, consider initiating a separate AI agent session dedicated to synthetic dataset generation. This agent can:

  • Focus solely on creating realistic, diverse test cases
  • Iterate on edge cases without context switching
  • Generate examples in batches by category

Structure:

import { createDataset } from "@arizeai/phoenix-client/datasets";
import { asExperimentEvaluator, getExperiment, runExperiment } from "@arizeai/phoenix-client/experiments";
import { create{MetricName}Evaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const evaluator = create{MetricName}Evaluator({ model: openai("gpt-4o-mini") });

// Define examples by category (target: 30-50 total examples)
const examplesByCategory = {
  // Failure modes (2-4 examples each)
  failure_mode_1: [
    { input: "...", output: "...", expected_label: "incorrect" as const },
    // ...
  ],
  // Success cases (2-4 examples each)
  correct_case_1: [
    { input: "...", output: "...", expected_label: "correct" as const },
    // ...
  ],
  // Edge cases
  edge_cases: [
    // ...
  ],
};

// TaskOutput must include `input` and `output` text so failed examples can be
// printed with full context for debugging.
type TaskOutput = {
  expected_label: string;
  label: string;
  score: number;
  explanation: string;
  category: string;
  input: string;
  output: string;
};

// Accuracy evaluator to compare predicted vs expected labels
const accuracyEvaluator = asExperimentEvaluator({
  name: "accuracy",
  kind: "CODE",
  evaluate: async (args) => {
    const output = args.output as TaskOutput;
    const score = output.expected_label === output.label ? 1 : 0;
    return {
      label: score === 1 ? "accurate" : "inaccurate",
      score,
      explanation: `Expected: ${output.expected_label}, Got: ${output.label}`,
    };
  },
});

// The task function must return input/output text alongside the eval result
// so that the failed examples printer can display what went wrong.
const task = async (example) => {
  const input = example.input.question as string;
  const output = example.output.answer as string;
  const expectedLabel = example.output.expected_label as string;

  const evalResult = await evaluator.evaluate({ input, output });

  return {
    expected_label: expectedLabel,
    category: example.metadata?.category as string,
    input,
    output,
    ...evalResult,
  };
};

async function main() {
  const dataset = await createDataset({ ... });
  const experiment = await runExperiment({
    dataset,
    task,
    evaluators: [accuracyEvaluator],
  });

  const result = await getExperiment({ experimentId: experiment.id });

  // Print detailed results by category
  printResultsByCategory(result);

  // Print failed examples so the user can see what went wrong
  printFailedExamples(result);
}

// IMPORTANT: Always print failed examples (where the evaluator's label did not
// match the expected label). This is critical for diagnosing prompt issues and
// deciding whether benchmark examples need adjustment. For each failed example,
// print the category, input, output (truncated if long), expected vs actual
// label, and the LLM judge's explanation.
function printFailedExamples(result) {
  const failures = [];

  for (const run of Object.values(result.runs)) {
    const output = run.output;
    if (run.error || !output) continue;
    if (output.expected_label !== output.label) {
      failures.push(output);
    }
  }

  if (failures.length === 0) {
    console.log("\nAll examples matched expected labels.");
    return;
  }

  console.log(`\nFAILED EXAMPLES (${failures.length})`);
  console.log("=".repeat(80));

  for (const [i, ex] of failures.entries()) {
    const truncatedOutput =
      ex.output.length > 120 ? ex.output.slice(0, 120) + "..." : ex.output;
    const truncatedExplanation =
      ex.explanation.length > 200
        ? ex.explanation.slice(0, 200) + "..."
        : ex.explanation;

    console.log(`\n  ${i + 1}. [${ex.category}]`);
    console.log(`     Input:    ${ex.input}`);
    console.log(`     Output:   ${truncatedOutput}`);
    console.log(`     Expected: ${ex.expected_label}  |  Got: ${ex.label}`);
    console.log(`     Reason:   ${truncatedExplanation}`);
  }
}

main();

Benchmark requirements:

  • Positive examples (expected: correct/pass)
  • Negative examples for each failure mode
  • Edge cases
  • Different input formats if applicable
  • IMPORTANT: The benchmark must print failed examples (where the evaluator's label did not match the expected label). For each failure, print the category, input, output (truncated if long), expected vs actual label, and the LLM judge's explanation. This is critical for diagnosing prompt issues and tuning examples. The task function must return input and output text in its result so the printer has access to them.

Step 7: Run Benchmark

# Start Phoenix (or use Phoenix Cloud)
PHOENIX_WORKING_DIR=/tmp/phoenix-test phoenix serve

# Run benchmark
cd js/benchmarks/evals-benchmarks
export OPENAI_API_KEY="..."
pnpm tsx src/{metric_name}_benchmark.ts

Step 8: Create Documentation

Create documentation page at docs/phoenix/evaluation/pre-built-metrics/{metric-name}.mdx.

Follow the approved template structure:

  1. Overview - When to use, what it measures
  2. Supported Levels - Span/trace/session, relevant span kinds
  3. Input Requirements - Required fields, formatting tips
  4. Output Interpretation - Labels, scores, direction
  5. Usage Examples - Python and TypeScript in tabs
  6. Using Input Mapping - With lambda example if applicable
  7. Viewing/Modifying the Prompt - Link to GitHub config, show custom prompt usage
  8. Configuration - Link to LLM config docs
  9. Using with Phoenix - Links to traces and experiments docs
  10. Benchmarks - "Coming soon" placeholder
  11. API Reference - Links to Python and TypeScript docs
  12. Related - Links to related evaluators

Reference file: Use docs/phoenix/evaluation/pre-built-metrics/faithfulness.mdx as a template.

Update navigation:

  1. Add the metric to the landing page card grid in docs/phoenix/evaluation/pre-built-metrics.mdx
  2. Add the URL to docs/phoenix/sitemap.xml

Checklist

  • YAML config created with clear criteria
  • tox -e compile_prompts run successfully
  • Python evaluator class with docstrings and examples
  • TypeScript evaluator wrapper with types
  • Export added to llm/index.ts
  • JS packages rebuilt (pnpm build)
  • Benchmark suite with diverse test cases
  • Benchmark prints failed examples with input, output, expected/actual labels, and explanation
  • Benchmark run with acceptable accuracy (>80% target)
  • Documentation page created following template
  • Landing page updated with new metric card
  • Sitemap updated with new URL

Tips for Good Prompts

  1. Be explicit about criteria - List what makes something correct vs incorrect
  2. Handle edge cases - Multi-item evaluations, context from earlier turns
  3. Separate concerns - If evaluating X, explicitly state you're NOT evaluating Y
  4. Provide reasoning guidance - Tell the judge what to consider before deciding
  5. Use clear data formatting - Wrap inputs in XML-style tags like <context>, <output>

Prompt Playground

5 Variables

Fill Variables

Preview

# Creating a New Built-in Classification Metric

This guide describes how to create a new built-in classification evaluator metric for Phoenix evals. Follow these steps in order.

## Overview

Built-in metrics consist of:

1. **YAML Config** - Prompt template with criteria
2. **Generated Types** - Auto-generated Python and TypeScript code
3. **Python Evaluator** - Python class wrapping the config
4. **TypeScript Evaluator** - TypeScript factory function
5. **Benchmark Suite** - Synthetic test examples

## Step 1: Create the YAML Config

Create a new file in `prompts/classification_evaluator_configs/` named `{METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml`.

**Required fields:**

```yaml
name: metric_name # lowercase, snake_case
description: Brief description of what this metric evaluates
optimization_direction: maximize # or minimize or neutral
messages:
  - role: user
    content: >-
      # Your prompt template here
      # Use mustache {{placeholder}} for template variables
choices:
  correct: 1.0 # Map label to score
  incorrect: 0.0 # Adjust labels as needed
```

**Template placeholders:** Use `{{variable_name}}` syntax (Mustache format). IMPORTANT: If the user does not specify what input data is provided, ask follow-up questions so you know exactly what placeholders are needed in the prompt template and what they should be called.

Common placeholders:

- `{{input}}` - User query or conversation context
- `{{output}}` - LLM response to evaluate
- `{{reference}}` - Ground truth or expected output

**Reference existing configs:**

- `TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml` - Tool selection evaluation
- `TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.yaml` - Tool invocation evaluation
- `CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml` - Response correctness
- `HALLUCINATION_CLASSIFICATION_EVALUATOR_CONFIG.yaml` - Hallucination detection

## Step 2: Compile Prompts

Run to generate Python and TypeScript types:

```bash
tox -e compile_prompts
```

This generates:

- `packages/phoenix-evals/src/phoenix/evals/__generated__/classification_evaluator_configs/`
- `js/packages/phoenix-evals/src/__generated__/default_templates/`

## Step 3: Create Python Evaluator

Create `packages/phoenix-evals/src/phoenix/evals/metrics/{metric_name}.py`:

```python
from pydantic import BaseModel, Field

from ..__generated__.classification_evaluator_configs import (
    {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG,
)
from ..evaluators import ClassificationEvaluator
from ..llm import LLM
from ..llm.prompts import PromptTemplate


class {MetricName}Evaluator(ClassificationEvaluator):
    """
    Docstring describing the evaluator.

    Args:
        llm (LLM): The LLM instance to use for evaluation.

    Notes:
        - What this metric evaluates
        - What it returns
        - Requirements

    Examples::

        from phoenix.evals.metrics.{metric_name} import {MetricName}Evaluator
        from phoenix.evals import LLM

        llm = LLM(provider="openai", model="gpt-4o-mini")
        evaluator = {MetricName}Evaluator(llm=llm)
        scores = evaluator.evaluate({
            "input": "...",
            "output": "...",
        })
    """

    NAME = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.name
    PROMPT = PromptTemplate(
        template=[
            msg.model_dump() for msg in {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.messages
        ],
    )
    CHOICES = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.choices
    DIRECTION = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.optimization_direction

    class {MetricName}InputSchema(BaseModel):
        # Define input fields matching template placeholders
        input: str = Field(description="Description of this field")
        output: str = Field(description="Description of this field")

    def __init__(self, llm: LLM):
        super().__init__(
            name=self.NAME,
            llm=llm,
            prompt_template=self.PROMPT.template,
            choices=self.CHOICES,
            direction=self.DIRECTION,
            input_schema=self.{MetricName}InputSchema,
        )
```

## Step 4: Create TypeScript Evaluator

Create `js/packages/phoenix-evals/src/llm/create{MetricName}Evaluator.ts`:

````typescript
import { {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG } from "../__generated__/default_templates";
import { CreateClassificationEvaluatorArgs } from "../types/evals";
import { ClassificationEvaluator } from "./ClassificationEvaluator";
import { createClassificationEvaluator } from "./createClassificationEvaluator";

export interface {MetricName}EvaluatorArgs<
  RecordType extends Record<string, unknown> = {MetricName}EvaluationRecord,
> extends Omit<
    CreateClassificationEvaluatorArgs<RecordType>,
    "promptTemplate" | "choices" | "optimizationDirection" | "name"
  > {
  optimizationDirection?: CreateClassificationEvaluatorArgs<RecordType>["optimizationDirection"];
  name?: CreateClassificationEvaluatorArgs<RecordType>["name"];
  choices?: CreateClassificationEvaluatorArgs<RecordType>["choices"];
  promptTemplate?: CreateClassificationEvaluatorArgs<RecordType>["promptTemplate"];
}

export type {MetricName}EvaluationRecord = {
  input: string;
  output: string;
  // Add fields matching template placeholders
};

/**
 * Creates a {metric_name} evaluator function.
 *
 * @example
 * ```ts
 * const evaluator = create{MetricName}Evaluator({ model: openai("gpt-4o-mini") });
 * const result = await evaluator.evaluate({
 *   input: "...",
 *   output: "...",
 * });
 * ```
 */
export function create{MetricName}Evaluator<
  RecordType extends Record<string, unknown> = {MetricName}EvaluationRecord,
>(args: {MetricName}EvaluatorArgs<RecordType>): ClassificationEvaluator<RecordType> {
  const {
    choices = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.choices,
    promptTemplate = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.template,
    optimizationDirection = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.optimizationDirection,
    name = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.name,
    ...rest
  } = args;
  return createClassificationEvaluator<RecordType>({
    ...rest,
    promptTemplate,
    choices,
    optimizationDirection,
    name,
  });
}
````

**Add export to** `js/packages/phoenix-evals/src/llm/index.ts`:

```typescript
export * from "./create{MetricName}Evaluator";
```

## Step 5: Build JS Packages

```bash
cd js && pnpm build
```

## Step 6: Create Benchmark Suite

Create `js/benchmarks/evals-benchmarks/src/{metric_name}_benchmark.ts`.

**Reference existing benchmarks** in the same directory for patterns:

- `conciseness_benchmark.ts` - Good example with failed examples printout
- `correctness_benchmark.ts` - Good example of category-based organization
- `tool_invocation_benchmark.ts` - Multi-tool and context handling
- `document_relevance_benchmark.ts` - Large synthetic dataset

**Target dataset size:** Aim for **30-50 synthetic examples** covering:

- 2-4 examples per failure mode (incorrect cases)
- 2-4 examples per success scenario (correct cases)
- At least 2 edge case categories

**Synthetic dataset creation:** For complex metrics, consider initiating a **separate AI agent session** dedicated to synthetic dataset generation. This agent can:

- Focus solely on creating realistic, diverse test cases
- Iterate on edge cases without context switching
- Generate examples in batches by category

**Structure:**

```typescript
import { createDataset } from "@arizeai/phoenix-client/datasets";
import { asExperimentEvaluator, getExperiment, runExperiment } from "@arizeai/phoenix-client/experiments";
import { create{MetricName}Evaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const evaluator = create{MetricName}Evaluator({ model: openai("gpt-4o-mini") });

// Define examples by category (target: 30-50 total examples)
const examplesByCategory = {
  // Failure modes (2-4 examples each)
  failure_mode_1: [
    { input: "...", output: "...", expected_label: "incorrect" as const },
    // ...
  ],
  // Success cases (2-4 examples each)
  correct_case_1: [
    { input: "...", output: "...", expected_label: "correct" as const },
    // ...
  ],
  // Edge cases
  edge_cases: [
    // ...
  ],
};

// TaskOutput must include `input` and `output` text so failed examples can be
// printed with full context for debugging.
type TaskOutput = {
  expected_label: string;
  label: string;
  score: number;
  explanation: string;
  category: string;
  input: string;
  output: string;
};

// Accuracy evaluator to compare predicted vs expected labels
const accuracyEvaluator = asExperimentEvaluator({
  name: "accuracy",
  kind: "CODE",
  evaluate: async (args) => {
    const output = args.output as TaskOutput;
    const score = output.expected_label === output.label ? 1 : 0;
    return {
      label: score === 1 ? "accurate" : "inaccurate",
      score,
      explanation: `Expected: ${output.expected_label}, Got: ${output.label}`,
    };
  },
});

// The task function must return input/output text alongside the eval result
// so that the failed examples printer can display what went wrong.
const task = async (example) => {
  const input = example.input.question as string;
  const output = example.output.answer as string;
  const expectedLabel = example.output.expected_label as string;

  const evalResult = await evaluator.evaluate({ input, output });

  return {
    expected_label: expectedLabel,
    category: example.metadata?.category as string,
    input,
    output,
    ...evalResult,
  };
};

async function main() {
  const dataset = await createDataset({ ... });
  const experiment = await runExperiment({
    dataset,
    task,
    evaluators: [accuracyEvaluator],
  });

  const result = await getExperiment({ experimentId: experiment.id });

  // Print detailed results by category
  printResultsByCategory(result);

  // Print failed examples so the user can see what went wrong
  printFailedExamples(result);
}

// IMPORTANT: Always print failed examples (where the evaluator's label did not
// match the expected label). This is critical for diagnosing prompt issues and
// deciding whether benchmark examples need adjustment. For each failed example,
// print the category, input, output (truncated if long), expected vs actual
// label, and the LLM judge's explanation.
function printFailedExamples(result) {
  const failures = [];

  for (const run of Object.values(result.runs)) {
    const output = run.output;
    if (run.error || !output) continue;
    if (output.expected_label !== output.label) {
      failures.push(output);
    }
  }

  if (failures.length === 0) {
    console.log("\nAll examples matched expected labels.");
    return;
  }

  console.log(`\nFAILED EXAMPLES (${failures.length})`);
  console.log("=".repeat(80));

  for (const [i, ex] of failures.entries()) {
    const truncatedOutput =
      ex.output.length > 120 ? ex.output.slice(0, 120) + "..." : ex.output;
    const truncatedExplanation =
      ex.explanation.length > 200
        ? ex.explanation.slice(0, 200) + "..."
        : ex.explanation;

    console.log(`\n  ${i + 1}. [${ex.category}]`);
    console.log(`     Input:    ${ex.input}`);
    console.log(`     Output:   ${truncatedOutput}`);
    console.log(`     Expected: ${ex.expected_label}  |  Got: ${ex.label}`);
    console.log(`     Reason:   ${truncatedExplanation}`);
  }
}

main();
```

**Benchmark requirements:**

- Positive examples (expected: correct/pass)
- Negative examples for each failure mode
- Edge cases
- Different input formats if applicable
- **IMPORTANT: The benchmark must print failed examples** (where the evaluator's label did not match the expected label). For each failure, print the category, input, output (truncated if long), expected vs actual label, and the LLM judge's explanation. This is critical for diagnosing prompt issues and tuning examples. The task function must return `input` and `output` text in its result so the printer has access to them.

## Step 7: Run Benchmark

```bash
# Start Phoenix (or use Phoenix Cloud)
PHOENIX_WORKING_DIR=/tmp/phoenix-test phoenix serve

# Run benchmark
cd js/benchmarks/evals-benchmarks
export OPENAI_API_KEY="..."
pnpm tsx src/{metric_name}_benchmark.ts
```

## Step 8: Create Documentation

Create documentation page at `docs/phoenix/evaluation/pre-built-metrics/{metric-name}.mdx`.

**Follow the approved template structure:**

1. **Overview** - When to use, what it measures
2. **Supported Levels** - Span/trace/session, relevant span kinds
3. **Input Requirements** - Required fields, formatting tips
4. **Output Interpretation** - Labels, scores, direction
5. **Usage Examples** - Python and TypeScript in tabs
6. **Using Input Mapping** - With lambda example if applicable
7. **Viewing/Modifying the Prompt** - Link to GitHub config, show custom prompt usage
8. **Configuration** - Link to LLM config docs
9. **Using with Phoenix** - Links to traces and experiments docs
10. **Benchmarks** - "Coming soon" placeholder
11. **API Reference** - Links to Python and TypeScript docs
12. **Related** - Links to related evaluators

**Reference file:** Use `docs/phoenix/evaluation/pre-built-metrics/faithfulness.mdx` as a template.

**Update navigation:**

1. Add the metric to the landing page card grid in `docs/phoenix/evaluation/pre-built-metrics.mdx`
2. Add the URL to `docs/phoenix/sitemap.xml`

## Checklist

- [ ] YAML config created with clear criteria
- [ ] `tox -e compile_prompts` run successfully
- [ ] Python evaluator class with docstrings and examples
- [ ] TypeScript evaluator wrapper with types
- [ ] Export added to `llm/index.ts`
- [ ] JS packages rebuilt (`pnpm build`)
- [ ] Benchmark suite with diverse test cases
- [ ] Benchmark prints failed examples with input, output, expected/actual labels, and explanation
- [ ] Benchmark run with acceptable accuracy (>80% target)
- [ ] Documentation page created following template
- [ ] Landing page updated with new metric card
- [ ] Sitemap updated with new URL

## Tips for Good Prompts

1. **Be explicit about criteria** - List what makes something correct vs incorrect
2. **Handle edge cases** - Multi-item evaluations, context from earlier turns
3. **Separate concerns** - If evaluating X, explicitly state you're NOT evaluating Y
4. **Provide reasoning guidance** - Tell the judge what to consider before deciding
5. **Use clear data formatting** - Wrap inputs in XML-style tags like `<context>`, `<output>`
Share: