Creating a New Built-in Classification Metric

This guide describes how to create a new built-in classification evaluator metric for Phoenix evals. Follow these steps in order.

PublishedFeb 7, 2026

Loading actions...

5 minBeginnerpromptSingle file

Skill content

Main instructions and any bundled files for this skill.

markdown

Creating a New Built-in Classification Metric

This guide describes how to create a new built-in classification evaluator metric for Phoenix evals. Follow these steps in order.

Overview

Built-in metrics consist of:

YAML Config - Prompt template with criteria
Generated Types - Auto-generated Python and TypeScript code
Python Evaluator - Python class wrapping the config
TypeScript Evaluator - TypeScript factory function
Benchmark Suite - Synthetic test examples

Step 1: Create the YAML Config

Create a new file in prompts/classification_evaluator_configs/ named {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml.

Required fields:

name: metric_name # lowercase, snake_case
description: Brief description of what this metric evaluates
optimization_direction: maximize # or minimize or neutral
messages:
  - role: user
    content: >-
      # Your prompt template here
      # Use mustache {{placeholder}} for template variables
choices:
  correct: 1.0 # Map label to score
  incorrect: 0.0 # Adjust labels as needed

Template placeholders: Use {{variable_name}} syntax (Mustache format). IMPORTANT: If the user does not specify what input data is provided, ask follow-up questions so you know exactly what placeholders are needed in the prompt template and what they should be called.

Common placeholders:

{{input}} - User query or conversation context
{{output}} - LLM response to evaluate
{{reference}} - Ground truth or expected output

Reference existing configs:

TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml - Tool selection evaluation
TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.yaml - Tool invocation evaluation
CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml - Response correctness
HALLUCINATION_CLASSIFICATION_EVALUATOR_CONFIG.yaml - Hallucination detection

Step 2: Compile Prompts

Run to generate Python and TypeScript types:

tox -e compile_prompts

This generates:

packages/phoenix-evals/src/phoenix/evals/__generated__/classification_evaluator_configs/
js/packages/phoenix-evals/src/__generated__/default_templates/

Step 3: Create Python Evaluator

Create packages/phoenix-evals/src/phoenix/evals/metrics/{metric_name}.py:

from pydantic import BaseModel, Field

from ..__generated__.classification_evaluator_configs import (
    {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG,
)
from ..evaluators import ClassificationEvaluator
from ..llm import LLM
from ..llm.prompts import PromptTemplate


class {MetricName}Evaluator(ClassificationEvaluator):
    """
    Docstring describing the evaluator.

    Args:
        llm (LLM): The LLM instance to use for evaluation.

    Notes:
        - What this metric evaluates
        - What it returns
        - Requirements

    Examples::

        from phoenix.evals.metrics.{metric_name} import {MetricName}Evaluator
        from phoenix.evals import LLM

        llm = LLM(provider="openai", model="gpt-4o-mini")
        evaluator = {MetricName}Evaluator(llm=llm)
        scores = evaluator.evaluate({
            "input": "...",
            "output": "...",
        })
    """

    NAME = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.name
    PROMPT = PromptTemplate(
        template=[
            msg.model_dump() for msg in {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.messages
        ],
    )
    CHOICES = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.choices
    DIRECTION = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.optimization_direction

    class {MetricName}InputSchema(BaseModel):
        # Define input fields matching template placeholders
        input: str = Field(description="Description of this field")
        output: str = Field(description="Description of this field")

    def __init__(self, llm: LLM):
        super().__init__(
            name=self.NAME,
            llm=llm,
            prompt_template=self.PROMPT.template,
            choices=self.CHOICES,
            direction=self.DIRECTION,
            input_schema=self.{MetricName}InputSchema,
        )

Step 4: Create TypeScript Evaluator

Create js/packages/phoenix-evals/src/llm/create{MetricName}Evaluator.ts:

import { {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG } from "../__generated__/default_templates";
import { CreateClassificationEvaluatorArgs } from "../types/evals";
import { ClassificationEvaluator } from "./ClassificationEvaluator";
import { createClassificationEvaluator } from "./createClassificationEvaluator";

export interface {MetricName}EvaluatorArgs&#x3C;
  RecordType extends Record&#x3C;string, unknown> = {MetricName}EvaluationRecord,
> extends Omit&#x3C;
    CreateClassificationEvaluatorArgs&#x3C;RecordType>,
    "promptTemplate" | "choices" | "optimizationDirection" | "name"
  > {
  optimizationDirection?: CreateClassificationEvaluatorArgs&#x3C;RecordType>["optimizationDirection"];
  name?: CreateClassificationEvaluatorArgs&#x3C;RecordType>["name"];
  choices?: CreateClassificationEvaluatorArgs&#x3C;RecordType>["choices"];
  promptTemplate?: CreateClassificationEvaluatorArgs&#x3C;RecordType>["promptTemplate"];
}

export type {MetricName}EvaluationRecord = {
  input: string;
  output: string;
  // Add fields matching template placeholders
};

/**
 * Creates a {metric_name} evaluator function.
 *
 * @example
 * ```ts
 * const evaluator = create{MetricName}Evaluator({ model: openai("gpt-4o-mini") });
 * const result = await evaluator.evaluate({
 *   input: "...",
 *   output: "...",
 * });
 * ```
 */
export function create{MetricName}Evaluator&#x3C;
  RecordType extends Record&#x3C;string, unknown> = {MetricName}EvaluationRecord,
>(args: {MetricName}EvaluatorArgs&#x3C;RecordType>): ClassificationEvaluator&#x3C;RecordType> {
  const {
    choices = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.choices,
    promptTemplate = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.template,
    optimizationDirection = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.optimizationDirection,
    name = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.name,
    ...rest
  } = args;
  return createClassificationEvaluator&#x3C;RecordType>({
    ...rest,
    promptTemplate,
    choices,
    optimizationDirection,
    name,
  });
}

Add export to js/packages/phoenix-evals/src/llm/index.ts:

export * from "./create{MetricName}Evaluator";

Step 5: Build JS Packages

cd js &#x26;&#x26; pnpm build

Step 6: Create Benchmark Suite

Create js/benchmarks/evals-benchmarks/src/{metric_name}_benchmark.ts.

Reference existing benchmarks in the same directory for patterns:

conciseness_benchmark.ts - Good example with failed examples printout
correctness_benchmark.ts - Good example of category-based organization
tool_invocation_benchmark.ts - Multi-tool and context handling
document_relevance_benchmark.ts - Large synthetic dataset

Target dataset size: Aim for 30-50 synthetic examples covering:

2-4 examples per failure mode (incorrect cases)
2-4 examples per success scenario (correct cases)
At least 2 edge case categories

Synthetic dataset creation: For complex metrics, consider initiating a separate AI agent session dedicated to synthetic dataset generation. This agent can:

Focus solely on creating realistic, diverse test cases
Iterate on edge cases without context switching
Generate examples in batches by category

Structure:

import { createDataset } from "@arizeai/phoenix-client/datasets";
import { asExperimentEvaluator, getExperiment, runExperiment } from "@arizeai/phoenix-client/experiments";
import { create{MetricName}Evaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const evaluator = create{MetricName}Evaluator({ model: openai("gpt-4o-mini") });

// Define examples by category (target: 30-50 total examples)
const examplesByCategory = {
  // Failure modes (2-4 examples each)
  failure_mode_1: [
    { input: "...", output: "...", expected_label: "incorrect" as const },
    // ...
  ],
  // Success cases (2-4 examples each)
  correct_case_1: [
    { input: "...", output: "...", expected_label: "correct" as const },
    // ...
  ],
  // Edge cases
  edge_cases: [
    // ...
  ],
};

// TaskOutput must include `input` and `output` text so failed examples can be
// printed with full context for debugging.
type TaskOutput = {
  expected_label: string;
  label: string;
  score: number;
  explanation: string;
  category: string;
  input: string;
  output: string;
};

// Accuracy evaluator to compare predicted vs expected labels
const accuracyEvaluator = asExperimentEvaluator({
  name: "accuracy",
  kind: "CODE",
  evaluate: async (args) => {
    const output = args.output as TaskOutput;
    const score = output.expected_label === output.label ? 1 : 0;
    return {
      label: score === 1 ? "accurate" : "inaccurate",
      score,
      explanation: `Expected: ${output.expected_label}, Got: ${output.label}`,
    };
  },
});

// The task function must return input/output text alongside the eval result
// so that the failed examples printer can display what went wrong.
const task = async (example) => {
  const input = example.input.question as string;
  const output = example.output.answer as string;
  const expectedLabel = example.output.expected_label as string;

  const evalResult = await evaluator.evaluate({ input, output });

  return {
    expected_label: expectedLabel,
    category: example.metadata?.category as string,
    input,
    output,
    ...evalResult,
  };
};

async function main() {
  const dataset = await createDataset({ ... });
  const experiment = await runExperiment({
    dataset,
    task,
    evaluators: [accuracyEvaluator],
  });

  const result = await getExperiment({ experimentId: experiment.id });

  // Print detailed results by category
  printResultsByCategory(result);

  // Print failed examples so the user can see what went wrong
  printFailedExamples(result);
}

// IMPORTANT: Always print failed examples (where the evaluator's label did not
// match the expected label). This is critical for diagnosing prompt issues and
// deciding whether benchmark examples need adjustment. For each failed example,
// print the category, input, output (truncated if long), expected vs actual
// label, and the LLM judge's explanation.
function printFailedExamples(result) {
  const failures = [];

  for (const run of Object.values(result.runs)) {
    const output = run.output;
    if (run.error || !output) continue;
    if (output.expected_label !== output.label) {
      failures.push(output);
    }
  }

  if (failures.length === 0) {
    console.log("\nAll examples matched expected labels.");
    return;
  }

  console.log(`\nFAILED EXAMPLES (${failures.length})`);
  console.log("=".repeat(80));

  for (const [i, ex] of failures.entries()) {
    const truncatedOutput =
      ex.output.length > 120 ? ex.output.slice(0, 120) + "..." : ex.output;
    const truncatedExplanation =
      ex.explanation.length > 200
        ? ex.explanation.slice(0, 200) + "..."
        : ex.explanation;

    console.log(`\n  ${i + 1}. [${ex.category}]`);
    console.log(`     Input:    ${ex.input}`);
    console.log(`     Output:   ${truncatedOutput}`);
    console.log(`     Expected: ${ex.expected_label}  |  Got: ${ex.label}`);
    console.log(`     Reason:   ${truncatedExplanation}`);
  }
}

main();

Benchmark requirements:

Positive examples (expected: correct/pass)
Negative examples for each failure mode
Edge cases
Different input formats if applicable
IMPORTANT: The benchmark must print failed examples (where the evaluator's label did not match the expected label). For each failure, print the category, input, output (truncated if long), expected vs actual label, and the LLM judge's explanation. This is critical for diagnosing prompt issues and tuning examples. The task function must return input and output text in its result so the printer has access to them.

Step 7: Run Benchmark

# Start Phoenix (or use Phoenix Cloud)
PHOENIX_WORKING_DIR=/tmp/phoenix-test phoenix serve

# Run benchmark
cd js/benchmarks/evals-benchmarks
export OPENAI_API_KEY="..."
pnpm tsx src/{metric_name}_benchmark.ts

Step 8: Create Documentation

Create documentation page at docs/phoenix/evaluation/pre-built-metrics/{metric-name}.mdx.

Follow the approved template structure:

Overview - When to use, what it measures
Supported Levels - Span/trace/session, relevant span kinds
Input Requirements - Required fields, formatting tips
Output Interpretation - Labels, scores, direction
Usage Examples - Python and TypeScript in tabs
Using Input Mapping - With lambda example if applicable
Viewing/Modifying the Prompt - Link to GitHub config, show custom prompt usage
Configuration - Link to LLM config docs
Using with Phoenix - Links to traces and experiments docs
Benchmarks - "Coming soon" placeholder
API Reference - Links to Python and TypeScript docs
Related - Links to related evaluators

Reference file: Use docs/phoenix/evaluation/pre-built-metrics/faithfulness.mdx as a template.

Update navigation:

Add the metric to the landing page card grid in docs/phoenix/evaluation/pre-built-metrics.mdx
Add the URL to docs/phoenix/sitemap.xml

Checklist

Tips for Good Prompts

Be explicit about criteria - List what makes something correct vs incorrect
Handle edge cases - Multi-item evaluations, context from earlier turns
Separate concerns - If evaluating X, explicitly state you're NOT evaluating Y
Provide reasoning guidance - Tell the judge what to consider before deciding
Use clear data formatting - Wrap inputs in XML-style tags like <context>, <output>

Contents

Prompt Playground

5 Variables

Fill Variables

placeholder

variable_name

input

output

reference

Preview

# Creating a New Built-in Classification Metric

This guide describes how to create a new built-in classification evaluator metric for Phoenix evals. Follow these steps in order.

## Overview

Built-in metrics consist of:

1. **YAML Config** - Prompt template with criteria
2. **Generated Types** - Auto-generated Python and TypeScript code
3. **Python Evaluator** - Python class wrapping the config
4. **TypeScript Evaluator** - TypeScript factory function
5. **Benchmark Suite** - Synthetic test examples

## Step 1: Create the YAML Config

Create a new file in `prompts/classification_evaluator_configs/` named `{METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml`.

**Required fields:**

```yaml
name: metric_name # lowercase, snake_case
description: Brief description of what this metric evaluates
optimization_direction: maximize # or minimize or neutral
messages:
  - role: user
    content: >-
      # Your prompt template here
      # Use mustache {{placeholder}} for template variables
choices:
  correct: 1.0 # Map label to score
  incorrect: 0.0 # Adjust labels as needed
```

**Template placeholders:** Use `{{variable_name}}` syntax (Mustache format). IMPORTANT: If the user does not specify what input data is provided, ask follow-up questions so you know exactly what placeholders are needed in the prompt template and what they should be called.

Common placeholders:

- `{{input}}` - User query or conversation context
- `{{output}}` - LLM response to evaluate
- `{{reference}}` - Ground truth or expected output

**Reference existing configs:**

- `TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml` - Tool selection evaluation
- `TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.yaml` - Tool invocation evaluation
- `CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml` - Response correctness
- `HALLUCINATION_CLASSIFICATION_EVALUATOR_CONFIG.yaml` - Hallucination detection

## Step 2: Compile Prompts

Run to generate Python and TypeScript types:

```bash
tox -e compile_prompts
```

This generates:

- `packages/phoenix-evals/src/phoenix/evals/__generated__/classification_evaluator_configs/`
- `js/packages/phoenix-evals/src/__generated__/default_templates/`

## Step 3: Create Python Evaluator

Create `packages/phoenix-evals/src/phoenix/evals/metrics/{metric_name}.py`:

```python
from pydantic import BaseModel, Field

from ..__generated__.classification_evaluator_configs import (
    {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG,
)
from ..evaluators import ClassificationEvaluator
from ..llm import LLM
from ..llm.prompts import PromptTemplate


class {MetricName}Evaluator(ClassificationEvaluator):
    """
    Docstring describing the evaluator.

    Args:
        llm (LLM): The LLM instance to use for evaluation.

    Notes:
        - What this metric evaluates
        - What it returns
        - Requirements

    Examples::

        from phoenix.evals.metrics.{metric_name} import {MetricName}Evaluator
        from phoenix.evals import LLM

        llm = LLM(provider="openai", model="gpt-4o-mini")
        evaluator = {MetricName}Evaluator(llm=llm)
        scores = evaluator.evaluate({
            "input": "...",
            "output": "...",
        })
    """

    NAME = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.name
    PROMPT = PromptTemplate(
        template=[
            msg.model_dump() for msg in {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.messages
        ],
    )
    CHOICES = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.choices
    DIRECTION = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.optimization_direction

    class {MetricName}InputSchema(BaseModel):
        # Define input fields matching template placeholders
        input: str = Field(description="Description of this field")
        output: str = Field(description="Description of this field")

    def __init__(self, llm: LLM):
        super().__init__(
            name=self.NAME,
            llm=llm,
            prompt_template=self.PROMPT.template,
            choices=self.CHOICES,
            direction=self.DIRECTION,
            input_schema=self.{MetricName}InputSchema,
        )
```

## Step 4: Create TypeScript Evaluator

Create `js/packages/phoenix-evals/src/llm/create{MetricName}Evaluator.ts`:

````typescript
import { {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG } from "../__generated__/default_templates";
import { CreateClassificationEvaluatorArgs } from "../types/evals";
import { ClassificationEvaluator } from "./ClassificationEvaluator";
import { createClassificationEvaluator } from "./createClassificationEvaluator";

export interface {MetricName}EvaluatorArgs<
  RecordType extends Record<string, unknown> = {MetricName}EvaluationRecord,
> extends Omit<
    CreateClassificationEvaluatorArgs<RecordType>,
    "promptTemplate" | "choices" | "optimizationDirection" | "name"
  > {
  optimizationDirection?: CreateClassificationEvaluatorArgs<RecordType>["optimizationDirection"];
  name?: CreateClassificationEvaluatorArgs<RecordType>["name"];
  choices?: CreateClassificationEvaluatorArgs<RecordType>["choices"];
  promptTemplate?: CreateClassificationEvaluatorArgs<RecordType>["promptTemplate"];
}

export type {MetricName}EvaluationRecord = {
  input: string;
  output: string;
  // Add fields matching template placeholders
};

/**
 * Creates a {metric_name} evaluator function.
 *
 * @example
 * ```ts
 * const evaluator = create{MetricName}Evaluator({ model: openai("gpt-4o-mini") });
 * const result = await evaluator.evaluate({
 *   input: "...",
 *   output: "...",
 * });
 * ```
 */
export function create{MetricName}Evaluator<
  RecordType extends Record<string, unknown> = {MetricName}EvaluationRecord,
>(args: {MetricName}EvaluatorArgs<RecordType>): ClassificationEvaluator<RecordType> {
  const {
    choices = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.choices,
    promptTemplate = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.template,
    optimizationDirection = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.optimizationDirection,
    name = {METRIC_NAME}_CLASSIFICATION_EVALUATOR_CONFIG.name,
    ...rest
  } = args;
  return createClassificationEvaluator<RecordType>({
    ...rest,
    promptTemplate,
    choices,
    optimizationDirection,
    name,
  });
}
````

**Add export to** `js/packages/phoenix-evals/src/llm/index.ts`:

```typescript
export * from "./create{MetricName}Evaluator";
```

## Step 5: Build JS Packages

```bash
cd js && pnpm build
```

## Step 6: Create Benchmark Suite

Create `js/benchmarks/evals-benchmarks/src/{metric_name}_benchmark.ts`.

**Reference existing benchmarks** in the same directory for patterns:

- `conciseness_benchmark.ts` - Good example with failed examples printout
- `correctness_benchmark.ts` - Good example of category-based organization
- `tool_invocation_benchmark.ts` - Multi-tool and context handling
- `document_relevance_benchmark.ts` - Large synthetic dataset

**Target dataset size:** Aim for **30-50 synthetic examples** covering:

- 2-4 examples per failure mode (incorrect cases)
- 2-4 examples per success scenario (correct cases)
- At least 2 edge case categories

**Synthetic dataset creation:** For complex metrics, consider initiating a **separate AI agent session** dedicated to synthetic dataset generation. This agent can:

- Focus solely on creating realistic, diverse test cases
- Iterate on edge cases without context switching
- Generate examples in batches by category

**Structure:**

```typescript
import { createDataset } from "@arizeai/phoenix-client/datasets";
import { asExperimentEvaluator, getExperiment, runExperiment } from "@arizeai/phoenix-client/experiments";
import { create{MetricName}Evaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const evaluator = create{MetricName}Evaluator({ model: openai("gpt-4o-mini") });

// Define examples by category (target: 30-50 total examples)
const examplesByCategory = {
  // Failure modes (2-4 examples each)
  failure_mode_1: [
    { input: "...", output: "...", expected_label: "incorrect" as const },
    // ...
  ],
  // Success cases (2-4 examples each)
  correct_case_1: [
    { input: "...", output: "...", expected_label: "correct" as const },
    // ...
  ],
  // Edge cases
  edge_cases: [
    // ...
  ],
};

// TaskOutput must include `input` and `output` text so failed examples can be
// printed with full context for debugging.
type TaskOutput = {
  expected_label: string;
  label: string;
  score: number;
  explanation: string;
  category: string;
  input: string;
  output: string;
};

// Accuracy evaluator to compare predicted vs expected labels
const accuracyEvaluator = asExperimentEvaluator({
  name: "accuracy",
  kind: "CODE",
  evaluate: async (args) => {
    const output = args.output as TaskOutput;
    const score = output.expected_label === output.label ? 1 : 0;
    return {
      label: score === 1 ? "accurate" : "inaccurate",
      score,
      explanation: `Expected: ${output.expected_label}, Got: ${output.label}`,
    };
  },
});

// The task function must return input/output text alongside the eval result
// so that the failed examples printer can display what went wrong.
const task = async (example) => {
  const input = example.input.question as string;
  const output = example.output.answer as string;
  const expectedLabel = example.output.expected_label as string;

  const evalResult = await evaluator.evaluate({ input, output });

  return {
    expected_label: expectedLabel,
    category: example.metadata?.category as string,
    input,
    output,
    ...evalResult,
  };
};

async function main() {
  const dataset = await createDataset({ ... });
  const experiment = await runExperiment({
    dataset,
    task,
    evaluators: [accuracyEvaluator],
  });

  const result = await getExperiment({ experimentId: experiment.id });

  // Print detailed results by category
  printResultsByCategory(result);

  // Print failed examples so the user can see what went wrong
  printFailedExamples(result);
}

// IMPORTANT: Always print failed examples (where the evaluator's label did not
// match the expected label). This is critical for diagnosing prompt issues and
// deciding whether benchmark examples need adjustment. For each failed example,
// print the category, input, output (truncated if long), expected vs actual
// label, and the LLM judge's explanation.
function printFailedExamples(result) {
  const failures = [];

  for (const run of Object.values(result.runs)) {
    const output = run.output;
    if (run.error || !output) continue;
    if (output.expected_label !== output.label) {
      failures.push(output);
    }
  }

  if (failures.length === 0) {
    console.log("\nAll examples matched expected labels.");
    return;
  }

  console.log(`\nFAILED EXAMPLES (${failures.length})`);
  console.log("=".repeat(80));

  for (const [i, ex] of failures.entries()) {
    const truncatedOutput =
      ex.output.length > 120 ? ex.output.slice(0, 120) + "..." : ex.output;
    const truncatedExplanation =
      ex.explanation.length > 200
        ? ex.explanation.slice(0, 200) + "..."
        : ex.explanation;

    console.log(`\n  ${i + 1}. [${ex.category}]`);
    console.log(`     Input:    ${ex.input}`);
    console.log(`     Output:   ${truncatedOutput}`);
    console.log(`     Expected: ${ex.expected_label}  |  Got: ${ex.label}`);
    console.log(`     Reason:   ${truncatedExplanation}`);
  }
}

main();
```

**Benchmark requirements:**

- Positive examples (expected: correct/pass)
- Negative examples for each failure mode
- Edge cases
- Different input formats if applicable
- **IMPORTANT: The benchmark must print failed examples** (where the evaluator's label did not match the expected label). For each failure, print the category, input, output (truncated if long), expected vs actual label, and the LLM judge's explanation. This is critical for diagnosing prompt issues and tuning examples. The task function must return `input` and `output` text in its result so the printer has access to them.

## Step 7: Run Benchmark

```bash
# Start Phoenix (or use Phoenix Cloud)
PHOENIX_WORKING_DIR=/tmp/phoenix-test phoenix serve

# Run benchmark
cd js/benchmarks/evals-benchmarks
export OPENAI_API_KEY="..."
pnpm tsx src/{metric_name}_benchmark.ts
```

## Step 8: Create Documentation

Create documentation page at `docs/phoenix/evaluation/pre-built-metrics/{metric-name}.mdx`.

**Follow the approved template structure:**

1. **Overview** - When to use, what it measures
2. **Supported Levels** - Span/trace/session, relevant span kinds
3. **Input Requirements** - Required fields, formatting tips
4. **Output Interpretation** - Labels, scores, direction
5. **Usage Examples** - Python and TypeScript in tabs
6. **Using Input Mapping** - With lambda example if applicable
7. **Viewing/Modifying the Prompt** - Link to GitHub config, show custom prompt usage
8. **Configuration** - Link to LLM config docs
9. **Using with Phoenix** - Links to traces and experiments docs
10. **Benchmarks** - "Coming soon" placeholder
11. **API Reference** - Links to Python and TypeScript docs
12. **Related** - Links to related evaluators

**Reference file:** Use `docs/phoenix/evaluation/pre-built-metrics/faithfulness.mdx` as a template.

**Update navigation:**

1. Add the metric to the landing page card grid in `docs/phoenix/evaluation/pre-built-metrics.mdx`
2. Add the URL to `docs/phoenix/sitemap.xml`

## Checklist

- [ ] YAML config created with clear criteria
- [ ] `tox -e compile_prompts` run successfully
- [ ] Python evaluator class with docstrings and examples
- [ ] TypeScript evaluator wrapper with types
- [ ] Export added to `llm/index.ts`
- [ ] JS packages rebuilt (`pnpm build`)
- [ ] Benchmark suite with diverse test cases
- [ ] Benchmark prints failed examples with input, output, expected/actual labels, and explanation
- [ ] Benchmark run with acceptable accuracy (>80% target)
- [ ] Documentation page created following template
- [ ] Landing page updated with new metric card
- [ ] Sitemap updated with new URL

## Tips for Good Prompts

1. **Be explicit about criteria** - List what makes something correct vs incorrect
2. **Handle edge cases** - Multi-item evaluations, context from earlier turns
3. **Separate concerns** - If evaluating X, explicitly state you're NOT evaluating Y
4. **Provide reasoning guidance** - Tell the judge what to consider before deciding
5. **Use clear data formatting** - Wrap inputs in XML-style tags like `<context>`, `<output>`

View Original Source

Related Skills

General

PromptBeginner5 minmarkdown

Untitled Skill

193

Jan 12, 2026

General

PromptBeginner5 minmarkdown

Frontend Typescript Linting.mdc

TypeScript and ESLint rules that MUST be followed when creating, modifying, or reviewing any file under apps/frontend/, including .ts, .tsx, .js, and .jsx files. Also apply when discussing frontend li...

160

Feb 15, 2026

General

PromptBeginner5 minmarkdown

2. Apply Deepthink Protocol (reason about dependencies

risks

126

Jan 15, 2026