ABPrompt

**Scientific A/B Testing for LLM Prompts**

PublishedJan 14, 2026

Loading actions...

5 minBeginnerpromptSingle file

Skill content

Main instructions and any bundled files for this skill.

markdown

ABPrompt

Scientific A/B Testing for LLM Prompts

Run controlled experiments on your prompts with statistical rigor. Compare variants with hypothesis testing, effect sizes, and AI-powered insights.

Features

Experiment Framework: Define variants, test inputs, and sample sizes
LLM-as-Judge: Automated quality evaluation on 7 dimensions
Statistical Analysis: Welch's t-test, Mann-Whitney U, Cohen's d, confidence intervals
Comprehensive Telemetry: Track cost, latency, tokens, quality scores
AI Insights: Generate actionable recommendations automatically
Visual Reports: ASCII-based terminal visualizations

Installation

pip install -e .

Quick Start

from abprompt import (
    Variant, Experiment, ExperimentRunner,
    LLMJudge, StatisticalAnalyzer, Provider
)

# Define prompt variants
variant_a = Variant(
    name="concise",
    prompt_template="Answer briefly: {question}",
    model="claude-3-5-haiku-20241022",
    provider=Provider.ANTHROPIC,
)

variant_b = Variant(
    name="detailed",
    prompt_template="Explain thoroughly: {question}",
    model="claude-3-5-haiku-20241022",
    provider=Provider.ANTHROPIC,
)

# Create experiment
experiment = Experiment(
    id="prompt_test",
    name="Concise vs Detailed",
    variants=[variant_a, variant_b],
    test_inputs=[{"question": "What is machine learning?"}],
    sample_size=30,
)

# Run with LLM-as-Judge
runner = ExperimentRunner(judge=LLMJudge())
trials = await runner.run(experiment)

# Analyze statistically
analyzer = StatisticalAnalyzer()
results = analyzer.compare_all_metrics(trials_a, trials_b)
winner, confidence, reason = analyzer.determine_winner(results)

Live Demo

export ANTHROPIC_API_KEY=your-key-here
python live_demo.py

Statistical Methods

Method	Purpose
Welch's t-test	Compare means with unequal variances
Mann-Whitney U	Non-parametric alternative
Cohen's d	Effect size measurement
95%/99% CI	Confidence intervals
Bonferroni	Multiple comparison correction

Quality Metrics (LLM-as-Judge)

Quality Score (overall)
Relevance
Clarity
Completeness
Accuracy
Engagement
Creativity

Tests

pytest tests/ -v
# 178 tests, 100% pass rate

Project Structure

abprompt/
  core/
    types.py       # Data models
    runner.py      # Experiment execution
  judge/
    llm_judge.py   # LLM-as-Judge evaluation
  telemetry/
    collector.py   # Metrics collection
  analysis/
    statistics.py  # Statistical tests
    insights.py    # AI insights generation
  visualization/
    reports.py     # Report generation
tests/
  test_*.py        # Comprehensive tests

Day 9/30 AI Challenge

This project is part of my 30-day AI challenge.

License

MIT

Contents

View Original Source

Related Skills

General

PromptBeginner5 minmarkdown

Untitled Skill

193

Jan 12, 2026

General

PromptBeginner5 minmarkdown

Frontend Typescript Linting.mdc

TypeScript and ESLint rules that MUST be followed when creating, modifying, or reviewing any file under apps/frontend/, including .ts, .tsx, .js, and .jsx files. Also apply when discussing frontend li...

160

Feb 15, 2026

General

PromptBeginner5 minmarkdown

2. Apply Deepthink Protocol (reason about dependencies

risks

126

Jan 15, 2026

Skill content

ABPrompt

Features

Installation

Quick Start

Live Demo

Statistical Methods

Quality Metrics (LLM-as-Judge)

Tests

Project Structure

Day 9/30 AI Challenge

Links

License

Related Skills

Untitled Skill

Frontend Typescript Linting.mdc

2. Apply Deepthink Protocol (reason about dependencies