ABPrompt

**Scientific A/B Testing for LLM Prompts**

Views1
PublishedJan 14, 2026

Loading actions...

5 minBeginnerpromptSingle file

Skill content

Main instructions and any bundled files for this skill.

markdown

ABPrompt

Scientific A/B Testing for LLM Prompts

Run controlled experiments on your prompts with statistical rigor. Compare variants with hypothesis testing, effect sizes, and AI-powered insights.

Features

  • Experiment Framework: Define variants, test inputs, and sample sizes
  • LLM-as-Judge: Automated quality evaluation on 7 dimensions
  • Statistical Analysis: Welch's t-test, Mann-Whitney U, Cohen's d, confidence intervals
  • Comprehensive Telemetry: Track cost, latency, tokens, quality scores
  • AI Insights: Generate actionable recommendations automatically
  • Visual Reports: ASCII-based terminal visualizations

Installation

pip install -e .

Quick Start

from abprompt import (
    Variant, Experiment, ExperimentRunner,
    LLMJudge, StatisticalAnalyzer, Provider
)

# Define prompt variants
variant_a = Variant(
    name="concise",
    prompt_template="Answer briefly: {question}",
    model="claude-3-5-haiku-20241022",
    provider=Provider.ANTHROPIC,
)

variant_b = Variant(
    name="detailed",
    prompt_template="Explain thoroughly: {question}",
    model="claude-3-5-haiku-20241022",
    provider=Provider.ANTHROPIC,
)

# Create experiment
experiment = Experiment(
    id="prompt_test",
    name="Concise vs Detailed",
    variants=[variant_a, variant_b],
    test_inputs=[{"question": "What is machine learning?"}],
    sample_size=30,
)

# Run with LLM-as-Judge
runner = ExperimentRunner(judge=LLMJudge())
trials = await runner.run(experiment)

# Analyze statistically
analyzer = StatisticalAnalyzer()
results = analyzer.compare_all_metrics(trials_a, trials_b)
winner, confidence, reason = analyzer.determine_winner(results)

Live Demo

export ANTHROPIC_API_KEY=your-key-here
python live_demo.py

Statistical Methods

MethodPurpose
Welch's t-testCompare means with unequal variances
Mann-Whitney UNon-parametric alternative
Cohen's dEffect size measurement
95%/99% CIConfidence intervals
BonferroniMultiple comparison correction

Quality Metrics (LLM-as-Judge)

  • Quality Score (overall)
  • Relevance
  • Clarity
  • Completeness
  • Accuracy
  • Engagement
  • Creativity

Tests

pytest tests/ -v
# 178 tests, 100% pass rate

Project Structure

abprompt/
  core/
    types.py       # Data models
    runner.py      # Experiment execution
  judge/
    llm_judge.py   # LLM-as-Judge evaluation
  telemetry/
    collector.py   # Metrics collection
  analysis/
    statistics.py  # Statistical tests
    insights.py    # AI insights generation
  visualization/
    reports.py     # Report generation
tests/
  test_*.py        # Comprehensive tests

Day 9/30 AI Challenge

This project is part of my 30-day AI challenge.

License

MIT

Share: