General
PromptBeginner5 minmarkdown
Untitled Skill
193
**Scientific A/B Testing for LLM Prompts**
Loading actions...
Main instructions and any bundled files for this skill.
Scientific A/B Testing for LLM Prompts
Run controlled experiments on your prompts with statistical rigor. Compare variants with hypothesis testing, effect sizes, and AI-powered insights.
pip install -e .
from abprompt import (
Variant, Experiment, ExperimentRunner,
LLMJudge, StatisticalAnalyzer, Provider
)
# Define prompt variants
variant_a = Variant(
name="concise",
prompt_template="Answer briefly: {question}",
model="claude-3-5-haiku-20241022",
provider=Provider.ANTHROPIC,
)
variant_b = Variant(
name="detailed",
prompt_template="Explain thoroughly: {question}",
model="claude-3-5-haiku-20241022",
provider=Provider.ANTHROPIC,
)
# Create experiment
experiment = Experiment(
id="prompt_test",
name="Concise vs Detailed",
variants=[variant_a, variant_b],
test_inputs=[{"question": "What is machine learning?"}],
sample_size=30,
)
# Run with LLM-as-Judge
runner = ExperimentRunner(judge=LLMJudge())
trials = await runner.run(experiment)
# Analyze statistically
analyzer = StatisticalAnalyzer()
results = analyzer.compare_all_metrics(trials_a, trials_b)
winner, confidence, reason = analyzer.determine_winner(results)
export ANTHROPIC_API_KEY=your-key-here
python live_demo.py
| Method | Purpose |
|---|---|
| Welch's t-test | Compare means with unequal variances |
| Mann-Whitney U | Non-parametric alternative |
| Cohen's d | Effect size measurement |
| 95%/99% CI | Confidence intervals |
| Bonferroni | Multiple comparison correction |
pytest tests/ -v
# 178 tests, 100% pass rate
abprompt/
core/
types.py # Data models
runner.py # Experiment execution
judge/
llm_judge.py # LLM-as-Judge evaluation
telemetry/
collector.py # Metrics collection
analysis/
statistics.py # Statistical tests
insights.py # AI insights generation
visualization/
reports.py # Report generation
tests/
test_*.py # Comprehensive tests
This project is part of my 30-day AI challenge.
MIT
TypeScript and ESLint rules that MUST be followed when creating, modifying, or reviewing any file under apps/frontend/, including .ts, .tsx, .js, and .jsx files. Also apply when discussing frontend li...
risks