v0.11.1 Alpha

Sample or replay
before full eval.

Driftcut is an alpha CLI for LLM migration canaries. It validates a structured prompt corpus, samples representative batches, runs baseline and candidate models, or replays historical paired outputs.

Retries transient provider failures, checks deterministic quality signals, judges only the ambiguous prompts, and tells you whether to stop, continue, or proceed — before spending on a full evaluation.

View on GitHub Read the docs →

CLI-first CSV / JSON corpus Historical replay Transient retries Tiered judge escalation Decision + HTML report Optional Redis memory Category scorecards

alpha run

$ driftcut run --config migration.yaml GPT-4o to Claude Haiku migration gate Mode: live Baseline: openai/gpt-4o Candidate: anthropic/claude-haiku Corpus: 30 prompts, 4 categories Batch 1: 12 prompts, 0 API errors, $0.18 cumulative 12/24 Judge coverage: 3/3 ambiguous prompts Decision: CONTINUE (58% confidence) Batch 2: 12 prompts, 0 API errors, $0.31 cumulative 24/24 Judge coverage: 4/4 ambiguous prompts Decision: PROCEED (82% confidence) Run complete Prompts tested: 24/30 Total cost: $0.31 Judge cost: $0.03 Latency p50: 910ms → 690ms Latency p95: 1480ms → 1100ms Decision: PROCEED (82% confidence) Output: ./driftcut-results/results.json + report.html Tiered judging: light first, heavy on low confidence

Current Status

A real alpha migration gate.

The repository now includes deterministic checks, tiered judging that escalates from a light judge to a heavy judge when confidence is low, a decision engine with confidence and report output, a replay path for historical paired outputs, and an optional Redis memory layer for baseline caching and run-history persistence.

Shipping Now

What works today

Config and corpus validation
Stratified batch sampling by category and criticality
Concurrent baseline/candidate execution via LiteLLM
Transient retry handling for rate limits, timeouts, and 5xxs
Historical replay on canonical paired-output JSON
Deterministic checks and tiered judging (light + heavy escalation)
Latency, cost, category scorecards, decision output, and HTML reporting
Richer semantic failure archetypes beyond a generic judge-worse label
driftcut init scaffolding for instant project setup
driftcut bootstrap to classify raw prompts into a structured corpus
driftcut diff to compare two runs and see what changed
Public benchmark demo with offline replay (examples/demo/)

Planned Next

What sharpens the wedge further

More production examples and case studies
Sequential hypothesis testing for formal confidence (if demand)

Workflow

What the app does and how it works.

Driftcut is built for teams that already have a real prompt corpus and want a cheaper first pass before a full migration evaluation.

Bring your corpus

Each prompt carries a category, a criticality level, an expected output type, and optional deterministic expectations such as required strings or JSON keys.

Sample strategically

The sampler builds balanced batches and prioritizes high-criticality prompts earlier, so the first slice is more informative than a naive random sample in both live and replay mode.

Run or replay and decide

Baseline and candidate can run live for each prompt, or historical paired outputs can be replayed through the same runtime. Driftcut records latency and cost when available, judges only ambiguous prompts, and updates the migration decision after each batch.

Positioning

The wedge is still strong.

Eval frameworks answer "how good is this model?" Driftcut is aimed at the earlier question: "should I keep spending money testing this migration candidate?"

	Typical eval tooling	Driftcut direction
Primary question	How good is this model overall?	Should I keep testing this migration candidate?
Current alpha	Quality measurement	Sampling or replay + tiered semantic judging + migration decisions
Next milestone	More benchmarks and scoring	Real-world case studies and packaging polish
Output shape	Metrics to interpret	STOP / CONTINUE / PROCEED + HTML report

FAQ

Common questions

Is this a replacement for full evaluation?

No. It is the step before a full evaluation. Driftcut helps you run a cheaper first pass on real prompts before you invest in the larger comparison.

Do I need labeled benchmarks?

No ground-truth labels are required today, but you do need a structured prompt corpus. This project assumes the benchmark already exists in your product context.

What is implemented right now?

Validation, scaffolding, corpus bootstrap from raw prompts, run comparison, sampling, paired model execution, canonical replay loading, deterministic checks, tiered judging (light with heavy escalation) for ambiguous prompts, latency tracking, cost tracking, optional Redis-backed baseline caching, decision output, JSON export, and HTML reporting. Want to see it on a real cost-cut decision? The examples/demo walkthrough reproduces a gpt-4o migration against two cheaper candidates with offline replay.

What does a run cost?

Today the cost is the sampled baseline and candidate model calls plus optional judge cost for ambiguous prompts only. The judge is designed to be selective, not universal.

Sample or replaybefore full eval.