v0.11.1 Alpha

Sample or replay
before full eval.

Driftcut is an alpha CLI for LLM migration canaries. It validates a structured prompt corpus, samples representative batches, runs baseline and candidate models, or replays historical paired outputs.

Retries transient provider failures, checks deterministic quality signals, judges only the ambiguous prompts, and tells you whether to stop, continue, or proceed — before spending on a full evaluation.

CLI-first CSV / JSON corpus Historical replay Transient retries Tiered judge escalation Decision + HTML report Optional Redis memory Category scorecards
alpha run
$ driftcut run --config migration.yaml GPT-4o to Claude Haiku migration gate Mode: live Baseline: openai/gpt-4o Candidate: anthropic/claude-haiku Corpus: 30 prompts, 4 categories Batch 1: 12 prompts, 0 API errors, $0.18 cumulative 12/24 Judge coverage: 3/3 ambiguous prompts Decision: CONTINUE (58% confidence) Batch 2: 12 prompts, 0 API errors, $0.31 cumulative 24/24 Judge coverage: 4/4 ambiguous prompts Decision: PROCEED (82% confidence) Run complete Prompts tested: 24/30 Total cost: $0.31 Judge cost: $0.03 Latency p50: 910ms690ms Latency p95: 1480ms1100ms Decision: PROCEED (82% confidence) Output: ./driftcut-results/results.json + report.html Tiered judging: light first, heavy on low confidence

A real alpha migration gate.

The repository now includes deterministic checks, tiered judging that escalates from a light judge to a heavy judge when confidence is low, a decision engine with confidence and report output, a replay path for historical paired outputs, and an optional Redis memory layer for baseline caching and run-history persistence.

Shipping Now

What works today

  • Config and corpus validation
  • Stratified batch sampling by category and criticality
  • Concurrent baseline/candidate execution via LiteLLM
  • Transient retry handling for rate limits, timeouts, and 5xxs
  • Historical replay on canonical paired-output JSON
  • Deterministic checks and tiered judging (light + heavy escalation)
  • Latency, cost, category scorecards, decision output, and HTML reporting
  • Richer semantic failure archetypes beyond a generic judge-worse label
  • driftcut init scaffolding for instant project setup
  • driftcut bootstrap to classify raw prompts into a structured corpus
  • driftcut diff to compare two runs and see what changed
  • Public benchmark demo with offline replay (examples/demo/)
Planned Next

What sharpens the wedge further

  • More production examples and case studies
  • Sequential hypothesis testing for formal confidence (if demand)

What the app does and how it works.

Driftcut is built for teams that already have a real prompt corpus and want a cheaper first pass before a full migration evaluation.

01

Bring your corpus

Each prompt carries a category, a criticality level, an expected output type, and optional deterministic expectations such as required strings or JSON keys.

02

Sample strategically

The sampler builds balanced batches and prioritizes high-criticality prompts earlier, so the first slice is more informative than a naive random sample in both live and replay mode.

03

Run or replay and decide

Baseline and candidate can run live for each prompt, or historical paired outputs can be replayed through the same runtime. Driftcut records latency and cost when available, judges only ambiguous prompts, and updates the migration decision after each batch.


The wedge is still strong.

Eval frameworks answer "how good is this model?" Driftcut is aimed at the earlier question: "should I keep spending money testing this migration candidate?"

Typical eval tooling Driftcut direction
Primary question How good is this model overall? Should I keep testing this migration candidate?
Current alpha Quality measurement Sampling or replay + tiered semantic judging + migration decisions
Next milestone More benchmarks and scoring Real-world case studies and packaging polish
Output shape Metrics to interpret STOP / CONTINUE / PROCEED + HTML report

Common questions

Is this a replacement for full evaluation?

No. It is the step before a full evaluation. Driftcut helps you run a cheaper first pass on real prompts before you invest in the larger comparison.

Do I need labeled benchmarks?

No ground-truth labels are required today, but you do need a structured prompt corpus. This project assumes the benchmark already exists in your product context.

What is implemented right now?

Validation, scaffolding, corpus bootstrap from raw prompts, run comparison, sampling, paired model execution, canonical replay loading, deterministic checks, tiered judging (light with heavy escalation) for ambiguous prompts, latency tracking, cost tracking, optional Redis-backed baseline caching, decision output, JSON export, and HTML reporting. Want to see it on a real cost-cut decision? The examples/demo walkthrough reproduces a gpt-4o migration against two cheaper candidates with offline replay.

What does a run cost?

Today the cost is the sampled baseline and candidate model calls plus optional judge cost for ambiguous prompts only. The judge is designed to be selective, not universal.