Driftcut is an alpha CLI for LLM migration canaries. It validates a structured prompt corpus, samples representative batches, runs baseline and candidate models, or replays historical paired outputs.
Retries transient provider failures, checks deterministic quality signals, judges only the ambiguous prompts, and tells you whether to stop, continue, or proceed — before spending on a full evaluation.
The repository now includes deterministic checks, tiered judging that escalates from a light judge to a heavy judge when confidence is low, a decision engine with confidence and report output, a replay path for historical paired outputs, and an optional Redis memory layer for baseline caching and run-history persistence.
driftcut init scaffolding for instant project setupdriftcut bootstrap to classify raw prompts into a structured corpusdriftcut diff to compare two runs and see what changedexamples/demo/)Driftcut is built for teams that already have a real prompt corpus and want a cheaper first pass before a full migration evaluation.
Each prompt carries a category, a criticality level, an expected output type, and optional deterministic expectations such as required strings or JSON keys.
The sampler builds balanced batches and prioritizes high-criticality prompts earlier, so the first slice is more informative than a naive random sample in both live and replay mode.
Baseline and candidate can run live for each prompt, or historical paired outputs can be replayed through the same runtime. Driftcut records latency and cost when available, judges only ambiguous prompts, and updates the migration decision after each batch.
Eval frameworks answer "how good is this model?" Driftcut is aimed at the earlier question: "should I keep spending money testing this migration candidate?"
| Typical eval tooling | Driftcut direction | |
|---|---|---|
| Primary question | How good is this model overall? | Should I keep testing this migration candidate? |
| Current alpha | Quality measurement | Sampling or replay + tiered semantic judging + migration decisions |
| Next milestone | More benchmarks and scoring | Real-world case studies and packaging polish |
| Output shape | Metrics to interpret | STOP / CONTINUE / PROCEED + HTML report |
No. It is the step before a full evaluation. Driftcut helps you run a cheaper first pass on real prompts before you invest in the larger comparison.
No ground-truth labels are required today, but you do need a structured prompt corpus. This project assumes the benchmark already exists in your product context.
Validation, scaffolding, corpus bootstrap from raw prompts, run comparison, sampling, paired model execution, canonical replay loading, deterministic checks, tiered judging (light with heavy escalation) for ambiguous prompts, latency tracking, cost tracking, optional Redis-backed baseline caching, decision output, JSON export, and HTML reporting. Want to see it on a real cost-cut decision? The examples/demo walkthrough reproduces a gpt-4o migration against two cheaper candidates with offline replay.
Today the cost is the sampled baseline and candidate model calls plus optional judge cost for ambiguous prompts only. The judge is designed to be selective, not universal.