How to evaluate a new AI model release in one afternoon

Model releases used to be annual events. Now something ships every few weeks, each launch arrives with benchmark charts, and every one triggers the same client question: should we switch?

You cannot afford a quarter-long evaluation every time. You also shouldn't switch on vibes — or stay put on inertia. Here's the process we actually run, which fits in an afternoon per workload. Nothing here requires an ML background; it requires discipline and about four hours.

Prerequisite: a golden set (build once, reuse forever)

The one-time investment that makes afternoon evaluations possible: a golden set — 30 to 50 real examples from your actual workload, with known-correct answers.

For an extraction pipeline, that's 40 real invoices and the correct fields for each. For an internal assistant, 30 real questions your team has asked, with answers a domain expert signed off. Real examples, including the ugly ones — the scanned-at-an-angle invoice, the ambiguous question. The vendor's demo inputs are the ones everything works on; your golden set exists to represent the inputs that break things.

If you take one thing from this post: benchmarks measure the model on their distribution. The golden set measures it on yours. Only one of those pays your invoices.

The afternoon, hour by hour

Hour 1 — Run both, change nothing else. Same prompts, same settings, golden set through the current model and the new one. Resist the urge to also "improve" the prompt mid-test — change one variable. Log outputs, latency, and token usage for every example. (The token counts matter: new models sometimes tokenize differently, so the same document can cost a different amount even at the same per-token price.)

Hour 2 — Grade blind. Put the outputs side by side with the model names hidden, and score each against the known-correct answer. Blind matters more than it sounds — everyone, including us, unconsciously grades the shiny new thing generously. For extraction-type work the grading is mechanical (fields match or they don't). For judgment-type work, have the person who'd catch the error in production do the grading.

Hour 3 — Read the changelog like an engineer, not a marketer. Capability is only half a release. The other half is behavior and terms: parameters removed or renamed, defaults changed, new refusal behavior, latency profile, data-retention and privacy terms. Recent releases have changed all of these — including, in one case, a retention requirement that rules the model out for zero-retention workloads entirely. A model that's 3% better and breaks your error handling is a downgrade.

Hour 4 — Apply the decision rule. Ours is boring on purpose:

Switch only if the new model is measurably better on your golden set, or equal and meaningfully cheaper or faster — and the operational changes (API behavior, terms, latency) are ones you've explicitly accepted.

"Better on benchmarks," "everyone's talking about it," and "the old one feels dated" do not appear in the rule. Note the second clause pays for the whole habit: several times now, the afternoon's answer has been "the cheaper tier now clears the bar" — the evaluation that saves money, not spends it.

What this costs and what it buys

Building the golden set the first time: a day, mostly a domain expert's. Each evaluation after that: an afternoon, most of it waiting for runs. What it buys: every "should we switch?" conversation ends with evidence instead of opinions, forever. It's some of the highest-leverage time in an AI system's whole lifecycle.

This harness — golden set, runner, grading sheet — ships as part of every build we do, because a system you can't re-evaluate is a system you can't maintain. If you want help setting one up for a workload you already run, or want an honest read on whether the latest release changes anything for you, the free AI Review is the easiest place to start.

Prerequisite: a golden set (build once, reuse forever)

The afternoon, hour by hour

What this costs and what it buys

Theteardowns,thebuildlogs,thehonestmath.