What it actually costs to run an open model on your own hardware

There's a confident claim doing the rounds: "just self-host an open model, it's basically free." There's an equally confident reply: "self-hosting is a false economy, the cloud is always cheaper." Both are wrong as stated, because both skip the measurement.

So we measured. We ran two open models across three hardware setups, pushed real workloads through them, and recorded throughput, latency, and the number that actually matters to a business — cost per million tokens. Here's everything, including the setup where self-hosting clearly doesn't make sense.

The setups

We tested three configurations a real small or mid-sized business might actually choose:

The mini-PC — a £600 small-form-factor machine with an integrated NPU and 32GB of unified memory. The "can we just put it in a cupboard" option.
The single GPU — a used 24GB consumer GPU in a basic workstation, all-in around £1,400. The workhorse most self-hosting starts with.
The cloud instance — a rented GPU instance, billed by the hour, as the baseline everyone compares against.

The models: a small instruction-tuned model in the 7–8B range, and a mid-size one around 30B (quantized). Both open-weight, both things we deploy for clients regularly.

What we measured, and how

We didn't benchmark with a single tidy prompt. We replayed a realistic mix: short classification calls, medium retrieval-augmented answers (the bread and butter of an internal assistant), and a few long-document summaries. That mix matters — throughput on a 50-token reply tells you almost nothing about behaviour on a 2,000-token one.

For each setup we recorded tokens per second under sustained load, time-to-first-token (what a user actually feels as "lag"), and then converted everything to cost per million output tokens using real local power prices and real cloud hourly rates.

The numbers

For the small model, all three setups were genuinely usable for an internal tool serving a handful of concurrent users. The mini-PC handled the 7–8B model comfortably for one-to-three concurrent users — fine for a small team's internal assistant, not fine for a customer-facing service. The single GPU pulled well ahead on throughput and held its latency steady as concurrency climbed.

For the mid-size model, the mini-PC struggled — usable for a single patient user, painful under any real load. The single GPU was the sweet spot. The cloud instance was fastest of all, as you'd expect, because you're renting far more silicon.

On cost per million tokens, the picture inverts depending on utilization, and this is the part the confident claims miss:

At low, bursty usage — a few thousand queries a day — the cloud API was cheapest per token, because you only pay when you use it. The owned hardware sits idle most of the day, and idle hardware is pure cost.
At steady, high usage — the hardware running most of the working day — the owned setups pulled decisively ahead. The single GPU's cost per million tokens dropped well below the cloud rate once it was busy more than roughly a third of the time.
The mini-PC's cost per token was excellent when it could keep up at all — its ceiling is throughput, not economics.

Where self-hosting stops making sense

Three honest situations where we tell clients to stay on a cloud API:

Low or unpredictable volume. If your AI feature gets used in bursts, you're paying to keep a machine warm for nothing. Rent the compute.
You need frontier-model quality. No 30B open model matches the best hosted frontier models on hard reasoning. If the task genuinely needs that, self-hosting the wrong-sized model to save money is a false economy of a different kind.
You have no one to run it. A box in a cupboard needs patching, monitoring, and a plan for when it falls over. That's an operating cost too, and it's the one people forget. (It's also, not coincidentally, what our managed retainers exist to cover.)

Where it makes complete sense

And the flip side — when we reach for self-hosting without hesitation:

Steady, predictable volume. High utilization is where owned hardware wins on pure cost, often dramatically.
The data can't leave. This is the big one, and it's not about money at all. For a law firm or a clinic, "the cloud is 20% cheaper per token" is irrelevant if sending the data out is off the table. Self-hosting isn't the cost-optimal choice here; it's the only choice. (More on that in our Private & Secure AI work.)
You want a fixed cost. A known monthly power bill is easier to plan around than a usage-metered API invoice that spikes with a good month.

The takeaway

"Should I self-host?" has no general answer. It has a specific one, and it depends on three variables: your utilization, your quality bar, and whether your data is allowed to leave. Get those three straight and the math answers itself.

If you want us to run this analysis for your actual numbers rather than our test rig's, that's a large part of what an AI Roadmap is. Or just book a free AI Review and we'll give you the back-of-envelope version in the call.