Every leaderboard tells the same story: the biggest hosted models win. But the gap that matters for most software isn’t between a local 8B model and a frontier model — it’s between “needs a network round-trip and an API key” and “runs where the data already lives.”
What local actually buys you
Privacy by default. When inference happens on the machine that holds the data, nothing leaves. There’s no data processing agreement to read, no retention policy to trust. For anything touching personal notes, medical text, or internal code, this isn’t a nice-to-have.
Predictable cost. An API bill scales with usage. A local model scales with electricity, which for a quantized 8B model on a consumer GPU is measured in cents per day. Once the hardware exists, marginal inference is nearly free.
Latency floors. A network round-trip costs 50–200 ms before the model generates a single token. Local inference starts immediately. For interactive tools — autocomplete, classification, on-device search — that difference is the whole product.
The honest trade-offs
Local isn’t free. You trade capability for control:
- An 8B quantized model will lose to a frontier model on hard reasoning. Pretending otherwise wastes everyone’s time.
- VRAM is the real constraint. A Q4 quantization of an 8B model fits in ~6 GB; a 70B model does not fit in anything most people own.
- Tooling maturity varies. Ollama makes the happy path easy; the moment you need custom sampling or batching, you’re reading llama.cpp source.
Where I’ve landed
My rule of thumb after a year of running models locally: if the task is transformation (summarize, classify, extract, rewrite), a small local model is usually enough. If the task is open-ended reasoning, be honest and reach for something bigger.
The interesting engineering isn’t picking a side. It’s building software where the local model handles 90% of requests and knows when it’s out of its depth.