Everyone benchmarks tokens per second. Almost nobody benchmarks watt-hours per response. But if you care about the sustainability of AI — and the industry’s energy curve says we should — the second number is the one that matters.
The setup
The measurement is simpler than it sounds. A wall-power meter (or nvidia-smi --query-gpu=power.draw sampled during generation) plus a fixed prompt set gives you:
energy per response = avg power draw (W) × generation time (s) / 3600
Run that across quantization levels and model sizes and patterns fall out quickly.
What the numbers show
Three things surprised me when I measured my own setup:
Quantization is nearly free efficiency. Dropping from FP16 to Q4_K_M roughly halves memory bandwidth pressure, which on consumer GPUs is the bottleneck. Same response, noticeably less energy, quality loss that’s hard to detect below 8B scale on transformation tasks.
Idle draw dominates light workloads. If the GPU sits at 30 W idle waiting for occasional requests, a day of light usage is mostly idle energy. Power management and model load/unload policies matter more than inference speed for intermittent workloads.
Smaller models win harder than the speed numbers suggest. A 3B model isn’t just 2–3x faster than an 8B model — it also draws less power while being faster, so the energy gap per response is wider than the latency gap.
Why this matters beyond my electricity bill
Datacenter inference hides its costs behind an API. On-device inference makes the cost visible — you can literally watch the power draw. That visibility changes how you build: you start asking whether a request needs a model at all, whether a cached answer works, whether a regex would do.
Efficiency stops being an infrastructure team’s problem and becomes a design constraint. That’s the discipline I want to carry into whatever I build next.