
Tomasz Wosiński
Jun 16, 2026
In production, the smartest model and the right model are different questions. Most of the time the right answer is smaller, cheaper, and older than the leaderboard suggests.
By Tomasz Wosiński, Chief Solutions Officer, maiven
Every architecture review I sit in eventually reaches the same moment. Someone pulls up the latest model leaderboard, points at the top row, and says "let's use that one." It feels like the safe call. Nobody gets fired for picking the best model.
Years before maiven, working in enterprise software at companies like Oracle and HPE, I watched the same reflex play out in infrastructure. The biggest database, the top appliance, the most expensive compute tier, all specced for a workload that never needed any of it. We eventually learned to right-size. That discipline is arriving in AI now, and most teams are skipping straight past it.
The leaderboard is not your spec
A model that tops a public benchmark is winning at hard, general problems designed to stress it. Graduate-level math. Competitive coding. Long reasoning chains built specifically to break it. Your production task is almost never that. It is sorting an inbound email into one of eight buckets, or pulling five fields from a contract, or drafting a first reply a human reads before it ships. Those jobs are narrow, and a much smaller model usually clears them at the same accuracy as the frontier model, for a fraction of the cost and a fraction of the latency.
The number that decides this is not on the leaderboard. It is the pass rate on your own task, with your own data. Until you have measured that, "the best model" is a guess wearing a benchmark.
Good enough is a spec, not a compromise
"Good enough" sounds like settling. It means you defined the bar before you went shopping. What accuracy does this task actually need to be useful? What does a wrong answer cost, and who catches it before it reaches a customer? A misrouted ticket that a human reviews is cheap to get wrong. An automated credit decision is not. Same company, same week, completely different bars, and they justify completely different models.
Set the acceptance threshold first. Then choose the cheapest model that clears it. That order is the whole game. Teams that pick the model first and reverse-engineer the threshold to match end up paying frontier prices to clear a bar a mid-tier model had already cleared.
At scale, the model is your unit economics
This is where it stops being academic. One call to the best model costs a rounding error. A production system does not make one call.
Take AITS, an agentic platform we built that tracks how AI search engines mention and recommend brands. A single report runs about 130 coordinated LLM operations across more than 60 agents. The platform serves more than 7,000 brands and moves through billions of tokens. At that volume, the per-token price of your model is the unit economics of the product. Multiply the price difference between the frontier model and a capable smaller one across billions of operations, and it decides whether the product carries a margin or loses money every time it runs.
A system that looks brilliant in a demo and runs underwater per transaction is not a product. It is a liability with good lighting.
Newest also means least predictable
Production rewards boring. The newest, largest model is usually the slowest to answer and the first to throttle when traffic spikes. It is also the one you have spent the least time with, so you meet its failure modes in production before anyone has written them down. A model that has been in the wild for a year is a known quantity. You know how it breaks and what it costs under load. For a system real users depend on, that track record is worth more than a few points on a benchmark you were never running.
The real answer is rarely one model
"Which model should we use" is the wrong shape of question. Mature systems do not run on one model. They route. The easy majority of operations go to a small, fast, cheap model. The genuinely hard cases escalate to the frontier model, which now earns its price on the small slice of work that actually needs it. Most of the system runs on good enough. The expensive model becomes a scalpel, used on purpose for the few cuts that need it.
This is how we build at maiven. A fleet of sixty agents does not mean sixty calls to the smartest model. Each step gets the model it needs, picked on the cost and latency it can carry and the accuracy it has to hit. That routing is most of the engineering. It is also most of the margin.
The question to carry into your next review
When you rent intelligence by the token, model selection is a margin decision you remake every day the product runs. The team that right-sizes it controls its own economics. The team that defaults to the best model hands its economics to whoever sets the frontier price this quarter.
So the next time someone points at the top of the leaderboard, change the question on the table. Ask the one that actually decides whether you have a product. What is the smallest, cheapest model that reliably clears the bar this step needs?
Most of the time, the honest answer is smaller, cheaper, and already a year old.
EXPLORE OTHER ARTICLES

Why Your AI Pilot Budget Explodes at Production Scale: Forecasting the Real TCO in 2026
Tomislav Sokolic
Dec 6, 2025

Agentic AI Factory: How To Build, Govern, And Control Your Digital Employees
Tomislav Sokolic
Dec 6, 2025

What I Learned About the Risks of Hiring Offshore AI Development Shops
Tomasz Wosiński
Dec 6, 2025
