Stop Using Your Best Model for Everything: A Practical Guide to Model Routing
Most teams point all agent tasks at one frontier model and watch the bill climb. Routing by task complexity cuts costs 20-43% with near-identical quality.
I got our AI infrastructure bill last quarter and stared at it for a minute longer than I should have. Not because it was catastrophic — it wasn’t — but because I recognised that a significant portion of it was waste I had designed into the system myself. Every task in our agent pipelines at Zero was going to the same model. The expensive one. The one I’d reached for instinctively when I first set things up because it was the best, and I didn’t want to think about it more than that.
That instinct is extremely common. And it’s costing teams a lot of money for not very much return.
The Default Is “Best Model, Always”
Here’s how most teams I’ve seen build with AI agents, including teams that are otherwise technically sharp: they pick one frontier model — Claude Opus, GPT-4 Turbo, whatever feels right — and route every task to it. Documentation generation. Config file changes. Boilerplate scaffolding. Refactoring a five-line function. All of it. Same model, same price, every time.
The justification is understandable. Frontier models make fewer mistakes. When you’re building fast, debugging routing logic is overhead you don’t want. And the per-token price difference feels abstract until the bill lands.
But here’s the thing: a frontier model writing a README update or a Dockerfile for a standard Node app is like hiring a senior architect to sweep the office floor. It’s not that they do it badly. They do it fine. You’re just paying architect rates for sweeping.
The pattern I’m describing — routing different tasks to different models based on complexity — isn’t new in distributed systems. We do it with compute all the time. You don’t run your analytics batch jobs on your API servers. You route by workload type. We just haven’t been doing it with LLMs yet, and the cost difference between tiers is large enough that we should be.
What the Price Gap Actually Looks Like
As of mid-2026, the rough tier structure for Anthropic’s models looks like this (input / output per million tokens):
- Claude Opus 4.8 (flagship): $5 input / $25 output
- Claude Sonnet 4.6 (mid-tier): $3 input / $15 output
- Claude Haiku 4.5 (efficient): $1 input / $5 output
Output tokens are the expensive side — they’re priced at 5x input across the board, which matters because agents generate a lot of output: code, explanations, diffs, structured JSON. So the meaningful ratio to think about is output cost. Opus is 5x more expensive per output token than Haiku. Sonnet is 3x more expensive than Haiku. That’s the gap you’re leaving on the table.
Now let me make that concrete with a representative agent session.
Say a developer kicks off a task: update the API docs for a new endpoint, scaffold three config files for a staging environment, write a basic test suite for a utility function, then refactor a cross-service authentication flow that touches four different modules. Four distinct subtasks in one session.
All-Opus session:
- Docs update: ~4K tokens total → ~$0.08
- Config scaffolding: ~6K tokens → ~$0.12
- Test suite: ~8K tokens → ~$0.16
- Auth refactor: ~25K tokens → ~$0.51
Total: roughly $0.87 per session. That sounds small. But at 1,000 sessions a month — not unusual for a team of 8-10 engineers using agents daily — that’s $870/month or ~$10,400/year for a single usage pattern.
Routed session (docs + config + tests → Haiku; auth refactor → Opus):
- Docs update on Haiku: ~$0.03
- Config scaffolding on Haiku: ~$0.04
- Test suite on Haiku: ~$0.06
- Auth refactor on Opus: ~$0.51
Total: roughly $0.64 per session. Same session, same outcomes, about 26% less.
These are illustrative numbers, not empirical benchmarks from our specific stack — the real savings vary by task mix and token volume. But the directional math is solid, and in my experience the higher the proportion of routine tasks in your agent sessions, the closer you get to 40%+ savings. Teams with very heavy documentation, test generation, or code generation workloads for standard patterns can push past that.
Factory, who build coding agent infrastructure, claimed their routing product cuts token spend 20-25% while maintaining frontier performance — and separately, their Droid agent (which uses multi-LLM routing under the hood) ranked #1 on Terminal-Bench 2 with a score of 58.75%. I’m not going to use those numbers as proof of anything — benchmark conditions differ from production — but they point in the same direction. Routing isn’t sacrificing quality for cost. It’s matching model capability to task requirements.
How to Actually Classify Tasks
The routing decision comes down to one question: does this task require frontier-level reasoning, or is competent execution enough?
The answer depends on a handful of signals:
Strong signals that a task needs frontier:
- Cross-service or cross-module coordination (touching 3+ files with non-obvious interdependencies)
- Security-sensitive code paths: auth, payments, encryption, session handling
- Schema migrations or database changes where the blast radius of a mistake is high
- Novel architecture decisions — anything where the right answer isn’t clearly established in the codebase
- Debugging where the root cause is genuinely unclear and requires hypothesis generation
- Tasks where the model needs to maintain and reason about a long shared context across many prior turns
Signals that efficient models handle fine:
- Single-file changes with clear specifications
- Documentation updates, changelog generation, README edits
- Config file generation (Dockerfiles, CI YAML,
.env.example) - Boilerplate scaffolding following established project patterns
- Unit/integration tests for functions with clear inputs and outputs
- Linting fixes, import reorganization, formatting
- Simple bug fixes where the error message and fix are obvious
- Summarization of prior context for handoff
The middle: tasks you have to watch There’s a grey zone. A “simple refactor” that turns out to require understanding an implicit contract across three calling sites is not simple. A documentation update that requires understanding a complex API design and accurately representing its behavior is not cheap work. The mistake people make when they start routing is over-routing to efficient models and then wondering why quality dipped. The fix is to set your threshold conservatively and expand as you build confidence.
Writing Complexity Signals Into Task Descriptions
One thing that surprised me: the easiest routing improvements came not from fancy routing logic but from changing how tasks were described to agents in the first place.
When a task description is vague — “fix the auth bug” — you can’t route it intelligently. When it’s specific — “fix the JWT expiry check in src/middleware/auth.ts:47 which isn’t comparing timestamps correctly; single file, no cross-service impact” — you can. A well-written task description encodes enough structure that even a simple keyword heuristic can route it correctly most of the time.
The practice I’ve landed on: before we dispatch a task in our agent pipelines, we write a one-line complexity annotation. It follows a consistent format:
[scope: single-file | multi-file | cross-service]
[domain: docs | config | tests | feature | security | schema | refactor]
[ambiguity: low | medium | high]
Tasks tagged single-file + docs/config/tests + low ambiguity go to efficient models automatically. Everything tagged cross-service or security or high ambiguity goes frontier. Everything else gets a soft default that we’re still tuning.
This sounds like overhead but it takes 10-15 seconds per task to write. The discipline of being specific about scope also tends to make the tasks themselves better — smaller, more focused, easier for any model to complete well.
Building a Model Pool and a Routing Policy Doc
Once you decide to route, you need a written policy. Not because it’s bureaucratic, but because without it you’ll get inconsistency — different engineers making different calls, no way to audit what went where, and no baseline to measure savings against.
Our routing policy document is one page. It covers:
-
The model pool: which models are in use and their tier (Frontier / Efficient). Right now we run Opus as frontier and Haiku as efficient, with Sonnet as a fallback for tasks that feel too complex for Haiku but don’t clearly need Opus. Having that middle tier is useful for the grey zone.
-
The routing table: task domains and their default model tier, with override conditions.
-
What MUST stay frontier: a hard list. For us it’s anything touching authentication, payment processing, production database schemas, and any task flagged as high ambiguity by the task author. This list is non-negotiable, and it’s short — which is the point.
-
How context carries across switches: when a session switches from frontier to efficient mid-task (or vice versa), what context gets passed. You need to handle this explicitly. The model switch shouldn’t lose thread — the receiving model needs enough context to understand the task it’s inheriting.
-
Who can override and how: in a regulated environment, you want an audit trail for routing decisions. Even if you’re not regulated, it’s useful to know which engineer overrode the default and why.
Measuring Weekly
Routing only produces value if you’re measuring it. The instrumentation is straightforward but you have to actually do it.
We track per session: which model tier was used for each subtask, token counts (input and output separately), and the actual cost. Every Monday I look at three numbers: total spend, average cost per session, and the routing efficiency ratio (what percentage of tasks went to efficient models this week vs. what would have been optimal in hindsight).
The last metric is hard to calculate precisely because “what would have been optimal” requires a judgment call. We approximate it by having engineers flag tasks post-completion: “this could have used a cheaper model” or “we needed frontier and I’m glad we had it.” Imperfect, but it gives you signal over time.
The first four weeks of measuring, you’ll find patterns you didn’t expect. For us, the biggest surprise was that test generation was one of our highest-volume task types and we were running almost all of it on Opus. Tests for utility functions with clear type signatures don’t need Opus. We moved 80% of test generation to Haiku and saw essentially no quality difference in the output.
CI Routing
If you run AI-assisted tasks in CI — code review agents, documentation generation, security scanning — the economics of routing hit even harder because the volume is much higher and no human is in the loop to catch mistakes.
CI is also where I’d be more conservative about routing. The failure mode in CI is that a low-quality output slips through and gets merged, which is worse than an expensive token. So our CI routing policy is stricter: anything in a pull request critical path (code review, security scan) stays frontier. Auxiliary tasks (updating the CHANGELOG, generating docs for new functions, summarizing test failures) go efficient.
The test: would a bad output from this task get caught before it caused a problem? If yes, route aggressively. If no, stay frontier.
Where Routing Hurts (Honest Take)
I want to be honest about the failure modes because most content on this topic is too optimistic.
You will route something wrong. A task you classify as routine will turn out to be complex, and an efficient model will produce something subtly wrong — not obviously broken, but wrong in a way that requires review. The mitigation isn’t to stop routing, it’s to have humans reviewing outputs and a feedback loop back to your routing policy.
Context bleed between sessions. If your agent architecture re-uses context across sessions in ways that aren’t explicit, model switches can cause inconsistency. The session that started on Opus with a deep understanding of your auth architecture doesn’t automatically give that context to Haiku when it takes over a documentation task. Make context passing explicit and serialized.
The grey zone is real and doesn’t shrink. You will always have tasks you’re not sure how to classify. My recommendation: when in doubt, route up, not down. The cost of a frontier call on a task that didn’t need it is small. The cost of an efficient model producing wrong output on a sensitive task is larger.
Routing logic is maintenance surface. A routing policy doc and a set of heuristics is code in a different form. It will get stale as your codebase evolves and as model capabilities change. Schedule a review quarterly — not just to tune the savings, but to check that the “must stay frontier” list is still accurate and complete.
The Bigger Point
I’m not making a case against frontier models. Opus, and whatever comes after Opus, is extraordinary — genuinely capable of reasoning about complex systems in ways that matter. When I’m designing a new data model for Zero or thinking through how to refactor our multi-tenant auth logic, I want the best model available. I’m not interested in saving fifty cents on that.
But “use the best model for everything” is not a technical strategy. It’s a default that you fell into and haven’t revisited. When I look at the tasks flowing through our agent sessions in a given week, a majority of them — by volume, if not by importance — are things that a capable, efficient model handles well. Docs. Tests. Config. Boilerplate. Mechanical refactors with clear specs.
The cost of routing those tasks correctly is some upfront work to write a routing policy and instrument your sessions. The return is 20-40% lower infrastructure costs that compound over time, plus a cleaner mental model of where your expensive compute is actually going.
At the scale most teams operate today — a few thousand agent sessions a month — that’s real money. At the scale serious AI-native companies are heading toward — tens of thousands of sessions, always-on background agents, CI automation that runs on every push — it becomes infrastructure discipline you can’t afford not to have.
Start by writing down what tasks your agents actually do in a typical week. You’ll almost certainly find that a larger fraction than you expect is routine work. That’s where the savings are. And unlike most cost-cutting, this one doesn’t require you to accept worse outcomes — just to be more deliberate about which tool you’re reaching for and why.