Cut Your Coding-Agent Bill, Part 2: Route the Loop to a Cheaper Model
Part 2 of 3: point the coding agent you already use at a cheaper open-weight model with two environment variables. Which models hold up for execution, how to wire it, and the real economics.
Part 1 made the case: plan high, execute cheap, because the long search-edit-test-fix loop is where the meter runs. This part is the cheapest version of “execute cheap” that still runs in the cloud: route the loop to an open-weight model that costs a fraction per token, without changing the agent you already like.
The headline is that this is almost embarrassingly easy. It is usually two environment variables.
Why this works at all: one API, many models
Almost every coding agent talks to its model over the same shape of HTTP request, the OpenAI-compatible chat-completions API. That format became a de facto standard, so a whole ecosystem of providers serves open-weight models behind the identical interface. Which means switching the engine under your agent is not a rewrite. It is a new base URL and a new key.
Your agent sends “here is the conversation, here are the tools, what next?” to an endpoint. Change the endpoint to one that serves a cheaper model, and the agent neither knows nor cares. The plan and the project stay the same; only the engine running the reps changes.
Layer 2 · Mechanism how it actually works
Two routes. An aggregator like OpenRouter exposes hundreds of models, open and closed, behind one OpenAI-compatible endpoint and one key, so you can A/B models by changing a single string. Or go direct to a provider that hosts a specific open-weight family. Either way the agent’s code is untouched; you are swapping the value of OPENAI_BASE_URL and OPENAI_API_KEY.
Layer 3 · Math & where it breaks go deeper
Agents that already speak the OpenAI format (Cline, Aider, Continue, and most editor extensions) take the swap directly:
# point an OpenAI-compatible agent at a cheaper provider
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_API_KEY="sk-or-..."
export OPENAI_MODEL="<an open-weight code model>"Agents that speak a different provider’s API can be pointed through a small gateway that translates to the OpenAI format, so the same trick applies. The practical test for any candidate model is not a benchmark score; it is whether it can drive your tools reliably over a long session, which is the next section.
You can stop after Layer 1 and still be correct about the model swap, just less complete.
Which models actually hold up for execution
Execution is not a reasoning beauty contest. The job is to call the test runner, grep the repo, read files, hit the database, and work through tools across a long session without losing the thread. The traits that matter:
- Tool-use reliability. It must emit well-formed tool calls, consistently, hundreds of times in a row. A model that is brilliant at prose but flaky at function-calling will waste your savings on retries. This is the single most important property for an executor.
- A genuinely long context. The loop accumulates file contents and logs; an executor needs to hold the thread across a big, growing prompt without falling apart in the back half.
- Code-specialization helps. The open-weight families built or tuned for code (the various code-focused releases from DeepSeek, Qwen, GLM, Mistral, Llama, Kimi and others) tend to be stronger executors than general chat models of the same size.
You do not pick by leaderboard. You pick by running your own loop on a real task and watching whether it drives the tools cleanly. Keep the planner on a frontier model; you are only swapping the executor.
The economics, honestly
Open-weight models served competitively often land around a fifth of the per-token price of a frontier model, sometimes less, and the gap is widest on output tokens, which is where the loop (and its reasoning) spends. Stack that on top of the Part 1 techniques (caching the static context, keeping prompts lean) and the loop’s cost can fall by most of itself while the planning, still on the premium model, barely moves.
“The premium model touches the work for the minutes that need genius. Everything else runs on an engine priced for volume. Same output, a fraction the meter.
”
Two honest caveats. First, an open-weight executor is not as strong a reasoner, so when the plan itself is wrong, that is a thinking problem, go back to the premium model, fix the plan, hand it back. You will need that less often than you fear. Second, your savings are real only if the executor is reliable; a cheap model that thrashes is not cheap. Measure cost per finished task, not per token. Quantify your own split with the Coding Agent Cost Split tool and the break-even with Self-Host vs API.
Set it up this weekend
Get a cheap endpoint
Sign up for an aggregator or a provider that serves an open-weight code model over the OpenAI-compatible API. Grab the base URL and a key.
Point your executor at it
Set OPENAI_BASE_URL, OPENAI_API_KEY, and the model name. If your agent speaks a different provider’s format, drop a translating gateway in front. Keep your planner on the frontier model.
Run a real task and watch the tools
Hand it the plan.md from Part 1 and let it run the loop on an actual feature. Watch for clean tool calls and a held thread, not benchmark scores. If it drives your tools reliably, you are done.
Escalate rarely
Bring the premium model back only when the plan was wrong. That is a thinking problem, not an execution one.
Part 3 takes the same loop off the cloud entirely and runs it on your own GPU, where the marginal token is free and your code never leaves the building, with the quantization, VRAM, and break-even math to decide if it is worth it.