Cut Your Coding-Agent Bill, Part 2: Route the Loop to a C...

Part 1 made the case: plan high, execute cheap, because the long search-edit-test-fix loop is where the meter runs. This part is the cheapest version of “execute cheap” that still runs in the cloud: route the loop to an open-weight model that costs a fraction per token, without changing the agent you already like.

The headline is that this is almost embarrassingly easy. It is usually two environment variables.

Why this works at all: one API, many models

Almost every coding agent talks to its model over the same shape of HTTP request, the OpenAI-compatible chat-completions API. That format became a de facto standard, so a whole ecosystem of providers serves open-weight models behind the identical interface. Which means switching the engine under your agent is not a rewrite. It is a new base URL and a new key.

The expensive model touches the work for the few minutes that matter. The cheap one does the hundreds of laps.

Layer 1 · Intuition

Your agent sends “here is the conversation, here are the tools, what next?” to an endpoint. Change the endpoint to one that serves a cheaper model, and the agent neither knows nor cares. The plan and the project stay the same; only the engine running the reps changes.

Layer 2 · Mechanism how it actually works

Two routes. An aggregator like OpenRouter exposes hundreds of models, open and closed, behind one OpenAI-compatible endpoint and one key, so you can A/B models by changing a single string. Or go direct to a provider that hosts a specific open-weight family. Either way the agent’s code is untouched; you are swapping the value of OPENAI_BASE_URL and OPENAI_API_KEY.

Layer 3 · Math & where it breaks go deeper

Agents that already speak the OpenAI format (Cline, Aider, Continue, and most editor extensions) take the swap directly:

# point an OpenAI-compatible agent at a cheaper provider
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_API_KEY="sk-or-..."
export OPENAI_MODEL="<an open-weight code model>"

Agents that speak a different provider’s API can be pointed through a small gateway that translates to the OpenAI format, so the same trick applies. The practical test for any candidate model is not a benchmark score; it is whether it can drive your tools reliably over a long session, which is the next section.

You can stop after Layer 1 and still be correct about the model swap, just less complete.

The swap is a string. The agent posts to whatever endpoint you set; the cheaper model answers in the same format.

Which models actually hold up for execution

Execution is not a reasoning beauty contest. The job is to call the test runner, grep the repo, read files, hit the database, and work through tools across a long session without losing the thread. The traits that matter:

Tool-use reliability. It must emit well-formed tool calls, consistently, hundreds of times in a row. A model that is brilliant at prose but flaky at function-calling will waste your savings on retries. This is the single most important property for an executor.
A genuinely long context. The loop accumulates file contents and logs; an executor needs to hold the thread across a big, growing prompt without falling apart in the back half.
Code-specialization helps. The open-weight families built or tuned for code (the various code-focused releases from DeepSeek, Qwen, GLM, Mistral, Llama, Kimi and others) tend to be stronger executors than general chat models of the same size.

You do not pick by leaderboard. You pick by running your own loop on a real task and watching whether it drives the tools cleanly. Keep the planner on a frontier model; you are only swapping the executor.

The economics, honestly

Open-weight models served competitively often land around a fifth of the per-token price of a frontier model, sometimes less, and the gap is widest on output tokens, which is where the loop (and its reasoning) spends. Stack that on top of the Part 1 techniques (caching the static context, keeping prompts lean) and the loop’s cost can fall by most of itself while the planning, still on the premium model, barely moves.

“
The premium model touches the work for the minutes that need genius. Everything else runs on an engine priced for volume. Same output, a fraction the meter.
”

The trade you want

Two honest caveats. First, an open-weight executor is not as strong a reasoner, so when the plan itself is wrong, that is a thinking problem, go back to the premium model, fix the plan, hand it back. You will need that less often than you fear. Second, your savings are real only if the executor is reliable; a cheap model that thrashes is not cheap. Measure cost per finished task, not per token. Quantify your own split with the Coding Agent Cost Split tool and the break-even with Self-Host vs API.

Set it up this weekend

Get a cheap endpoint

Sign up for an aggregator or a provider that serves an open-weight code model over the OpenAI-compatible API. Grab the base URL and a key.

Point your executor at it

Set OPENAI_BASE_URL, OPENAI_API_KEY, and the model name. If your agent speaks a different provider’s format, drop a translating gateway in front. Keep your planner on the frontier model.

Run a real task and watch the tools

Hand it the plan.md from Part 1 and let it run the loop on an actual feature. Watch for clean tool calls and a held thread, not benchmark scores. If it drives your tools reliably, you are done.

Escalate rarely

Bring the premium model back only when the plan was wrong. That is a thinking problem, not an execution one.

Part 3 takes the same loop off the cloud entirely and runs it on your own GPU, where the marginal token is free and your code never leaves the building, with the quantization, VRAM, and break-even math to decide if it is worth it.

What to remember

Most coding agents speak the OpenAI-compatible API, so swapping the executor model is a base URL and a key, not a rewrite.
Use an aggregator (one endpoint, many models) or go direct to a provider hosting an open-weight code family. Keep the planner on a frontier model; only swap the executor.
Pick an executor by tool-use reliability and long-context stamina over a real task, not by leaderboard. Code-tuned open-weight families tend to drive tools best.
Open-weight inference often runs around a fifth of frontier per-token price, widest on output, which is where the loop spends. Measure cost per finished task, not per token.
The catch is your code passes through a third party. For client code or real secrets, scope tokens and read the data terms, or run the loop locally (Part 3).

Cut Your Coding-Agent Bill, Part 2: Route the Loop to a Cheaper Model

Why this works at all: one API, many models

Which models actually hold up for execution

The economics, honestly

Set it up this weekend

Get a cheap endpoint

Point your executor at it

Run a real task and watch the tools

Escalate rarely

Anshad Ameenza

Only if you find it useful

Related Articles

Cut Your Coding-Agent Bill, Part 1: Where the Money Actually Goes

Cut Your Coding-Agent Bill, Part 3: Run the Loop on Your Own Hardware

500+ Open-Source AI Tools for AI Agents, Machine Learning, Computer Vision, NLP, and More

Cut Your Coding-Agent Bill, Part 2: Route the Loop to a Cheaper Model

Why this works at all: one API, many models

Which models actually hold up for execution

The economics, honestly

Set it up this weekend

Get a cheap endpoint

Point your executor at it

Run a real task and watch the tools

Escalate rarely

Anshad Ameenza

Only if you find it useful

Related Articles

Cut Your Coding-Agent Bill, Part 1: Where the Money Actually Goes

Cut Your Coding-Agent Bill, Part 3: Run the Loop on Your Own Hardware

500+ Open-Source AI Tools for AI Agents, Machine Learning, Computer Vision, NLP, and More

Cookie & Reality Check