Anshad Ameenza.
Technology · · Updated: Jun 28, 2026

Cut Your Coding-Agent Bill, Part 1: Where the Money Actually Goes

Most of your coding-agent bill is grind, not genius. Part 1 of 3: how token pricing really works, why the loop bleeds you, and the plan-high, execute-cheap fix you can apply today.


You ship the side project and it works. Then you open the billing dashboard and run the math no builder wants to run: a few hundred dollars in a single week, for something nobody is paying you to build. The agent was incredible right up until the invoice arrived, and now you are rationing prompts, thinking twice before you let it try something, building like an accountant instead of a builder.

This is a three-part series on getting that number down without getting worse output. This part is the principle and the techniques you can use today on the frontier models you already pay for. Part 2 routes the expensive half of the work to cheaper open-weight models. Part 3 takes the loop fully local, onto hardware you own. And there is a calculator that does this arithmetic for you in the Coding Agent Cost Split tool.

The expensive habit nobody questions

Most of us reach for the smartest, priciest model for everything. It feels responsible: best model, best output. Here is what that misses. Almost none of your tokens go toward hard thinking. They go toward grind. The thirtieth file edit. Re-reading the same module for the fourth time. Run the tests, read the error, fix the typo, run them again. Search the codebase for where that function actually lives. Wire up boilerplate you have written a hundred times.

That loop runs through maybe ninety percent of shipping anything real, and you are running every lap of it through a model priced for genius-level reasoning. You are paying premium rates for grunt work.

The hard part of building was never the typing. It’s the thinking. The rest is volume, and volume is where an expensive model bleeds you.

The whole series in one line

Where the money actually goes

To fix it you have to see the meter clearly, so here is one level down into how you are billed.

One session, by costPlanning · short & rareExecution loop · hundreds of calls, each re-sending context + paying to “think”reasoning tokens
A long agent session: a few expensive planning calls, then hundreds of loop calls, each re-sending a growing context and paying for reasoning tokens. The loop is the bill.
Layer 1 · Intuition

You pay per token, separately for the tokens you send (input) and the tokens the model writes (output). An agent session is not one call; it is hundreds, and each one is more expensive than the last, because the conversation keeps growing and you re-send all of it every time.

Layer 2 · Mechanism how it actually works

Three things stack up. First, input grows every turn: each loop iteration re-sends the system prompt, the files in context, and the whole running transcript, so call number two hundred is paying for a huge prompt even if the new instruction is tiny. Second, reasoning is billed as output: modern coding models “think” before they answer, and those hidden thinking tokens are charged at the output rate, on every step. Third, tool results pile into context: every file read and test log the agent pulls in stays in the prompt for the rest of the session.

Layer 3 · Math & where it breaks go deeper

A rough model of one session’s cost:

cost  ≈  Σ over calls [ input_tokens(call) × in_price
                        + (answer + reasoning)_tokens(call) × out_price ]

The trap is that input_tokens(call) rises with the conversation, so cost per call drifts upward through the session, and there are hundreds of calls in the loop versus a handful in planning. Two prices matter most: output price (3 to 5 times input price on frontier models, and it is what reasoning tokens cost) and the gap between a frontier model and a cheaper one (often around 5 times per token, the subject of Part 2). The expensive part of building is short and rare; the cheap part is long and constant; and the default setup pays the expensive rate for the long part.

You can stop after Layer 1 and still be correct about agent token economics, just less complete.

The fix: plan high, execute cheap

Stop asking “which model is best” and start asking “which part of the job am I paying for.” Then split the work. One model gets the architecture and the decisions where being wrong early gets expensive later. A second, cheaper one gets the search-edit-test-fix loop that runs hundreds of times a day.

Premium model architecture · decisions plan.md Cheaper model the build loop ×hundreds search edit test fix × hundreds
The expensive model touches the work for the few minutes that matter. The cheap one does the hundreds of laps.

The split is not a downgrade. The premium model touches the project for the twenty minutes that actually need it; the cheaper engine does the eight hours of reps. Same app, same quality bar.

Five things you can do today, same models

You do not need to switch labs to start saving. These work inside the frontier tools you already use.

Make the expensive model write a plan, not code

Open with the hard questions: architecture, data model, build order, the parts most likely to break. Ask for a written plan and a task list, explicitly not an implementation. Save it as a plan.md the executor reads straight from source, so you are not re-explaining the project every session. That is the twenty minutes worth every cent of the premium.

Turn on prompt caching

If your big context (the system prompt, the repo map, the plan) is mostly static, cache it. A cached read is typically billed at a fraction of the normal input price, often around a tenth, so the hundredth loop call stops re-paying full price for the same 20,000-token preamble. This is usually the single biggest line-item win, and it is one flag.

Keep context small and fresh

Context is not free memory; it is re-billed every call. Don’t dump the whole repo in. Scope to the files that matter, clear the conversation between unrelated tasks, and use sub-agents with their own fresh context for side quests so the main thread stays lean.

Spend reasoning where it earns its keep

Thinking tokens are output tokens. For mechanical steps, rename this, fix this import, run the tests, dial the reasoning effort down or use a non-thinking model. Save the deep reasoning budget for the genuinely hard calls.

Batch the patient work and cap the thrash

Anything latency-tolerant, a big refactor sweep, a docs pass, can go through a batch endpoint, often around half price. And cap retries: an agent that loops on a failing test doesn’t find the answer faster, it finds the bill faster.

Where this goes next

Tiering inside one lab gets you a long way. The bigger savings come from running that long, cheap half of the work on a model that costs a fraction per token, and that is a choice between someone else’s cheap API and your own hardware.

Part 2 is the open-weight route: how to point the agents you already use (Claude Code, Cline, Aider) at a cheaper model with two environment variables, which models hold up for execution, and the real economics. Part 3 goes all the way local: running the loop on your own GPU, the quantization and VRAM math, and exactly when owning the hardware beats renting the tokens.

AI Coding Agents Cost Engineering
Share:
Anshad Ameenza
About the Author

Anshad Ameenza

Lifelong Learner, Engineer, Technology Leader & Innovation Architect

20+ years of experience in technology leadership, innovation, and digital transformation. Building and scaling technology ventures.

Only if you find it useful

No pitch here. If these pieces are worth your time, you can get new ones in your inbox. If not, skip it with a clear conscience, nothing is being sold. Rare emails, no spam, leave whenever you like.

Continue Reading

Related Articles