Anshad Ameenza.
Technology · · Updated: Jun 28, 2026

Cut Your Coding-Agent Bill, Part 3: Run the Loop on Your Own Hardware

Part 3 of 3: take the agent loop fully local. Ollama, llama.cpp and vLLM, quantization and VRAM math explained, the OpenAI-compatible local endpoint, and exactly when owning the hardware beats renting tokens.


Part 1 said plan high, execute cheap. Part 2 made “cheap” mean a cheaper cloud API. This part makes it mean free at the margin: run the execution loop on a model on your own machine, where every token after the electricity is zero and your code never leaves the building.

This is not for everyone, and I will be honest about when it is a worse idea than Part 2. But for high-volume builders and anyone working on code that cannot go to a third party, owning the loop changes the math entirely.

Frontier model planning, in the cloud plan.md Local model the loop, on your GPU · free at the margin search edit test fix × hundreds
The expensive model touches the work for the few minutes that matter. The cheap one does the hundreds of laps.

When local actually wins

Three situations flip the decision toward your own hardware. Volume: if the loop runs all day, a one-time hardware cost beats a meter that never stops. Privacy: client code, secrets, anything under an agreement that forbids third-party processing, never leaves your machine. Control: no rate limits, no price changes, no model deprecated out from under you, and it works on a plane. The cost stops being per-token and becomes per-watt.

The trade is that a model you can run at home is weaker than a frontier model, and it is slower unless you have serious hardware. So the tiering from Part 1 still holds, harder: keep planning on a frontier model in the cloud, and run only the execution loop locally.

The two numbers that decide everything: quantization and VRAM

Whether a model fits on your GPU comes down to two ideas. Get these and the rest is setup.

Layer 1 · Intuition

A model is billions of numbers (weights). Running it means holding those numbers in your GPU’s memory (VRAM). Quantization shrinks each number so more of them fit; the model gets smaller and faster, and slightly less precise.

Layer 2 · Mechanism how it actually works

Weights are trained at 16 bits each. Quantization stores them at fewer bits, commonly 8 or 4, which roughly halves or quarters the memory and speeds up inference, at a small and usually acceptable quality cost down to about 4 bits. The standard local format is GGUF (used by llama.cpp and Ollama), which ships a model at a chosen quantization level like Q4_K_M or Q8_0. For serving many requests fast, vLLM runs higher-throughput quantized formats on the GPU. Same model, different container; pick the quant that fits your card with room to spare.

Layer 3 · Math & where it breaks go deeper

The rule of thumb for memory:

VRAM  ≈  params × bytes_per_param  +  KV_cache(context)  +  overhead

bytes_per_param:  FP16 = 2.0   ·   Q8 ≈ 1.0   ·   Q4 ≈ 0.5

So a 14-billion-parameter model at Q4 needs roughly 14e9 × 0.5 = 7 GB for weights, plus a gigabyte or three for the KV cache (which grows with how much context you hold) and overhead. That fits comfortably on a 16 GB card. A 32B model at Q4 wants around 18 to 20 GB; a 70B at Q4 is in the 40 GB range, so two cards or a big one. Do not guess, the VRAM calculator does this exactly: pick a model and a quantization and it tells you whether it fits and how many GPUs you need.

You can stop after Layer 1 and still be correct about quantization and VRAM, just less complete.

A 14B model on your GPUFP16~28 GBQ8~14 GBQ4~7 GB · fits a 16 GB card
Same model, three sizes. Lower precision fits more model on the same card, for a small quality cost down to about 4 bits.

The setup: a local OpenAI-compatible endpoint

The beautiful part is that everything from Part 2 still applies. Your local server speaks the same OpenAI-compatible API, so the agent swap is the same two variables, just pointed at localhost.

Layer 1 · Intuition

You run a small server on your machine that loads the model and answers the same kind of request a cloud API does. Then you point your agent at it instead of the internet.

Layer 2 · Mechanism how it actually works

The easy on-ramp is Ollama (or LM Studio): install it, pull a quantized code model, and it serves an OpenAI-compatible endpoint with one command. For more speed and parallel requests, vLLM serves the model with high throughput. llama.cpp is the engine underneath much of this and runs well even without a top-end GPU.

Layer 3 · Math & where it breaks go deeper

Stand up the server and point the agent at it:

# 1. serve a quantized code model locally (OpenAI-compatible)
ollama pull <a-code-model>
ollama serve            # exposes http://localhost:11434/v1

# 2. point your executor at the local endpoint (same swap as Part 2)
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"        # any value; it's local
export OPENAI_MODEL="<a-code-model>"

For throughput instead of simplicity, vllm serve <model> exposes the same /v1 interface on a port you point the agent at. Either way, the planner stays on a frontier model in the cloud; only the loop runs on localhost.

You can stop after Layer 1 and still be correct about serving a model locally, just less complete.

The economics: a wall of fixed cost

Local inverts the cost shape. The cloud is pure variable cost, cheap to start, unbounded as you scale. Local is a fixed wall: you pay for the GPU (or the electricity on a card you already own), and then the loop is effectively free no matter how many laps it runs.

Renting tokens is cheaper until you use a lot of them. Owning the GPU is expensive until you use a lot of them. The crossover is the whole decision.

The break-even

So the question is volume. A weekend tinkerer who runs the agent a few hours a week will never amortize a GPU; Part 2 is the right answer for them. A builder running the loop all day, every day, crosses the break-even fast, and after that the marginal feature costs watts, not dollars. Put your real usage into Self-Host vs API and it finds the crossover for you.

The whole series in one move

Plan high, execute cheap, escalate rarely. Part 1 was the principle and the same-model wins. Part 2 routed the loop to a cheaper cloud model with two variables. Part 3 runs that loop on your own hardware, where the marginal token is free and your code stays home. You can mix all three: frontier planning, a cheap API for everyday execution, and a local model for the high-volume or sensitive work. The expensive model touches the twenty minutes that need it; everything else runs on an engine priced for volume, or on no meter at all.

AI Coding Agents Local LLMs Cost
Share:
Anshad Ameenza
About the Author

Anshad Ameenza

Lifelong Learner, Engineer, Technology Leader & Innovation Architect

20+ years of experience in technology leadership, innovation, and digital transformation. Building and scaling technology ventures.

Only if you find it useful

No pitch here. If these pieces are worth your time, you can get new ones in your inbox. If not, skip it with a clear conscience, nothing is being sold. Rare emails, no spam, leave whenever you like.

Continue Reading

Related Articles