The Token Economy: How Collapsing Inference Costs Rewrote My Architecture Decisions
Ethan Mollick tracks the economics. The story is a 600x price collapse in under six years. Here's what that actually changes about how you build.
The Shift I Almost Missed
In early 2023, I was helping a client decide whether to integrate LLM-based document processing into their workflow. We ran the numbers carefully. GPT-4 at launch was priced around $30 per million input tokens. The client’s volume was modest by enterprise standards but substantial enough that the monthly inference bill came out to something that needed a dedicated budget line. We built the feature, put a rate limiter on it, and treated it as a premium capability to use judiciously.
By mid-2025, the equivalent processing cost — on a model that was meaningfully more capable — had fallen to well under a dollar per million tokens. The rate limiter was still there, inherited from the original architecture, still shaping user experience based on a cost constraint that no longer existed.
I only caught this because I’d been reading Ethan Mollick.
What Mollick Has Been Tracking
Ethan Mollick is a professor at Wharton and runs a Substack called “One Useful Thing” — which, despite the modest name, has become one of the more honest and data-grounded places to read about how AI is actually reshaping work and economics. He’s also the author of Co-Intelligence, published in 2024, which made a case that AI would reshape knowledge work more profoundly than most frameworks were acknowledging.
One of the threads he’s tracked consistently is inference economics. His framing: when GPT-4 first launched, it cost around $50 to process a million tokens. By 2025, a much more capable model — GPT-5 nano — cost around 14 cents per million tokens. That’s a roughly 350x price reduction in roughly two years. Other analysts tracking the broader market have documented price declines on the order of 600x across the six-year arc from early frontier models to economy-tier models in 2026 like Gemini 2.0 Flash at $0.10 per million tokens.
The speed of that collapse is unlike anything I’ve seen in other infrastructure cost curves. Cloud compute dropped in cost significantly from 2010 to 2020, but that was a decade-long trend measured in factors. AI inference costs are dropping faster, and they’re dropping because of both hardware improvements and dramatically more efficient model architectures.
Mollick’s broader argument, which I find compelling, is that this cost collapse is what enables the billion-user scenario for AI — not some grand democratization initiative, but simple economics. When the marginal cost of serving one more user an intelligent response approaches zero, the business models that weren’t viable at $30 per million tokens become straightforwardly viable at $0.14.
What This Changes in Practice
I want to be specific about the architectural decisions this rewrites, because I’ve made several of these shifts myself in the past 18 months.
Classification as a first step. In 2023, adding an LLM call to classify user intent before routing to the right tool felt expensive. You were paying inference cost for a preprocessing step that might not add much value. In 2025, with economy-tier models at single-digit cents per million tokens, classification layers are cheap enough to add freely. We now use lightweight classification on almost every user interaction in the Zero platform before routing to more expensive operations. The quality improvement from better routing pays for the classification cost many times over.
Ensemble approaches. Calling multiple models on the same task and comparing outputs — or using one model to critique another’s output — was economically absurd when each call had meaningful cost. Now it’s a design pattern I actually use. For anything where accuracy matters and latency requirements are tolerant, running two smaller models and reconciling their outputs often beats one large model at comparable total cost.
Longer contexts. The cost-per-token pricing model means context length directly affects cost. When prices were high, there was constant pressure to trim context windows, to compress prompts aggressively, to figure out what you could leave out. At current price levels, I’ve largely stopped optimizing for prompt length on anything except the very highest volume endpoints. The time I would spend shaving tokens is worth more than the cost I’d save.
Richer intermediate reasoning. Chain-of-thought prompting — asking the model to reason through a problem before answering — produces better outputs on complex tasks but uses more tokens. In 2023 that was a meaningful tradeoff. In 2026 it’s almost always worth it, because the output quality improvement outweighs the cost increase on any task where correctness matters.
The Outcome-Based Pricing Shift
The cost collapse also sets up a structural shift in how AI capabilities get priced at the product level — away from token-based consumption and toward outcome-based pricing.
This is still early, but the logic is tight. If I’m an enterprise customer, I don’t intrinsically care about tokens. I care about outcomes: documents processed, tickets resolved, decisions supported. When per-token costs were high, the pricing model made sense because tokens were the expensive resource. As the cost of tokens approaches zero, the value I’m delivering as an AI product builder is not the tokens — it’s the outcome. The tokens are the medium.
We’re seeing early versions of this in how AI work tools are starting to be marketed. “Per task completed” or “per outcome” framings are appearing alongside or replacing “per token” or “per API call” models. The economic pressure is toward outcome measurement because that’s what aligns incentives between the AI product and the customer.
I think this shift will accelerate in 2026 and 2027, and it will change what you measure. If you’re pricing by outcome, you need to know your cost per outcome. That requires knowing your token consumption per outcome, your success rate per attempt, and your cost of the human escalation path when the AI fails. Suddenly the reliability metrics from Karpathy’s march-of-nines framework become cost metrics. Every failed agent run has a real dollar cost — either direct retry cost or human escalation cost. Reliability and economics become the same problem.
The Trap I See People Falling Into
Mollick is generally optimistic about the economic trajectory of AI, and I think he’s right to be. But I want to name a trap that the cost collapse creates.
When inference is cheap, it becomes tempting to use AI everywhere, including places where it doesn’t add value, because “why not, it’s almost free?” I’ve seen engineering teams add LLM calls to pipelines where a simple regex or lookup would have been correct, faster, and more predictable — because the LLM was available and cheap and felt sophisticated.
This is the opposite problem from 2023, when everything was too expensive to use. But it’s still a problem, because cheap inference + bad judgment + high volume still produces significant bills and unpredictable behavior. The economics changed. The need to think clearly about where AI adds genuine value versus where it adds complexity without benefit — that didn’t change.
Mollick’s work emphasizes using AI as a genuine thinking partner, not as a decoration layer. His framing of “co-intelligence” is about a real collaborative relationship where the AI is doing meaningful cognitive work you’d otherwise have to do yourself or not do at all. That framing is the right filter. Does this AI call represent cognitive work that would otherwise consume human time or not happen at all? If yes, it’s probably a good use. If it’s adding a language-model shaped step to something deterministic that was working fine, the low token cost doesn’t make it a good idea.
Architecture Decisions for the Cheap-Token Era
The shift I’ve made in how I think about AI architecture:
Token cost is no longer the first constraint. Latency, reliability, and correctness are. Design around those; then check whether the resulting token consumption is acceptable. In most cases it will be.
Rethink every rate limiter and usage cap you implemented in 2022–2023. Some of them are protecting against real abuse or load patterns. Some of them are protecting against a cost that no longer exists. Audit which is which.
Build for outcome measurement from the start. If you can’t measure whether the AI-assisted workflow is producing better outcomes than the alternative, you don’t actually know what you’re getting for the token spend, even if the token spend is low.
And read Mollick. He’s not primarily an architecture writer — he’s an economist studying AI adoption — but the economic framing he provides is exactly what’s missing from most engineering discussions about AI deployment. The tokens are not the point. The outcomes are.