Inference Ate the Datacenter
Jensen Huang's 'AI factories' framing isn't marketing. Inference is now the dominant AI workload — and it's reshaping how datacenters are designed.
Jensen Huang has a gift for naming things at the right moment. When he started calling datacenters “AI factories” in 2024, a lot of people dismissed it as NVIDIA showmanship. A rebranding play. But the framing is actually correct, and understanding why changes how you think about where compute is going.
At NVIDIA’s GTC 2024 event, Huang laid out the logic explicitly: “The raw material that goes in is data and electricity. What comes out of it is data tokens.” He described a structural shift where companies and countries are “partnering with NVIDIA to shift the trillion-dollar traditional data centers to accelerated computing and build a new type of data center — AI factories — to produce a new commodity: artificial intelligence.”
A factory is a useful mental model because it forces you to think about inputs, outputs, throughput, yield, and unit economics — the same way a manufacturing plant thinks. A traditional datacenter was more like a warehouse: you stored compute capacity and rented it out. An AI factory is a production system. And production systems are designed around the dominant production task.
The dominant production task has shifted from training to inference. That shift is what is actually reshaping the datacenter industry.
Training vs. Inference: The Economics Are Different
When the current wave of AI started, training was the headline. GPT-3 cost tens of millions of dollars to train. GPT-4 cost more. Every major lab was racing to train larger, more capable models, and the compute market was built around massive, long-duration training jobs — clusters of thousands of GPUs running synchronously for weeks or months.
Inference was almost an afterthought in the infrastructure conversation. Once you trained a model, serving it felt like a smaller, simpler problem.
That calculus has flipped.
Training a frontier model is a one-time or periodic event — a lab trains GPT-5, trains the next Gemini, trains the next Claude. Inference happens continuously, at every query, from every user, across every product that uses the model. ChatGPT, at its peak, was handling over 100 million daily active users. Every one of those queries is an inference call. Microsoft’s Copilot products, embedded across Office and Teams and GitHub, are generating inference demand at enterprise scale, continuously. Meta AI is running on WhatsApp across two billion users.
The numbers compound fast. Jensen Huang, speaking about what he calls inference-time scaling — where models do extended reasoning and chain-of-thought deliberation before answering — described a dual-axis problem: throughput (how many tokens per second a system can generate) and responsiveness (tokens per second per individual user). Reasoning models like o3, Gemini Thinking, or DeepSeek-R1 don’t just generate an answer; they think through a problem, generating many intermediate tokens before producing output. That multiplies the inference compute per query by a factor of ten to a hundred compared to older models.
At NVIDIA’s earnings calls through 2024 and 2025, Huang has consistently emphasized that the shift to reasoning AI — inference-time scaling — is driving “orders-of-magnitude increases” in the compute required for inference. That is not marketing hyperbole. If you have a model doing 20 steps of chain-of-thought reasoning before answering versus a model that answers directly, you have roughly 20x the inference compute per query.
What an AI Factory Actually Looks Like
The architectural implications are significant. A datacenter designed for training is optimized for sustained, batch, high-throughput computation. You want the biggest GPU clusters you can build, interconnected at maximum bandwidth (NVLink, InfiniBand), running the same job for days without interruption. Latency within a single request is less critical than aggregate throughput across the entire cluster.
An inference datacenter has different requirements. Latency matters enormously — users will not wait ten seconds for a response, even if the model is smarter. You need horizontal scalability, because demand spikes unpredictably. You need cost efficiency per query, because at consumer scale, the math has to work at fractions of a cent per token.
This is why you are seeing differentiated infrastructure emerge. NVIDIA’s Blackwell architecture was explicitly designed for inference-time scaling, with the NVLink rack-scale system aimed at what Huang called “reasoning AI models” driving demand that training-era GPUs weren’t optimized for. Inference-specialized chips — from NVIDIA, from AWS with Inferentia, from Google with TPUs — are all oriented around a different optimization surface than training chips.
The “AI factory” framing captures something else too: factories are purpose-built. You don’t run injection molding equipment in the same facility as precision machining if you can help it. The trend in serious AI infrastructure is toward purpose-built clusters — some optimized for large pretraining runs, others optimized for high-concurrency, low-latency inference, others for fine-tuning and experimentation.
The Builder’s Angle
I have built products on top of large-scale AI systems, and the inference-first reality shapes decisions at every layer.
The first is cost modeling. Training costs are bounded and predictable; you know roughly what a training run costs before you start. Inference costs at scale are variable and scale with usage. If your product succeeds and usage grows ten times, your inference bill grows ten times too — unless you are actively optimizing. I have seen teams get blindsided by this. A product that is economical at 10,000 daily users becomes impossible to sustain at a million.
The second is latency architecture. Inference-time scaling means smarter models but slower responses. If your application can tolerate latency — async research tasks, background document processing, batch analysis — you can use the most capable reasoning models freely. If your application requires sub-second responses, you need a different model or a caching layer or a distilled model tuned for speed. Most product architectures now need to think about this explicitly.
The third is where the real value gets captured. Training a frontier model is something three to five organizations in the world can credibly do right now. Inference infrastructure is something thousands of companies are building and competing on. The application layer — the products, the interfaces, the integrations — is where the majority of the value will accrue over the next five years. The inference layer is the utility that makes those products possible.
The Industrial Metaphor Holds
I grew up technically in a world where the dominant metaphor for computing was either “the brain” (for AI) or “the library” (for databases and storage). Jensen’s factory metaphor cuts through both of those in a useful way.
A factory has production schedules. It has utilization rates. It has quality control. It has unit economics that have to work. It has capital investment cycles. These are not new concepts — they are how the manufacturing world has operated for 200 years. AI infrastructure is importing those disciplines, because AI at scale is genuinely a production problem, not a research problem.
The companies that understand this early — that think about inference the way a factory manager thinks about production throughput, yield, and cost per unit — are the ones that will build durable businesses on top of this infrastructure wave. The ones that still think about AI as a research or capability problem will be left managing infrastructure costs they didn’t plan for and can’t afford.
The datacenter got eaten by inference. The smart builders are designing for that reality from day one.