The End of NVIDIA's Ninety Percent

There is a peculiar dynamic in the AI chip market right now. NVIDIA’s best customers — the three or four companies spending the most on AI compute — are simultaneously its most serious competitors in the silicon layer. And they are not losing.

This is not a new dynamic in tech. Intel built a dominant server CPU business, then watched as Amazon, Google, and Microsoft all developed custom ARM-based chips for their own datacenters. Apple took a similar path in mobile. Hyperscalers have a long history of vertically integrating into silicon when the economics justify it, and when their scale makes the investment sensible. What is different now is that the AI chip market is large enough, and the TCO advantages clear enough, that the move to custom AI silicon has accelerated well past the experimental phase.

Let me be concrete about what is real versus what is still mostly positioning.

The Chips That Actually Exist

Google’s TPU (Tensor Processing Unit) is the oldest and most mature custom AI accelerator in production. Google has been running TPUs since around 2016 — before the current AI boom, before anyone was talking about GPU clusters at scale. The current generation is TPU v6, also called Trillium. Google has publicly stated a 4.7x performance improvement over TPU v5e with 67% better energy efficiency. This is not a research project; these chips run Gemini inference and training at production scale. Google routinely deploys TPU clusters of 10,000 or more chips.

AWS Trainium is Amazon’s training-focused custom chip, with Inferentia handling inference. The adoption curve has been real but initially slow — internal AWS data from 2024 showed Trainium at under 1% of GPU usage, with Inferentia at around 2.7%. Those numbers have moved since, as AWS has scaled Trainium availability and Amazon has aggressively pushed it to customers, particularly for Bedrock workloads. But the honest picture is that AWS’s custom silicon is still maturing relative to Google’s.

Microsoft Maia was co-designed with OpenAI specifically for GPT-4 class model workloads. Microsoft has been more cautious about disclosing utilization numbers, but the stated purpose is handling a subset of Azure’s OpenAI inference at lower cost than equivalent NVIDIA hardware for that specific workload profile. Approximately 70% of Azure AI workloads reportedly still run on NVIDIA hardware as of late 2025. Maia is not a general-purpose replacement — it is a cost optimization for a specific, high-volume workload.

Meta MTIA (Meta Training and Inference Accelerator) has been deployed internally for Meta’s recommendation systems and, more recently, for inference on some Llama model workloads. Meta has not disclosed utilization at scale, but they have confirmed production deployment.

The TCO Math That Drives This

Why do hyperscalers bother? Building a competitive AI chip requires hundreds of millions of dollars in engineering investment, multiple years of development, and close coordination with foundries like TSMC. That is a serious commitment. The answer is in the economics at scale.

SemiAnalysis and others doing TCO analysis on hyperscaler custom silicon have found TCO advantages of 40 to 65% for targeted workloads — specifically high-volume, well-characterized inference jobs where you can optimize the chip for a narrow set of numerical operations. NVIDIA’s GPUs are general-purpose accelerators. They are optimized across a wide range of workloads, which means they are not optimally efficient for any single one.

When you are running ten billion inference calls per day on a specific model — which Google and Microsoft are doing with Gemini and Copilot respectively — a 40% reduction in per-call compute cost is worth billions of dollars per year. At that scale, the chip development investment amortizes quickly.

NVIDIA’s data center AI market share was estimated at around 86% in 2024 and is projected to decline to approximately 75% by 2026 as hyperscalers deploy custom silicon at scale, with custom chips capturing 15 to 25% of the market in specific inference workloads. That is a material shift, though NVIDIA’s dominance in training and in general-purpose GPU workloads remains largely intact.

What NVIDIA Actually Defends

The nuanced reality — and this is where I think a lot of the coverage gets sloppy — is that NVIDIA is not losing across the board. The areas where custom silicon wins are structurally different from the areas where NVIDIA wins.

Custom silicon wins on inference for well-characterized, high-volume workloads. If you are running the same model on the same input distribution at massive scale, you can optimize aggressively for that specific task. Google does this with search-related Gemini inference. Amazon does this with certain Bedrock inference. The workload is known, the optimization surface is clear.

NVIDIA wins on training, on new model development, on flexibility, and on the long tail of workloads. The reason every AI research lab and every enterprise AI team buys NVIDIA GPUs is that CUDA and the software ecosystem are unmatched. The H100 and B200 are the tools that frontier model researchers want, because they work with everything, because PyTorch support is first-class, and because NVIDIA’s iteration on new features is faster than any custom silicon vendor’s.

Training runs are also inherently less predictable in their workload profile — researchers are constantly changing model architectures, trying new training techniques, scaling in ways that no fixed-function chip anticipated. GPUs handle this; custom chips often don’t.

The divide will probably deepen: hyperscaler custom silicon takes an increasing fraction of inference; NVIDIA dominates training and the heterogeneous long tail. The training market is also growing fast, so NVIDIA’s absolute revenue grows even as its share in inference erodes.

What This Means for the Rest of Us

I spent years doing enterprise architecture at companies like Dell/EMC and HP. The question of “build vs. buy” for infrastructure is never really resolved once — it is re-evaluated as the economics and scale shift. What the hyperscalers are doing now is the infrastructure equivalent of a large manufacturer deciding to build their own machinery when they have enough volume to justify it.

If you are building AI products on top of cloud infrastructure, the custom silicon race has a few practical implications.

First, cloud pricing for AI inference is going to drop structurally over the next three to five years. When AWS can serve inference at 50% lower cost using Inferentia or Trainium versus NVIDIA hardware, that savings eventually flows into competitive pricing pressure. Google has already done this with TPU-based pricing. The inference cost curves are going to continue falling.

Second, portability matters more. If your inference stack is tightly coupled to NVIDIA-specific features — certain CUDA operations, specific memory patterns — you may find yourself paying a premium as more workloads migrate to custom silicon backends. Multi-backend inference frameworks and hardware-agnostic model formats are not just good engineering hygiene; they are increasingly good economics.

Third, for the small number of teams actually building model-serving infrastructure at scale (not using managed APIs, but running their own inference stack), the evaluation of hardware is now genuinely complex. The right answer is no longer obviously “buy NVIDIA.” The TCO calculation depends on your specific workload, your scale, your team’s ability to work with less mature software stacks, and your tolerance for the operational complexity of custom silicon.

The end of NVIDIA’s 90% dominance is not imminent, and it’s not a collapse. It’s a slow structural rotation, driven by the most sophisticated and well-resourced customers in the world deciding that at their scale, vertical integration into silicon is worth it. For everyone below hyperscaler scale, NVIDIA remains the default answer. But the direction is clear.

The End of NVIDIA's Ninety Percent

The Chips That Actually Exist

The TCO Math That Drives This

What NVIDIA Actually Defends

What This Means for the Rest of Us

Anshad Ameenza

Related Articles

Inference Ate the Datacenter

Power, Not Capital, Is the New Constraint in AI

Stop Using Your Best Model for Everything: A Practical Guide to Model Routing

The End of NVIDIA's Ninety Percent

The Chips That Actually Exist

The TCO Math That Drives This

What NVIDIA Actually Defends

What This Means for the Rest of Us

Anshad Ameenza

Related Articles

Inference Ate the Datacenter

Power, Not Capital, Is the New Constraint in AI

Stop Using Your Best Model for Everything: A Practical Guide to Model Routing

Cookie & Reality Check