The March of Nines: Why Agent Reliability Is the Real AI ...

90% Working Is Not Working

There’s a specific kind of demo that I’ve stopped getting excited about. It’s the one where an AI agent does something impressive — books a meeting, writes and executes a data pipeline, submits a pull request — and everyone in the room nods along, and someone says “we should deploy this.” And I’ve learned to ask the same question every time: “What happened in the runs that didn’t work?”

That question usually produces an uncomfortable silence.

Andrej Karpathy gave the clearest articulation I’ve heard of why that silence matters. In his 2025 interview on Dwarkesh Patel’s podcast, he described what he calls the “march of nines” — drawing directly from his five years running Tesla’s Autopilot team. The framework is this: “When you get a demo and something works 90% of the time, that’s just the first nine. Then you need the second nine, a third nine, a fourth nine, a fifth nine. Every single nine is the same amount of work.”

That’s it. That’s the whole insight. And it’s devastating for how most organizations are thinking about AI agents right now.

What the Self-Driving Analogy Actually Teaches

The self-driving domain is the clearest prior art we have on the nine-of-nines problem because the failure consequences are physical and immediate. A car that drives correctly 90% of the time is not a car you can sell. It’s not even a car you can run as a fleet product with a safety driver. It’s a demo.

Getting from 90% to 99% is a lot of work. Getting from 99% to 99.9% is as much work as everything before it. Getting to 99.999% — the kind of reliability where you can remove the safety driver entirely — took Tesla’s Autopilot team years of dedicated engineering on corner cases, edge conditions, sensor fusion failures, and the long tail of road situations that a demo never encounters.

Karpathy’s point is that the demo gap and the production gap are separated by multiple full cycles of this work. And because each nine is roughly as hard as all previous nines combined, organizations that treat “impressive demo” as “ready to ship” are systematically underestimating what comes next.

This maps onto AI agents almost perfectly.

Where Agent Chains Actually Break

An AI agent doing anything non-trivial is usually doing something like: parse intent, retrieve relevant context, plan a sequence of steps, execute tool calls, validate outputs, handle errors, format results. Call it six to ten sequential operations, each with its own reliability distribution.

The math is uncomfortable. If each step in a ten-step agent chain succeeds 95% of the time (which is already optimistic for anything involving real-world tool calls, API responses, or model reasoning on novel inputs), the end-to-end success rate is roughly 60%. That’s not a reliable product. That’s a coin flip with extra steps.

I’ve been deploying agent systems in a few different contexts — some internal to Zero, some for clients who want to automate workflows — and the reliability patterns I see are consistent with what Karpathy describes. Things that work 90% of the time in controlled testing hit different failure modes in production because production has a longer tail. The retrieval step that worked fine on your evaluation set fails on the user’s actual query phrasing. The tool call that worked in staging hits a rate limit in production during peak hours. The output parser that handled every test case chokes on a response format that’s 15% different from what the model usually produces.

Each failure mode is individually addressable. But each fix reveals the next failure mode. The march of nines in agent systems is addressing the long tail of failure modes, one nine at a time.

The 2026 Landscape

What I’m seeing in 2026 is a divergence between organizations that understand this and those that don’t, and it’s becoming visible in deployment outcomes.

The organizations that don’t understand the march of nines are deploying agents broadly, measuring task completion rate on their initial evaluation set, declaring success, and then dealing with a slow drip of production failures that erode trust in the system. The failure isn’t catastrophic — the agent usually does something, just not always the right something — which means the signal is noisy and the response is slow. By the time there’s enough evidence to take the reliability problem seriously, the political cost of rolling back is high.

The organizations that do understand it are doing something different: they’re deploying agents narrowly, in domains where they can instrument every failure, and they’re running the march of nines deliberately. They pick one workflow, get it to four or five nines, validate that thoroughly, and only then expand scope. It’s slower to deploy this way. It’s dramatically more reliable when it reaches production.

This is exactly how responsible autonomous vehicle deployment worked. You don’t start in San Francisco downtown at rush hour. You start in a geofenced area with controlled conditions, measure everything, address failures systematically, and expand the operational design domain as you add nines.

Why Long-Task Reliability Is the Hard Problem

Single-step AI tasks — generate a summary, classify a support ticket, write a first draft — have gotten genuinely reliable. The better frontier models do these things at a quality level that’s consistent enough to build products on. That’s the 99%+ territory for narrow, bounded tasks.

The reliability cliff happens when you chain things together and the task duration extends. A task that takes one second to complete has a reliability ceiling limited by the model’s per-call accuracy. A task that takes ten minutes — because it requires multiple tool calls, intermediate reasoning steps, external API dependencies, and user state that might change during execution — has a reliability ceiling that compounds every intermediate failure probability.

This is why I’m skeptical of the 2025 wave of “fully autonomous” agent marketing. Fully autonomous over a long-horizon task means you’re trusting not just the model, but the model’s planner, the model’s error recovery behavior, every tool it calls, every API those tools depend on, and the model’s ability to know when it’s stuck and needs to escalate. Each of those has a reliability distribution. The product is a probability distribution that degrades as task horizon extends.

I had a direct experience of this building an automated research and synthesis workflow at Zero. The demo worked beautifully. Twenty minutes of autonomous research, structured output, nicely formatted. We ran it fifty times in testing and it worked forty-two times. That’s 84%. We thought that was acceptable. In the first month of production use it hit three failure modes we’d never seen in testing: a specific source returning an unexpected response format, a query that caused the retrieval step to loop, and an output that looked complete but had silently dropped a required section. All individually fixable. All invisible until real usage found them.

We’re now on the march. Each fix finds the next edge case. That’s what Karpathy means.

What This Means for Architecture

The practical implication of the march of nines for anyone building with agents: design for failure modes before you design for features.

Specifically: every agent chain needs explicit failure states that are visible and recoverable. Not silent failures where the agent returns something plausible but wrong — visible failures where the system can say “I got to step four and couldn’t proceed, here’s what I had.” Human escalation paths aren’t a fallback. They’re a feature. The handoff from agent to human should be graceful and informative, not a dropped context and a confused user.

Second: monitor at the step level, not just at the outcome level. If you’re only measuring whether the final output was correct, you can’t see where in the chain the reliability is degrading. Instrumenting intermediate steps is overhead. It’s worth it.

Third: build your evaluation suite to cover the long tail, not just the happy path. Demo scenarios are always the happy path. Real usage isn’t. If your evals only cover the scenarios you thought of, you’re testing for the first nine. Getting to the second nine requires actively hunting for the scenarios you didn’t think of.

Karpathy’s insight is ultimately about humility in the face of compounding complexity. The demo is not the product. The first nine is not the nines you need. Every additional nine is as hard as everything that came before.

The question for 2026 isn’t whether your agent can do the thing. It’s whether it can do the thing reliably enough that you’d trust it to run while you sleep.

For most agents deployed today, the honest answer is not yet.

The March of Nines: Why Agent Reliability Is the Real AI Problem in 2026

90% Working Is Not Working

What the Self-Driving Analogy Actually Teaches

Where Agent Chains Actually Break

The 2026 Landscape

Why Long-Task Reliability Is the Hard Problem

What This Means for Architecture

Anshad Ameenza

Related Articles

How to Build Powerful Web-Search Agents That Actually Work

RAG Grew Up: Context Engineering and the MCP Standard

Stop Using Your Best Model for Everything: A Practical Guide to Model Routing

The March of Nines: Why Agent Reliability Is the Real AI Problem in 2026

90% Working Is Not Working

What the Self-Driving Analogy Actually Teaches

Where Agent Chains Actually Break

The 2026 Landscape

Why Long-Task Reliability Is the Hard Problem

What This Means for Architecture

Anshad Ameenza

Related Articles

How to Build Powerful Web-Search Agents That Actually Work

RAG Grew Up: Context Engineering and the MCP Standard

Stop Using Your Best Model for Everything: A Practical Guide to Model Routing

Cookie & Reality Check