The Productivity Paradox of AI Coding Tools
A rigorous 2025 study found experienced developers were 19% slower with AI tools — yet felt faster. Here's what that means for teams.
There’s a number I keep coming back to. Nineteen percent.
Not nineteen percent faster — slower. Developers using AI coding tools were 19% slower on real tasks compared to working without them. This is the headline finding from a randomized controlled trial published by METR (Model Evaluation and Threat Research) in July 2025, and it’s one of the most counterintuitive data points I’ve seen in years of building engineering teams.
The kicker? Those same developers predicted they’d be 24% faster before the study started. After completing their tasks with AI assistance, they estimated they had been 20% more productive. They were slower in reality, felt faster in the moment, and reported being faster after the fact. That’s not a tool problem — that’s a perception problem, and it has serious implications for how we manage engineering teams in the AI era.
What METR Actually Did
Let me give the study its due, because methodology matters. METR recruited 16 experienced open-source developers — people with an average of five years contributing to large, mature repositories (22,000+ GitHub stars, a million-plus lines of code, projects over a decade old). These aren’t boot camp graduates. They ran 246 real tasks drawn from actual issue backlogs, randomly assigned to either “AI-allowed” (Cursor Pro with Claude 3.5/3.7 Sonnet) or “AI-forbidden” conditions. Tasks were timed end-to-end by independent evaluators.
This is not a survey. Not a self-reported estimate. An actual randomized controlled trial with a preregistered design on real production codebases. The rigor here is unusually high for this type of research.
The result: a statistically significant 19% slowdown in the AI-assisted condition.
Gergely Orosz at The Pragmatic Engineer has been one of the most consistent voices tracking AI’s real (vs. marketed) impact on engineers. His newsletter has documented a consistent pattern: the productivity gains from AI tools tend to be concentrated among specific task types — greenfield boilerplate, test generation, documentation — while more complex tasks in large, established codebases show much weaker or negative effects. The METR study, conducted on exactly the kind of large, mature codebases that professional engineers actually work in, confirms what Gergely’s reporting has been pointing at.
Why This Happens — My Working Theory
I’ve been running engineering teams since before the AI coding era, from system engineer roles at HP and Nokia through to building out technology organizations at Intuit, Dell/EMC, and EY, and now scaling teams at Zero. I’ve watched this pattern play out in my own teams over the past two years.
My working theory: AI tools create a cognitive fluency trap.
When a developer types a prompt and sees code materialize, it feels like progress. The friction is gone. There’s no staring at a blank screen, no fighting with syntax. That sense of flow is real — but it’s decoupled from actual task completion. The trap is that what AI is really good at (generating plausible-looking code quickly) is not what slows experienced engineers down. Experienced engineers on large codebases are slow because of understanding, not because of typing.
Understanding why a 10-year-old codebase made the architectural choices it did. Understanding the failure modes of an existing abstraction before you extend it. Understanding which tests are actually meaningful. These are the slow parts. And for these, the AI-generated output is — at best — a distraction that needs to be verified, and at worst, a confident wrong answer that sends you down a rabbit hole for two hours.
There’s also an interruption cost. Switching between evaluating AI output, prompting again, evaluating more output — this breaks the deep focus mode that experienced engineers need for complex problem-solving. Ironically, the tool that’s supposed to speed things up may be fragmenting exactly the cognitive state that produces good work on hard problems.
The Perception Gap Is the Real Problem
Here’s what worries me more than the 19% slowdown: the perception gap.
If developers felt 20% faster while actually being 19% slower, you have a situation where developers will advocate enthusiastically for tools that are hurting their productivity. And that means managers who rely on developer self-reporting (which is most managers) will be systematically misled.
I’ve seen this at scale. When I ask engineers in my teams how AI tools are affecting them, the answer is almost always some version of “they’re amazing, I can’t imagine working without them.” When I look at actual cycle times and pull request throughput on complex features, the picture is more mixed. The correlation between “developer enthusiasm for a tool” and “measurable productivity improvement” turns out to be pretty weak.
This doesn’t mean the tools are bad. It means the measurement and management frameworks need to change.
Individual Speed vs. Team Throughput
This is the paradox I want to name explicitly, because I don’t see it talked about enough.
Even if an individual developer is slower on a given task with AI, there are team-level effects that could still be net positive. An AI tool that helps a mid-level engineer generate scaffolding and tests might free up a senior engineer to spend more time on architecture review. A tool that helps a developer in Trivandrum get unstuck on an API they haven’t used before might reduce the number of blocking questions they ping Bangalore with. These effects don’t show up in individual task timing.
Conversely, there are team-level effects that could make things worse. If every developer is generating more code faster (even if each task is slower end-to-end), you might end up with more code to review, more tests to maintain, more surface area for bugs — without a corresponding increase in feature value delivered. Code volume is not value. I’ve seen teams where the introduction of AI tools created a “code inflation” effect: PRs got bigger, review cycles got longer, and the team’s actual throughput on features the business cared about didn’t budge.
The right frame isn’t “is this individual faster?” It’s “is the team shipping more valuable things faster?” Those are different questions.
What I’m Actually Doing in My Teams
A few things I’ve changed based on thinking about this:
Measure outcomes, not activity. We’ve gotten stricter about tracking lead time (time from feature start to production) and deployment frequency rather than developer sentiment or lines of code. These are the metrics that actually correlate with business value.
Match tool to task type. We explicitly guide engineers on when AI assistance is likely helpful (new language/framework, boilerplate-heavy tasks, test case generation) vs. when it’s likely a net negative (deep debugging in legacy code, architecture decisions in a complex system, code review). This sounds obvious but it requires active culture-building — left to their own devices, engineers tend to use the tool everywhere because it feels good.
Slow down on AI-generated code review. We’ve added a lightweight norm: when a PR contains substantial AI-generated code, the author writes a short note explaining what they did to verify the logic — not just that tests pass, but that they actually understand and stand behind the implementation. This has surfaced a surprising number of “actually I’m not sure this is right” moments that would previously have slipped through.
Don’t penalize honesty. The perception gap is partly a social problem. Engineers may report that AI tools make them faster because that’s what they think leadership wants to hear, or because admitting “I spent two hours debugging AI output” feels like admitting failure. Making it safe to be honest about when the tools aren’t helping is part of building good measurement.
The Broader Context
I want to be careful not to overclaim. The METR study is one data point — 16 developers, tasks from a specific type of large open-source project. There’s genuine evidence that AI tools help with certain tasks for certain developers. GitHub’s own Copilot study (a randomized trial published in 2022) found meaningful speed improvements on writing new functions in isolation. The picture likely varies by task complexity, codebase age, developer experience level, and how well the developer has learned to work with the tool.
But the METR study matters precisely because it tested the conditions that are hardest and most realistic: experienced engineers, on large production codebases, doing real work. This is not a toy problem. And in those conditions, the most honest read of the evidence is that we don’t yet know how to get AI tools to reliably improve productivity, and we may be systematically overestimating the gains.
That’s not a reason to stop using the tools. It’s a reason to stop assuming the tools are working and start measuring whether they actually are.
The best engineers I know are currently doing something interesting: they’re treating AI as a junior collaborator they have to supervise, not a productivity multiplier they can rely on. They’re faster in some ways (the tool handles a lot of low-level lookup and boilerplate), slower in others (supervision has a cost), and thoughtful about which mode they’re in. That’s probably the right posture for now.