The People Pushing Back on AI Are Not Wrong
Gary Marcus has been documenting AI's reliability failures for years. As someone who builds with this stuff, I think the backlash forming in 2025-26 is legitimate signal.
I want to start somewhere uncomfortable: I think Gary Marcus is largely right.
That might seem like an odd thing for someone who spends most of his time building AI-adjacent products to say. Marcus is probably the most prominent sustained critic of the large language model paradigm, and he’s spent the last several years in a running argument with most of the people I respect in this field. He’s been called a doomist, a publicity-seeker, an AI skeptic who doesn’t understand the technology, a broken clock.
But I’ve been watching what he actually says versus what his critics say he says, and the gap is significant.
What Marcus Actually Argues
Gary Marcus is a cognitive scientist who co-founded Geometric Intelligence (acquired by Uber in 2016) and has been active in machine learning circles since the mid-2010s. He is not opposed to AI as a category. He is opposed to a specific set of claims about what current AI systems can reliably do, and he’s been cataloguing the gap between those claims and observed performance since at least 2019.
His 2024 book Taming Silicon Valley (MIT Press) makes the case that generative AI, as currently architected, is “seductive but unreliable” — his phrase — and that the industry has been dramatically overhyping what it can deliver while actively resisting any accountability structure that would force honest assessment. He documents hallucinations not as a known bug being actively fixed but as a structural property of how these systems work: they are, as he puts it, “smearing together words” in ways that produce fluent outputs without any underlying commitment to truth. His core concern is that reliability is not a simple engineering problem for LLMs — it’s an architectural one, and the industry’s response to that concern has largely been marketing, not engineering.
In a June 2025 essay for Project Syndicate, “AI’s Reliability Crisis,” he made the argument directly: generative AI is fundamentally unreliable and there is, as of yet, no apparent solution within the current paradigm. He cites specific failures — lawyers using AI to file fabricated case citations, medical errors, generated code with security vulnerabilities, 30% of web content becoming demonstrably low-quality AI output — and argues these aren’t edge cases; they’re predictable outputs of systems that optimize for fluency over accuracy.
On the economics: he’s pointed out, with numbers, that OpenAI reporting roughly $1 billion in monthly revenue while remaining deeply unprofitable, and extrapolating that to a sector-wide revenue around $25 billion annually against trillions in investment, constitutes an economic structure that looks a lot like a bubble. He’s been saying this since 2023. He hasn’t been proven wrong yet.
The Backlash That’s Been Building
By the summer of 2025, what Marcus had been predicting for two years was becoming hard to ignore. Fortune ran a piece in August 2025 with the headline “Bubble or not, the AI backlash is validating one critic’s warnings.” The vibe, as multiple observers noted, had shifted.
What was that shift? Not a single event but an accumulation. Legal cases involving fabricated AI citations. Enterprise customers quietly discontinuing AI deployments that hadn’t delivered the promised productivity gains. Increasingly vocal pushback from workers who’d had their jobs restructured around tools that were supposed to augment them but mostly just created new categories of cleanup work. Students and teachers in a kind of Mexican standoff, each using AI to produce and evaluate work that neither fully trusted. Journalists documenting specific failures in high-stakes domains: medicine, law, finance.
And underneath all of it, a growing suspicion that the metrics used to benchmark AI progress — benchmark performance on controlled datasets — don’t map cleanly onto real-world reliability. That there’s a difference between a system that scores well on a test and a system you can actually depend on.
This is a legitimate epistemic critique. It deserves serious engagement, not dismissal.
What I See from the Inside
I build with these systems. I’ve integrated AI into the way Zero operates. I use LLMs in my own workflow daily. I’m not neutral.
And from that position, I want to say: the critique is real. Not in the catastrophizing version — not the “AI is fundamentally broken and will never be useful” reading that sometimes gets attributed to Marcus, which isn’t really his argument. But in the more specific, engineering-level version: these systems fail in ways that are hard to predict, hard to catch before they cause harm, and the failure modes are often correlated with the highest-stakes applications.
When you build with LLMs at anything beyond toy scale, you spend a lot of time on what I’d call the reliability tax: adding layers of verification, building human checkpoints, designing prompts that constrain the output space to reduce hallucination surface area, monitoring for drift in production behavior, maintaining manual fallbacks. This is real engineering work, and it never fully disappears. You’re managing the gap between what the model is capable of in ideal conditions and what it reliably produces at the tail of the distribution in the wild.
That tax is often not accounted for in the ROI calculations presented to people deciding whether to deploy. The demos look extraordinary. The production environment is messier.
Why This Is Signal, Not Luddism
The original Luddites — and I want to be historically precise here — were not opposed to machines as a category. They were skilled textile workers in early 19th century England who objected to the deployment of specific machinery in ways that violated their trade agreements, degraded the quality of their craft, and transferred wealth from skilled workers to mill owners. The opposition was specific and economic, not irrational.
The term “Luddism” has been lazily repurposed to mean “person who irrationally fears new technology,” which is a useful slur for dismissing any concern about any deployment without engaging with its substance.
The backlash forming around AI in 2025-26 is not that. It’s not people scared of something they don’t understand. It’s:
Writers who’ve watched their markets collapse under a flood of generated content while the platforms that distribute that content have quietly switched to algorithmic curation of it.
Doctors who’ve seen patients come in with AI-generated symptom analyses that confidently identified the wrong condition.
Engineers who’ve been handed codebases increasingly infected with AI-generated code that passes review and fails in production.
Teachers who’ve been asked to assess whether work they can’t verify is real, while simultaneously being told AI is going to transform education.
Legal professionals whose clients have suffered real consequences from fabricated precedents their AI-assisted lawyers didn’t catch.
These are not abstract concerns. They are people documenting specific, concrete failures in domains where failure has real costs. Calling this Luddism is a way of not having to engage with the evidence.
Marcus, whatever you think of his analysis, is doing something important: he is insisting that the burden of proof runs the right way. The default for any technology that makes claims about reliability in high-stakes domains should be: show me the evidence that it works. Not: prove it doesn’t work before we deploy it everywhere. That’s not being anti-technology. That’s being a responsible engineer.
What Good Faith Looks Like
I’m not suggesting the answer is to slow AI development to a crawl or to treat every deployment with blanket suspicion. I think that would be wrong. The legitimate applications are vast and the benefits for people who currently lack access to expertise — education, legal help, medical information, financial planning — are real.
But I do think the industry has a credibility problem that it has largely earned. When researchers produce benchmark results that don’t transfer to real-world performance. When companies launch products with confident claims and then quietly patch the embarrassing failures. When “safety” teams are treated as compliance overhead rather than genuinely integrated engineering functions. When the people who raise internal concerns about deployment timelines are sidelined. These patterns accumulate, and they’re why a substantial fraction of thoughtful, technically literate people don’t trust the field’s self-assessment.
Marcus’s ask isn’t unreasonable: transparency about failure rates in real-world conditions. Accountability structures for high-stakes deployments. Some honesty about what these systems can’t do. These are the standards we apply to pharmaceuticals, to aircraft, to financial instruments. They’re not exotic standards.
The backlash will intensify before it resolves. My hope is that it pushes the serious engineering work — interpretability, robustness, honest evaluation — rather than producing a regulatory overcorrection that freezes the field. But that requires the people building this stuff to stop treating criticism as an attack and start treating it as useful data.
Some of us are trying. Marcus is trying. The question is whether the organizations with the most resources to fix the problems have any incentive to listen.