Anshad Ameenza.
Engineering · · Updated: Jun 25, 2026

Automated Development: Soon You Won't Write Code, You'll Build the Machine That Does

The job is shifting from writing code to building the pipeline that writes and ships it. A field guide to automated development: the history, the mechanism, and what comes next.


A bug gets filed at 9:14 in the morning. Refunds above a certain amount round the wrong way, and finance noticed before the customers did. By 9:20 there is a branch with a one-line fix, a regression test that fails on the old code and passes on the new, and a short note explaining the root cause. A reviewer leaves two comments. They are answered in seconds, not because someone was hovering at the keyboard, but because no one was. A person reads the final diff, agrees, and clicks merge. Keyboard time spent writing the fix: zero.

This is not a staged demo. It is becoming a normal Tuesday. And the surprising part is not that a machine wrote the code. We have been inching toward that for fifteen years. The surprising part is what it does to the engineer standing next to it. If the software writes the software, what exactly are you there for?

Here is the answer, and it is older than it looks.

You stopped being the author. You became the compiler engineer.

In 1957 a small team at IBM shipped Fortran, the first compiler most working programmers ever trusted. Before that, serious code was hand-written assembly, and the prevailing belief was that no automatic translator could match a careful human at choosing instructions and squeezing registers. The Fortran team spent most of their effort not on the language but on proving the generated code was fast enough that a professional would accept it. They won that argument. Within a decade, hand-writing assembly for general work went from the job to a specialty.

Notice what happened to the programmer. The job did not disappear. It moved up a level. You stopped writing the machine’s output and started writing your intent, then you trusted a piece of software to translate intent into output. Later, when the translation got good, the interesting work moved again: into the compiler itself, into the people who make the translation smarter. Register allocation and instruction scheduling, the very things humans were sure they did better, are now things compilers do better than almost anyone.

Automated development is that same move, one rung higher. We are getting a new compiler. Its input is not C or Rust. Its input is a goal, an issue, a half-formed intent in plain language. Its output is not an executable. Its output is a shipped change in production, reviewed and verified. The whole apparatus that turns “fix the refund rounding bug” into a deployed, monitored fix is the compiler. And the job that matters most is no longer feeding it tokens by hand. It is building and tuning the compiler.

This reframing is the whole point, so sit with it before the mechanics. Your output is no longer features. Your output is the machine that produces features. A feature you ship by hand is, in this view, a small failure of the machine, a car you had to pull off the line and assemble yourself because the line could not yet do it. Sometimes you have to. The goal is to have to less often.

The oldest move there is: coordination, one level up

The compiler analogy explains the mechanics. There is a deeper pattern underneath it, and it is the real reason I think this is inevitable rather than just trendy.

I argued in a longer essay that evolution’s true headline is not competition but coordination. Every giant leap, from the first cell to your own body to an ant colony, came from the opposite of cut-throat struggle: smaller units giving up some independence to merge into something larger, until cooperation turned out to be competition moved up a level. That essay is The Great Mergers, and automated development is a clean implementation of its thesis, running in software, on a timescale of years instead of eons.

Look at the pipeline the way a biologist would. Triage, specification, implementation, review, verification, monitoring: these are specialized cells. Not one of them is a whole engineer, and each is worse on its own than the generalist whose slice it took. Bound together with defined hand-offs, though, they become a higher-level individual, a thing that ships software. Selection stops acting on the parts and starts acting on the organism. A mitochondrion cannot survive outside your cells and has no need to; the implementation agent does not need to be a great standalone engineer, it needs to be a great cell. You, meanwhile, move up to tending the whole organism rather than being one of its hands.

A software factory is a major evolutionary transition you can watch happen in a year: specialized parts merging into a higher-level individual, with selection moving to the whole.

The deeper frame

Hold this lens for the rest of the post, because the major transitions have a shape, and once you see it you can see where this goes. The history of life already ran this experiment, many times, and it tells you what happens to the parts, to autonomy, and to the things that refuse to merge.

How we got here: three eras in fifteen years

The path to this point is short and easy to mark, because each era changed what the AI was allowed to touch.

Autocompletefinishes your lineInteractive agentsdoes a task, you steerAutomated developmentowns the whole loopunit: a tokenunit: a taskunit: a change
Each era handed the machine a larger unit of work: a token, then a task, then the whole loop.

The autocomplete era gave the machine a token. GitHub previewed Copilot in 2021, built on OpenAI’s Codex, and the pitch was a better keyboard: it finished your line, your function, the boilerplate you had typed a thousand times. You were still the author. The AI just typed faster than you in the boring parts. Useful, but it never left your side, and it never owned anything end to end.

The interactive era gave the machine a task. Once ChatGPT landed in late 2022 and tools like Cursor, Aider, and Claude Code matured through 2023 and 2024, you could hand over a whole unit of work. “Add pagination to this endpoint.” The agent reads files, writes across several of them, runs the tests, and shows you the result. In early 2024, Cognition pitched Devin as the first AI software engineer, and the framing stuck: not a typist, a teammate. But you were still in the loop on every task, watching, correcting, approving. The agent was an extremely eager intern who could not be trusted to ship alone.

The automated era gives the machine a change. This is the shift happening now. Instead of one agent doing one task while you watch, a set of agents runs a full pipeline: triage, specification, implementation, review, verification, and monitoring, with humans pulled in only where the machine is not yet good enough. The unit of work is no longer a token or a task. It is an entire change, from the moment an issue is filed to the moment it is live and being watched in production.

The work has moved from writing the output, to delegating the task, to designing the system that owns the whole loop.

The thesis

The reason this era is different in kind, not just degree, is the loop. Autocomplete and interactive agents are open at both ends: a human starts them and a human finishes them. Automated development closes the loop. The output of monitoring becomes the input to triage. The system can, in principle, run without a human touching it, which is exactly why it is worth building carefully.

The mechanism: what the pipeline actually does

Abstractions are easy to nod along to and hard to act on, so here is the concrete machine. It is a small assembly line of specialized agents, each with one job, with defined hand-offs and human gates where trust is not yet earned. The shape below is what most serious attempts converge on.

Triage: understand and reproduce

A triage agent reads the new issue, tries to reproduce it, and decides the path. If the change is well-scoped and low-risk, it goes straight to implementation. If it is ambiguous or large, it needs a written spec first. If it is genuinely unclear, it asks a human a sharp question, or parks the issue with a note on what would unblock it. Most of the quality of the whole pipeline is decided here, because a good triage decision prevents wasted work downstream.

Spec: pin down the ambiguous and the large

For anything non-trivial, a spec agent writes down what “done” means: the intended behavior, the edge cases, the files likely to change, the test that would prove it works. A human reviews the spec, not the code. Reviewing intent is faster than reviewing a diff, and catching a wrong assumption here costs minutes instead of a wasted implementation cycle.

Implement: write the change

An implementation agent turns the spec into a real diff: code, tests, and a short writeup of what it did and why. This is the part everyone pictures when they imagine AI writing software, and it is, increasingly, the least interesting part. Generation is close to solved for well-specified work. The hard parts sit on either side of it.

Review: a second machine reads the diff

A separate review agent critiques the change with fresh context: style, edge cases, security smells, whether it actually matches the spec. Using a different agent for review matters, because the one that wrote the code is the worst judge of it. This is generate-and-critique, the oldest trick for making models more reliable, wired into the line.

Verify: prove it works, do not assume it

A verification agent runs the tests and, increasingly, drives the actual application with computer use: it opens the app, clicks through the flow, and confirms the refund now rounds correctly on screen, not just in a unit test. Verification is where trust is won or lost, and it is the stage most teams underbuild.

Human gate: review code and evidence

A person reviews the diff together with the verification output and decides: ship, or bounce it back to spec, implementation, review, or verification. This is the gate that shrinks over time. Early on you stop here constantly. The goal is to need this less, by making the earlier stages good enough that the answer is almost always yes.

Ship and monitor: close the loop

The change goes through CI/CD and deploys. Then a monitoring agent watches production, and when it detects a regression or an anomaly, it files a new issue, which lands back at triage. The line feeds itself. That feedback edge is what turns a clever set of scripts into a system.

If you have built anything with agent loops, this will look familiar, because it is the same outer-loop pattern applied to the whole software lifecycle rather than a single task. I have written before about building an agent loop you can actually trust and about moving from coding to conducting a team of agents; the pipeline here is those ideas grown up and pointed at production.

The uncomfortable part: productivity gets redefined

Here is where most people flinch, and they are right to.

If the machine ships changes, then “how many features did you ship this quarter” stops measuring you. In the old world, a prolific engineer was one who personally produced a lot. In this world, that same engineer is a bottleneck: a skilled person doing by hand what the line should do. The metric that matters flips to something like throughput per unit of cost. How much shipped product came out, divided by what it cost in compute and in human attention combined.

That second number is the one teams are about to feel in their budgets. For a decade, AI coding help has been treated as research and development, a fixed cost you wave through. That is ending. When software is produced by a metered pipeline burning tokens, its cost behaves like cost of goods sold: a variable that scales with output and shows up on the bottom line. Companies will start asking the blunt question they ask of any factory: if I spend another dollar here, do I get more than a dollar of value back? The era of unlimited token budgets for interactive agents is closing, and a discipline that looks a lot like cloud cost management is opening in its place. I dug into this tension, the bill that arrives after the easy wins, in the post on AI code quality and its hidden costs.

There is real pain in the transition, and it is worth naming honestly rather than selling around. In the short term you will review a lot of mediocre machine output. You will watch agents fail in dumb ways. It will sometimes feel slower than just writing the thing yourself, because for that one task, it is. The case for going through the pain anyway is simple: the task you do by hand teaches you nothing reusable, while the failure you fix in the pipeline makes every future task of that shape automatic. You are trading a fast win now for a compounding one later.

This is one bet, not the only one

I have described the pipeline as if it is the obvious endgame. It is the leading approach today, but honesty requires saying clearly: it is a bet, and there are serious people betting differently. Anyone who tells you the architecture is settled is selling something.

The monolith bet says the scaffolding is temporary. In 2019 Rich Sutton wrote an essay called The Bitter Lesson, arguing that across the history of AI, general methods that scale with compute keep beating clever hand-built structure. Apply that here and the multi-stage pipeline, with its carefully separated triage and spec and review agents, starts to look like scaffolding around a model that is not yet strong enough. The bet is that a single, much more capable generalist agent eventually swallows most of these stages, and the elaborate hand-wired factories age badly. I think this is the most important counter-argument, and the right response is to build pipelines that are easy to thin out, not cathedrals you will be unable to tear down.

The specification bet says English and tests become the source code. In this view the interesting artifact is not the pipeline but the spec. You write behavior precisely enough, in language and in tests, and the system’s only job is to satisfy it, the way a compiler satisfies a language standard. Programming becomes specification, and the skill that matters is stating exactly what you want, which has always been the genuinely hard part anyway.

The swarm bet says throughput beats elegance. Instead of one careful pass through tidy stages, you generate many candidate solutions cheaply and lean hard on a verifier to pick the best. If verification is strong, you do not need an elegant process; you need volume and a good judge. This is generate-and-test scaled up, and it gets more attractive every time inference gets cheaper.

The amplification bet says keep the human in the chair. A serious camp argues that full autonomy is the wrong target, that the quality and trust ceiling is reached with humans central and agents as power tools. On this view the win is a ten-times-better engineer, not a removed one, and the teams chasing zero human involvement are optimizing a metric that quietly destroys the craft and the accountability.

My own read: these are not mutually exclusive, and the pipeline is the right thing to build first because it forces you to instrument every stage. Even if a stronger model later absorbs three of your six stages, the evals and verification you built to trust them are exactly what you keep. Build the factory partly so you learn precisely where the machine fails.

The tools making it happen

The lineup changes monthly, so treat these as examples of each role, not endorsements, and expect the names to shuffle. What is stable is the shape: one tool per stage of the line.

Inline and autocomplete. GitHub Copilot, Cursor’s tab completion, and Windsurf still own the keystroke-level layer. This is the most mature and least interesting frontier now, table stakes rather than edge.

Interactive build agents. Cursor, Claude Code, OpenAI’s Codex in its CLI and cloud forms, Aider, Cline, GitHub Copilot’s agent mode, and Google’s Jules are where most teams currently live: hand an agent a task, watch it work across files and run tests. This is the interactive era in full bloom, and for many teams it is still the daily driver.

Orchestration and harnesses. Turning single agents into a pipeline is the job of frameworks like LangGraph, CrewAI, and Microsoft’s AutoGen, agent SDKs from the model labs, and durable-execution engines like Temporal for the workflow plumbing that has to survive restarts and retries. This layer is young and where a lot of the real engineering now happens.

Review. Dedicated review agents such as CodeRabbit, Greptile, and Graphite’s review tooling, alongside the review modes built into the coding assistants, handle the second-reader stage. The differentiator is context: a reviewer that understands the whole repo catches what a diff-only reviewer cannot.

Verification. Playwright for driving browsers, the computer-use abilities from Anthropic and OpenAI for clicking through real apps, and benchmarks like SWE-bench as a yardstick for how well agents resolve real issues. This stage is the current frontier and the one most worth investing in, because it is where trust comes from.

Monitoring and the loop. Observability platforms like Sentry and Datadog, increasingly with AI triage that turns an alert into a filed, reproducible issue, are what close the circle and feed the next run of the pipeline.

Evals and self-improvement. LangSmith, Braintrust, and home-grown eval harnesses are how you measure whether a prompt or a model or a harness change actually made the line better. In a world where the pipeline is the product, your evals are the most valuable code you own.

What happens next

Predictions are cheap, so let me make specific ones you can hold me to.

Software production moves onto the books as a variable cost. Within a year or two, “cost per shipped change” becomes a tracked number in engineering-heavy companies, and a role that looks like cloud cost management appears for agent spend. Unlimited token budgets get replaced by metered ones with a return-on-investment expectation attached.

The headline engineering metric flips. Teams stop celebrating output and start reporting the share of changes that ship with no human edit, and at what cost. The uncomfortable corollary: a person hand-writing a routine change starts to read as a process failure, not a heroic.

The pipeline starts improving itself, slowly. The highest-leverage agents will not be the ones writing features. They will be the ones watching the pipeline, finding the exact stage where a human had to step in, and rewriting that stage so the next similar case does not need a human. This is recursive, it starts clumsy, and it is the whole game. The line that improves its own line is the destination.

Verification, not generation, becomes the moat. Writing code is nearly commoditized; trusting it is not. The teams that win will be the ones with the best verification and evals, because that is what lets them safely shrink the human gate. Model choice will matter less than the harness around it.

The org chart thins and changes shape. Expect fewer engineers defined by the features they personally build and more defined by the part of the factory they own. New titles that mean “I improve the machine that builds the product” rather than “I build the product.” Small teams will ship like large ones, because leverage compounds when the marginal change costs tokens instead of weeks.

And the scaffolding will partly collapse. As base models get stronger, some of today’s elaborate pipelines will look over-engineered, and the teams that built rigid, hard-to-change factories will spend a painful season tearing them down. The ones who treated the pipeline as a learning instrument, instrumented and easy to thin, will simply remove the stages the model outgrew and keep the evals. Build for that future on purpose.

Selection moves to the whole, and the parts lose autonomy. This is the prediction the evolutionary lens makes that a roadmap cannot, and it is the one I am most confident about. Every major transition ends the same way: once units merge, the higher level becomes the thing that lives or dies, and the parts can no longer go it alone. Expect exactly that here. The unit that succeeds or fails in the market becomes the factory, not the feature and not the individual engineer. Hand-coding a routine change will come to feel like a cell trying to live outside the body, possible for a while and pointless in the end. Read that as loss if you identify with the part. Read it as leverage if you identify with the whole. The transitions also tell you who gets left behind: the units that refuse to specialize and merge keep their independence and their ceiling, and they are outcompeted by the organisms that gave theirs up. The same fate waits for teams that insist every change pass through a human’s hands as a point of pride.

None of this means the craft dies. It means the craft moves, the way it moved when we stopped writing assembly and started trusting compilers. The most interesting problem in software was never typing the code. It was designing the system in which good software gets built and shipped reliably. For the first time, that system is something you can actually build, end to end, and watch run. That is not the end of engineering. It is the most leveraged version of it we have ever had.

AI Agents Software Engineering Developer Tools Evolution
Share:
Anshad Ameenza
About the Author

Anshad Ameenza

Lifelong Learner, Engineer, Technology Leader & Innovation Architect

20+ years of experience in technology leadership, innovation, and digital transformation. Building and scaling technology ventures.

Get new ideas in your inbox

New Insights, Big Ideas, and half-built tools land here eventually. No spam, I can barely send one of these on schedule. Unsubscribe whenever you come to your senses.

Continue Reading

Related Articles