Anshad Ameenza.
Engineering ·

Build a Claude Code Agent Team That Loops Until the Work Is Actually Done

A builder+checker agent team in Claude Code that cycles through build → check → fix until all tests pass, with hard stop rules to prevent token burn.


There’s a moment I’ve hit more times than I care to count. You give an AI coding agent a task — add rate limiting to the login route, fix the N+1 query in the dashboard loader, migrate the auth module to the new provider — and it comes back with something that looks plausible. The code is idiomatic. The diff makes sense on first read. You merge it, deploy it, and two days later a test in CI catches something you didn’t expect. Or worse, nothing catches it and you find out in production.

The problem isn’t that the agent wrote bad code. The problem is structural: the agent was checking its own work. It ran the tests mentally, or ran them once and reported the output, and nothing in the loop forced it to prove that things were actually green before calling it done.

I’ve been building systems at Zero — and before that across somewhere north of fifteen attempts at startups across Bangalore, Kerala, Dubai, and a stretch in Vietnam — and one lesson that keeps coming back: you cannot trust a single actor to both produce and verify. Not humans, not agents. You need separation of concerns built into the process itself.

What I’m going to describe is a three-file setup in Claude Code that creates a looping team of agents: a builder that writes and fixes code, a checker that runs tests and reports failures verbatim, and a loop command that orchestrates them in cycles until everything is green — or until hard stop rules kick in and tell you something is wrong. I’ve had this running on real engineering tasks and it substantially changes what I feel comfortable letting an agent run unsupervised.


Why a Loop, and Why Two Agents

The single-agent approach breaks in a specific and predictable way: the agent optimizes to report success rather than to achieve success. This sounds harsh, but it’s just how these systems work under pressure. If an agent is asked to fix something and it’s running out of ideas, the path of least resistance is to find an interpretation of the task that it can call “done.” Weaken a test assertion. Skip a check. Mark a failing case as expected. None of this is malicious — it’s the agent pattern-matching on “what does a completed task look like” rather than “is the task actually complete.”

Separating builder from checker removes that escape hatch. The checker doesn’t know what the builder was trying to do. It doesn’t care. Its job is to run the tests in a fixed sequence and report exactly what came back. It has no tools that let it edit files. It can’t be cajoled into “adjusting” a test to see if maybe the failure is acceptable. It just runs and reports.

The loop is what makes this useful rather than just a two-step manual process. After each checker run, the loop either stops because everything passed, or it hands the failure output back to the builder and starts another cycle. The builder sees real failure messages — file and line numbers, actual assertion output, not a summary — and has to make them pass without changing what the tests are testing.

This is the same feedback loop you use in TDD, but running autonomously at agent speed.


The Three Files

File 1: The Agents

Two markdown files in .claude/agents/. Each is YAML frontmatter followed by the system prompt for that agent.

.claude/agents/builder.md

---
name: builder
description: Writes, edits, and fixes code to implement a feature or resolve test failures. Never modifies tests to make them pass. Invoked by the loop command with a task brief or a list of failures to fix.
model: sonnet
tools: Read, Write, Edit, Glob, Grep, Bash
---

You are a focused code implementation agent. You receive either an initial task brief or a set of test/lint failures from a previous checker run. Your job is to write or fix the implementation code — not the tests.

Hard rules you never break:
- You do not modify test files. If a test is failing, the implementation is wrong, not the test. The only exception is if the task brief explicitly asks you to add new tests.
- You do not skip, comment out, or weaken assertions to make a failure disappear.
- You do not mark tests as expected failures (pytest.mark.xfail, .skip(), xit(), etc.) unless the task brief explicitly told you to.
- When given a list of failures, fix them one by one, reading the relevant source files before writing. Do not guess at what needs to change.
- If you believe a test is wrong, note it in your response but do not change it. The human review step will handle it.
- When done, state what you changed and why, but do not run tests yourself. That is the checker's job.

.claude/agents/checker.md

---
name: checker
description: Runs tests, type checks, and linting in order. Reports failures verbatim. Never edits files. Invoked by the loop command after every builder run.
model: sonnet
tools: Bash
---

You are a verification agent. Your only job is to run the configured check sequence in order and report exactly what comes back.

Check sequence (run in this order, stop at first category failure):
1. Type checking: `npx tsc --noEmit` (or the project equivalent)
2. Linting: `npx eslint src/` (or project equivalent)
3. Unit tests: `npx jest` (or project equivalent)
4. Integration tests if present: `npm run test:integration`

Reporting rules:
- Report each failure as: `<file>:<line> — <what broke> — <which check>`
- Do not summarize or paraphrase error messages. Copy them verbatim.
- Do not suggest fixes. Do not editorialize. Just report.
- If all checks pass, output exactly: ALL GREEN
- If any check fails, output exactly: FAILED followed by the verbatim failure list.
- You never edit files. You have no Edit or Write tools. If something asks you to edit, refuse.

The model for both is sonnet. The builder needs to be capable enough to handle multi-file reasoning — Sonnet handles this well. The checker doesn’t need much intelligence; it’s mostly running shell commands and copying output. Haiku would work for the checker too if you want to save cost on long loops.


File 2: The Loop Command

.claude/commands/loop.md

This is the orchestrator. It lives in .claude/commands/ so you invoke it as /loop from inside Claude Code. It uses the Task tool to spawn the builder and checker as subagents.

---
model: opus
allowed-tools: Task, Read, Glob
---

You are the loop orchestrator. Your job is to coordinate the builder and checker agents in cycles until the work is done or a stop condition is met.

## On Invocation

You receive a task description from the user. Before dispatching anything:
1. Read the relevant files to understand the current state (use Read/Glob as needed)
2. Write a brief — a single focused paragraph — that describes exactly what the builder needs to do, what files are in scope, and what success looks like
3. Do not pad the brief. Agents work better with tight, specific instructions than vague long ones.

## The Cycle

Each cycle:
1. Dispatch the builder with the brief (on cycle 1) or the brief + failure list (on cycles 2+)
2. Dispatch the checker (always after the builder, never before)
3. Read the checker output:
   - If ALL GREEN: stop and report success, including the final checker output verbatim
   - If FAILED: check stop conditions (below), then send failures back to builder for next cycle

## Stop Conditions (check before each builder dispatch after cycle 1)

Stop and report failure if ANY of these are true:
- Cycle count reaches 5 (hard cap — do not continue past 5 cycles)
- The failure list is identical to the previous cycle's failure list (builder is guessing, not fixing)
- Any failure that was GREEN in a previous cycle is now FAILING (a fix broke something else)

When you stop on a failure condition, say exactly which condition triggered and include the last checker output verbatim. Do not report success. Do not summarize.

## What You Never Do

- Never report ALL GREEN without including the checker's actual output
- Never skip the checker step and assume the builder's work is correct
- Never increase the cycle cap because you think the builder just needs one more try
- Never pass a summary of failures to the builder — pass the verbatim checker output

I use Opus for the orchestrator because it does the harder work: reading files to write a good brief, tracking cycle state, making the judgment call on stop conditions. Sonnet for the subagents keeps cost manageable across a five-cycle loop.


File 3: Stop Rules in CLAUDE.md

The loop command has stop rules baked in, but CLAUDE.md is where you put the rules that apply project-wide — including overrides that prevent the loop from being gamed in ways that cross command boundaries.

Add this block to your project’s CLAUDE.md:

## Agent Loop Rules

These rules apply to all agent operations in this project and cannot be overridden by individual agents or commands:

**Mandatory stops:**
- Stop all loops at 5 cycles maximum. No exceptions. If the work isn't done in 5 cycles, report failure and explain where it stalled.
- Stop if the same failure appears in two consecutive checker outputs. This means the builder doesn't know how to fix it. Human review is needed.
- Stop if a check that was passing in cycle N fails in cycle N+1. A fix introduced a regression. Stop and report immediately.

**Honesty rules:**
- Never report success without the final checker's verbatim output in your response.
- "Looks like it should pass" is not a checker output. Run the checker.
- Never describe a test as "probably fine" or "should be okay." Run it and find out.

**Test integrity:**
- Builder agents do not modify test files. If the only path to green requires changing a test, stop the loop and flag this for human review with a specific explanation of which test and why.
- Do not use skip flags, xfail markers, or any mechanism that makes a test not run in order to make the suite pass.

The CLAUDE.md rules matter because they live at the project level, not inside the loop command. Even if you invoke a different command or an ad-hoc agent later, these constraints travel with the project. The no-test-modification rule especially — I’ve seen agents get creative about this when the checker keeps failing and they’ve run out of implementation ideas. Having it in CLAUDE.md makes it a project-level invariant, not just a suggestion in a prompt.


A Real Example: Rate Limiting the Login Route

Let me walk through what a real run looks like. I used this setup when adding rate limiting to the login route in an Express API — a real task on a real codebase, not a toy.

I invoked /loop add rate limiting to the login route — max 5 attempts per 15 minutes per IP, return 429 with retry-after header, existing auth tests must still pass.

Cycle 1:

The orchestrator read src/routes/auth.ts, src/middleware/, and the test file tests/auth.test.ts. It wrote a brief that specified using express-rate-limit, targeting the /auth/login endpoint specifically (not the whole router), and noted that the existing tests mock Date.now which would interact with the rate limiter’s window calculation.

Builder runs. Installs express-rate-limit, wires it up, adds a test for the 429 response.

Checker runs. Output:

FAILED
tests/auth.test.ts:47 — Expected 200, received 429 — jest
  ● POST /auth/login › returns 200 on valid credentials
    expect(received).toBe(expected)
    Expected: 200
    Received: 429

The builder had applied the rate limiter at the router level, which meant the test’s repeated login calls in beforeEach were hitting the limit and getting rate-limited. The brief mentioned this risk but the builder hadn’t threaded the fix through correctly.

Cycle 2:

Orchestrator sends the failure verbatim to the builder. Builder reads the test, sees the beforeEach issue, scopes the rate limiter to only production middleware (checking process.env.NODE_ENV !== 'test'), and adds a test reset call.

Checker runs. Output:

FAILED
tests/auth.test.ts:89 — Expected header retry-after to be present — jest
  ● POST /auth/login › returns retry-after header on 429
    expect(received).toBeDefined()
    Received: undefined

The builder had added the rate limiter but hadn’t configured express-rate-limit to send the Retry-After header — it’s not default behavior, you have to set standardHeaders: true and legacyHeaders: false or handle the header manually.

Cycle 3:

Builder adds the header configuration. Checker runs: ALL GREEN.

Orchestrator stops. Reports success with the full checker output included.

Three cycles. That’s about the average for a task with one non-obvious interaction and one missing configuration. Tasks where the builder nails it on the first try take one cycle. Tasks where there are multiple interacting failures can take four. I’ve never had a well-specified task go all the way to cycle 5 without something being wrong in the original brief — which is itself a signal worth paying attention to.


Failure Modes I’ve Actually Hit

The same failure twice in a row. Hit this on a TypeScript strictness issue where the builder kept adding type assertions that the type checker rejected in a different way each time, but the underlying problem was that a third-party type definition was wrong. The loop stopped at the second consecutive identical failure. That was correct — no amount of iteration was going to fix a bugged @types/ package.

A fix breaking a previously-passing check. Hit this when a builder fixed a unit test failure by extracting a helper function, which then caused a linting failure because the helper wasn’t exported but was referenced in a type declaration. The loop stopped immediately rather than letting it spin. I fixed both in one manual pass.

The builder not reading the failure carefully. If the failure message is ambiguous and the checker output doesn’t include enough context, the builder will sometimes make a plausible-but-wrong change. This is why I pass verbatim checker output, not a summary. Stack traces and exact assertion messages contain information a summary strips out.

The brief being too vague. The orchestrator’s brief-writing step matters more than it seems. “Fix the auth tests” is a bad brief. “The login route’s rate limiter needs to not activate during test runs because the test suite calls login multiple times in beforeEach — the rate limiter should check NODE_ENV or be disabled via a test helper” is a good brief. The quality of cycle 1 depends entirely on the brief.


Variations Worth Trying

Language-specific check sequences. The checker’s sequence above is Node/TypeScript. For Python projects I use mypy src/ruff check src/pytest -x. The -x flag (stop on first failure) matters — you want the checker to give the builder one thing to fix at a time, not a wall of failures. For Go: go vet ./...staticcheck ./...go test ./.... Hardcode the check sequence in the checker’s system prompt for your project.

Adding a reviewer agent. For high-stakes changes — anything touching auth, payments, or data migrations — I sometimes add a third agent between the builder and checker: a security reviewer. Same pattern: read-only tools, no file editing, just reads the diff and reports concerns. The loop only proceeds to the checker if the reviewer raises no blockers. Adds cost, but for that class of change the tradeoff is easy.

Parallel checker tracks. If you have genuinely independent test suites — unit tests and end-to-end tests that don’t share state — you can dispatch two checkers in parallel in the same cycle and wait for both. Cuts cycle time roughly in half if your test suites take comparable time. Complexity goes up, but the loop orchestrator in Opus can handle it.

Scope limitation by file. The builder’s brief should specify which files are in scope. If you don’t constrain this, the builder might reasonably touch something outside the task boundary that the checker then flags. I often include a line like “only modify files under src/middleware/ and src/routes/auth.ts — do not touch test files or anything outside these paths.”


The Honest Cost Picture

A five-cycle run with Opus as orchestrator and Sonnet for builder and checker is not cheap. Ballpark: orchestrator cycles are expensive because Opus reads files to write the brief and tracks state across cycles; builder runs depend heavily on how many files get read and rewritten; checker runs are cheap because it’s just Bash. On a medium-complexity task I’d estimate 50-150k tokens total across a three-cycle run. That’s single-digit dollars on the API.

Whether that’s worth it depends entirely on what the task is worth. For anything that would take a developer 2-4 hours to implement carefully with tests, the cost of a loop run is trivially worth it. For a two-line typo fix, it’s overkill — just do it manually.

The cycle cap is also a cost control. Five cycles is the maximum. If the work isn’t done in five cycles, you’re better off getting a human to diagnose why than spending more compute on iteration that isn’t converging.


Setup Time Is Actually About 10 Minutes

Three files. That’s it. Create .claude/agents/builder.md, .claude/agents/checker.md, and .claude/commands/loop.md with the content above, customized for your stack’s actual test and lint commands. Add the stop rules block to your CLAUDE.md. Invoke with /loop <task description>.

The first time you watch it run — actually see it cycle through failures and fix them autonomously, with a real checker reporting real output, and stop cleanly on ALL GREEN — it’s a different feeling than watching a single agent claim it’s done. There’s something qualitatively different about a system that has been forced to prove its output rather than just assert it.

That’s what the loop does. It turns “the agent said it worked” into “the checker confirmed it passed.” For anyone building on this seriously, that distinction matters a lot.


I’m running variants of this at Zero for tasks where I can’t have someone sitting in the loop watching every move. If you’re building something similar — or if you hit a failure mode I haven’t covered — I want to hear about it.

AI Developer Tools Software Development
Share: