Teams of Agents: When Multi-Agent Systems Are Worth the Complexity
Anthropic's engineering blog revealed how they built a multi-agent research system that outperformed single agents by over 90%. But that number hides the real design question.
I remember the first time I ran a real engineering team, somewhere around 2004. The lesson that took me the longest to internalize: adding more people to a project doesn’t make it faster in proportion to the headcount. It adds coordination overhead. Communication paths multiply as O(n²). Dependencies form. People block each other.
Brooks’s Law — “adding manpower to a late software project makes it later” — applies to agents too, it turns out. Multi-agent systems are powerful, but they’re not free, and the “just spin up more agents” reflex can waste serious compute and produce worse outputs if you’re not thoughtful about when parallelism actually helps.
Anthropic published a detailed engineering writeup on how they built their multi-agent research system, and it’s one of the most useful technical documents I’ve seen on this topic — not because it tells you to always use multiple agents, but because it’s precise about when parallelism is worth it and when it isn’t. Let me unpack the key findings, and then tell you how I think about this in my own systems.
What Anthropic Actually Built and Found
The system uses an orchestrator-worker pattern: a lead agent (they used Claude Opus 4) receives the research query, develops a strategy, and spawns subagents (Claude Sonnet 4) that work in parallel to explore different aspects of the problem. Each subagent gets a self-contained task with its own context window — and critically, Anthropic noted that each subagent doesn’t know the other subagents exist. They work independently, return findings to the lead, and the lead synthesizes the final answer.
The performance result Anthropic reported on their internal evaluation: the multi-agent setup outperformed a single-agent Claude Opus 4 by more than 90%.
That number gets cited a lot. But here’s what people miss: the task they were running was research — specifically, open-ended research on complex topics requiring parallel exploration of multiple sources. This is a task where the structure of the problem maps almost perfectly to parallel execution. You can fan out, explore independently, and aggregate. Each subagent is doing work that doesn’t depend on the other subagents’ results.
The 90%+ improvement on research tasks tells you about the ceiling for this architecture. It doesn’t tell you much about whether multi-agent is the right choice for your problem.
The Key Design Insight: Context Windows as Compression
One of the most interesting observations in Anthropic’s writeup — and the one that made the architecture click for me — is this: subagents are doing compression. Each subagent gets a fresh context window, explores a thread of a problem, and returns a distilled summary to the lead agent. The parallelism isn’t just about speed; it’s about getting more information into the final synthesis than a single agent with a single context window could ever see.
Imagine trying to synthesize 200 relevant documents in a single context window — even with large-context models, that’s hard. You’d be hitting limits and degrading quality. Instead: spawn 5 subagents, each looking at 40 documents, each returning a concise synthesis to the lead. The lead sees 5 compressed summaries totaling maybe 10% of the raw tokens, but preserving the signal from all 200 documents.
That’s architecturally elegant. It’s essentially a reduce step — each subagent does a local reduce on its slice, and the lead does a global reduce on the outputs. If you’ve ever thought about map-reduce at scale, you can see why this works and why the performance gain on research tasks is so large.
The Anthropic team also noted that their subagents were explicitly prompted to use search strategies that mirror expert human research: start with short, broad queries, evaluate what’s available, then narrow. They found that without this prompting, agents defaulted to long, specific queries that performed worse. Prompt engineering — or, more accurately, context engineering — of the subagent behavior was their primary lever for improvement.
The Token Cost Equation
Multi-agent systems consume approximately 15 times more tokens than standard chat interactions, according to Anthropic’s writeup. Let that land.
This is not a footnote. At production scale, 15x token consumption is a business decision. If your single-agent system costs $X in API calls per month, the equivalent multi-agent system might cost $15X. For some use cases, the output quality improvement is worth that. For many, it isn’t.
The honest framework here is: multi-agent architectures are worth the premium when the task has two properties simultaneously. First, the problem must be parallelizable — meaning it can be broken into independent subproblems that don’t block each other. Second, the value of a better answer must be high enough to justify the cost. Research tasks for high-stakes decisions score high on both. Generating a summary of last week’s meeting notes scores low on both.
I’ve seen teams reach for multi-agent patterns as a first reflex because it sounds more sophisticated, and then wonder why their latency doubled and their costs exploded without much quality improvement. The sophistication is in knowing when not to use it.
When Multi-Agent Breaks Down
Let me be concrete about the task types where multi-agent doesn’t help, based on both Anthropic’s writeup and what I’ve seen in practice.
Tightly coupled tasks. Anthropic specifically mentioned coding as an example where multi-agent is less effective — because coding tasks are often tightly sequential. Step N depends on the output of Step N-1. The function you’re refactoring depends on understanding the context of what calls it. You can’t usefully parallelize this work the way you can parallelize research across different sources. Spawning multiple code-writing subagents typically produces conflicts that the orchestrator then has to resolve, adding overhead without adding much value.
Tasks with unclear scope. Multi-agent systems need well-defined subproblems. If the problem is ambiguous — if you’re not sure yourself how to decompose it — spawning multiple agents amplifies the ambiguity. Each subagent will interpret the task differently, and the lead will get conflicting outputs that are hard to reconcile. For ambiguous problems, a single agent that can ask clarifying questions iteratively is usually more effective.
Short tasks. The orchestration overhead — spinning up subagents, managing their outputs, running the synthesis step — has a fixed cost that’s non-trivial. For tasks that a single agent can complete well in 30 seconds, the overhead of multi-agent coordination probably isn’t worth it.
When you care about reasoning transparency. A single agent’s chain of thought is followable. Multi-agent reasoning is distributed across multiple context windows, each opaque to the others. Debugging why a multi-agent system reached a particular conclusion is significantly harder than debugging a single agent. If you’re building in a domain where you need to audit AI reasoning (healthcare, finance, legal), this matters a lot.
How I Think About the Architecture Decision
When I’m designing a new system that involves LLMs, my first question is always: does the problem decompose naturally into independent subtasks? I mean this literally — I’ll sketch out a dependency graph. If I can draw the subproblems as nodes and there are few or no edges between them (few dependencies), parallelism is probably a net win. If the graph is dense, I stick with a single agent or a tightly sequential chain.
The second question is: what’s the value asymmetry of better answers? For a consumer product feature, a 30% quality improvement might mean slightly higher retention. For an enterprise research tool used by consultants making $1,000/hour decisions, a 30% quality improvement might be worth a lot of money. The token cost only looks expensive until you price it against the value of what you’re generating.
Third: observability. Before I deploy any multi-agent system in production, I need logging at every orchestration boundary. What did the lead agent decide to delegate? What did each subagent return? Where did the synthesis happen and how? Without this, you’re flying blind. I’ve built systems where a subagent was silently returning empty results for certain query types, and the orchestrator was generating plausible-sounding but empty answers — we only caught it because we had per-subagent logging.
The Org Design Parallel
There’s something philosophically interesting about multi-agent systems that I keep coming back to. Building them has forced me to think more clearly about something that applies to human teams too: the shape of the problem should determine the shape of the organization, not the other way around.
The reason Anthropic’s research system works well is that the architecture fits the problem. Research is inherently explorative and parallelizable. The orchestrator-worker pattern mirrors what a good research team actually does: a senior researcher defines the questions and assigns areas of inquiry, junior researchers go deep on their individual threads, and the senior synthesizes the findings into a coherent answer.
When I’ve seen multi-agent systems fail — and I’ve seen several now — it’s usually because someone applied an orchestrator-worker pattern to a problem that required sequential, interdependent reasoning. It’s like trying to parallelize a proof. You can’t run the middle and the end at the same time before the beginning is done.
The meta-skill here — whether you’re designing human teams or agent teams — is the ability to look at a problem and see its natural structure. Where are the dependencies? Where is the real parallelism? What does the synthesis step require? These are old architecture questions. They just have new answers when the workers are LLMs.
A Note on Claude 4 as the Orchestrator
Anthropic made a specific and interesting choice in their system: Claude Opus 4 as the orchestrator, Claude Sonnet 4 as the subagents. This maps to a real principle: the orchestrator has a harder job in some ways — it needs to hold the overall strategy, decompose intelligently, and synthesize coherently — while the subagents are doing more focused, bounded tasks.
This also has a cost logic: Opus 4 is more capable and more expensive; Sonnet 4 is strong at focused tasks and cheaper. You pay the premium for orchestration capability where it matters (at the coordination layer) and use cost-efficient intelligence for the execution layer. That’s good systems thinking.
Not all problems need this split. But it’s worth considering when you design your own multi-agent systems: what level of reasoning capability does the orchestrator actually need vs. what the subagents need? Matching model capability to task complexity is part of the architecture.