Anshad Ameenza.
Engineering · · Updated: Jul 25, 2026

How to Build Powerful Web-Search Agents That Actually Work

Nine concrete techniques for building web-research agents that are fast, cheap, and don't hallucinate — from query decomposition to adversarial cross-checking.


I’ve been building and breaking AI systems for a while now. Before Zero — the university I’m building now — I ran through fifteen-odd startups across Bangalore, Kerala, Dubai, and a stretch in Vietnam where the coffee kept me honest. In all of that, one thing that keeps coming back to bite teams is underestimating what it takes to actually build a good research agent. Not a demo. A thing that works in production, at scale, without hallucinating your users into a bad decision.

Web-search agents feel deceptively simple. You string together a search API, an LLM, and call it done. A few thousand tokens later the thing confidently tells you something wrong with three fake citations. I’ve seen this kill trust in internal tools overnight.

So here’s what I actually know about building these well. Nine techniques, ordered roughly by how early you need to think about them.


1. Query Planning and Decomposition Before You Touch the Search API

The worst thing you can do is hand a user’s raw question directly to a search engine. “What’s the competitive landscape for fintech lending in Southeast Asia and how does it compare to India’s UPI ecosystem” is not a search query. It’s a research brief.

Before any retrieval happens, a planning step needs to break that down. I use a dedicated planning call — usually a strong frontier model — that takes the question and emits a structured list of sub-queries with explicit dependencies. Something like:

{
  "sub_queries": [
    { "id": 1, "query": "fintech lending startups Southeast Asia 2025", "depends_on": [] },
    { "id": 2, "query": "Vietnam Philippines Indonesia digital lending regulation", "depends_on": [1] },
    { "id": 3, "query": "India UPI merchant lending ecosystem 2025", "depends_on": [] },
    { "id": 4, "query": "UPI credit stack vs Southeast Asia BNPL comparison", "depends_on": [1, 3] }
  ]
}

Queries without dependencies run in parallel. Queries with dependencies wait for their upstream results and then optionally rewrite themselves based on what was found. This is called query rewriting — before executing query 4, you inject a summary of the results from 1 and 3 and ask the model to sharpen the query given what it already knows.

This alone doubles the relevance of what you retrieve. The reason is simple: a search engine is not a reasoner. It can’t infer what you actually mean. The planning model does that work and converts intent into precise retrieval instructions.


2. Aggressive Model Routing — Don’t Use a Hammer for Everything

This is where most teams bleed money. They pick one model and use it for every step. That’s expensive and unnecessary.

The insight is that different tasks in a research pipeline have wildly different difficulty levels. Planning and synthesis — those need a strong model. But summarizing a 2,000-word webpage into three sentences? Extracting named entities from a search result? Classifying whether a source is relevant? Tiny models handle these fine.

Route aggressively at every layer:

  • Planning and final synthesis: frontier model (Claude Opus, GPT-4o, etc.)
  • Sub-query generation and rewriting: mid-tier model (Sonnet, GPT-4o mini)
  • Summarization of individual pages: small/cheap model (Haiku, Llama 3.1 8B)
  • Relevance classification: a fine-tuned small model or even heuristics

The math is not subtle. If your orchestrator decides that a summarization subtask can go to a model that costs 40x less, and you’re doing that subtask 50 times per research run, you’ve just saved a lot of money without touching quality. I’ve seen this cut total inference cost by 60–70% on real workloads.

The key is that routing decisions happen in the orchestrator, which runs on your best model. The orchestrator doesn’t need to be cheap — it needs to be smart enough to correctly classify what kind of work each step requires and send it to the right subagent.


3. Subagents to Parallelize and Contain Scope

Long research tasks will kill you if you run them as a single sequential loop in one context window. Two problems: cost and contamination.

Cost: a single 200,000-token context doing ten rounds of search-read-reflect is brutally expensive. Every token in the context window gets paid for on every generation step.

Contamination: when one search thread’s results bleed into another’s reasoning, you get weird cross-topic hallucinations. The model starts mixing up facts from different subtopics because they’re all swimming in the same context.

The fix is subagents. Each sub-query or research thread gets its own agent with its own isolated context. The subagent does its job — search, read, extract — and returns a compact summary to the orchestrator. The orchestrator never sees raw web content; it only sees curated summaries.

This is the multi-agent pattern that serious research systems are converging on. Parallelization is a first-class concern, not an afterthought. When queries 1 and 3 from my example above have no dependencies, they fire simultaneously in two separate subagents. Wall-clock time drops accordingly.

The orchestrator’s context stays clean. It holds the research plan, the growing set of summaries, and the synthesis task. That’s it.


4. Use a Search Engine That Was Actually Built for Agents

Google is built for humans clicking ads. The result format, the ranking signals, the way results are returned — all of it optimizes for a person scanning a SERP on a screen. Agents don’t click ads. They want precise semantic retrieval.

Exa is largely a semantic search engine built for this use case. Instead of keyword matching, it uses neural embeddings to understand query intent and match on meaning. This matters enormously for research tasks. If I ask for “companies building infrastructure for AI agent memory,” I want companies that do that, not pages that happen to contain those exact words.

Exa can return things like 100–1,000 company homepages matching a semantic description, which is incredibly useful for market research and competitive intelligence. That’s a retrieval capability that keyword search engines simply can’t replicate cleanly. The precision at the top of the list is meaningfully higher for conceptual queries.

For targeted research — “find me all the papers from 2025 on reinforcement learning for robotics” or “find me fintechs operating in the GCC” — this kind of semantic retrieval is the difference between a research agent that produces usable output and one that wastes 80% of its retrieved content on irrelevant material.


5. Content Extraction — Pages Are 90% Noise

This is something Exa has done real work on that I find genuinely interesting. Raw webpages are terrible inputs for LLMs. You’ve got nav bars, footers, cookie banners, related articles, ads, comment sections — the actual information you want is maybe 10–20% of the token count.

Exa trains what they call “extraction” models — specialized models that strip up to roughly 90% of the tokens from a fetched page, keeping only the dense informational content. Their “highlights” feature in particular can reduce token usage by 79–90% compared to passing full page text to an LLM. For RAG pipelines with tight context budgets, that translates to fitting 4–5x more sources into the same token budget.

The lesson here — even if you’re not using Exa — is that content preprocessing is not optional. Whatever your fetching layer looks like, you should be running a stripping pass before anything hits an LLM. At minimum: remove HTML markup, extract main content blocks, drop boilerplate. Tools like Trafilatura, Readability.js, and similar newspaper-extraction libraries do a reasonable job. But training an extraction model specifically for your retrieval pattern, the way Exa has, is a significant unlock if you’re operating at scale.

Every token you don’t send to the LLM is money you keep and latency you avoid.


6. Iterative Search-Read-Reflect Loops with a Stopping Critic

One-shot retrieval doesn’t work for complex research. You don’t know exactly what you need until you start reading what you find. The right architecture is a loop:

  1. Search based on current plan
  2. Read and extract from top results
  3. Reflect: what did I learn? what am I still missing? is the question answered?
  4. If not done: rewrite queries based on what I found, repeat

The critical piece most people skip is step 3 — an explicit critic that decides whether to keep going. Without this, agents either stop too early (miss important information) or loop forever (runaway cost). I implement this as a separate model call with a strict schema:

{
  "coverage": "partial",
  "confidence": 0.6,
  "gaps": ["No data found on Vietnam specifically", "Conflicting figures on market size"],
  "should_continue": true,
  "next_queries": ["Vietnam fintech lending 2025 market size", "Southeast Asia BNPL growth rate source"]
}

The should_continue field is a hard gate. If coverage is complete and confidence is above your threshold, the loop terminates. You also want a hard maximum iteration count — I use 5–7 iterations for most research tasks — because sometimes a question just can’t be fully answered from available web content and you need to return what you have rather than spin indefinitely.

This loop structure is what separates real research agents from one-shot RAG.


7. Source Quality Ranking, Recency Filtering, and Deduplication

Not all sources are equal and your agent should know that. I maintain a lightweight source quality scorer that runs before extracted content goes to the LLM. It checks:

  • Domain authority heuristics: is this a primary source (company blog, official docs, regulatory filing) or a tertiary aggregator?
  • Recency: for fast-moving topics, a 2022 source about AI tooling is actively misleading. I filter by date aggressively and weight recent sources higher.
  • Semantic deduplication: five different tech blogs reporting on the same press release are one data point, not five. I embed all extracted summaries and cluster them; only one representative from each cluster proceeds to synthesis.

Deduplication is underrated. Without it, your synthesis model sees the same fact echoed five times and infers it’s well-established when it might just be a single press release that spread. The model can’t tell the difference between five independent confirmations and five copies of the same article. You have to handle this in the pipeline.

For recency filtering, I pass a published_after parameter to the search API where supported and always extract the publication date from pages. If the page doesn’t have a clear date, it gets flagged as low-confidence.


8. Adversarial Cross-Verification of Key Claims

Research agents fail in a specific way: they find a claim in one source, treat it as fact, and build subsequent reasoning on it. When the original claim was wrong, everything downstream is wrong too. This compounds.

For any claim the synthesizer marks as high-stakes — a statistic, a company valuation, a regulatory fact — I run an adversarial verification step: query for the same fact from independent sources with no prior context given to the retrieval. If source A says the Indian BNPL market is $X billion and I find three independent sources that also say $X billion, I have reasonable confidence. If source B says $Y billion, I surface the discrepancy rather than picking one.

Practically, this looks like a second subagent that knows only the specific claim to verify, not the rest of the research context. It does its own retrieval, extracts the relevant facts, and returns a verdict: confirmed, contradicted, or inconclusive. The orchestrator surfaces contradictions to the final synthesis step with explicit uncertainty.

This catches a surprising amount of bad data. The web is full of figures that got copy-pasted wrong from an original source, outdated statistics that still rank highly, and genuinely contested claims that look authoritative in isolation. Adversarial verification forces the agent to treat retrieval as hypothesis testing, not fact collection.


9. Structured Output with Inline Citation Grounding

The final output of a research agent is only as trustworthy as its citations. Every factual claim in the output should be tied to a specific source URL. Not a “sources” section at the bottom — inline grounding where each claim links directly to the document that supports it.

I enforce this by giving the synthesis model a strict output schema:

{
  "claims": [
    {
      "text": "The Indian UPI ecosystem processed $2.5 trillion in transactions in FY2025",
      "source_id": "src_003",
      "confidence": "high",
      "quote": "UPI recorded transactions worth ₹206 lakh crore..."
    }
  ],
  "sources": {
    "src_003": {
      "url": "https://...",
      "title": "NPCI Annual Report 2025",
      "retrieved_at": "2026-07-20"
    }
  }
}

The model cannot make a claim without attaching a source ID. If it tries to generate a claim with no supporting source from the retrieved set, that claim either gets marked as model_knowledge (lower confidence, not from retrieval) or gets flagged for human review.

This schema also makes evaluation possible. You can automatically check: what fraction of claims have sources? Are those sources actually in the retrieved set? Do the quotes match? This turns citation grounding from a nice-to-have into a machine-checkable property of every research output.


Putting It Together

A production web-research agent isn’t one loop and one model. It’s a system:

  • A planning layer that decomposes questions and generates typed sub-queries
  • A routing layer that sends each task to the cheapest model that can do it well
  • Parallel subagents with isolated contexts doing retrieval and extraction
  • A semantic search backend (like Exa) that retrieves by meaning, not keywords
  • A content extraction step that strips noise before any LLM sees a webpage
  • An iterative loop with a stopping critic that knows when to terminate
  • Source quality scoring, recency filtering, and semantic deduplication
  • Adversarial cross-verification for high-stakes claims
  • Structured output with inline citation grounding and machine-checkable evidence

None of these is exotic. All of them require intentional implementation. Skip any one of them and you’ll see it show up in output quality — either in cost, hallucination rate, or user trust.

The teams I’ve watched build the best research agents treat the pipeline the same way they treat any distributed system: each component has a clear interface, a defined failure mode, and an observable output. That mindset is what separates a demo that impresses from a tool people actually rely on.

I’m building a lot of this into how we think about knowledge infrastructure at Zero. If you’re working on something similar, I want to hear about it.

AI Developer Tools Architecture
Share: