RAG Grew Up: Context Engineering and the MCP Standard
From prompt engineering to context engineering to MCP — how the industry's mental model for building with LLMs has matured, and what it means architecturally.
Somewhere around mid-2024 I stopped using the phrase “prompt engineering” with my teams. Not because the work went away — we were doing more of it than ever — but because the term had become misleading. It implied that the craft involved was about phrasing questions cleverly. What we were actually doing was far more architectural.
By the time Tobi Lütke, CEO of Shopify, posted his tweet in June 2025 — “I really like the term ‘context engineering’ over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM” — I felt the industry had finally caught up to what practitioners had been doing for a year. Simon Willison, whose blog is one of the best technical traces of how this field has evolved, wrote on June 27, 2025 that he thought the term would stick. He was right.
And separately — almost quietly, given how consequential it turned out to be — Anthropic open-sourced the Model Context Protocol in November 2024. These two threads (the conceptual reframe and the technical standard) are related, and understanding both is important if you’re architecting systems that depend on LLMs.
The Actual Problem With “Prompt Engineering”
Here’s what we were actually doing when we called it prompt engineering: we were deciding what information to put in the LLM’s context window, in what format, in what order, and with what framing — so that the model could do useful work on a task it wasn’t directly trained for.
That’s not just clever wording. That’s a system design problem. The decisions involved include:
- What data sources to retrieve from, and how to prioritize when you have more data than fits in the context
- How to chunk, embed, and retrieve documents so that the most relevant content actually surfaces
- What tools to expose to the model, and what format their outputs should be in
- How much of the conversation history to preserve vs. summarize vs. discard as context fills up
- How to structure the system prompt to establish reliable behavior without eating too much of the window
Retrieval-Augmented Generation (RAG) was the first serious answer to part of this problem — specifically, the problem of giving the model access to external knowledge it wasn’t trained on. The idea: at query time, retrieve relevant documents from a vector store, stuff them into the context window, let the model generate an answer conditioned on those documents. It worked. It still works. But “doing RAG” came to be treated as if it solved the broader problem, when it was really solving one narrow slice of it.
Context engineering is the more complete framing. You’re engineering the entire context — everything the model sees when it goes to generate its response. Tools, memories, retrieval results, conversation history, system instructions, user state. All of it. And the quality of that engineering is now one of the most significant determinants of whether an LLM-based system actually works in production.
What MCP Is and Why It Matters
Anthropic open-sourced the Model Context Protocol in November 2024. The problem it addresses: before MCP, every AI application that needed to connect to external tools and data sources had to build custom connectors for each one. This created what Anthropic described as an N×M integration problem — every model needed to be connected to every tool separately, and every tool needed to know about every model. Messy and fragile.
MCP is a standardized open protocol (built on JSON-RPC 2.0, conceptually similar to how the Language Server Protocol standardized IDE integrations) that defines how AI systems connect to data sources and tools. An MCP server exposes capabilities — files, APIs, databases, whatever — in a standard way. Any MCP-compatible client (Claude, but also now Cursor, Zed, and others) can connect to any MCP server without custom integration work.
Anthropic shipped initial MCP SDKs for Python and TypeScript, along with pre-built servers for Google Drive, Slack, GitHub, Git, Postgres, and Puppeteer. The adoption curve since then has been steep: OpenAI adopted MCP for its Agents SDK and Responses API in March 2025, Google DeepMind confirmed MCP support in Gemini in April 2025, and Microsoft and GitHub joined MCP’s steering committee in May 2025. At this point it’s effectively the industry standard for AI-to-tool connectivity.
Why does this matter for context engineering? Because MCP is what operationalizes it. When you’re thinking about what context to provide your model, MCP defines how you give it structured access to the external world — how it reaches into a database, how it calls an API, how it reads a file. It’s the plumbing for the context layer.
The Architecture of Good Context
Let me get specific about what context engineering looks like in practice, because I think the conceptual discussion stays too abstract.
When I’m designing an LLM-based system now, I think in terms of what I’d call the context stack:
Static context is what’s baked into the system prompt — the model’s role, constraints, output format, and any fixed knowledge it should always have. This is the cheapest context to provide but also the least adaptive. Don’t bloat it.
Retrieved context is what RAG pulls in — documents, code snippets, database records relevant to the specific query. The quality of retrieval (embedding model choice, chunking strategy, re-ranking) matters enormously here. Bad retrieval is worse than no retrieval — noise in the context window degrades output quality in ways that are subtle and hard to debug. In one system I built for Zero, we found that using a small re-ranking step (using a cross-encoder to reorder the top-k retrieved chunks) improved answer quality more than doubling the number of retrieved documents did.
Tool outputs are what the model gets back from tool calls during a multi-turn interaction. With MCP, this is standardized — the model asks for a tool call, the MCP server executes it, the result comes back in a structured format. The design question here is which tools to expose, how to format their outputs, and how to handle tool call failures gracefully.
Conversation history is the trickiest. Long conversations accumulate fast. A 40-message conversation in a coding assistant is thousands of tokens before you’ve retrieved anything. Most production systems need a strategy here — hierarchical summarization, selective retention of key decisions, or simply a sliding window — and the right strategy is highly task-dependent. There isn’t a universal answer.
Injected memory is the emerging frontier: external memory stores that the model can read from and write to, allowing state to persist across sessions. This is where a lot of the current agentic research is focused.
The craft of context engineering is knowing how to balance these layers given a fixed context budget. Models have gotten better (Gemini 2.5 Pro and Claude 3.7 support very long contexts), but longer context windows don’t eliminate the need for good context engineering — they raise the stakes, because what you can fit is now a design decision rather than a hard constraint.
What Changed for RAG Specifically
RAG hasn’t gone away — it’s grown up. The basic pattern (retrieve, stuff into context, generate) is still the foundation, but the production reality is more layered:
Hybrid search (combining dense vector search with BM25 keyword search and then re-ranking) has become the standard in serious deployments because pure vector search misses cases where keyword overlap matters a lot (like looking up a specific product name or error code).
Structured data + unstructured data is a real design problem. Most real enterprise data isn’t just documents — it’s databases, APIs, structured schemas. Text-to-SQL approaches work for simpler cases, but anything non-trivial needs MCP-style tool access rather than pure retrieval.
Evaluation is still the messy part. Knowing whether your retrieval is good — whether the right documents are coming back for the right queries — requires building an eval harness with labeled test cases. This is unglamorous infrastructure work, but teams that skip it pay for it later when their system degrades as their data changes.
Caching matters more than people realize at scale. Frequently retrieved documents that don’t change much can be cached in the context using semantic or prompt caching features (Anthropic offers prompt caching in the API), reducing both latency and token cost.
The Deeper Point
I want to end with something that I think the framing of “context engineering” illuminates that “prompt engineering” obscured: this is fundamentally a systems engineering problem, not a linguistics problem.
When we called it prompt engineering, it sounded like a skill related to natural language — you needed to know how to phrase things. Some people got good at it through intuition and trial and error. That approach doesn’t scale to production systems that need to work reliably across many users and use cases.
Context engineering, done well, involves: data architecture (what’s stored where and how it’s indexed), retrieval system design (embedding models, chunking, re-ranking), tool interface design (what capabilities to expose and how to format them), conversation state management, evaluation frameworks, and observability (logging what the model sees so you can debug when it goes wrong). These are engineering problems that have engineering solutions.
The terminology shift matters because it signals what kind of talent you need and what kind of investment is required. If you’re building production LLM systems and you think of it as prompt engineering, you’ll hire differently and invest differently than if you think of it as context engineering. The second framing is closer to the truth of what’s actually required.
MCP provides the plumbing. Context engineering provides the design practice. Together, they’re how you build AI systems that are actually reliable — not just impressive in a demo.