▶️ LISTEN TO THIS ARTICLE

Two years ago, GPT-4 shipped with an 8K token window and everyone was building RAG pipelines to compensate. Today, Gemini 2.5 Pro handles 2 million tokens. Claude Sonnet 4 takes a million. Llama 4 claims 10 million. The question that keeps surfacing at every engineering standup: why bother with retrieval if you can just stuff everything into the prompt?

A January 2025 evaluation by Li et al. tested this head-to-head across 13,628 questions and 12 datasets. Long context scored 56.3% correct. RAG scored 49.0%. Clear win for the "just shove it all in" crowd.

But the same paper found that about 10% of questions, 1,294 out of 13,628, could only be answered correctly by RAG. The retriever found information that the long-context model missed entirely, even with the full document sitting right there in the prompt. That's not a rounding error. That's a capability gap.

The Middle Is Still a Dead Zone

Liu et al.'s 2023 "Lost in the Middle" paper showed that models perform best when the answer sits at the beginning or end of the context window, and degrade when it's buried in the middle. Two years and a dozen model generations later, the core finding still holds.

Leng et al. tested 20 LLMs on RAG workflows with context from 2,000 to 128,000 tokens. Only a handful of frontier models held accuracy past 64K: GPT-4o scored 0.769 at 64K and 0.767 at 100K. Most open-source models peaked around 16K-32K tokens and then started losing answers. Llama 3.1 405B declined after 32K.

Throwing more context at the problem works if you have a top-tier model. For everyone else, it doesn't.

Where Each Approach Wins

Li et al.'s Self-Route study confirmed what practitioners suspected. Long context beats RAG on tasks needing global understanding: summarization, multi-hop reasoning, pattern recognition across full documents. RAG fights back on dialogue-based queries and domain-specific questions where precision matters more than coverage.

The February 2025 LaRA benchmark formalized this with 2,326 test cases. The conclusion: no silver bullet. The optimal approach depends on model size, task type, and retrieval characteristics. Not a satisfying answer for product teams, but the honest one.

Throwing more context at the problem works if you have a top-tier model. For everyone else, it doesn't.

The Cost Equation

Processing a million tokens isn't free. Every query against a large document set using long context means reprocessing all those tokens. Every time.

RAG stores documents once, retrieves the relevant 1,000-2,000 tokens, and pays for only those. A customer support system handling 10,000 daily queries against a product knowledge base? Long context would be ruinously expensive. RAG with a vector store would cost a fraction.

Latency follows the same curve. Long context responses at 128K+ tokens take 30-60 seconds. RAG returns in under a second. For user-facing applications, that's disqualifying.

What This Changes for Agent Memory

This matters most for anyone designing agent memory architectures. Agents accumulate knowledge over time, often reaching tens of millions of tokens. No context window covers that, even at a million tokens.

Jin et al. found that adding more retrieved passages improves results up to a point, then accuracy declines as irrelevant material starts interfering. Their fix: retrieval reordering, which prioritizes relevant documents and pushes noise to the middle where the model pays less attention. Ugly? Yes. Effective? Also yes.

The most interesting recent work is Luo et al.'s RetroLM, which retrieves at the KV-cache page level instead of the document level. It beat both standard long-context LLMs and existing RAG on LongBench, InfiniteBench, and RULER. The future isn't "RAG or long context." It's hybrid systems that blur the line.

The smart bet for agents managing long-term memory: a tiered architecture with recent context in the window, older knowledge in a retrieval layer, and a routing mechanism that picks the right tool for each query.

Anyone declaring RAG dead is reading the benchmarks wrong. Anyone ignoring million-token windows is going to get blindsided. The fight isn't over. It's just gotten sharper.

The future isn't 'RAG or long context.' It's hybrid systems that blur the line.

Sources

Research Papers:

Related Swarm Signal Coverage: