Context Window Management: When 1M Tokens Isn't Enough

▶️ LISTEN TO THIS ARTICLE

Gemini 2.5 Pro advertises a 1 million token context window. Claude takes a million. Llama 4 claims 10 million. On paper, the context problem is solved. In production, it isn't close.

Some 2026 long-context guidance puts effective capacity well below advertised maximums, with performance dropping before the formal context limit is reached. Benchmark writeups also show that model performance can diverge sharply at long context lengths. Treat exact model rankings and scores as time-bound; the durable lesson is the gap between "supports" and "performs at."

Many enterprise agent failures are better explained as context drift, stale state, or memory loss during multi-step reasoning than as raw context exhaustion. The context window didn't overflow. It rotted. This guide covers what actually happens when agents hit the limits of their context, and the engineering patterns that production teams use to manage it.

The Lost in the Middle Problem Hasn't Gone Away

Liu et al.'s 2023 "Lost in the Middle" paper showed that models perform best when the answer sits at the beginning or end of the context window and degrade when it's buried in the middle. Three years and multiple model generations later, the core finding still holds for most models in production.

Leng et al. tested 20 LLMs on RAG workflows with context from 2,000 to 128,000 tokens. The source set indicates that only some models held up as contexts grew, while many declined after mid-range context lengths. The exact scores should be treated as benchmark-specific, but the pattern matters: longer context does not automatically mean better retrieval.

The 2025 study cited below reframed the phenomenon not as a bug to fix but as an emergent property of how LLMs learn information retrieval during pre-training. Larger models exhibit reduced U-shaped recall curves, meaning scale helps. But scale alone doesn't eliminate the problem. It reduces it from catastrophic to merely expensive.

The practical implication: if your agent retrieves a 50-page contract and the critical clause is on page 27, a frontier model might find it. A mid-tier model probably won't. Position-awareness in your context assembly isn't optional.

The cost calculation most teams miss: it's not the per-token price.

Context Rot: The Silent Agent Killer

Context overflow gets the attention. Context rot does the damage.

Every message, tool call, file read, and response adds tokens to the window. In a multi-turn agent session, the signal-to-noise ratio drops with every interaction. The agent still has the information technically present in its context, but its ability to retrieve and act on that information degrades measurably after 20-30 turns. The model isn't forgetting. It's drowning.

A UC Berkeley, Stanford, and IBM Research study of 306 practitioners found that 68% of production agent systems execute 10 or fewer steps before requiring human intervention. Nearly half execute fewer than five. This isn't just a reliability problem. It's a context problem. Each step adds noise. Each tool call returns data the model must carry forward. By step 15, the agent is reasoning over a context that's mostly its own prior outputs, not the original task specification.

The compounding effect is mathematical. If each step adds 500 tokens of tool output and the agent runs 20 steps, that's 10,000 tokens of accumulated state competing with the original instructions for attention. The instructions haven't moved in the context window, but they've been pushed proportionally further from the model's attention focus.

What a Context Window Actually Costs

The "just stuff it all in" approach has a price tag that most teams discover too late.

A European bank case study from Q3 2025 ran a controlled comparison between long-context and RAG approaches on the same document corpus. The long-context approach was 34% more accurate on simple queries. The RAG approach was 67% more accurate on queries requiring synthesis across documents from different time periods. Latency was 8x lower with RAG. Cost was 94% lower.

The cost asymmetry is structural. Attention in transformer models scales quadratically with sequence length. Doubling your context doesn't double your cost; it roughly quadruples it. At 1M tokens per request with Gemini 2.5 Pro, a single inference call runs approximately $1.25 in input tokens alone. Run that 10,000 times a day and you're burning $12,500 daily on input tokens before output costs.

Prompt caching changes the math significantly. Anthropic's cache reads cost 0.1x the base price, a 90% savings on repeated context. OpenAI gives 50% off cached inputs automatically. But caching only helps when the context is stable across requests. For agents processing unique documents per session, caching provides minimal relief.

The cost calculation most teams miss: it's not the per-token price. It's the per-token price multiplied by every token you didn't need. Stuffing 200K tokens into a context window when your agent only needs 15K of relevant information means you're paying 13x more than necessary for worse results.

The architecture requires explicit decisions about what information lives at each tier and when to promote or demote content between tiers.

Five Patterns That Production Teams Actually Use

The teams that ship working agents don't pick one strategy. They layer multiple approaches based on what each conversation turn actually requires. The Anthropic context engineering guide frames this as designing the agent's "mental world" rather than optimizing individual prompts.

1. Retrieval-Augmented Generation (RAG)

RAG remains the workhorse for document-heavy applications because it solves the cost and relevance problems simultaneously. Instead of loading entire document collections into context, a retriever identifies the most relevant chunks and injects only those.

The 2025 evaluation by Li et al. tested long context against RAG across 13,628 questions. Long context scored 56.3%. RAG scored 49.0%. But 10% of questions, 1,294 out of 13,628, could only be answered correctly by RAG. The retriever found information that the long-context model missed entirely, even with the full document in the prompt.

The winning pattern for 2026 isn't RAG or long context. It's RAG to select the evidence set, then long context to reason over it. Use vector retrieval to find the 20 most relevant passages, assemble them into a focused context of 5-10K tokens, and let the model reason over a clean signal. The RAG architecture patterns guide covers the implementation details.

2. Sliding Window and Context Compression

For long-running agent sessions, sliding window approaches keep the most recent turns while summarizing or dropping older ones. The implementation varies by use case.

The simplest version keeps the system prompt and last N turns, dropping everything before that. More sophisticated versions use an LLM to summarize the dropped context into a condensed "memory" block that stays at the top of the window. Research on agent context compression shows that well-tuned compression can maintain 85-90% of task-relevant information in 20-30% of the original token count.

The trap is aggressive compression on the wrong content. Summarizing factual tool outputs works well. Summarizing the user's original intent or constraints works poorly, because the compressed version inevitably loses specificity. Keep the original task specification verbatim. Compress the intermediate work.

3. Hierarchical Context Architecture

Instead of one flat context window, split the agent's information into tiers. The first tier holds the system prompt, current task, and active constraints, never compressed, always present. The second tier holds recent tool outputs and conversation turns, managed with a sliding window. The third tier holds summarized history and reference information, retrieved on demand.

This mirrors how the Model Context Protocol (MCP) structures external information delivery. MCP servers provide governed metadata on request rather than dumping everything into the prompt upfront. The agent asks for what it needs when it needs it, keeping the active context focused.

The architecture requires explicit decisions about what information lives at each tier and when to promote or demote content between tiers. Those decisions are specific to your domain. A legal research agent needs different tier boundaries than a customer service agent.

4. Task Decomposition

When a task requires more context than a single window can effectively process, break it into subtasks that each fit comfortably within the effective window.

Long-running agent research confirms what practitioners have found: agents that decompose a 50-step task into five independent 10-step subtasks, each with a fresh context, outperform agents that try to maintain state across all 50 steps. The subtask agent starts each phase with a clean context containing only the original goal, the outputs from prior phases, and the specific instructions for the current phase.

This is the pattern behind the orchestrator-worker architecture described in Anthropic's "Building Effective Agents" guide. The orchestrator maintains a high-level plan in a compact context. Workers execute specific subtasks with focused contexts. The orchestrator integrates results and dispatches the next subtask.

5. Attention Steering

Recent research shows that the context window problem is partly an attention allocation problem. Models have the information but can't keep attention aligned with it as decoding progresses. DySCO, a decoding algorithm from Princeton Language and Intelligence, dynamically boosts retrieval-specific attention heads during generation to counteract attention drift.

On long-context benchmarks including MRCR and LongBenchV2, DySCO showed relative accuracy improvements of up to 25% at 128K context length across multiple models, without any changes to model weights. The technique works at inference time, meaning you can apply it to existing models without retraining.

This is early-stage for production use, but it signals where the field is heading: treating attention as a runtime resource to be managed, not a fixed property of the model.

The Context Engineering Discipline

The shift from prompt engineering to context engineering is the defining trend in production AI for 2026. Elasticsearch Labs frames the distinction clearly: prompt engineering focuses on how you communicate with the model, while context engineering focuses on what information the model has access to when it generates responses.

The structured context engineering study quantified the impact across 9,649 experiments. Model capability accounts for approximately 21 percentage points of accuracy variation. Context architecture accounts for about 2.7 points. Prompt format accounts for less than 1 point. The industry's obsession with finding the perfect prompt template has been a distraction. What enters the context window matters orders of magnitude more than how you phrase the question. See the full analysis for the breakdown.

In practice, context engineering means treating every token in the window as a scarce resource. Before adding a document to context, ask: does the agent need this to complete the current step? If not, leave it out and retrieve it later if needed. Before returning tool output, ask: does the agent need the full output, or would a summary suffice? Before carrying conversation history forward, ask: which turns contain information the agent will reference again?

The teams that manage context well share a pattern. They measure. They track token usage per step, context utilization ratios, and retrieval precision. They set budgets: this agent gets 30K tokens of context per step, split into 5K for instructions, 15K for retrieved content, and 10K for conversation history. They enforce those budgets programmatically, not through hope.

What This Means for Agent Architecture

The million-token context window is a capability, not a strategy. Filling it is almost always the wrong move. The agents that work in production are the ones that use context windows the way a good researcher uses a desk: keeping only the relevant materials in front of them, filing everything else within reach, and knowing exactly where to find it when needed.

The engineering work isn't in expanding the window. It's in curating what goes into it. That's the gap between a demo that processes a full codebase in one pass and a production system that handles repeated tasks without degrading. The context window is big enough. The question is whether you're filling it with signal or noise.

Sources

Research Papers:

Lost in the Middle: How Language Models Use Long Contexts -- Liu et al. (2023)
Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs -- (2025)
Structured Context Engineering for File-Native Agentic Systems -- (2026)
DySCO: Dynamic Scaling of Context for Long-Context Retrieval -- Ye et al., Princeton (2026)
Self-Route: Long Context with RAG -- Li et al. (2025)
Measuring Agents in Production -- Kapoor et al., UC Berkeley, Stanford, IBM Research (2025)

Industry / Case Studies:

Effective Context Engineering for AI Agents -- Anthropic (2026)
Building Effective Agents -- Anthropic (2024)
LLM Context Window Limitations in 2026 -- Atlan (2026)
Context Engineering vs. Prompt Engineering -- Elasticsearch Labs (2025)
Long Context Windows: Capabilities, Costs, and Tradeoffs -- Jason Willems (2026)
AI Agent Context Compression Strategies -- Zylos Research (2026)
Long-Running AI Agents and Task Decomposition -- Zylos Research (2026)

Related Swarm Signal Coverage:

Context Window Management: When 1M Tokens Isn't Enough

Key finding

Why it matters

Evidence base

Operator takeaway

Where this breaks

Use this if

Avoid this if