🎧 LISTEN TO THIS ARTICLE

Most teams building memory-augmented agents obsess over how memories are written — extracting facts, summarizing episodes, compressing context. A new paper from the MemAgents Workshop (ICLR 2026), by researchers at UC San Diego, Carnegie Mellon, and UNC, shows they're optimizing the wrong stage. The bottleneck is retrieval, and it's not close.

The Experiment

Boqin Yuan, Yue Su, and Kun Yao designed a diagnostic framework that isolates where agent memory actually breaks down. They crossed three write strategies — raw chunking, Mem0-style fact extraction, and MemGPT-style episode summarization — with three retrieval methods: cosine similarity, BM25, and hybrid reranking. All nine configurations ran on GPT-5-mini against LoCoMo, a benchmark of 1,540 questions drawn from long multi-session conversations averaging 600 turns each.

The results from Table 1 are stark. Switching retrieval method shifts accuracy by 14 to 23 percentage points depending on the write strategy. Switching write strategy? Only 3 to 8 points. Average accuracy across retrieval methods ranged from 57.1% (BM25) to 77.2% (hybrid reranking). The gap between the cheapest write strategy and the most expensive barely registers by comparison.

Raw Storage Wins

Retrieval method shifts accuracy by up to 23 points. Write strategy shifts it by 8. Teams are optimizing the wrong stage.

Here is the finding that should make memory pipeline engineers uncomfortable: raw chunked storage — zero LLM calls, no preprocessing — scored 77.9% with cosine retrieval and 81.1% with hybrid reranking. Both figures match or beat the more expensive alternatives. Extracted facts topped out at 77.3% under hybrid retrieval. Summarized episodes peaked at 73.3%.

The explanation is straightforward. Lossy compression discards conversational details that the backbone LLM can leverage directly when given the raw text. Fact extraction strips context. Summarization smooths over specifics. The downstream retriever cannot compensate for information that was thrown away upstream.

This echoes a pattern we have seen before. The tension between context windows and retrieval often comes down to what gets lost in translation — and the more you process memories before storing them, the more translation loss you introduce.

Where Failures Actually Happen

The diagnostic probes in Table 2 classify every incorrect answer into retrieval failure, utilization failure, or hallucination. Retrieval failure — where the system fails to surface relevant memories — accounts for 11% to 46% of all questions depending on configuration. Utilization failure, where relevant information is retrieved but the model ignores or misuses it, stays pinned between 4% and 8%. Hallucination barely registers at 0.4% to 1.4%.

The worst case is BM25 with extracted facts: 46.3% retrieval failure rate. The best case is hybrid reranking with raw chunks: 11.4% retrieval failure. That is a fourfold difference driven entirely by retrieval quality, not memory formatting.

Retrieval precision at k=5 correlates with downstream accuracy at r = 0.98. When agents get the right memories, they almost always use them correctly. The problem is getting the right memories in front of the model in the first place — a challenge familiar to anyone who has worked through the RAG reliability gap.

What This Means for Agent Builders

Raw chunked storage with zero LLM calls matches or beats expensive fact extraction and summarization pipelines.

Raw chunked storage with zero LLM calls matches or beats expensive fact extraction and summarization pipelines.

The practical implications cut against the current trend in agent memory design. Teams are investing significant compute in memory preprocessing — running LLM calls to extract structured facts, generate summaries, and compress episode histories. This research suggests that compute is better spent on retrieval quality.

Hybrid reranking, which uses a secondary LLM (GPT-5.2 in this study) to rescore retrieved candidates, cut retrieval failures roughly in half compared to BM25 alone. That single change drove beneficial memory utilization up to 79.0% under raw chunking — the highest rate across all nine configurations.

For teams building [agentic RAG](https://swarmsignal.net/agentic-rag/) systems or designing multi-agent memory architectures, the takeaway is pointed: store more, compress less, and spend your optimization budget on retrieval. The memories are probably already there. Your agent just cannot find them.