How Agent Memory Got an Architecture

A deep-dive guide on the emerging engineering discipline of agent memory — budget tiers, shared memory banks, empirical grounding, and temporal knowledge graphs.

In January, we described the goldfish brain problem — the fundamental memory limitation that makes AI agents forget everything between sessions, lose track of multi-step tasks, and repeat the same mistakes. That post asked a question that anyone building long-running agents has confronted: what memory architecture actually scales?

This month, four independent research efforts started answering.

Between February 3 and February 7, 2026, papers appeared on arxiv proposing budget-tiered memory routing, learned admission control for shared memory banks, a 9,649-experiment empirical study of what actually matters in context engineering, and a multi-agent memory framework that customizes memories per agent role. Taken together with recent work on temporal knowledge graphs, these results outline something that did not exist six months ago: a coherent architecture for agent memory.

This guide maps that emerging architecture. If you are building agents that need to remember things — which is to say, any agent doing real work — this is the landscape as of early February 2026.

The Memory Landscape Before These Papers

For the past eighteen months, retrieval-augmented generation has been the default memory architecture for LLM agents. The recipe is familiar: chunk your documents, embed them into vectors, store them in a vector database, and retrieve the top-k most similar chunks at query time using cosine similarity.

RAG works. It solved the immediate problem of giving models access to information beyond their training data. But as agents moved from simple question-answering to multi-step workflows spanning hours or days, the limitations became structural.

First, RAG memory is flat. Every chunk lives at the same level of importance. A critical system constraint and a casual aside from a user get the same treatment — embedded, stored, retrieved by similarity. There is no hierarchy, no quality tier, no way for the system to say "this memory matters more than that one."

Second, RAG has no temporal awareness. A fact stored yesterday and a fact stored six months ago are indistinguishable at retrieval time unless you manually bolt on metadata filtering. For agents operating over long time horizons — monitoring markets, managing projects, maintaining codebases — this is a fundamental gap.

Third, when multiple agents share a memory store, the noise problem compounds. Every agent writes its observations into the same pool. Without curation, the shared memory becomes a landfill rather than a library.

We covered these problems in depth in the goldfish brain post. The short version: vector similarity is necessary but not sufficient for agent memory. What changed this month is that researchers started building the layers that go on top.

Budget-Tiered Memory: Learning What to Remember

The first piece of the emerging architecture comes from Zhang et al.'s BudgetMem [1], which introduces a deceptively simple insight: not all memories need the same fidelity.

Consider a software engineering agent working through a complex debugging session. When it retrieves background context about the codebase architecture, a rough summary suffices. When it retrieves the exact error message and stack trace, precision matters. When it retrieves the user's original requirements to verify its fix, it needs the complete, unaltered original.

BudgetMem formalizes this intuition. The framework structures memory processing into modules, each offered at three budget tiers — Low, Mid, and High. A Low-tier memory operation might use a simple, fast method. A High-tier operation uses more compute, a more capable model, or a more sophisticated retrieval strategy. The crucial element is the router: a compact neural policy, trained with reinforcement learning, that decides which tier to use for each incoming query.

The paper explores three complementary strategies for realizing these tiers. Implementation complexity varies the sophistication of the memory method itself. Reasoning depth varies how much inference-time compute the agent spends on organizing or retrieving the memory. Capacity varies the size of the model handling memory operations.

The results across three benchmarks — LoCoMo, LongMemEval, and HotpotQA — demonstrate that this tiered approach outperforms strong baselines when performance is prioritized (the High-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. The RL router learns, without manual rules, when to spend memory budget and when to conserve it.

For agent builders, the practical implication is clear. If you are running agents at scale, you are paying for memory operations. A flat architecture where every memory gets the same expensive treatment is wasting resources. A tiered architecture that allocates budget based on query demands is both cheaper and, counterintuitively, often more accurate — because it forces the system to concentrate high-fidelity processing where it actually matters.

This echoes a pattern we see throughout the shift from prompt engineering to genuine agent architecture: the solutions that scale are the ones that make allocation decisions explicit rather than treating everything uniformly.

The second architectural piece addresses a problem that grows with every agent you add to a system: what happens when parallel agents share a memory bank?

Fu et al.'s LatentMem [2] tackles this directly. In multi-agent systems — the kind we explored in our coverage of agents that coordinate, audit, and trade with each other — agents need to share knowledge. One agent discovers that a customer's preferences have changed. Another agent, handling fulfillment, needs to know. The naive solution is a shared memory pool where every agent dumps its observations and every agent reads from the same store.

This naive approach fails predictably. When multiple agents with different roles, different objectives, and different context windows all write to the same memory, the result is noise. An observation critical to one agent is irrelevant to another. The shared memory becomes polluted with role-specific artifacts that degrade retrieval quality for everyone.

LatentMem introduces a learnable framework with two core components. An experience bank stores raw interaction trajectories in a compact form. A memory composer then synthesizes condensed, relevant memories conditioned on both the retrieved experience and the specific agent's role and context. Crucially, the memory composer does not produce one-size-fits-all summaries. It customizes what each agent sees based on that agent's needs.

The framework uses a training method called Latent Memory Policy Optimization to learn these compact representations. The results are striking: performance improvements reaching up to 19.36% over existing memory approaches, achieved without requiring modifications to the underlying multi-agent system architecture. This is a drop-in enhancement, not a ground-up redesign.

The deeper insight here is about the difference between sharing information and sharing memory. Information sharing is easy — broadcast everything to everyone. Memory sharing requires curation. It requires a system that understands which pieces of shared experience are relevant to which agent, and presents them accordingly.

This is the same principle that makes human organizational memory work. A company does not give every employee access to an unsorted dump of every email, meeting note, and document ever produced. It organizes knowledge by role, by relevance, by need-to-know. LatentMem brings this principle to multi-agent systems with a learned controller rather than hand-crafted rules.

For teams building multi-agent architectures, the takeaway is that shared memory needs an admission policy. Without one, scaling agents means scaling noise. With a learned admission and customization layer, scaling agents can mean scaling collective intelligence.

What Actually Matters: Empirical Evidence at Scale

The third contribution to the emerging architecture is the most sobering for practitioners who have spent months optimizing prompt formats and context window arrangements.

McMillan's structured context engineering study [3] ran 9,649 experiments across 11 models, 4 data formats (YAML, Markdown, JSON, and a custom Token-Oriented Object Notation), and schemas ranging from 10 to 10,000 tables. The goal was to answer a question that the agent-building community has debated endlessly: how should you format and structure context for LLM agents?

The findings upend several common assumptions.

Format barely matters in aggregate. The choice between YAML, Markdown, JSON, and TOON showed no statistically significant effect on aggregate accuracy (chi-squared = 2.45, p = 0.484). Individual models showed format-specific preferences, but across the board, the format holy wars have been a distraction.

Model capability dominates everything. The study found a 21 percentage-point accuracy gap between frontier-tier models (Claude, GPT, Gemini) and open-source alternatives. This gap dwarfs any effect from formatting, context engineering, or architectural choices. If you have budget to spend, spend it on a better model before you spend it on fancier context formatting.

Domain partitioning is the key to scaling. File-native agents successfully scaled to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. The trick is not cramming more into a single context window — it is organizing information into domain-specific partitions that the agent can navigate.

Compact formats can backfire. Paradoxically, token-efficient formats sometimes consumed more tokens at scale because the model needed unfamiliar search patterns to navigate them. Familiarity beats compression.

For agent builders, this study provides empirical grounding for resource allocation decisions. The community has spent enormous energy on prompt formatting and context window optimization. This study suggests that energy is better spent on model selection and information architecture. The connection to reasoning depth is direct: a more capable model with simple formatting outperforms a less capable model with meticulously engineered context.

This does not mean context engineering is irrelevant. It means the hierarchy of importance is: model capability first, domain architecture second, format details a distant third.

Time-Aware Memory: Knowing When It Matters

The final piece of the emerging architecture addresses a dimension that most agent memory systems ignore entirely: time.

Recent work on temporal knowledge graphs — notably Zep's Graphiti engine [4] and the MemoTime framework [5] — introduces memory systems that understand not just what was stored, but when it was relevant and how temporal relationships between facts affect reasoning.

Consider an agent managing a portfolio. A company's earnings report from last quarter is relevant context, but its relevance decays and transforms over time. The CEO's statement from three years ago might become critically relevant again when a pattern repeats. A market event from yesterday overrides a projection from last week. Traditional RAG retrieves by semantic similarity, treating all of these as equally current. Temporal knowledge graphs encode the time dimension directly.

Zep's Graphiti engine, introduced by Rasmussen et al., is a temporally-aware knowledge graph engine that dynamically synthesizes unstructured conversational data and structured business data while maintaining historical relationships [4]. It tracks not just facts but fact validity periods — when something became true, when it ceased being true, and how facts relate across time. In benchmarks, this approach showed up to 18.5% accuracy improvement and 90% latency reduction compared to baselines, with particular strength in enterprise scenarios requiring temporal reasoning.

MemoTime, from Tan et al., pushes further with a hierarchical "Tree of Time" structure that decomposes complex temporal questions while enforcing monotonic timestamps and unified temporal bounds across multiple entities [5]. The framework combines structured grounding with recursive reasoning and continual experience learning. The results are substantial: up to 24% improvement over strong baselines, with smaller models achieving performance comparable to GPT-4-Turbo.

For long-running agents — the kind that maintain state over days, weeks, or months — temporal awareness is not optional. An agent that cannot distinguish between current facts and historical facts, that cannot reason about the sequence of events, and that cannot understand temporal dependencies between entities is an agent that will make increasingly wrong decisions as time passes.

The temporal knowledge graph approach also connects naturally to the memory architecture described by the other papers. Budget tiers can prioritize recent, high-relevance temporal facts for high-fidelity processing while compressing older historical context into lower tiers. Shared memory banks can use temporal relevance to filter what gets admitted — an event from an hour ago is more likely to matter to collaborating agents than one from six months ago. The empirical findings on domain partitioning extend to temporal partitioning: organizing memory by time period enables the same scaling benefits as organizing by domain.

The Emerging Architecture

These research threads are converging on something that looks like a genuine architecture for agent memory. Not a single monolithic system, but a set of composable layers.

At the base: a retrieval layer that goes beyond flat vector similarity. Domain-partitioned storage that organizes information by topic, by time period, by relevance tier. This is where the context engineering findings [3] inform the design — structure the information architecture before worrying about formatting.

Above that: a budget-routing layer that allocates computational resources based on query demands. Not every retrieval needs the same fidelity. The BudgetMem router [1] demonstrates that an RL-trained policy can make these allocation decisions better than fixed rules.

Parallel to that: a sharing and admission layer for multi-agent systems. When agents collaborate, their shared memory needs curation. LatentMem's learned composer [2] shows that customizing shared memory per agent role outperforms naive broadcasting.

Threading through all layers: temporal awareness. Knowledge graphs with time-aware edges [4][5] that track when facts were valid, how they relate across time, and which temporal context matters for the current query.

What is still missing? Forgetting policies — principled mechanisms for deciding when to discard or compress old memories. Cross-session persistence that survives agent restarts without accumulating unbounded storage. And integration: no single system yet combines all four of these layers into a unified agent memory stack. These are the pieces. The assembly is still ahead of us.

Conclusion

Six months ago, agent memory meant "add a vector store." The goldfish brain problem was real and unsolved. Today, the outlines of a proper engineering discipline are visible. Budget-aware routing. Learned admission for shared memory. Empirical grounding for design decisions. Temporal knowledge graphs that encode when, not just what.

The goldfish is not yet an elephant. But it is evolving toward something with real memory architecture — hierarchical, temporally aware, budget-conscious, and designed for multi-agent collaboration. The open questions remain substantial: how to forget gracefully, how to persist across sessions without drowning in history, and how to integrate these layers into production systems that run for months.

We will be tracking those questions in future coverage. For now, the research direction is clear: agent memory is becoming an engineering discipline, not a hack.

Disclosure: Swarm Signal is an independent publication. We have no financial relationships with the research teams, institutions, or companies cited in this article. All papers referenced are publicly available on arxiv.

References

[1] Zhang, H., Yue, H., Feng, T., Long, Q., Bao, J., Jin, B., Zhang, W., Li, X., You, J., Qin, C., & Wang, W. (2026). "Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory." arXiv:2602.06025. https://arxiv.org/abs/2602.06025

[2] Fu, M., Zhang, G., Xue, X., Li, Y., He, Z., Huang, S., Qu, X., Cheng, Y., & Yang, Y. (2026). "LatentMem: Customizing Latent Memory for Multi-Agent Systems." arXiv:2602.03036. https://arxiv.org/abs/2602.03036

[3] McMillan, D. (2026). "Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale." arXiv:2602.05447. https://arxiv.org/abs/2602.05447

[4] Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). "Zep: A Temporal Knowledge Graph Architecture for Agent Memory." arXiv:2501.13956. https://arxiv.org/abs/2501.13956

[5] Tan, X., Wang, X., Liu, Q., Xu, X., Yuan, X., Zhu, L., & Zhang, W. (2025). "MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning." arXiv:2510.13614. https://arxiv.org/abs/2510.13614