The Real Problem With Agent Memory Isn't Storage. It's the Contract.

Why the real differentiator in agent systems is not the memory store, but the harness-enforced rules for what gets remembered, recalled, rewritten, and forgotten.

Estimated read time: 8 minutes

TL;DR – The industry is still talking about agent memory as if the main question is where to store it. That is no longer the hard part. The hard part is deciding what gets remembered, what gets compacted, what gets recalled, who can change it, and how it is governed. In other words: the real problem with agent memory is not storage. It is the memory contract.

Everyone agrees agent memory matters now.

LangChain is writing about it. TLDR keeps surfacing it. OpenAI is showing how memory fits into real internal agent systems. Infrastructure vendors are reframing it as a production architecture problem instead of a prompt trick.

That is progress.

But the conversation still has one big blind spot.

Most teams still treat memory like a storage upgrade.

They bolt on a vector database, save a few preferences, maybe summarize a transcript, and assume they have built an agent that remembers.

They have not.

They have built a place where information can accumulate. That is not the same thing as memory.

Because the hard part was never just storing information. The hard part is deciding:

what deserves to be remembered
what should stay transient
what survives compaction
what gets pulled back into context later
who can overwrite, delete, or share it
how any of this is evaluated over time

That is not a database question.

That is a runtime question.

Recent writing across the ecosystem all points in the same direction, even if it uses different language.

LangChain’s Memory for agents argued that memory is application-specific. Its newer Your harness, your memory made the sharper point: if you do not control the harness, you do not really control memory. TLDR recently highlighted Sebastian Raschka’s Components of a Coding Agent, which is one of the clearest explanations of how working memory, transcript compaction, repo context, and session state actually behave in modern harnesses. OpenAI’s in-house data agent shows memory as only one layer inside a broader context stack. Oracle’s Agent Memory: Why Your AI Has Amnesia and How to Fix It makes the enterprise point directly: retrieval, forgetting, scoping, and compliance are part of the problem, not implementation details.

Put differently:

Memory is not a sidecar. Memory is part of the runtime.

What I think is still missing is the right abstraction for that runtime.

My proposal is simple: agent memory needs a contract.

The Missing Abstraction: The Memory Contract

A memory contract is the explicit runtime agreement between the harness and the memory system.

It answers six questions:

What is allowed to enter memory?
What is kept only in the working set?
What survives compaction and becomes durable?
What gets retrieved back into context, and when?
Who is allowed to edit, delete, or share memory?
How do we audit, expire, and evaluate memory over time?

Short-term versus long-term memory is a useful starting point, but it is too coarse for production systems. It tells you where information lives. It does not tell you how information moves, mutates, earns the right to persist, or gets blocked from persisting.

That is why two agents with the same model and the same vector store can behave completely differently. Their memory contract is different.

Diagram showing the memory contract loop

Short-Term vs Long-Term Is Too Blunt

Most current discussions split memory into two buckets:

short-term memory: thread context, checkpoints, recent transcript, tool outputs
long-term memory: facts, preferences, episodic traces, learned behaviors across sessions

That framing is not wrong. It is just too blunt.

Consider a coding agent. Sebastian’s breakdown of the coding harness makes an important distinction between the full transcript and the distilled working memory. Those are not the same thing. One is an append-only record for resumption and audit. The other is a compact state optimized for the next model call.

Now consider OpenAI’s data agent. Its memory is not just “stored knowledge”. It sits alongside table usage, human annotations, code-derived enrichment, institutional knowledge, and live runtime queries. The system improves not because it “has memory” in the abstract, but because the harness decides when stored corrections outrank a naive fresh search.

Or take LangChain’s newer argument around open harnesses. The real concern is not just storage location. It is that hidden compaction, opaque summaries, and provider-managed state quietly define what the agent will remember tomorrow. That is a harness decision, even if it shows up as an API feature.

That is the punchline: memory quality is mostly a control problem.

The Four Planes of Agent Memory

The cleanest way to think about the memory contract is to break it into four planes.

Diagram showing the four planes of agent memory

1. Working Set

This is the actively promptable state for the current turn.

It should contain the live goal, a compact transcript, recent tool outputs, open decisions, and bounded reminders. It must stay small, relevant, and refreshable. If you let it become a dump of everything that happened, your agent becomes slower, noisier, and harder to steer.

This is where harness design matters most. A good harness clips logs, deduplicates repeated reads, preserves recent events at higher fidelity, and only carries forward what is still actionable.

2. Session Journal

This is the durable record of what actually happened in a session.

It includes user turns, tool calls, approvals, failures, retries, and final outputs. The session journal is not what the model should read every turn. It is what the system should preserve for replay, audit, debugging, evaluation, and background consolidation.

Many teams collapse this into chat history. That is a mistake. A useful journal is structured, queryable, and separate from the prompt budget.

3. Durable Memory

This is what survives across sessions and becomes part of the agent’s evolving baseline.

Recent articles are mostly aligned here. Durable memory usually has at least three subtypes:

semantic memory for facts, preferences, and stable relationships
episodic memory for past cases, traces, and successful or failed examples
procedural memory for learned instructions, policies, and behavioral updates

The important shift is that durable memory should not be treated as one undifferentiated retrieval corpus. A user’s preference, a successful debugging trace, and a revised system instruction have different update rules, different privacy scopes, and different recall conditions.

4. Policy Plane

This is the most overlooked layer.

The policy plane governs all the others. It decides namespace boundaries, user versus team memory, retention rules, expiration, deletion, provenance, trust level, and whether a memory can be promoted into the working set at all.

This is also where “forgetting” belongs. Forgetting should not be a bug or a cron side effect. It should be a deliberate rule inside the contract: decay low-value memories, invalidate contradicted facts, preserve audit-critical traces, and keep privileged memories out of the wrong context.

How the Memory Contract Works in Practice

Once you adopt this framing, the design questions get sharper fast.

Admission

Not everything deserves to become memory.

The harness should distinguish between:

transient noise: tool logs, one-off errors, exploratory dead ends
reusable evidence: stable facts, recurring constraints, preferred workflows
contract-changing events: corrections, approvals, policy updates, user preferences

Memory should be earned, not appended.

Consolidation

This is where the recent ecosystem work gets interesting.

LangChain distinguishes between hot-path updates and background updates. OpenAI describes a daily offline pipeline that converts multiple context layers into normalized retrieval artifacts. Google’s always-on memory direction points to the same pattern: many useful memories should be consolidated outside the latency-sensitive serving path.

I think this is the right design center: the serving agent should be fast, and the background agent should be thoughtful.

That means:

serving path for critical immediate saves
background path for extraction, deduplication, contradiction checks, and prompt optimization
explicit promotion rules from session journal to durable memory

Retrieval

Retrieval should not be a plain semantic search over everything that ever happened.

The contract should combine task context, permissions, recency, provenance, and memory type. Sometimes the best thing to recall is a user preference. Sometimes it is the last failed attempt. Sometimes it is a procedural rule that should always be present. Sometimes the right answer is to retrieve nothing at all.

This is why the phrase “memory system” can be misleading. What matters is not just the store. What matters is the retrieval planner in the harness.

Mutation and Deletion

One of the most underdeveloped parts of agent design today is deciding who gets to rewrite memory.

Should the agent be allowed to update its own instructions? Under what confidence threshold? Can user feedback directly overwrite durable memory, or does it create a pending candidate? Can a team-level correction override a personal memory? How are contradictions resolved?

Those are memory contract questions. They should be explicit.

Evaluation

OpenAI’s data agent article gets this right: evolving systems need evals, or they drift silently.

Memory should be evaluated on more than retrieval hit rate. The real metrics are things like:

did memory improve task success?
did it reduce repeated mistakes?
did it introduce stale or unsafe context?
did it increase latency or token cost beyond its value?
did it create cross-user leakage or governance risk?

If you cannot answer those questions, you do not yet have memory engineering. You have memory hope.

What This Changes for Real Agent Systems

Coding agents

For coding agents, the biggest win is separating working set from session journal and treating repo context as memory-adjacent but not identical to memory. The harness should decide what parts of file reads, diffs, test failures, and user preferences are worth carrying forward.

The most useful long-term memories are often not facts about the user. They are workflow patterns: preferred commands, recurring codebase caveats, naming conventions, and failure corrections that keep the agent from relearning the same lesson every week.

Research agents

For research agents, semantic memory is less about personalization and more about continuity: what counts as a trusted source, what claims were already verified, what contradictions remain unresolved, and how previous research trails ended.

The contract matters because the agent should not treat every retrieved note as equally authoritative. Provenance and freshness need to sit inside retrieval itself.

Support and operations agents

These systems need the strictest policy plane. User identity, approvals, compliance retention, and scoped recall matter more than clever summarization. A support agent that remembers everything but cannot explain or govern those memories is not production-ready.

The Strategic Takeaway

The next generation of strong agents will not win because they have “more memory.”

They will win because they have a better memory contract.

That contract will make short-term memory smaller and sharper, long-term memory more selective, retrieval more policy-aware, and background consolidation more intelligent. It will also make model portability more realistic, because the real asset will not be a hidden provider-managed state blob. It will be an explicit memory layer that your harness can inspect, evaluate, and evolve.

So if you are building agents right now, I would stop asking only:

Which vector database should we use?
Should we add long-term memory now or later?
Can the model fit more context?

And I would start asking:

What is our memory contract?
What enters memory, and why?
What survives compaction?
What is allowed back into the prompt?
Who owns memory when we switch models or providers?
How do we know memory is helping instead of polluting?

That is the level where agent memory stops being a demo feature and starts becoming infrastructure.

The Closing Bet

My bet is that over the next 12 to 18 months, the strongest agent products will not distinguish themselves primarily on model quality.

They will distinguish themselves on memory governance.

Not because memory is glamorous. Not because users will talk about namespaces, compaction policies, or background consolidation pipelines. But because those choices will quietly determine whether an agent becomes more useful with use, or more chaotic with age.

That is the real line between a memorable demo and a durable product.

The demo says, “Look, the agent remembers me.”

The product says, “The agent remembers the right things, forgets the wrong things, and improves without becoming dangerous, stale, or locked into somebody else’s black box.”

That is a much harder standard.

It is also the standard that matters.

If I were building an agent team today, I would treat memory contract design as a first-class systems problem, right alongside tool design, evaluations, and safety.

Because once agents become long-lived, memory is no longer a feature.

Memory becomes the product’s accumulated judgment.

Source Notes

This piece synthesizes ideas from recent posts by LangChain, TLDR-linked articles, OpenAI, and Oracle, especially: