Agent memory is the gap between a stateless LLM call and a system that knows what you did last week. The problem is simple: context windows are finite, sessions end, and agents need to remember facts, preferences, and task state across interactions. The solution space is messy.
Mikio Braun’s follow-up post surveys the current memory infrastructure landscape after his initial exploration. The key insight is that memory is not one problem. It is at least six different architectural patterns, each with different storage backends, retrieval latencies, and failure modes.
The Memory Stack
Braun’s survey identifies six categories of memory approaches. The three most relevant for production systems are:
Platform-native memory: ChatGPT’s saved memories, Claude’s conversation summaries, Gemini’s cross-product context. These are opaque, product-driven implementations. You get convenience but no control over retrieval logic or schema evolution.
Memory middleware: APIs like Mem0 and Letta that sit between your agent and storage. They extract facts from conversations, store them in vector or graph databases, and inject relevant context into prompts. You own the orchestration but delegate the memory layer.
Embedded memory libraries: Code you import into your agent runtime. Think LangChain’s memory modules or custom state managers in LangGraph. You control everything but also handle all the plumbing.
Platform memory gives you zero-config persistence but locks you into provider-specific retrieval logic. Middleware adds a network hop but lets you swap LLM providers. Embedded libraries give you full control at the cost of managing storage lifecycle yourself.
Storage Backend Trade-offs
Different memory types demand different storage shapes:
| Memory Type | Storage Backend | Retrieval Method | Latency Profile | Failure Mode |
|---|---|---|---|---|
| Episodic (conversation history) | Key-value store (Redis, DynamoDB) | Session ID lookup | <10ms | Session keys accumulate until Redis memory limit, causing eviction of active sessions |
| Semantic (facts, preferences) | Vector DB (Pinecone, Weaviate) | Embedding similarity | 50-200ms | Embedding model updates invalidate similarity scores, returning irrelevant facts |
| Procedural (task workflows) | Graph DB (Neo4j, Memgraph) | Path traversal | 100-500ms | Graph traversal queries break when node properties change, requiring full reindex |
| Working memory (current task state) | In-memory (agent runtime) | Direct access | <1ms | Lost on crash, no persistence |
Most production systems use a hybrid: working memory in-process, episodic memory in a fast key-value store, and semantic memory in a vector database. The orchestration layer decides when to query each backend.
Retrieval Strategy Patterns
Storage is straightforward. Retrieval requires deciding which memories are relevant and when to query them.
Explicit retrieval: The agent explicitly calls a memory tool when it needs context. This keeps prompts lean but requires the agent to know when it is missing information. Works well for task-specific workflows where memory needs are predictable.
Automatic injection: The memory system queries relevant context before every agent turn and injects it into the prompt. Higher token cost, lower cognitive load on the agent. ChatGPT’s reference chat history works this way.
Hybrid with budget: Retrieve top-k memories by relevance, then prune to fit a token budget. Requires a scoring function (recency, relevance, user-flagged importance). Some middleware layers implement this pattern, though the specific retrieval logic varies by provider.
The latency problem is real. If your agent needs to query a vector database mid-execution, you add 100-200ms per retrieval. For interactive workflows, that compounds fast. Some systems pre-fetch likely memories at session start, trading upfront latency for smoother execution.
Unbounded Growth and Pruning
Memory grows without bounds unless you intervene. Three common strategies:
- Time-based expiration: Drop memories older than N days. Simple but loses long-term context.
- Summarization: Periodically compress old memories into higher-level summaries. Reduces token count but loses detail. Requires a separate summarization pass (more LLM calls, more cost).
- Importance scoring: Tag memories with relevance scores and prune low-value entries. Requires either user feedback or a heuristic (access frequency, recency, explicit user saves).
None of these are free. Summarization adds latency and cost. Scoring requires instrumentation. Expiration loses data. The choice depends on whether you prioritize cost control, context preservation, or operational simplicity.
Schema Evolution and Versioning
Agent capabilities evolve. You add new tools, change prompt structures, or refactor your orchestration graph. Your memory schema needs to keep up.
The problem: memories stored six months ago reference tools that no longer exist or use a fact schema that has changed. Do you migrate old memories? Ignore them? Re-embed them with the new schema?
Migration strategies:
- Lazy migration: Re-process memories on retrieval if they match an old schema. Spreads the cost but adds retrieval latency.
- Batch migration: Run a background job to update all memories when the schema changes. Upfront cost, clean retrieval.
- Versioned schemas: Store schema version with each memory and handle multiple versions in retrieval logic. More complex but avoids data loss.
If you are building a memory layer, version your schema from day one. Retrofitting versioning into an existing system requires migrating all stored memories, which can be expensive at scale.
Implementation Sketch
Here is a minimal hybrid memory setup using Redis for episodic storage and Pinecone for semantic retrieval:
import json
import os
from redis import Redis
from pinecone import Pinecone
from openai import OpenAI
# Configuration
TTL_SECONDS = 86400 * 7 # 7 days
class AgentMemory:
def __init__(self, session_id):
self.session_id = session_id
self.redis = Redis(host='localhost', port=6379)
self.pinecone = Pinecone(api_key=os.environ['PINECONE_API_KEY'])
self.index = self.pinecone.Index('agent-memory')
self.openai = OpenAI()
def store_turn(self, user_msg, agent_msg):
# Episodic: append to session history
key = f"session:{self.session_id}"
self.redis.rpush(key, f"User: {user_msg}")
self.redis.rpush(key, f"Agent: {agent_msg}")
self.redis.expire(key, TTL_SECONDS)
# Semantic: extract and embed facts
facts = self._extract_facts(user_msg, agent_msg)
for fact in facts:
embedding = self._embed(fact)
self.index.upsert([(
f"{self.session_id}:{hash(fact)}",
embedding,
{"text": fact, "session": self.session_id}
)])
def retrieve_context(self, query, k=5):
# Episodic: last N turns
recent = self.redis.lrange(f"session:{self.session_id}", -10, -1)
# Semantic: top-k relevant facts
query_emb = self._embed(query)
results = self.index.query(vector=query_emb, top_k=k)
facts = [r.metadata['text'] for r in results.matches]
return {
"recent_turns": [t.decode() for t in recent],
"relevant_facts": facts
}
def _extract_facts(self, user_msg, agent_msg):
# Use LLM to extract memorable facts
try:
response = self.openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Extract factual statements about user preferences, past actions, or stated information. Return as JSON array of strings."
}, {
"role": "user",
"content": f"User: {user_msg}\nAgent: {agent_msg}"
}]
)
facts = json.loads(response.choices[0].message.content)
return facts
except (json.JSONDecodeError, KeyError):
return []
def _embed(self, text):
response = self.openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
This pattern separates fast session retrieval from slower semantic search. The agent can pull recent context synchronously and optionally enrich it with semantic facts if the query warrants it.
Observability Gaps
Memory systems are hard to debug. You cannot see what the agent retrieved or why it chose certain facts over others. Useful instrumentation includes:
- Retrieval logs: What memories were fetched for each agent turn, with relevance scores.
- Memory drift metrics: Track cosine similarity between fact embeddings. Flag pairs with similarity greater than 0.85 but contradictory sentiment as detected by an LLM judge.
- Token budget tracking: How much of your context window is memory vs. task instructions.
- Pruning audit logs: What got deleted and why.
Without these, you are flying blind when the agent hallucinates based on stale or conflicting memories.
Technical Verdict
Use agentic memory when:
- Your agent needs to maintain state across sessions (customer support, personal assistants, long-running workflows).
- You can tolerate 100-200ms retrieval latency. This is acceptable for async workflows like email agents but problematic for real-time chat interfaces where users expect sub-second responses.
- You have a strategy for pruning or summarizing unbounded memory growth.
- You can instrument retrieval to debug when memory causes incorrect behavior.
Avoid or defer when:
- Your agent is stateless by design (one-shot tasks, ephemeral queries).
- You cannot afford the operational overhead of managing vector databases and schema migrations.
- Your context window is large enough to fit all relevant history in-prompt. This is possible for agents operating on fixed datasets like a 50-page product manual that fits in a 128k context window.
- You are still iterating on core agent capabilities and memory would add too much surface area.
Memory is not a feature you bolt on at the end. It is a persistence layer with all the usual database problems: schema evolution, query optimization, data consistency, and observability. Treat it like infrastructure, not a library.