Most agent memory systems fail because they treat persistence as a single problem. After testing 33 memory frameworks over six months, the pattern is clear: production agent state requires three distinct layers, each solving a different failure mode.
The typical approach (vector store + conversation history) breaks down when sessions run long, when agents need to write structured data, or when you need to audit what the agent actually remembers. The working architecture emerged from production use on OpenClaw, a framework with 350K+ GitHub stars.
The Three-Layer Memory Stack
Agent memory maps to three storage primitives, each with different latency, durability, and query characteristics.
Layer 1: Conversation Compression
This solves the context window problem. Without compression, agents forget the start of long conversations when token limits hit. The solution is a directed acyclic graph (DAG) of summaries.
Recent turns stay verbatim. Older turns get compressed into summaries. The oldest summaries get merged into higher-level summaries. The agent always has access to the full conversation arc without hitting token limits.
Lossless-Claw implements this with a DAG structure where each node is either a raw turn or a summary. When the context window fills, the compressor walks the DAG, merges the oldest leaf summaries, and writes a new parent node. The agent’s prompt includes the compressed DAG, not the raw history.
Layer 2: File-Based Persistent Memory
This is the durable record. Plain markdown files the agent reads and writes: daily journals (2026-05-28.md), a curated MEMORY.md, preference files, project notes.
Why files instead of a database? Three reasons:
- Version control: Git tracks every change the agent makes
- Human readability: You can audit and edit memory without tooling
- Portability: No schema migrations, no database dependencies
The agent writes structured entries to these files. A local embedding model (QMD runs a 333MB GGUF model) indexes the files for semantic search. When the agent needs context, it queries by meaning, not keywords. “How did we handle the auth migration?” retrieves the right journal entry even if it never used the word “auth.”
Layer 3: Semantic Search Over Native Files
This bridges the gap between durable storage and retrieval. The embedding model runs locally, so search is sub-second and no data leaves the machine.
The search layer indexes markdown files as they change. Each paragraph becomes a vector. When the agent queries, the model returns the top-k most semantically similar paragraphs, along with file paths and line numbers.
This is not a RAG system. The agent doesn’t generate answers from retrieved chunks. It reads the actual files, using search to find the right starting point.
Storage Primitive Trade-offs
Different memory types map to different storage backends. Here’s what actually works in production.
| Memory Type | Storage Backend | Retrieval Pattern | Failure Mode |
|---|---|---|---|
| Episodic (what happened) | Markdown files + Git | Semantic search over journals | Embedding drift as vocabulary changes |
| Semantic (facts, preferences) | Curated markdown files | Direct file read + search fallback | Stale data if agent doesn’t update |
| Procedural (how to do things) | Code files, runbooks | Grep + semantic search | Conflicts when agent edits working code |
| Short-term (current session) | DAG compression in memory | Linear scan of recent nodes | Memory leak if DAG isn’t pruned |
The key insight: episodic and semantic memory need different write patterns. Episodic memory is append-only (daily journals). Semantic memory is edit-in-place (the agent updates MEMORY.md when preferences change).
Implementation: The Hybrid Architecture
The working system combines all three layers with explicit handoffs.
class AgentMemory:
def __init__(self, workspace_path):
self.compressor = DAGCompressor(max_tokens=8000)
self.files = FileMemory(workspace_path)
self.search = LocalEmbedding(model="qmd-333m.gguf")
def add_turn(self, role, content):
# Layer 1: Add to conversation DAG
self.compressor.add_turn(role, content)
# Layer 2: Write to daily journal if significant
if self._is_significant(content):
today = datetime.now().strftime("%Y-%m-%d")
self.files.append(f"{today}.md", content)
def query(self, question):
# Layer 3: Semantic search over files
results = self.search.query(question, top_k=5)
# Layer 2: Read full context from files
context = []
for result in results:
content = self.files.read(result.file_path)
context.append(content)
return context
def get_context_for_prompt(self):
# Layer 1: Compressed conversation history
compressed = self.compressor.get_compressed_history()
# Layer 2: Curated memory file
memory = self.files.read("MEMORY.md")
return f"{memory}\n\n{compressed}"
The handoff points matter. The compressor runs after every turn. The file writer runs only for significant events (the agent decides what’s significant). The search index rebuilds incrementally as files change.
Schema Evolution and Garbage Collection
Production memory systems need two maintenance operations: schema evolution and garbage collection.
Schema Evolution
As the agent’s tasks change, the memory schema changes. A project memory file might start with just a title and description, then add sections for architecture decisions, deployment notes, and known issues.
The file-based approach handles this naturally. The agent reads the current schema, decides what to add, and writes the new structure. No migration scripts, no downtime.
Garbage Collection
Old journal entries accumulate. The search index grows. The DAG compressor holds references to turns that are no longer relevant.
The working approach: time-based expiration with manual override. Journal entries older than 90 days get archived (moved to an archive/ directory, removed from the search index). The agent can still read archived files if it needs to, but they don’t pollute search results.
The DAG compressor prunes nodes when the conversation ends. A “conversation” ends when the agent is idle for more than 30 minutes or when the user explicitly starts a new session.
Observability: What to Instrument
Memory systems fail silently. The agent retrieves the wrong context, writes to the wrong file, or skips writing entirely. You need instrumentation at every layer.
Layer 1 (Compression) Metrics
- DAG depth (how many summary levels exist)
- Compression ratio (original tokens / compressed tokens)
- Time since last prune
Layer 2 (Files) Metrics
- Write frequency per file
- File size growth rate
- Git commit frequency (if files aren’t being committed, something’s wrong)
Layer 3 (Search) Metrics
- Query latency
- Index size
- Top-k result relevance (manual spot checks)
The most useful metric: retrieval precision. Sample 20 queries per day, manually check if the top-3 results are actually relevant. If precision drops below 70%, the embedding model is drifting or the file structure has changed.
Cross-Session State Migration
The hardest problem: an agent starts a task in one session, gets interrupted, and resumes in a new session. The new session needs to reconstruct the agent’s mental state.
The file-based approach solves this with explicit state files. When the agent starts a task, it writes a task-{id}.md file with:
- Task description
- Current step
- Blockers
- Next actions
When the session ends, the agent updates the file. When a new session starts, the agent reads all active task files and decides which one to resume.
This is not automatic. The agent must be prompted to write task files and to check for active tasks on startup. The prompt template includes:
On startup:
1. Read all files in tasks/ directory
2. If any task has status "in-progress", ask the user if they want to resume
3. If resuming, read the full task file and continue from the last step
Before ending a session:
1. Update all active task files with current status
2. If a task is blocked, write the blocker to the file
Failure Modes and Mitigations
Every memory architecture has failure modes. Here’s what breaks and how to handle it.
Failure Mode 1: Embedding Drift
As the agent’s vocabulary changes, old embeddings become less relevant. A query for “authentication” might not retrieve entries that used “auth” or “login.”
Mitigation: Rebuild the search index weekly. The local embedding model is fast enough (333MB GGUF model runs at ~100 paragraphs/second) that a full reindex takes minutes, not hours.
Failure Mode 2: File Conflicts
If the agent edits a file while you’re editing it, Git will catch the conflict, but the agent won’t know how to resolve it.
Mitigation: The agent writes to its own namespace. User files live in src/, agent files live in memory/. The agent never writes to user files without explicit permission.
Failure Mode 3: Context Pollution
The agent retrieves too much irrelevant context and wastes tokens on noise.
Mitigation: Two-stage retrieval. First, semantic search returns 20 candidates. Second, a lightweight reranker (a smaller model or a simple heuristic like recency + relevance score) picks the top 5. The reranker runs in <100ms and dramatically improves precision.
Technical Verdict
Use this architecture when:
- You need agents that remember across sessions
- You want human-readable, auditable memory
- You’re building on local infrastructure (no cloud dependencies)
- You need to version-control agent state
Avoid this architecture when:
- You need sub-10ms retrieval latency (file reads + embedding search take 50-200ms)
- You’re building multi-tenant systems (file-based memory doesn’t isolate well)
- You need complex graph queries (files don’t support graph traversal)
- You want zero maintenance (you’ll need to prune old journals and rebuild indexes)
The three-layer stack (compression + files + search) is not the only way to build agent memory, but it’s the only approach that survived six months of production use. The key insight: treat memory as a stack, not a single storage backend. Each layer solves a different problem, and the handoffs between layers are where the architecture either works or breaks.