33 AI Memory Engines Tested: What Persistent Agent State Actually Requires

Most agent memory systems fail because they treat persistence as a single problem. After testing 33 memory frameworks over six months, the pattern is clear: production agent state requires three distinct layers, each solving a different failure mode.

The typical approach (vector store + conversation history) breaks down when sessions run long, when agents need to write structured data, or when you need to audit what the agent actually remembers. The working architecture emerged from production use on OpenClaw, a framework with 350K+ GitHub stars.

The Three-Layer Memory Stack

Agent memory maps to three storage primitives, each with different latency, durability, and query characteristics.

Layer 1: Conversation Compression

This solves the context window problem. Without compression, agents forget the start of long conversations when token limits hit. The solution is a directed acyclic graph (DAG) of summaries.

Recent turns stay verbatim. Older turns get compressed into summaries. The oldest summaries get merged into higher-level summaries. The agent always has access to the full conversation arc without hitting token limits.

Lossless-Claw implements this with a DAG structure where each node is either a raw turn or a summary. When the context window fills, the compressor walks the DAG, merges the oldest leaf summaries, and writes a new parent node. The agent’s prompt includes the compressed DAG, not the raw history.

Layer 2: File-Based Persistent Memory

This is the durable record. Plain markdown files the agent reads and writes: daily journals (2026-05-28.md), a curated MEMORY.md, preference files, project notes.

Why files instead of a database? Three reasons:

Version control: Git tracks every change the agent makes
Human readability: You can audit and edit memory without tooling
Portability: No schema migrations, no database dependencies

The agent writes structured entries to these files. A local embedding model (QMD runs a 333MB GGUF model) indexes the files for semantic search. When the agent needs context, it queries by meaning, not keywords. “How did we handle the auth migration?” retrieves the right journal entry even if it never used the word “auth.”

Layer 3: Semantic Search Over Native Files

This bridges the gap between durable storage and retrieval. The embedding model runs locally, so search is sub-second and no data leaves the machine.

The search layer indexes markdown files as they change. Each paragraph becomes a vector. When the agent queries, the model returns the top-k most semantically similar paragraphs, along with file paths and line numbers.

This is not a RAG system. The agent doesn’t generate answers from retrieved chunks. It reads the actual files, using search to find the right starting point.

Storage Primitive Trade-offs

Different memory types map to different storage backends. Here’s what actually works in production.

Memory Type	Storage Backend	Retrieval Pattern	Failure Mode
Episodic (what happened)	Markdown files + Git	Semantic search over journals	Embedding drift as vocabulary changes
Semantic (facts, preferences)	Curated markdown files	Direct file read + search fallback	Stale data if agent doesn’t update
Procedural (how to do things)	Code files, runbooks	Grep + semantic search	Conflicts when agent edits working code
Short-term (current session)	DAG compression in memory	Linear scan of recent nodes	Memory leak if DAG isn’t pruned

The key insight: episodic and semantic memory need different write patterns. Episodic memory is append-only (daily journals). Semantic memory is edit-in-place (the agent updates MEMORY.md when preferences change).

Implementation: The Hybrid Architecture

The working system combines all three layers with explicit handoffs.

class AgentMemory:
    def __init__(self, workspace_path):
        self.compressor = DAGCompressor(max_tokens=8000)
        self.files = FileMemory(workspace_path)
        self.search = LocalEmbedding(model="qmd-333m.gguf")
        
    def add_turn(self, role, content):
        # Layer 1: Add to conversation DAG
        self.compressor.add_turn(role, content)
        
        # Layer 2: Write to daily journal if significant
        if self._is_significant(content):
            today = datetime.now().strftime("%Y-%m-%d")
            self.files.append(f"{today}.md", content)
            
    def query(self, question):
        # Layer 3: Semantic search over files
        results = self.search.query(question, top_k=5)
        
        # Layer 2: Read full context from files
        context = []
        for result in results:
            content = self.files.read(result.file_path)
            context.append(content)
            
        return context
        
    def get_context_for_prompt(self):
        # Layer 1: Compressed conversation history
        compressed = self.compressor.get_compressed_history()
        
        # Layer 2: Curated memory file
        memory = self.files.read("MEMORY.md")
        
        return f"{memory}\n\n{compressed}"

The handoff points matter. The compressor runs after every turn. The file writer runs only for significant events (the agent decides what’s significant). The search index rebuilds incrementally as files change.

Schema Evolution and Garbage Collection

Production memory systems need two maintenance operations: schema evolution and garbage collection.

Schema Evolution

As the agent’s tasks change, the memory schema changes. A project memory file might start with just a title and description, then add sections for architecture decisions, deployment notes, and known issues.

The file-based approach handles this naturally. The agent reads the current schema, decides what to add, and writes the new structure. No migration scripts, no downtime.

Garbage Collection

Old journal entries accumulate. The search index grows. The DAG compressor holds references to turns that are no longer relevant.

The working approach: time-based expiration with manual override. Journal entries older than 90 days get archived (moved to an archive/ directory, removed from the search index). The agent can still read archived files if it needs to, but they don’t pollute search results.

The DAG compressor prunes nodes when the conversation ends. A “conversation” ends when the agent is idle for more than 30 minutes or when the user explicitly starts a new session.

Observability: What to Instrument

Memory systems fail silently. The agent retrieves the wrong context, writes to the wrong file, or skips writing entirely. You need instrumentation at every layer.

Layer 1 (Compression) Metrics

DAG depth (how many summary levels exist)
Compression ratio (original tokens / compressed tokens)
Time since last prune

Layer 2 (Files) Metrics

Write frequency per file
File size growth rate
Git commit frequency (if files aren’t being committed, something’s wrong)

Layer 3 (Search) Metrics

Query latency
Index size
Top-k result relevance (manual spot checks)

The most useful metric: retrieval precision. Sample 20 queries per day, manually check if the top-3 results are actually relevant. If precision drops below 70%, the embedding model is drifting or the file structure has changed.

Cross-Session State Migration

The hardest problem: an agent starts a task in one session, gets interrupted, and resumes in a new session. The new session needs to reconstruct the agent’s mental state.

The file-based approach solves this with explicit state files. When the agent starts a task, it writes a task-{id}.md file with:

Task description
Current step
Blockers
Next actions

When the session ends, the agent updates the file. When a new session starts, the agent reads all active task files and decides which one to resume.

This is not automatic. The agent must be prompted to write task files and to check for active tasks on startup. The prompt template includes:

On startup:
1. Read all files in tasks/ directory
2. If any task has status "in-progress", ask the user if they want to resume
3. If resuming, read the full task file and continue from the last step

Before ending a session:
1. Update all active task files with current status
2. If a task is blocked, write the blocker to the file

Failure Modes and Mitigations

Every memory architecture has failure modes. Here’s what breaks and how to handle it.

Failure Mode 1: Embedding Drift

As the agent’s vocabulary changes, old embeddings become less relevant. A query for “authentication” might not retrieve entries that used “auth” or “login.”

Mitigation: Rebuild the search index weekly. The local embedding model is fast enough (333MB GGUF model runs at ~100 paragraphs/second) that a full reindex takes minutes, not hours.

Failure Mode 2: File Conflicts

If the agent edits a file while you’re editing it, Git will catch the conflict, but the agent won’t know how to resolve it.

Mitigation: The agent writes to its own namespace. User files live in src/, agent files live in memory/. The agent never writes to user files without explicit permission.

Failure Mode 3: Context Pollution

The agent retrieves too much irrelevant context and wastes tokens on noise.

Mitigation: Two-stage retrieval. First, semantic search returns 20 candidates. Second, a lightweight reranker (a smaller model or a simple heuristic like recency + relevance score) picks the top 5. The reranker runs in <100ms and dramatically improves precision.

Technical Verdict

Use this architecture when:

You need agents that remember across sessions
You want human-readable, auditable memory
You’re building on local infrastructure (no cloud dependencies)
You need to version-control agent state

Avoid this architecture when:

You need sub-10ms retrieval latency (file reads + embedding search take 50-200ms)
You’re building multi-tenant systems (file-based memory doesn’t isolate well)
You need complex graph queries (files don’t support graph traversal)
You want zero maintenance (you’ll need to prune old journals and rebuild indexes)

The three-layer stack (compression + files + search) is not the only way to build agent memory, but it’s the only approach that survived six months of production use. The key insight: treat memory as a stack, not a single storage backend. Each layer solves a different problem, and the handoffs between layers are where the architecture either works or breaks.

Source Links

Primary Article: I Tested 33 AI Memory Engines