Building a Memory Server for Claude Code: Why Stateless Agents Need External Context Stores

Claude Code is stateless. Every session starts from zero. You explain your project structure, your deployment rules, your “don’t push without checking the allowlist” warning. Next morning, you do it again.

This isn’t a bug. It’s the default behavior of commercial coding agents. They don’t persist context across sessions because they don’t ship with a memory layer. The Model Context Protocol (MCP) gives you the hooks to build one yourself, but it doesn’t solve the hard problems: what to remember, how to retrieve it, and how to keep token budgets from exploding.

Tom Tokita built a memory server for Claude Code after burning ten minutes every morning re-explaining the same project logistics. The result is a custom MCP server that handles persistent memory, context condensation, delegated file reading, and compliance checking. This isn’t a conceptual exercise. It’s a working prototype submitted to the Hermes Agent Challenge.

The Stateless Agent Problem

Claude Code runs in a session. When the session ends, the context window disappears. No state persists. No project memory survives. The next session is a blank slate.

For a single project, this is annoying. For a live client portfolio, it’s a wall. You spend the first chunk of every session on logistics the system already knew yesterday.

The obvious fix is to add memory. The non-obvious part is deciding what memory means for a coding agent:

Session memory: What happened in this conversation?
Project memory: What are the rules, structure, and quirks of this codebase?
Workflow memory: What deployment steps, compliance checks, or approval gates apply?

Each type has different retrieval patterns, different staleness characteristics, and different token costs.

MCP as a Memory Hook

MCP lets you run a server that exposes tools to Claude. The agent calls those tools during a session. Your server handles the request and returns structured data.

The protocol is straightforward:

You define a tool (e.g., retrieve_context)
Claude sees the tool in its capability list
Claude calls the tool with parameters
Your server returns JSON
Claude uses the response in its next reasoning step

The hard part isn’t the protocol. It’s deciding what to retrieve, when to retrieve it, and how to keep the context window from overflowing.

Architecture: Four Memory Primitives

Tokita’s server implements four capabilities:

1. Persistent Memory

The server stores facts across sessions. When Claude asks “what’s the deployment process for this project?”, the server retrieves stored context instead of forcing the user to re-explain.

Storage is key-value. Keys are project identifiers or topic tags. Values are text blobs. No vector search, no embeddings, just explicit retrieval by key.

This works because coding context is structured. You don’t need semantic search to find “deployment rules.” You need a lookup table.

2. Context Condensation

Raw memory grows unbounded. A three-month project accumulates hundreds of facts. You can’t inject all of them into every request.

The server condenses context before injection:

Recency weighting: Recent facts get priority
Relevance filtering: Only inject context related to the current task
Token budgeting: Cap injected context at a fixed token limit

Condensation is lossy. The trade-off is between completeness and token cost. A 200k context window buys you room, but you still need to decide what matters now.

3. Delegated File Reading

Claude Code can read files, but it’s expensive. Every file read consumes tokens. For large codebases, naive file reading blows the budget.

The memory server acts as a file cache. It reads files once, stores summaries, and returns condensed versions on request. Claude gets the structure without paying for the full file every time.

This is RAG for code. Instead of embedding documents, you’re caching file metadata and returning it on demand.

4. Compliance Checking

Some projects have hard rules: don’t deploy without approval, don’t modify certain files, don’t push to production on Fridays.

The server stores these rules and exposes a check_compliance tool. Before Claude executes a risky action, it calls the tool. The server returns pass/fail. If it fails, Claude stops.

This is a guardrail, not a security boundary. Claude can ignore the tool. But it usually doesn’t, because the tool is part of its capability list and the prompt instructs it to check before acting.

Retrieval Strategy: Explicit Over Semantic

The server uses explicit retrieval, not vector search. When Claude needs context, it calls a tool with a key. The server looks up the key and returns the value.

This is simpler than semantic search and works better for structured knowledge:

Approach	Pros	Cons
Explicit key lookup	Fast, deterministic, no embedding overhead	Requires structured keys, no fuzzy matching
Vector search	Handles fuzzy queries, discovers related context	Embedding cost, retrieval latency, relevance tuning
Recency-weighted cache	Prioritizes recent context, low complexity	Misses older but relevant facts

For coding agents, explicit retrieval wins. You know what you’re looking for: “deployment rules,” “API endpoints,” “test coverage requirements.” You don’t need semantic similarity. You need a lookup table.

Session Isolation and Concurrency

If multiple Claude sessions share the same memory backend, you need isolation. Otherwise, session A’s context leaks into session B.

The server handles this with session IDs. Every request includes a session identifier. The server scopes memory lookups to that session.

For shared project memory (facts that apply across sessions), the server uses project IDs. Session-specific memory is isolated. Project-wide memory is shared.

This is a namespace problem, not a locking problem. Reads are concurrent. Writes are append-only. No transactions, no coordination overhead.

Token Budget Management

The biggest operational challenge is token budgeting. You can’t inject unlimited context. You have to decide what fits.

The server uses a fixed budget (e.g., 10k tokens for injected context). When retrieving memory, it:

Fetches all relevant facts
Sorts by recency and relevance
Truncates to fit the budget
Returns the truncated list

This is greedy and lossy. You might drop important context. But the alternative is exceeding the context window and failing the request.

A better approach is adaptive budgeting: allocate more tokens to critical context (compliance rules, deployment steps) and fewer to background facts (project history, team notes). The server doesn’t implement this yet, but it’s the next step.

Code Snippet: Memory Retrieval Tool

Here’s the core retrieval logic in pseudocode:

def retrieve_context(session_id, project_id, query_key, max_tokens=10000):
    # Fetch session-specific memory
    session_facts = memory_store.get(f"session:{session_id}")
    
    # Fetch project-wide memory
    project_facts = memory_store.get(f"project:{project_id}:{query_key}")
    
    # Combine and sort by recency
    all_facts = session_facts + project_facts
    all_facts.sort(key=lambda f: f.timestamp, reverse=True)
    
    # Truncate to token budget
    result = []
    token_count = 0
    for fact in all_facts:
        fact_tokens = count_tokens(fact.content)
        if token_count + fact_tokens > max_tokens:
            break
        result.append(fact)
        token_count += fact_tokens
    
    return {"facts": result, "truncated": len(result) < len(all_facts)}

The truncation flag tells Claude that some context was dropped. Claude can request more specific context if needed.

Failure Modes

This architecture has predictable failure modes:

Stale context: Facts become outdated. The server doesn’t expire old memory automatically.
Retrieval misses: If the query key is wrong, the server returns nothing. Claude has no fallback.
Token budget overflow: If critical context exceeds the budget, it gets truncated. Claude proceeds with incomplete information.
Session ID collisions: If two sessions share an ID, their memory leaks. The server doesn’t validate uniqueness.

The stale context problem is the worst. A deployment rule changes, but the memory server still returns the old rule. Claude follows the old rule and breaks production.

The fix is versioning. Every fact gets a version number. When a fact changes, the server increments the version. Claude checks the version before using the fact. If the version is stale, Claude requests a refresh.

Observability: What to Log

You need to log three things:

Retrieval requests: What context did Claude request, and what did the server return?
Token usage: How many tokens were injected, and how many were truncated?
Compliance checks: What rules were checked, and what was the result?

Without these logs, you can’t debug why Claude made a bad decision. With them, you can trace the decision back to the injected context.

The server should also expose metrics:

Retrieval latency (p50, p99)
Token budget utilization
Truncation rate
Compliance check failure rate

These metrics tell you when the memory system is under stress and when you need to adjust the token budget or prune old facts.

Deployment Shape

The memory server runs as a sidecar. Claude Code connects to it over localhost. The server stores memory in a local SQLite database.

For multi-user deployments, you’d replace SQLite with Postgres and run the server as a shared service. Each user gets a session ID. The server isolates memory by session.

For high-availability deployments, you’d replicate the memory store and run multiple server instances behind a load balancer. Reads are concurrent. Writes are append-only. No coordination needed.

The simplest deployment is single-user, single-machine. The server runs on the same box as Claude Code. Memory is local. No network latency, no replication overhead.

Technical Verdict

Use this pattern when:

You work with the same projects repeatedly and spend time re-explaining context
Your projects have structured rules (deployment steps, compliance checks, approval gates)
You need guardrails that persist across sessions
You control the Claude Code environment and can run a local MCP server

Avoid this pattern when:

Your projects are one-off or highly variable (no stable context to remember)
You need semantic search over unstructured knowledge (use vector RAG instead)
You’re working in a locked-down environment where you can’t run local servers
You need multi-agent coordination (this is single-agent memory, not shared state)

The memory server solves the stateless agent problem by externalizing context. It’s not a general-purpose knowledge base. It’s a project-specific cache optimized for coding workflows. If you’re burning time re-explaining the same project every morning, this is the fix.

Source Links

Original Article: Claude Code Forgets Everything. So I Built It a Memory Server.