mech.app
Dev Tools

MemEval: Testing Framework for AI Agent Memory Systems

How to benchmark retrieval accuracy, forgetting behavior, and cross-session state in agent memory with standardized test scenarios and metrics.

Source: dev.to
MemEval: Testing Framework for AI Agent Memory Systems

Agent memory fails in production because no one tests it before deployment. Prompts have LangSmith. RAG pipelines have Ragas. Memory backends have nothing. You find out about recall failures when a customer says “I already told you my name” or when your agent recommends steak to someone who mentioned being vegan three turns ago.

MemEval is an open-source testing framework that runs standardized scenarios against any memory backend and reports what passes, what fails, and why. It measures seven dimensions: recall accuracy, relevance filtering, consistency across sessions, latency, privacy leakage, forgetting behavior, and update propagation.

Architecture

The framework separates test scenarios from execution harness from backend adapters. You write scenarios in YAML, the harness runs them against any provider, and adapters translate calls into backend-specific APIs.

┌──────────────────┐
│  YAML Scenarios  │  30 built-in cases
│  (multi-turn,    │  (or write custom)
│   privacy,       │
│   recall)        │
└────────┬─────────┘

         v
┌──────────────────┐
│   Evaluation     │  Runs scenarios
│    Harness       │  against backends
└────────┬─────────┘

    ┌────┴────┬────┬────┬────┬────┐
    v         v    v    v    v    v
  Mem0      Zep  Letta Lang- Crew Custom
                      Graph  AI
    │         │    │    │    │    │
    └────┬────┴────┴────┴────┴────┘

         v
┌──────────────────┐
│   7 Metrics      │  recall, relevance,
│  + Visualizer    │  consistency, latency,
│                  │  privacy, forgetting,
│                  │  update propagation
└────────┬─────────┘

         v
┌──────────────────┐
│  Scorecard +     │  Console, JSON,
│  CI Reports      │  GitHub Actions
└──────────────────┘

Each scenario defines a conversation flow, expected memory state, and assertions. The harness injects messages, queries memory, and checks whether the backend retrieved the right facts at the right time.

Test Scenario Structure

A scenario is a YAML file with turns, memory expectations, and assertions. Here’s a simplified privacy leak test:

scenario:
  name: "SSN Privacy Leak"
  turns:
    - role: user
      content: "My SSN is 123-45-6789"
    - role: assistant
      content: "Got it, I've noted your details"
    - role: user
      content: "What do you know about me?"
  
  assertions:
    - type: privacy_leak
      field: ssn
      pattern: '\d{3}-\d{2}-\d{4}'
      should_appear: false
      context: assistant_response
    
    - type: recall
      query: "user identification"
      should_retrieve: true
      should_not_contain: ["123-45-6789"]

The harness runs the conversation, queries memory after each turn, and checks whether the SSN appears in retrieval results or assistant responses. If the backend stores raw SSN text, the test fails.

Seven Metrics Explained

MetricWhat It MeasuresFailure Mode
Recall AccuracyDoes the backend retrieve facts mentioned N turns ago?Agent asks for information already provided
Relevance FilteringDoes it return only contextually useful memories?Agent surfaces unrelated past conversations
ConsistencyDo facts remain stable across sessions?Agent contradicts itself after reload
LatencyQuery response time under loadTimeout errors in production
Privacy LeakageAre PII patterns masked or excluded?SSN, credit card numbers in logs
Forgetting BehaviorDoes old data decay or get pruned?Stale preferences override new ones
Update PropagationDo edits to facts reflect in retrieval?Agent uses outdated account status

Recall and relevance are measured with embedding similarity and keyword matching. Consistency runs the same query across session boundaries and checks for contradictions. Privacy uses regex patterns to detect PII in retrieval results.

Backend Adapter Contract

To test a new memory backend, you implement four methods:

class MemoryBackendAdapter:
    def add_message(self, user_id: str, message: dict) -> None:
        """Store a conversation turn"""
        pass
    
    def search(self, user_id: str, query: str, limit: int = 5) -> list[dict]:
        """Retrieve relevant memories"""
        pass
    
    def get_all(self, user_id: str) -> list[dict]:
        """Fetch full memory state for assertions"""
        pass
    
    def clear(self, user_id: str) -> None:
        """Reset state between test runs"""
        pass

The harness calls add_message for each conversation turn, search to simulate agent retrieval, and get_all to verify internal state. clear ensures test isolation.

For stateful backends like Zep or Letta, add_message might trigger summarization or entity extraction. For vector stores like Mem0, it embeds the message and stores it with metadata. The adapter hides these differences.

Cross-Session State Handling

Consistency tests require session boundaries. The harness creates a session, runs a conversation, closes the session, opens a new one, and queries memory. If the backend loses state between sessions, the test fails.

Some backends persist to disk (Letta), some use external databases (Zep with Postgres), and some hold everything in memory (simple RAG). The adapter’s clear method must handle all three:

def clear(self, user_id: str) -> None:
    if self.backend_type == "persistent":
        self.db.delete_user(user_id)
    elif self.backend_type == "stateful":
        self.session_store.invalidate(user_id)
    else:
        self.memory_cache.pop(user_id, None)

If your backend doesn’t expose a delete API, you can’t run isolation tests. This is a real limitation: some hosted memory services don’t let you clear state programmatically.

Forgetting Behavior Tests

Agents need to forget. A user changes their email, updates their dietary preferences, or closes an account. The memory backend should either overwrite old facts, mark them as stale, or prune them entirely.

MemEval tests this with update scenarios:

scenario:
  name: "Email Update Propagation"
  turns:
    - role: user
      content: "My email is old@example.com"
    - role: user
      content: "Actually, change that to new@example.com"
  
  assertions:
    - type: update_propagation
      query: "user email"
      should_retrieve: "new@example.com"
      should_not_retrieve: "old@example.com"
      grace_period_turns: 0

If the backend returns both emails, the test fails. If it returns only the old one, the test fails. If it returns the new one, it passes.

Some backends implement time-based decay: older facts get lower retrieval scores. Others use explicit versioning: each fact has a timestamp, and queries return the latest version. The test doesn’t care how it’s implemented, only that retrieval reflects the update.

Privacy Leak Detection

PII in memory is a compliance risk. MemEval checks whether backends store raw SSNs, credit card numbers, or API keys.

The test injects a message with a known pattern, queries memory, and scans results:

def check_privacy_leak(response: str, patterns: dict) -> bool:
    for pii_type, regex in patterns.items():
        if re.search(regex, response):
            return True  # Leak detected
    return False

Some backends redact PII automatically (Zep has built-in PII detection). Others store everything verbatim. If your backend doesn’t redact, you need to preprocess messages before calling add_message.

Benchmark Results

The framework ships with 30 scenarios covering multi-turn recall, privacy, and consistency. Here’s what came out of running them against four backends:

BackendRecallRelevanceConsistencyPrivacyLatency (p95)
Mem087%92%78%Fail120ms
Zep91%88%95%Pass95ms
Letta84%85%89%Pass140ms
Custom79%90%72%Fail110ms

Zep scored highest on consistency because it persists to Postgres and maintains session state. Mem0 failed privacy tests because it stores raw message text in vector embeddings. The custom implementation had the lowest consistency because it uses in-memory state that doesn’t survive process restarts.

Latency numbers are for 100-message memory stores with 5-result retrieval. All backends degrade past 1,000 messages except Zep, which uses indexed search.

CI Integration

The framework outputs JSON reports that CI systems can parse:

{
  "backend": "mem0",
  "timestamp": "2026-06-01T08:12:19Z",
  "scenarios_run": 30,
  "scenarios_passed": 24,
  "metrics": {
    "recall_accuracy": 0.87,
    "privacy_leakage": true,
    "avg_latency_ms": 120
  },
  "failures": [
    {
      "scenario": "SSN Privacy Leak",
      "assertion": "should_not_contain",
      "actual": "Retrieved SSN in response"
    }
  ]
}

GitHub Actions can fail the build if privacy tests fail or if recall drops below a threshold:

- name: Run MemEval
  run: memeval run --backend mem0 --threshold 0.85
  
- name: Check Privacy
  run: |
    if jq -e '.metrics.privacy_leakage == true' results.json; then
      echo "Privacy leak detected"
      exit 1
    fi

Failure Modes

Context window overflow: If your backend uses LLM summarization and the context window fills up, old facts get dropped. MemEval detects this by querying for facts from early turns.

Embedding drift: If you change the embedding model, old vectors become incompatible. Consistency tests catch this when retrieval results change without new data.

Race conditions: Some backends update memory asynchronously. If you query immediately after adding a message, the fact might not appear. MemEval adds a configurable delay between turns.

Storage backend failures: If Postgres goes down, Zep loses state. If Redis evicts keys, session data disappears. The framework can’t distinguish between memory logic bugs and infrastructure failures.

Technical Verdict

Use MemEval when:

  • You’re evaluating memory backends and need objective comparison data
  • You’re building a custom memory layer and want regression tests
  • You need to prove compliance with privacy requirements
  • You want CI gates that catch memory degradation before production

Avoid it when:

  • Your agent doesn’t use persistent memory (stateless RAG is fine)
  • You’re testing prompt quality, not memory retrieval (use LangSmith instead)
  • Your backend doesn’t expose APIs for clearing state (you can’t run isolated tests)
  • You need real-time monitoring (this is a test suite, not observability)

The framework is strongest for pre-deployment validation. It won’t catch production issues like memory corruption under load or distributed state inconsistencies. For that, you still need logging, tracing, and alerts.

Tags

testing memory-systems evaluation-framework

Primary Source

dev.to