mech.app
AI Agents

Supermemory: Production Memory Engine Architecture for AI Agents

How Supermemory handles fact extraction, temporal updates, contradictions, and sub-50ms profile queries without managed services.

Source: github.com
Supermemory: Production Memory Engine Architecture for AI Agents

AI agents forget everything between sessions. Supermemory is a memory engine that extracts facts from conversations, handles temporal updates and contradictions, and delivers user profiles in under 50ms. It ranks first on LongMemEval, LoCoMo, and ConvoMem, the three major AI memory benchmarks.

This is not a vector database with a chat wrapper. It’s a fact extraction pipeline, a temporal reasoning layer, and a hybrid search engine that merges RAG document retrieval with personalized memory context in a single query path.

Architecture: Fact Extraction vs. RAG Chunking

Traditional RAG systems chunk documents into fixed-size segments and embed them. Supermemory extracts discrete facts from conversations and stores them as structured entities with temporal metadata.

Fact extraction pipeline:

  1. Conversation input hits the Go backend
  2. LLM extracts atomic facts (subject, predicate, object triples)
  3. Each fact gets a timestamp, confidence score, and source reference
  4. Facts are indexed in a relational store (not just a vector DB)
  5. Contradictions trigger a resolution flow: newer facts deprecate older ones unless confidence scores override

Data structures:

  • Facts table: id, user_id, subject, predicate, object, timestamp, confidence, source_id
  • Contradictions table: Links conflicting facts with resolution status
  • User profiles: Pre-computed aggregations of stable facts plus a sliding window of recent activity

This differs from RAG chunking because facts are semantically parsed, not mechanically split. A conversation about changing jobs generates discrete facts: “works at Company A” (deprecated), “works at Company B” (active), “started new role on 2026-05-15” (temporal anchor).

Temporal Reasoning and Automatic Forgetting

Supermemory handles three temporal cases:

  1. Updates: “User moved from SF to NYC” deprecates the old location fact
  2. Contradictions: “User prefers Python” vs. “User prefers Rust” triggers a resolution flow based on recency and confidence
  3. Expiration: Facts with explicit TTLs or relevance decay scores get pruned

Automatic forgetting mechanism:

  • Time-based TTL for ephemeral facts (e.g., “currently reading X”)
  • Relevance scoring decay: facts not referenced in recent queries lose weight
  • Explicit expiration signals from conversation context (“I used to like Y, but not anymore”)

The forgetting layer runs as a background job that scans the facts table for expired or low-relevance entries. This prevents memory bloat and keeps profile queries fast.

Sub-50ms User Profile Queries

User profile queries return in approximately 50ms. This is not magic. It’s pre-computed aggregation plus indexed graph traversal.

How it works:

  1. Pre-computed profiles: A materialized view of stable facts per user (name, preferences, long-term context)
  2. Recent activity window: Last N facts stored in a separate hot table
  3. Query path: Single SQL join between pre-computed profile and recent activity, indexed on user_id and timestamp

No vector search for profile queries. Vector embeddings are used for hybrid search (RAG + memory), but profile retrieval is pure relational lookups.

Trade-off table:

ApproachLatencyFreshnessComplexity
On-demand aggregation200-500msReal-timeLow
Pre-computed + hot window~50msNear real-timeMedium
Fully cached profiles<10msStaleHigh (cache invalidation)

Supermemory picks the middle path: pre-computed profiles updated every few minutes, recent activity appended in real time.

Hybrid Search: RAG + Memory in One Query

Hybrid search merges two retrieval paths:

  1. RAG document retrieval: Vector search over ingested files (PDFs, Notion pages, code)
  2. Personalized memory context: Fact-based user profile and recent conversation history

Query flow:

// Simplified hybrid search orchestration
async function hybridSearch(query: string, userId: string) {
  const [ragResults, memoryContext] = await Promise.all([
    vectorSearch(query, { limit: 10 }),
    getUserProfile(userId), // 50ms pre-computed lookup
  ]);

  const rankedResults = rerank(ragResults, memoryContext, {
    personalizeWeight: 0.3,
    recencyBoost: 0.2,
  });

  return {
    documents: rankedResults,
    userContext: memoryContext,
  };
}

The reranking step adjusts document scores based on user preferences and recent facts. If the user profile says “prefers TypeScript examples,” TypeScript-heavy documents get boosted.

This is not a sequential pipeline. Both retrieval paths run in parallel, and the merge happens at the scoring layer.

Deployment Shape: No Vercel, No Supabase

Supermemory ships as two Docker containers:

  • Go backend: Fact extraction, memory API, connector webhooks
  • Next.js frontend: Dashboard and user-facing app

Deployment isolation trade-offs:

DeploymentCostOps BurdenVendor Lock-in
Vercel + Supabase$20-50/moLowHigh
Self-hosted VPS$6/moMediumNone
Kubernetes cluster$50+/moHighNone

Supermemory targets the $6 VPS path. The Go backend is stateless (facts in Postgres, vectors in a separate index), so horizontal scaling is straightforward. The Next.js frontend is a static export that can be served from any CDN.

Container communication:

  • Backend exposes REST API on port 8080
  • Frontend calls backend via environment-configured base URL
  • No shared filesystem, no direct DB access from frontend
  • Webhooks from connectors (Google Drive, Notion) hit backend endpoints directly

This shape avoids managed service dependencies but requires you to handle Postgres backups, SSL termination, and log aggregation yourself.

Connectors and Multi-Modal Extractors

Supermemory includes connectors for Google Drive, Gmail, Notion, OneDrive, and GitHub. Each connector uses real-time webhooks to sync changes.

Connector flow:

  1. User authorizes OAuth app
  2. Connector registers webhook with provider
  3. Provider sends change notifications (new file, updated doc)
  4. Backend fetches changed content and triggers extraction

Multi-modal extractors:

  • PDFs: Text extraction via pdfplumber, chunked by section headings
  • Images: OCR via Tesseract, facts extracted from recognized text
  • Videos: Transcription via Whisper, facts extracted from transcript
  • Code: AST-aware chunking (function definitions, class declarations)

The AST-aware code chunker is notable. Instead of splitting code files at arbitrary line counts, it parses the syntax tree and chunks by semantic units (functions, classes, modules). This preserves context boundaries.

Likely Failure Modes

Fact extraction hallucinations:

If the LLM extracts incorrect facts, they propagate into the memory layer. Confidence scoring mitigates this, but there’s no ground truth validation.

Contradiction resolution loops:

Rapid back-and-forth contradictions (user changes preference multiple times in one session) can create resolution thrash. The system needs a cooldown period or a “pending” state for unstable facts.

Webhook delivery failures:

Connectors rely on provider webhooks. If a webhook is dropped, the memory layer misses updates. Polling fallbacks are necessary but add latency.

Profile staleness:

Pre-computed profiles updated every few minutes can miss rapid context changes. The recent activity window helps, but there’s a trade-off between freshness and query speed.

Postgres write contention:

High-frequency fact extraction from multiple concurrent conversations can bottleneck on the facts table. Partitioning by user_id or sharding across multiple Postgres instances is the escape hatch.

Technical Verdict

Use Supermemory when:

  • You need persistent memory across agent sessions
  • You want fact-based reasoning, not just document retrieval
  • You can tolerate 50ms profile queries (not 5ms)
  • You prefer self-hosted deployments over managed services
  • You need connectors for Google Drive, Notion, or GitHub

Avoid Supermemory when:

  • You need sub-10ms memory lookups (use a pure cache)
  • Your use case is stateless (traditional RAG is simpler)
  • You can’t manage Postgres and webhook infrastructure
  • You need guaranteed fact accuracy (LLM extraction has error rates)
  • You want a fully managed, zero-ops solution

Supermemory is production-grade plumbing for AI agents that need to remember. It’s not a toy, but it’s also not a managed service. You own the ops burden.