mech.app
Dev Tools

Answers Rot, Store Questions Instead: A Memory Pattern for Long-Horizon Agent Projects

How storing questions instead of answers solves memory decay in multi-session agent workflows. Field-tested pattern from ten weeks of production use.

Source: dev.to
Answers Rot, Store Questions Instead: A Memory Pattern for Long-Horizon Agent Projects

A milestone got marked “done” in session 47. Five sessions later, an audit revealed it was 2-of-4 complete against its own written criteria. No session had lied. The answer (“done”) had simply outlived the reality it described, and every subsequent session trusted the notes because that’s what good session hygiene demands.

The thing that caught it wasn’t the memory system. It was a human asking: “Is this milestone actually complete against its written criteria?”

That question would have caught the drift immediately. But the question wasn’t stored. Only the answer was.

The Memory Decay Problem

After ten weeks and roughly 300 sessions of running a production ML project through Claude Code, a pattern emerged from the failure modes. Answers decay. They get stale, context-drifts, or become technically correct but operationally misleading as the codebase evolves.

The standard approach stores answers:

  • Session summaries
  • Handoff notes
  • Completion markers
  • State snapshots

Each session reads these artifacts, trusts them, and builds on them. When an answer goes stale, every downstream session inherits the rot.

The inversion is simple: store the questions that generated the answers, then re-execute them each session.

Standing Questions Architecture

A standing question is a serialized query that gets re-run against current state. Instead of storing “milestone X is done,” you store “does milestone X meet all four completion criteria?” and execute it at session start.

Core Components

Question Store: Append-only log of question definitions. Each entry contains:

  • Question text (natural language or structured query)
  • Target (file path, API endpoint, database table)
  • Expected answer shape (boolean, list, metric)
  • Trigger condition (every session, on file change, weekly)

Execution Engine: Runs questions against current state. For a coding agent, this means:

  • File system queries (grep, AST parsing, test output)
  • Git operations (blame, diff, log)
  • Build artifacts (test coverage, lint results)
  • External APIs (deployment status, error rates)

Answer Cache: Ephemeral storage for current session. Answers live here during execution but don’t persist. Next session re-runs the questions.

Drift Detector: Compares current answers to previous session’s answers. Flags changes for human review or agent attention.

Storage Primitives

PrimitiveUse CaseTrade-off
Append-only logQuestion history, audit trailNo updates; must version questions
Key-value storeFast lookup by question IDNo semantic search; requires exact keys
Vector DBSemantic similarity, question clusteringOverhead for simple boolean queries
SQLite fileStructured queries, joins across questionsSchema migration pain as questions evolve

The production implementation used a hybrid: append-only JSON log for question definitions, SQLite for execution results and drift tracking, no vector DB (questions were explicit enough that semantic search added no value).

Implementation Pattern

# Question definition schema
{
  "id": "milestone_3_complete",
  "text": "Does milestone 3 meet all four completion criteria listed in ROADMAP.md?",
  "executor": "file_grep",
  "target": ["ROADMAP.md", "src/milestone3/"],
  "expected_shape": "boolean",
  "trigger": "every_session",
  "created_at": "2024-03-15T10:30:00Z",
  "version": 1
}

# Execution at session start
def run_standing_questions(session_id):
    questions = load_questions(trigger="every_session")
    results = {}
    
    for q in questions:
        current_answer = execute_question(q)
        previous_answer = get_last_answer(q.id, session_id - 1)
        
        if current_answer != previous_answer:
            flag_drift(q.id, previous_answer, current_answer)
        
        results[q.id] = current_answer
        log_execution(session_id, q.id, current_answer)
    
    return results

# Agent prompt injection
system_prompt = f"""
Current standing question results:
{format_results(results)}

Drift detected in: {list_drifted_questions()}

Proceed with session goals, but verify any drifted items before building on them.
"""

The key insight: questions are code. They version, diff, and compose like any other artifact. Answers are runtime output.

Question Versioning

When the agent’s understanding of the domain evolves, questions need updates. The pattern handles this through versioning, not mutation.

Version Bump: Create a new question with incremented version number. Old question remains in the log but stops executing.

Parallel Execution: Run both old and new versions for N sessions to compare behavior. Useful when refining a vague question into something more precise.

Deprecation: Mark old version as deprecated but keep it in the log. Audit trail shows why the question changed.

Example from the production run: “Is the API stable?” (v1) became “Do all endpoints return 2xx for happy-path requests in the last 100 CI runs?” (v2) after three sessions of ambiguous answers.

Failure Modes and Mitigations

Question Explosion: Too many questions slow session startup. Mitigation: trigger conditions (not every question runs every session) and question pruning (archive questions for completed milestones).

Flaky Executors: File system state can be inconsistent mid-session. Mitigation: run questions at session boundaries (start and end), not during active work.

Answer Interpretation Drift: Agent interprets the same answer differently across sessions. Mitigation: structured answer shapes (boolean, enum, metric) instead of free text.

Human Review Bottleneck: Drift flags pile up faster than humans can triage. Mitigation: severity levels (blocking vs. informational) and auto-resolution for expected drift (version bumps, refactors).

Observability

The production implementation logged:

  • Question execution time (median 1.2s for file system queries, 8s for test suite runs)
  • Drift rate (11% of questions flagged drift per session on average)
  • False positive rate (37% of drift flags were expected changes, not errors)
  • Question lifespan (median 12 sessions before deprecation or completion)

The 37% false positive rate was the biggest operational pain point. It improved to 18% after adding “expected drift” annotations (e.g., “this metric should increase over time”).

When to Use This Pattern

Good fit:

  • Multi-session agent workflows (more than 5 sessions on the same codebase)
  • Projects with explicit milestones or acceptance criteria
  • Domains where ground truth is checkable (tests pass, files exist, metrics hit threshold)
  • Teams that can write good questions (this is harder than it sounds)

Poor fit:

  • Single-session tasks (overhead exceeds benefit)
  • Exploratory work with no clear success criteria
  • Domains where questions can’t be automated (subjective quality, user experience)
  • Teams without time to maintain the question set

Technical Verdict

This pattern trades upfront question-writing cost for reduced memory decay over long horizons. It works when you can articulate what “done” means as a checkable condition, and when the cost of re-execution is lower than the cost of trusting stale answers.

The production run caught 14 instances of answer rot that would have propagated undetected (observed). It also added 2-3 minutes to session startup and required about 30 minutes per week of question maintenance (observed).

Use it if your agent project spans more than a few weeks and has objective success criteria. Skip it if you’re prototyping or if your domain resists automation.

The code, schemas, and execution logs are at github.com/Rocco-alt/standing-questions.

Tags

agentic-ai orchestration infrastructure

Primary Source

dev.to