MemAudit: How Poisoned Agent Memory Gets Detected After the Fact

Persistent memory in LLM agents creates a new attack surface. An adversarial user can inject malicious records through normal interaction, and those records sit dormant until retrieval pulls them into context during a future decision. By the time you notice the agent approved a fraudulent transaction or leaked sensitive data, the damage is done.

MemAudit is a post-hoc auditing framework that traces harmful agent behavior back to specific poisoned memory records. It combines counterfactual causal attribution with structural anomaly detection to answer the forensic question: which stored memories caused this bad output?

The Memory Poisoning Problem

Memory-augmented agents store past interactions in vector databases or key-value stores. Retrieval mechanisms pull relevant records into the prompt context during inference. This pattern improves long-horizon task execution but introduces a supply-chain risk.

Attack vector:

Adversary interacts with the agent normally (no direct database access required)
Agent stores the interaction as a legitimate memory record
Later, semantic similarity or keyword matching retrieves the poisoned record
The agent’s reasoning incorporates the malicious content

This is not prompt injection. The adversarial payload persists in the memory store and affects future sessions, potentially across different users or tasks.

Why prevention fails:

Input filtering cannot distinguish adversarial records from legitimate edge cases at write time
Semantic similarity scoring does not flag malicious intent
The attack surface scales with memory size and retrieval complexity

MemAudit Architecture

MemAudit runs asynchronously after harmful behavior is detected. It does not block the agent’s forward execution. The framework has two components:

1. Counterfactual Memory Influence Score

For each stored memory record, MemAudit measures causal contribution to the harmful output.

Process:

Identify the harmful output (e.g., approved fraudulent transaction)
Retrieve the set of memory records that were in context during that decision
For each record, re-run inference with that record removed (counterfactual)
Compare the counterfactual output to the original harmful output
Assign an influence score based on output divergence

Key insight: If removing a memory record causes the agent to produce a safe output, that record likely contributed causally to the harm.

Implementation considerations:

Requires access to the agent’s inference pipeline and prompt construction logic
Counterfactual re-runs can be parallelized across memory records
Influence scores are relative, not absolute (compare across the candidate set)

2. Memory Consistency Graph

Structural anomaly detection identifies records that are outliers in the broader memory topology.

Graph construction:

Nodes represent memory records
Edges connect records with high semantic similarity or co-retrieval frequency
Node features include embedding distance, retrieval count, and temporal metadata

Anomaly signals:

Low clustering coefficient (isolated record with few semantic neighbors)
High betweenness centrality (record bridges unrelated clusters)
Temporal outliers (recent record with disproportionate retrieval frequency)

Why this matters: Adversarial injections often have unusual structural properties. A malicious record designed to trigger on specific keywords may not fit naturally into the agent’s normal memory distribution.

Evaluation Against MINJA Attack

The paper evaluates MemAudit against MINJA (Memory Injection Attack), a query-only attack where adversarial records are generated through normal agent interactions.

Attack setup:

Adversary crafts benign-looking queries that cause the agent to store malicious records
No direct database modification required
Records are semantically similar to legitimate interactions but contain adversarial payloads

Detection performance:

Metric	QA Setting	Reasoning Agent
Precision (top-5 flagged records)	0.87	0.82
Recall (poisoned records identified)	0.79	0.74
False positive rate	0.08	0.11
Audit latency (per harmful output)	12s	18s

Key result: Combining causal attribution with structural anomaly detection outperforms either signal alone. Causal scores identify records that directly influenced the harmful output. Structural anomalies catch records that are semantically unusual even if their causal contribution is indirect.

Deployment Shape

MemAudit runs as a separate audit service, not inline with agent execution.

Trigger conditions:

Automated: flag outputs that fail safety classifiers or policy checks
Manual: security team investigates a reported incident
Scheduled: periodic audits of high-risk agent deployments (e.g., financial transaction agents)

Pipeline:

Incident detection system logs the harmful output and retrieves the full prompt context
Audit service receives the context and queries the memory store for candidate records
Counterfactual inference runs in parallel across candidate records
Structural anomaly detector scores each record based on graph topology
Combined ranking surfaces the top-k suspicious records for human review

State management:

Audit service maintains a separate read-only replica of the memory store
Counterfactual inference uses the same LLM endpoint as the production agent (or a dedicated audit endpoint)
Results are logged to a forensic database for compliance and incident response

Failure Modes

Counterfactual inference limitations:

If the agent’s reasoning is stochastic, removing a single memory may not deterministically change the output
High-temperature sampling or retrieval randomness can obscure causal relationships
Mitigation: run multiple counterfactual samples and aggregate influence scores

Structural anomaly false positives:

Legitimate edge cases (e.g., rare user queries) may appear structurally anomalous
New agent deployments have sparse memory graphs, reducing anomaly signal
Mitigation: combine structural scores with causal attribution to reduce false positives

Adversarial evasion:

Attacker can craft records that blend into the memory distribution (low structural anomaly score)
Attacker can inject multiple low-influence records that collectively cause harm
Mitigation: audit at the cluster level, not just individual records

Audit latency:

Counterfactual inference requires multiple LLM calls per memory record
Large memory stores (10k+ records) make exhaustive audits impractical
Mitigation: pre-filter candidates using retrieval scores or temporal heuristics before running counterfactuals

Code Sketch: Counterfactual Influence Score

def compute_influence_score(
    agent_pipeline,
    harmful_output: str,
    retrieved_memories: list[MemoryRecord],
    target_memory: MemoryRecord
) -> float:
    """
    Measure causal contribution of a single memory record
    to a harmful output via counterfactual inference.
    """
    # Remove target memory from context
    counterfactual_context = [
        m for m in retrieved_memories if m.id != target_memory.id
    ]
    
    # Re-run agent inference without the target memory
    counterfactual_output = agent_pipeline.run(
        context=counterfactual_context,
        temperature=0.0  # Deterministic for comparison
    )
    
    # Compare outputs using semantic similarity
    influence = 1.0 - cosine_similarity(
        embed(harmful_output),
        embed(counterfactual_output)
    )
    
    return influence

# Parallel audit across all retrieved memories
influence_scores = {}
with ThreadPoolExecutor(max_workers=8) as executor:
    futures = {
        executor.submit(
            compute_influence_score,
            agent_pipeline,
            harmful_output,
            retrieved_memories,
            mem
        ): mem.id
        for mem in retrieved_memories
    }
    for future in as_completed(futures):
        mem_id = futures[future]
        influence_scores[mem_id] = future.result()

# Rank memories by influence
ranked = sorted(
    influence_scores.items(),
    key=lambda x: x[1],
    reverse=True
)

Observability Hooks

Metrics to track:

Audit trigger rate (harmful outputs per 1k agent runs)
Mean influence score of flagged records
Structural anomaly score distribution across memory store
Counterfactual inference latency (p50, p95, p99)
False positive rate (flagged records confirmed benign after human review)

Logging:

Store full prompt context for every audited output
Log counterfactual outputs for reproducibility
Track memory record lineage (when created, by which user, retrieval frequency)

Alerting:

High-influence records with low structural anomaly scores (potential evasion)
Clusters of related suspicious records (coordinated attack)
Audit latency exceeding SLA (infrastructure bottleneck)

Security Boundaries

Isolation:

Audit service runs in a separate namespace with read-only access to the memory store
Counterfactual inference uses a dedicated LLM endpoint to avoid poisoning production traffic
Forensic logs are write-once, append-only to prevent tampering

Access control:

Only security team and incident response have access to audit results
Memory store access is logged and rate-limited
Counterfactual inference requests are authenticated and audited

Data retention:

Flagged memory records are quarantined, not deleted (preserve evidence)
Audit logs are retained per compliance requirements (e.g., SOC 2, GDPR)
Counterfactual outputs are encrypted at rest

Technical Verdict

Use MemAudit when:

Your agent stores persistent memory across sessions or users
The agent makes high-stakes decisions (financial transactions, compliance checks, access control)
You need forensic evidence for incident response or compliance audits
Prevention-only defenses (input filtering, output blocking) are insufficient

Avoid or defer when:

Your agent is stateless or uses only ephemeral context
Audit latency is unacceptable (real-time blocking required)
Memory store is small enough to manually review
You lack infrastructure for parallel counterfactual inference

Practical next steps:

Instrument your agent pipeline to log full prompt context on every run
Build a read-only replica of your memory store for audit queries
Implement a basic influence scoring function before adding structural anomaly detection
Define trigger conditions for automated audits (safety classifier thresholds, policy violations)