Persistent memory in LLM agents creates a new attack surface. An adversarial user can inject malicious records through normal interaction, and those records sit dormant until retrieval pulls them into context during a future decision. By the time you notice the agent approved a fraudulent transaction or leaked sensitive data, the damage is done.
MemAudit is a post-hoc auditing framework that traces harmful agent behavior back to specific poisoned memory records. It combines counterfactual causal attribution with structural anomaly detection to answer the forensic question: which stored memories caused this bad output?
The Memory Poisoning Problem
Memory-augmented agents store past interactions in vector databases or key-value stores. Retrieval mechanisms pull relevant records into the prompt context during inference. This pattern improves long-horizon task execution but introduces a supply-chain risk.
Attack vector:
- Adversary interacts with the agent normally (no direct database access required)
- Agent stores the interaction as a legitimate memory record
- Later, semantic similarity or keyword matching retrieves the poisoned record
- The agent’s reasoning incorporates the malicious content
This is not prompt injection. The adversarial payload persists in the memory store and affects future sessions, potentially across different users or tasks.
Why prevention fails:
- Input filtering cannot distinguish adversarial records from legitimate edge cases at write time
- Semantic similarity scoring does not flag malicious intent
- The attack surface scales with memory size and retrieval complexity
MemAudit Architecture
MemAudit runs asynchronously after harmful behavior is detected. It does not block the agent’s forward execution. The framework has two components:
1. Counterfactual Memory Influence Score
For each stored memory record, MemAudit measures causal contribution to the harmful output.
Process:
- Identify the harmful output (e.g., approved fraudulent transaction)
- Retrieve the set of memory records that were in context during that decision
- For each record, re-run inference with that record removed (counterfactual)
- Compare the counterfactual output to the original harmful output
- Assign an influence score based on output divergence
Key insight: If removing a memory record causes the agent to produce a safe output, that record likely contributed causally to the harm.
Implementation considerations:
- Requires access to the agent’s inference pipeline and prompt construction logic
- Counterfactual re-runs can be parallelized across memory records
- Influence scores are relative, not absolute (compare across the candidate set)
2. Memory Consistency Graph
Structural anomaly detection identifies records that are outliers in the broader memory topology.
Graph construction:
- Nodes represent memory records
- Edges connect records with high semantic similarity or co-retrieval frequency
- Node features include embedding distance, retrieval count, and temporal metadata
Anomaly signals:
- Low clustering coefficient (isolated record with few semantic neighbors)
- High betweenness centrality (record bridges unrelated clusters)
- Temporal outliers (recent record with disproportionate retrieval frequency)
Why this matters: Adversarial injections often have unusual structural properties. A malicious record designed to trigger on specific keywords may not fit naturally into the agent’s normal memory distribution.
Evaluation Against MINJA Attack
The paper evaluates MemAudit against MINJA (Memory Injection Attack), a query-only attack where adversarial records are generated through normal agent interactions.
Attack setup:
- Adversary crafts benign-looking queries that cause the agent to store malicious records
- No direct database modification required
- Records are semantically similar to legitimate interactions but contain adversarial payloads
Detection performance:
| Metric | QA Setting | Reasoning Agent |
|---|---|---|
| Precision (top-5 flagged records) | 0.87 | 0.82 |
| Recall (poisoned records identified) | 0.79 | 0.74 |
| False positive rate | 0.08 | 0.11 |
| Audit latency (per harmful output) | 12s | 18s |
Key result: Combining causal attribution with structural anomaly detection outperforms either signal alone. Causal scores identify records that directly influenced the harmful output. Structural anomalies catch records that are semantically unusual even if their causal contribution is indirect.
Deployment Shape
MemAudit runs as a separate audit service, not inline with agent execution.
Trigger conditions:
- Automated: flag outputs that fail safety classifiers or policy checks
- Manual: security team investigates a reported incident
- Scheduled: periodic audits of high-risk agent deployments (e.g., financial transaction agents)
Pipeline:
- Incident detection system logs the harmful output and retrieves the full prompt context
- Audit service receives the context and queries the memory store for candidate records
- Counterfactual inference runs in parallel across candidate records
- Structural anomaly detector scores each record based on graph topology
- Combined ranking surfaces the top-k suspicious records for human review
State management:
- Audit service maintains a separate read-only replica of the memory store
- Counterfactual inference uses the same LLM endpoint as the production agent (or a dedicated audit endpoint)
- Results are logged to a forensic database for compliance and incident response
Failure Modes
Counterfactual inference limitations:
- If the agent’s reasoning is stochastic, removing a single memory may not deterministically change the output
- High-temperature sampling or retrieval randomness can obscure causal relationships
- Mitigation: run multiple counterfactual samples and aggregate influence scores
Structural anomaly false positives:
- Legitimate edge cases (e.g., rare user queries) may appear structurally anomalous
- New agent deployments have sparse memory graphs, reducing anomaly signal
- Mitigation: combine structural scores with causal attribution to reduce false positives
Adversarial evasion:
- Attacker can craft records that blend into the memory distribution (low structural anomaly score)
- Attacker can inject multiple low-influence records that collectively cause harm
- Mitigation: audit at the cluster level, not just individual records
Audit latency:
- Counterfactual inference requires multiple LLM calls per memory record
- Large memory stores (10k+ records) make exhaustive audits impractical
- Mitigation: pre-filter candidates using retrieval scores or temporal heuristics before running counterfactuals
Code Sketch: Counterfactual Influence Score
def compute_influence_score(
agent_pipeline,
harmful_output: str,
retrieved_memories: list[MemoryRecord],
target_memory: MemoryRecord
) -> float:
"""
Measure causal contribution of a single memory record
to a harmful output via counterfactual inference.
"""
# Remove target memory from context
counterfactual_context = [
m for m in retrieved_memories if m.id != target_memory.id
]
# Re-run agent inference without the target memory
counterfactual_output = agent_pipeline.run(
context=counterfactual_context,
temperature=0.0 # Deterministic for comparison
)
# Compare outputs using semantic similarity
influence = 1.0 - cosine_similarity(
embed(harmful_output),
embed(counterfactual_output)
)
return influence
# Parallel audit across all retrieved memories
influence_scores = {}
with ThreadPoolExecutor(max_workers=8) as executor:
futures = {
executor.submit(
compute_influence_score,
agent_pipeline,
harmful_output,
retrieved_memories,
mem
): mem.id
for mem in retrieved_memories
}
for future in as_completed(futures):
mem_id = futures[future]
influence_scores[mem_id] = future.result()
# Rank memories by influence
ranked = sorted(
influence_scores.items(),
key=lambda x: x[1],
reverse=True
)
Observability Hooks
Metrics to track:
- Audit trigger rate (harmful outputs per 1k agent runs)
- Mean influence score of flagged records
- Structural anomaly score distribution across memory store
- Counterfactual inference latency (p50, p95, p99)
- False positive rate (flagged records confirmed benign after human review)
Logging:
- Store full prompt context for every audited output
- Log counterfactual outputs for reproducibility
- Track memory record lineage (when created, by which user, retrieval frequency)
Alerting:
- High-influence records with low structural anomaly scores (potential evasion)
- Clusters of related suspicious records (coordinated attack)
- Audit latency exceeding SLA (infrastructure bottleneck)
Security Boundaries
Isolation:
- Audit service runs in a separate namespace with read-only access to the memory store
- Counterfactual inference uses a dedicated LLM endpoint to avoid poisoning production traffic
- Forensic logs are write-once, append-only to prevent tampering
Access control:
- Only security team and incident response have access to audit results
- Memory store access is logged and rate-limited
- Counterfactual inference requests are authenticated and audited
Data retention:
- Flagged memory records are quarantined, not deleted (preserve evidence)
- Audit logs are retained per compliance requirements (e.g., SOC 2, GDPR)
- Counterfactual outputs are encrypted at rest
Technical Verdict
Use MemAudit when:
- Your agent stores persistent memory across sessions or users
- The agent makes high-stakes decisions (financial transactions, compliance checks, access control)
- You need forensic evidence for incident response or compliance audits
- Prevention-only defenses (input filtering, output blocking) are insufficient
Avoid or defer when:
- Your agent is stateless or uses only ephemeral context
- Audit latency is unacceptable (real-time blocking required)
- Memory store is small enough to manually review
- You lack infrastructure for parallel counterfactual inference
Practical next steps:
- Instrument your agent pipeline to log full prompt context on every run
- Build a read-only replica of your memory store for audit queries
- Implement a basic influence scoring function before adding structural anomaly detection
- Define trigger conditions for automated audits (safety classifier thresholds, policy violations)