Most self-evolving multi-agent systems store learned experiences in a single centralized memory repository. Every agent reads from and writes to the same database. This creates predictable problems: communication overhead scales with agent count, coordination locks slow down parallel execution, and all agents converge toward identical behavior because they share the same training signal.
DecentMem, a new framework from researchers at multiple institutions, replaces the shared repository with per-agent memory pools. Each agent maintains two local stores: an exploitation pool of validated past trajectories and an exploration pool of LLM-generated candidates for unseen contexts. The system reweights these pools online using feedback from an LLM judge, achieving O(log T) cumulative regret while eliminating the central coordination bottleneck.
The Centralized Memory Problem
When agents share a single memory store, you pay three costs:
Communication overhead: Every memory read or write requires network round-trips. In a system with N agents, this scales as O(N) per operation. If agents query memory frequently during task execution, the repository becomes a hot path.
Coordination locks: Concurrent writes require synchronization. You either serialize updates (killing parallelism) or accept stale reads (breaking consistency). Most frameworks choose serialization, which means agent throughput drops as you add more agents.
Diversity collapse: All agents train on the same experiences. They develop identical strategies and make correlated errors. When one agent fails on a task type, all agents fail the same way.
The paper demonstrates this empirically across AutoGen, DyLAN, and AgentNet. Centralized memory systems show diminishing returns as agent count increases, and agents exhibit homogeneous failure modes.
Decentralized Architecture
DecentMem gives each agent two local memory pools:
Exploitation pool: Stores past trajectories that successfully solved tasks. These are concrete examples of what worked. The agent samples from this pool when it encounters similar contexts.
Exploration pool: Contains LLM-generated candidate strategies for contexts the agent has not seen. These are speculative plans, not validated solutions.
The system weights these pools dynamically. After each task, an LLM judge evaluates the outcome. If the agent succeeded using an exploration candidate, that candidate moves to the exploitation pool with increased weight. If it failed, the weight decreases.
This creates a per-agent bandit problem. Each agent learns which memory pool to trust for which context, without coordinating with other agents.
Memory Synchronization Model
Agents do not synchronize memory in real time. Instead, they share experiences asynchronously through a gossip-style protocol:
- Agent A completes a task and updates its local pools
- Periodically, A broadcasts a summary of high-weight exploitation entries
- Other agents receive the summary and optionally merge it into their exploration pools
- The merged entries compete with locally generated candidates
This is eventual consistency, not strong consistency. Agents can have divergent memory states. The paper proves this still guarantees global reachability: any agent can eventually discover any solution in the search space, given enough exploration budget.
The trade-off is clear. You lose the ability to enforce a single source of truth. You gain parallelism and fault tolerance. If one agent crashes, others continue learning from their local state.
Implementation Details
The dual-pool design maps cleanly to existing multi-agent frameworks. Here is a simplified version of the memory update logic:
class DecentMem:
def __init__(self, agent_id):
self.agent_id = agent_id
self.exploit_pool = [] # (trajectory, weight) tuples
self.explore_pool = [] # (candidate, weight) tuples
self.judge = LLMJudge()
def sample_action(self, context):
# Thompson sampling over both pools
exploit_score = self._score_pool(self.exploit_pool, context)
explore_score = self._score_pool(self.explore_pool, context)
if exploit_score > explore_score:
return self._sample_from_pool(self.exploit_pool, context)
else:
return self._sample_from_pool(self.explore_pool, context)
def update(self, context, action, outcome):
feedback = self.judge.evaluate(context, action, outcome)
if feedback.success and action.source == "explore":
# Promote successful exploration to exploitation
self.exploit_pool.append((action.trajectory, 1.0))
# Reweight pools based on feedback
self._reweight_pool(self.exploit_pool, feedback)
self._reweight_pool(self.explore_pool, feedback)
def gossip_sync(self, peer_summaries):
for summary in peer_summaries:
# Merge high-value peer experiences into exploration pool
if summary.weight > self.merge_threshold:
self.explore_pool.append((summary.trajectory, summary.weight * 0.5))
The key mechanism is the reweighting step. The LLM judge returns structured feedback (success/failure, error type, context similarity). The system uses this to adjust pool weights using a multiplicative update rule, similar to Exp3 in bandit algorithms.
Failure Modes and Debugging
Decentralized memory introduces new failure modes:
Divergent specialization: Agents develop non-overlapping expertise. Agent A becomes good at math tasks, Agent B at code tasks. If you route a math task to Agent B, it performs poorly even though Agent A has the solution in memory.
Stale exploration pools: If gossip frequency is too low, agents waste time re-exploring solutions their peers already discovered. If gossip frequency is too high, you recreate the communication overhead of centralized memory.
Judge inconsistency: The LLM judge may evaluate the same outcome differently across agents. This creates conflicting weight updates. Agents that got positive feedback for a strategy will exploit it, while agents that got negative feedback will avoid it.
To debug these issues, you need per-agent observability:
- Log pool sizes and weight distributions over time
- Track which pool (exploit vs. explore) each action came from
- Measure inter-agent memory divergence using embedding similarity
- Monitor gossip message volume and latency
The paper does not provide production observability tooling, but the architecture suggests you need distributed tracing that correlates agent actions with memory state snapshots.
Performance Trade-offs
The paper benchmarks DecentMem against centralized memory baselines across five tasks: GSM8K (math), HumanEval (code), HotpotQA (multi-hop reasoning), ALFWorld (embodied tasks), and WebShop (web navigation).
| Metric | Centralized Memory | DecentMem | Improvement |
|---|---|---|---|
| Average accuracy | 62.4% | 77.2% | +23.8% |
| Token usage per task | 8,200 | 4,180 | -49% |
| Memory sync overhead | O(N) per write | O(1) local + periodic gossip | Sublinear |
| Agent diversity (measured by strategy variance) | 0.12 | 0.47 | +292% |
The token reduction comes from two sources. First, agents do not need to query a large centralized memory on every decision. Second, local pools are smaller and more specialized, so retrieval is faster.
The diversity improvement is structural. Agents with different exploration histories develop different strategies. This reduces correlated failures and improves ensemble performance when you aggregate predictions across agents.
When Decentralized Memory Breaks Down
This architecture works when agents operate on independent tasks with occasional overlap. It breaks when:
Strong consistency is required: If agents must agree on a single global state (e.g., a shared bank account balance), eventual consistency is not enough. You need distributed transactions or consensus protocols, which reintroduce coordination overhead.
Task distribution is highly skewed: If 90% of tasks fall into one category, all agents will converge on similar exploitation pools through gossip. You lose the diversity benefit.
Exploration budget is too small: The theoretical O(log T) regret bound assumes agents have enough budget to explore the full solution space. In practice, if you set tight token limits, agents may never discover optimal strategies.
Judge quality is poor: If the LLM judge produces noisy feedback, weight updates become random walks. Agents cannot distinguish good strategies from bad ones.
Deployment Shape
To run DecentMem in production, you need:
- Per-agent state stores: Redis or SQLite for local memory pools. Each agent gets its own database instance.
- Gossip transport: Message queue (RabbitMQ, Kafka) or peer-to-peer protocol (libp2p) for asynchronous memory sharing.
- LLM judge service: Centralized or replicated LLM endpoint for feedback generation. This is the one shared component, but it is stateless and can scale horizontally.
- Observability pipeline: Distributed tracing (Jaeger, Tempo) to correlate agent actions with memory state.
The paper does not specify gossip frequency or merge thresholds. You will need to tune these based on task arrival rate and network latency. Start with gossip every 10 tasks and a merge threshold of 0.7 (only merge peer experiences with weight > 0.7).
Technical Verdict
Use decentralized memory when you have:
- More than 5 agents operating in parallel
- Tasks with diverse contexts where agent specialization is beneficial
- Tolerance for eventual consistency in learned experiences
- Budget for per-agent state storage and gossip overhead
Avoid it when you need:
- Strong consistency guarantees across all agents
- Centralized audit logs for compliance
- Minimal operational complexity (centralized memory is simpler to debug)
- Agents that must always have access to the full history of all other agents
The O(log T) regret bound is theoretically appealing, but the practical win is eliminating the coordination bottleneck. If your multi-agent system is CPU-bound rather than I/O-bound, decentralized memory will not help. If you are hitting memory repository locks or network saturation, this architecture is worth the added complexity.