Most agent systems that learn from experience require gradient updates, parameter servers, or distillation from stronger models. FORGE (Failure-Optimized Reflective Graduation and Evolution) takes a different path: it evolves agent memory through population broadcast of prompt-injected natural language artifacts. No weight updates, no backpropagation. The system coordinates memory sharing across a pool of agent instances.
The paper targets hierarchical ReAct agents running in stochastic environments where zero-shot performance is strongly negative. Examples include network defense, multi-step planning, and similar domains where the agent needs to build domain-specific knowledge without expensive retraining loops.
The Two-Loop Architecture
FORGE wraps a Reflexion-style inner loop with a population-based outer loop.
Inner loop (per agent instance):
- Agent executes a task trajectory using hierarchical ReAct (thought, action, observation).
- On failure, a dedicated reflection agent (same underlying LLM, no distillation) converts the failed trajectory into a memory artifact.
- Memory artifacts come in three flavors: Rules (textual heuristics), Examples (few-shot demonstrations), or Mixed (both).
- The agent retries the task with the new memory injected into its prompt context.
Outer loop (across population):
- Multiple agent instances run the inner loop in parallel on different task samples.
- After a fixed number of episodes (a “stage”), the system evaluates each instance’s performance.
- The best-performing instance’s memory is broadcast to the entire population.
- Instances that meet a graduation criterion (performance threshold) are frozen and stop evolving.
- The population advances to the next stage with updated memory.
The key insight: memory evolution happens through selection and broadcast, not through gradient descent. The LLM weights never change. The prompt context does.
Memory Artifact Types and Token Budget
Each memory type has different token economics and failure modes. The paper evaluates three representation strategies across all experiments.
| Artifact Type | Avg Tokens* | Generalization | Failure Mode | Best For |
|---|---|---|---|---|
| Rules | 80-150 | High (broad patterns) | Over-generalization, contradictory rules accumulate | Common failure patterns, general heuristics |
| Examples | 500-800 | Low (specific cases) | High token cost, limited coverage | Rare edge cases, complex multi-step sequences |
| Mixed | 300-500 | Medium (balanced) | Redundancy between rule and example | Balancing coverage with token efficiency |
*Token counts derived from CybORG CAGE-2 experiments with 30-step trajectories.
Rules are natural language heuristics like “If the attacker is in subnet A, prioritize restoring host B before investigating logs.” They generalize well across similar states but risk over-generalization when the heuristic is too broad. Contradictory rules can accumulate if the reflection agent generates conflicting guidance in different stages.
Examples are full trajectory demonstrations showing the complete sequence of thoughts, actions, and observations that led to success or failure. They work well for rare edge cases and complex multi-step sequences but consume significantly more tokens. In a 30-step task, a single example can occupy 500-800 tokens depending on observation verbosity.
Mixed combines rules with selective examples. The reflection agent generates a rule for the general pattern and includes one or two examples for unusual cases. This balances coverage but introduces redundancy when the rule and example encode similar information.
Token budget becomes the primary constraint. At a 30-step horizon with 4-6 memory artifacts per agent, context window utilization can reach 60-70% of an 8K window before the agent even starts executing. FORGE handles this through graduation: once an instance converges, it stops accumulating new memory and frees up population slots for exploration. The graduation criterion requires both conditions to be met: an agent’s average return over the last 5 episodes must exceed the population median by at least 15% AND reach an absolute threshold (set at +40 reward in the CybORG experiments).
Without graduation, memory growth is linear with stages. An agent running for 10 stages with 2 artifacts per stage accumulates 20 items. If each artifact averages 200 tokens, that’s 4,000 tokens of memory alone. Graduation caps this growth by freezing high-performing agents at stage 3-5, keeping their memory fixed while other agents continue exploring.
Staged Population Protocol
The outer loop coordinates memory evolution without a central parameter server. The protocol operates in discrete stages, where each stage consists of multiple episodes per agent followed by a population-wide synchronization step.
At the end of each stage:
- All non-graduated agents are evaluated on their performance metrics.
- The best-performing agent’s memory is identified.
- That memory is copied to all other non-graduated agents (the broadcast step).
- Agents meeting the graduation criterion are frozen and excluded from future updates.
The broadcast step is critical. Without it, each agent learns in isolation (the Reflexion baseline). With broadcast, the population shares discoveries. An agent that stumbles on a useful heuristic in stage 2 propagates that knowledge to all peers in stage 3.
Graduation prevents memory bloat. Once an agent hits the performance threshold, it stops reflecting and stops accumulating artifacts. This keeps the token budget bounded and frees up compute for agents still exploring.
Failure-Optimized Reflection
The reflection agent is not a separate model. It’s the same LLM with a different system prompt. The prompt instructs it to:
- Identify the failure point in the trajectory.
- Extract the state context and action that led to failure.
- Formulate a corrective heuristic (Rule) or capture the full trajectory (Example).
- Return the artifact in a structured format for prompt injection.
The reflection prompt includes three hand-crafted few-shot demonstrations of good reflections. This is the only place FORGE uses manually created examples. Everything else is self-generated.
A typical generated Rule looks like this:
Rule: When the attacker has compromised a host in the DMZ subnet, restore the web server before investigating logs on internal hosts.
Rationale: The web server is the attacker’s entry point. Restoring it first prevents re-compromise while you investigate.
A typical generated Example captures the full trajectory:
Example: Failed trajectory at step 12
State: Attacker in DMZ, web server down, internal host showing suspicious traffic
Thought: Should investigate internal host first to understand attack scope
Action: Analyze logs on internal host 10.0.1.5
Observation: Logs show lateral movement but attacker re-compromised web server
Corrective action: Restore web server immediately, then investigate
The reflection agent’s output is validated before injection. The orchestrator parses the artifact structure, checks for required fields (state context, action, rationale), and rejects malformed outputs. This prevents prompt injection if the task environment allows adversarial inputs.
The “failure-optimized” part means the reflection agent only runs on failed trajectories. Successful trajectories are not reflected on. This keeps the memory focused on correcting mistakes, not reinforcing what already works.
Evaluation on CybORG CAGE-2
The paper tests FORGE on a network defense POMDP with a 30-step horizon. The environment is stochastic, the attacker is adversarial, and zero-shot performance is strongly negative (heavy-tailed, often below -100).
Four LLM families were tested: Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B. All showed similar patterns:
- Zero-shot baseline: strongly negative returns, high variance.
- Reflexion baseline (single-stream learning): modest improvement, still high failure rates.
- FORGE (population broadcast): 1.7x to 7.7x improvement over zero-shot, 29% to 72% improvement over Reflexion.
Major failure rates (episodes below -100) dropped to around 1% with FORGE, compared to 15-30% for Reflexion and 40-60% for zero-shot.
The ablation study confirmed that population broadcast is the critical mechanism. A no-graduation variant (where all agents keep evolving indefinitely) performed worse, suggesting that freezing converged instances prevents memory drift and keeps the population focused.
Observability and Debugging
FORGE exposes several observability hooks. Memory snapshots are captured at the end of each stage, showing each agent’s full list of artifacts. Reflection logs contain the raw output of the reflection agent for each failed trajectory, including the parsed state context and generated artifact. Graduation events are logged when an agent meets the criterion and stops evolving. Broadcast events record when the best agent’s memory is propagated to the population, including a diff of the memory changes.
The main debugging challenge is diagnosing why a particular memory artifact didn’t help. If a rule is too vague, it won’t trigger. If an example is too specific, it won’t generalize. The reflection logs are the primary diagnostic tool.
Another failure mode is contradictory rules. If the reflection agent generates a rule in stage 2 that contradicts a rule from stage 1, the agent’s behavior becomes unpredictable. For example, if stage 1 produces “prioritize restoring host B when attacker is in DMZ” and stage 3 produces “prioritize investigating host C when attacker is in DMZ,” the agent will exhibit inconsistent behavior depending on which rule fires first in the prompt context. FORGE doesn’t currently deduplicate or reconcile rules. That’s left to the LLM’s in-context reasoning.
Deployment Shape
FORGE is designed for offline batch evaluation, not real-time inference. The population runs in parallel, but each agent instance is stateful and long-lived. Typical deployment:
- Orchestrator: Manages the population, coordinates stages, handles broadcast.
- Agent workers: Each runs an instance of the hierarchical ReAct loop with its own memory.
- Reflection worker: Shared across the population, processes failed trajectories on demand.
- Evaluation harness: Runs the task environment, collects metrics, determines graduation.
The orchestrator is the single point of coordination. It doesn’t need to be distributed because the population size is small (typically 4-8 agents). The agent workers can run on separate machines or in separate containers.
State management is straightforward: each agent’s memory is a list of text artifacts. Serialization is trivial. The orchestrator checkpoints the population state at the end of each stage.
Security Boundaries
The architecture introduces potential prompt injection risks if the task environment allows adversarial inputs. A malicious actor could craft a failure trajectory that causes the reflection agent to generate a harmful rule. This rule would then propagate to the entire population via broadcast. While the paper does not explicitly discuss these attack vectors, they are inferred from the architecture’s reliance on unvalidated natural language artifacts.
Practical mitigations:
- Sandbox the reflection agent’s output. Parse and validate the structure of generated artifacts before injecting them into agent prompts.
- Limit the scope of rules. Use a schema that constrains rule syntax (e.g., “If [condition], then [action]”).
- Audit broadcast events. Log the full memory diff when broadcasting, and flag anomalies (e.g., sudden rule count spikes).
The lack of weight updates is a security advantage. An attacker can’t backdoor the model. The worst they can do is pollute the prompt context, which is easier to detect and rollback.
When to Use FORGE
Good fit:
- Stochastic environments where zero-shot performance is poor.
- Tasks with clear failure signals (e.g., negative rewards, exceptions, timeouts).
- Domains where human-readable memory artifacts are useful for debugging or auditing.
- Scenarios where retraining is expensive or impractical.
Poor fit:
- Real-time inference where latency matters. The reflection step adds overhead.
- Tasks with ambiguous failure modes. If the agent can’t tell when it failed, reflection won’t help.
- Environments with adversarial inputs. Prompt injection risks are higher.
- Single-agent deployments. The population broadcast mechanism is the main value add.
Technical Verdict
FORGE is a practical alternative to gradient-based agent learning when you need memory that evolves cheaply at inference time. The population broadcast mechanism is the key innovation: it turns agent learning into a selection problem rather than an optimization problem.
The main trade-off is token budget. Memory artifacts accumulate in the prompt context, and long-running tasks can hit context window limits. Graduation helps, but it’s a coarse-grained solution. A more sophisticated approach would prune or compress memory over time.
The lack of rule deduplication or contradiction detection is a gap. In production, you’d want a reconciliation step before broadcasting memory to the population.
Use FORGE when you have a pool of agents running similar tasks, clear failure signals, and a need for interpretable memory. Avoid it for single-agent deployments, real-time systems, or environments where prompt injection is a concern.