LLM Skirmish: Why Real-Time Strategy Games Expose Agent Coordination Bottlenecks

Frontier LLMs excel at generating complete applications in one shot. Models struggle with sequential game navigation, as evidenced by poor performance in Pokémon Red’s Mt. Moon. LLM Skirmish exposes why: the benchmark requires models to write code that executes under latency constraints, resource contention, and adversarial pressure, conditions absent from single-turn evaluations. This is a dev-tools story about benchmark design, not agent orchestration theory. The project tests how well code-generation models handle the plumbing of real-time execution.

The project adapts the Screeps open-source RTS engine (an MMO where players write JavaScript to control units and manage resources in persistent game worlds). Each tournament runs five rounds with a 2,000-frame limit per match. Models submit code that gets executed every frame, with a one-second runtime budget per frame.

Why RTS Games Break Standard Benchmarks

Traditional LLM evaluations test isolated tasks: generate a function, answer a question, summarize text. Real-time strategy games add three constraints that test coordination primitives:

Concurrent state updates: Multiple units act simultaneously. If your harvester and soldier both try to move to the same tile, the engine must serialize the conflict.
Latency budgets: You get one second of compute per frame. If your pathfinding algorithm takes 1.2 seconds, you forfeit that turn.
Adversarial planning: Your opponent’s actions invalidate your assumptions mid-execution. Static plans fail.

The Screeps paradigm isolates coding ability from the coordination layer. Models that excel at generating standalone functions now must handle event loops, state machines, and resource locks.

Architecture: Code Submission to Execution

LLM Skirmish runs a modified Screeps engine with a thin orchestration layer:

Pre-match: LLM receives game rules, API documentation, and opponent history (in rounds 2-5).
Code generation: Model outputs a JavaScript strategy file.
Validation: Engine parses the code, checks for syntax errors and forbidden APIs.
Execution: Strategy runs in a sandboxed V8 context with access to game state via Memory and Game objects.
Frame tick: Engine advances one frame, applies all unit actions, resolves conflicts, updates state.
Repeat: Strategy code runs again with updated state until match ends.

The key bottleneck: models must predict how their code will behave across 2,000 frames without intermediate feedback. There’s no REPL, no debugger, no ability to inspect why your harvester stopped collecting resources mid-match.

State Management and Conflict Resolution

The engine uses a shared memory model with deterministic action serialization. All unit actions are queued during the strategy execution phase, then applied in strict submission order during the frame tick. This order is based on when each agent’s code calls the action method within the frame, not on model identity or ELO.

Given identical initial state and agent code, action order is deterministic across replays, ensuring consistent conflict resolution. This differs from priority-queue systems where ties might be broken by unit ID or random selection. The Screeps engine documentation confirms that action order within a frame follows call order in the user’s code, making conflict resolution predictable but requiring careful sequencing.

Conflict Type	Resolution Strategy	Outcome
Movement collision	First submitted action wins tile	Units block each other, creating deadlocks
Resource contention	First harvester to call `harvest()` gets access	Other harvesters idle while waiting
Build orders	Spawn processes one unit per frame	Build queue stalls if spawn is under attack
API rate limits	Hard cap at 10 CPU per frame	Complex pathfinding triggers timeout

Models that generate monolithic strategy functions hit the CPU cap. Winning strategies use lazy evaluation: calculate paths only when units are idle, cache results in Memory, and bail early if the opponent’s position hasn’t changed.

// Naive approach: recalculates every frame
const path = PathFinder.search(creep.pos, target);
creep.moveByPath(path.path);

// Winning approach: cache and reuse
// Memory object persists across frames within a match
if (!creep.memory.path || creep.memory.target !== target.id) {
  creep.memory.path = PathFinder.search(creep.pos, target).path;
  creep.memory.target = target.id;
}
creep.moveByPath(creep.memory.path);

The difference: 8 CPU per frame vs. 0.3 CPU per frame for a five-unit army.

Latency Budget and Turn Forfeiture

Each model gets one second of wall-clock time per frame. The engine enforces this with a hard timeout:

Code runs in a separate process with ulimit constraints.
If execution exceeds 1,000ms, the process is killed.
The agent forfeits that frame: units don’t move, harvesters don’t collect, soldiers don’t attack.

This reveals a coordination failure that single-turn benchmarks miss. A model might generate syntactically correct code that works in isolation but becomes unusable under time pressure. According to the tournament documentation, GPT-5.2 was run with high reasoning level and won 68% of matches. The creators note that initial testing with xhigh reasoning level slowed down rounds and did not show notable improvements over high in test rounds, so xhigh was not used in the official tournament.

Observable Failure Modes

The tournament leaderboard reveals three patterns:

Claude Opus 4.5 (85% win rate, 1778 ELO): Generates compact strategies with explicit state machines. Uses Memory aggressively to avoid recalculation. Rarely hits CPU cap.

GPT-5.2 (68% win rate, 1625 ELO): Produces more verbose code with better comments but occasionally exceeds CPU budget on complex maps. Recovers well in rounds 2-5 by simplifying logic.

Gemini 3 Pro (26% win rate, 1297 ELO): Strong performance in rounds 1-3, collapses in rounds 4-5. The model appears to incorporate opponent history in later rounds but generates increasingly complex conditionals that exceed the CPU timeout. This suggests a failure to balance historical analysis with execution efficiency.

The Gemini failure is instructive. The model correctly identifies that later rounds should use opponent data to refine strategy. But it lacks the architectural intuition to keep the decision tree shallow. Instead of caching opponent patterns in Memory, it re-analyzes the full history every frame.

Inter-Agent Coordination Primitives

LLM Skirmish doesn’t support multi-agent teams yet, but the single-agent architecture reveals primitives that would apply:

Event sourcing: All game state changes are logged. Agents could replay history to detect opponent patterns.
Message passing: The Memory object acts as a key-value store. Multi-agent systems could use it for coordination signals.
Partial observability: Agents only see units within vision range. Fog of war forces probabilistic reasoning about opponent state.

The current design uses a synchronous execution model: all code runs, then the engine ticks. A multi-agent version would need to decide whether agents submit actions simultaneously (risking conflicts) or sequentially (giving later agents an information advantage).

Deployment Shape and Observability

The live demo at llmskirmish.com runs matches on-demand:

Frontend: Static site with match replay viewer.
Backend: Node.js server that spawns isolated V8 contexts for each agent.
Persistence: Match logs stored as JSON, replayed client-side for visualization.

There’s no distributed orchestration layer. Each match runs on a single core with a 30-second timeout for the full 2,000 frames. This works because the game engine is deterministic: given the same initial state and agent code, the outcome is identical.

Observability is minimal. The replay viewer shows unit positions and resource counts but doesn’t expose internal agent state. You can’t see what the model cached in Memory or why it chose to attack instead of retreat.

Lessons for Production Agentic Systems

The sandbox uses Node.js vm module with a restricted context:

No filesystem access.
No network access.
No access to process or require.
CPU and memory limits enforced by the parent process.

The vm module has a history of escapes, including prototype pollution via Object.prototype manipulation and context breakout through constructor chains. A production version would need a stronger boundary: separate containers, seccomp filters, or a WASM-based runtime.

The event-queue serialization model mitigates some risks. Because all actions are queued before execution, an agent can’t observe the immediate result of its own actions within the same frame. This prevents feedback loops where malicious code adapts in real time to engine responses.

These sandbox constraints mirror production challenges: any system that executes LLM-generated code must enforce resource limits, prevent side effects, and handle timeouts gracefully. The tournament data shows that models optimized for reasoning depth (like Gemini 3 Pro) struggle more with these constraints than models that generate compact, stateful code (like Claude Opus 4.5).

When to Use This Pattern

Real-time strategy games as benchmarks make sense when:

You need to test agent behavior under latency constraints.
Your production system involves concurrent state updates and resource contention.
Single-turn correctness is necessary but not sufficient.

They don’t make sense when:

Your agents operate in batch mode with no time pressure.
State updates are serialized and conflicts are rare.
You care more about reasoning depth than execution speed.

The Screeps paradigm (code as strategy) is particularly useful for testing models that generate long-running processes: workflow orchestrators, infrastructure controllers, trading bots. These systems share the same failure modes: they must handle partial failures, recover from timeouts, and adapt to adversarial inputs.

Technical Verdict

LLM Skirmish reveals a gap between generation quality and execution robustness. Models that produce clean code in isolation struggle when that code must run in a tight loop with strict latency budgets. The benchmark is valuable because it tests coordination primitives (memory management, conflict resolution, timeout handling) that matter in production agentic systems.

Use this pattern when you’re building agents that must operate in real-time environments with resource constraints. Avoid it if your agents can afford to take their time or if you’re optimizing for reasoning depth over execution speed.

The tournament data shows that explicit state machines with aggressive caching beat reactive strategies that try to reason through every decision. If you’re building orchestration layers for time-sensitive agents, prioritize compact code with disciplined memory management over verbose reasoning chains. Claude Opus 4.5’s 85% win rate demonstrates that execution discipline matters more than reasoning sophistication when latency budgets are tight.