Coding Agents Got Good in November: What Changed When RL-Trained Models Crossed the Daily-Driver Threshold

November 2025 marked a threshold crossing in coding agents. A quality shift moved agents from “often-work” to “mostly-work,” distinct from product launches or benchmark milestones. The change was subtle enough that it took weeks to become clear, but significant enough that developers started using agents as daily drivers instead of experimental toys.

Simon Willison’s PyCon US 2026 lightning talk provides a retrospective view with enough distance to separate signal from hype. The core technical change: Reinforcement Learning from Verifiable Rewards (RLVR) training runs that OpenAI and Anthropic had been executing throughout 2025 finally produced models that crossed a reliability threshold.

What RLVR Actually Means for Code Generation

Reinforcement Learning from Verifiable Rewards is not a new concept, but applying it to code generation at scale required solving a verification problem. Traditional RLHF (Human Feedback) doesn’t scale when you need millions of training examples. RLVR replaces human judgment with automated verification.

The verification stack:

Unit test execution (pass/fail is binary and cheap)
Type checker output (mypy, TypeScript compiler errors)
Linter violations (style and common bug patterns)
Runtime error detection in sandboxed execution
Diff quality against known-good implementations

The reward signal comes from these verifiable checks, not from human preference rankings. A model gets positive reinforcement when generated code passes tests, type checks cleanly, and runs without errors. Negative reinforcement when it fails any of these gates.

This approach scales because verification is deterministic and parallelizable. You can run millions of code generation attempts, execute them in isolated containers, and feed the results back into training without human annotation bottlenecks.

The Harness Layer: What Codex and Claude Code Actually Provide

Base models generate text. Agent harnesses turn that text into executable workflows. Willison’s talk references OpenAI’s Codex harness and Anthropic’s Claude Code harness as orchestration layers that sit between the model and the execution environment.

Core harness responsibilities (inferred from agent behavior):

Context management: Maintaining file tree state, open buffers, cursor position, and edit history across multi-turn interactions
Tool routing: Deciding when to read files, write files, execute commands, or search codebases based on model output
Sandboxing: Running generated code in isolated environments with resource limits and network restrictions
Error recovery: Capturing execution failures and feeding them back to the model for correction attempts
State checkpointing: Saving intermediate states so failed attempts don’t corrupt working code

The harness is not just a wrapper. It’s a state machine that interprets model outputs as actions, executes those actions safely, and manages the feedback loop. Without the harness, a model can generate plausible code. With the harness, it can iteratively debug and refine until tests pass.

Security Boundaries in Harness Execution

Running generated code requires isolation. The November threshold crossing depended on improved sandboxing that made it safe to execute agent output in production-adjacent environments. Current harness implementations use layered sandboxing:

Container-based isolation:

Docker containers with restricted capabilities
No network access by default
Read-only filesystem mounts for dependencies
CPU and memory limits enforced by cgroups

Language-level restrictions:

Disabled eval/exec in Python environments
No access to subprocess or os modules
Restricted file I/O to specific directories
Timeout enforcement on all execution

Monitoring layer:

System call tracing to detect escape attempts
Resource usage tracking for cost attribution
Output capture and sanitization
Automatic termination on policy violations

The security model assumes generated code is hostile. This is the correct assumption. Models occasionally generate code that attempts to read SSH keys, exfiltrate environment variables, or spawn reverse shells. Not through malice, but through pattern matching on training data that included security examples.

Architecture: How RLVR Training Differs from Base Model Training

Standard language model training optimizes for next-token prediction on a static corpus. RLVR training adds a dynamic loop where the model generates code, that code gets executed, and execution results influence future training batches.

# Conceptual structure of RLVR training loop
# This is pseudocode to illustrate the concept, not production code
# Actual implementations use framework-specific APIs (PyTorch, JAX)
import subprocess
import json
from typing import Dict, Any

def execute_in_sandbox(code: str, tests: str, timeout: int = 30) -> Dict[str, Any]:
    """Run generated code with tests in isolated container.
    
    Returns execution results including test pass/fail status and error output.
    In production, this would include comprehensive error handling and retry logic.
    """
    try:
        # Write code and tests to temporary files
        # Execute in Docker container with resource limits
        # Capture stdout, stderr, exit code
        result = subprocess.run(
            ["docker", "run", "--rm", "--network=none", 
             "--memory=512m", "--cpus=1", "sandbox-image",
             "python", "-m", "pytest", "tests.py"],
            capture_output=True,
            timeout=timeout,
            check=False  # Don't raise on non-zero exit
        )
        return {
            "tests_passed": result.returncode == 0,
            "stdout": result.stdout.decode(),
            "stderr": result.stderr.decode()
        }
    except subprocess.TimeoutExpired:
        return {
            "tests_passed": False,
            "stdout": "",
            "stderr": "Execution timeout exceeded"
        }

def calculate_reward(result: Dict[str, Any]) -> float:
    """Compute scalar reward from execution results."""
    reward = 0.0
    if result["tests_passed"]:
        reward += 1.0
    if "TypeError" in result["stderr"]:
        reward -= 0.5
    if "SyntaxError" in result["stderr"]:
        reward -= 1.0
    return reward

# Training loop (simplified)
for batch in training_batches:
    generated_code = model.generate(batch["prompt"])
    result = execute_in_sandbox(generated_code, batch["tests"])
    reward = calculate_reward(result)
    # Update model weights using policy gradient methods
    model.backward(reward)

The key difference: the model learns from the consequences of its own generated code, not just from human-written examples. This creates a feedback loop where the model discovers which patterns lead to working code through trial and error at massive scale.

What Actually Changed in November

Willison’s talk identifies the shift from “often-work” to “mostly-work” as the critical threshold. The specific failure modes that improved are not enumerated in the source, but the observable change was clear: agents became reliable enough for daily use without constant manual intervention.

The quality barrier crossed was practical, not benchmarked. Developers stopped needing to “spend most of your time fixing their stupid mistakes” (Willison’s phrasing). This is a user-experience threshold, not a percentage-point improvement on a leaderboard.

What became visible after November: gaps in observability and monitoring that didn’t matter when agents failed frequently. When agents work most of the time, understanding why they fail the rest of the time becomes valuable. Current implementations lack standard trace formats for multi-turn sessions, cost attribution for failed versus successful attempts, and reliable confidence scoring that correlates with actual code quality.

The Five Model Leadership Changes

November 2025 saw unprecedented model churn at the frontier. Five different models held or contested the “best for coding” position across the month:

Claude Sonnet 4.5 (September 29): Baseline leader entering November
GPT-5.1 (early November): First to show RLVR improvements in production
Gemini 3 (mid-November): Strong multimodal performance, competitive coding
GPT-5.1 Codex Max (late November): Specialized coding variant with extended context
Claude Opus 4.5 (end of November): Reclaimed lead with balanced performance

Each model represented a different training approach and architectural choice, but all showed evidence of RLVR-style training. The rapid succession indicated that multiple labs had reached similar training milestones simultaneously, suggesting the technique had matured enough for production deployment.

Willison notes that “most practitioners will agree that Opus 4.5 held the crown for the next couple of months” after its late November release.

The Holiday Experimentation Wave

The December-January period saw a surge in ambitious projects testing agent limits. Willison’s micro-javascript project exemplifies the threshold crossing: a JavaScript interpreter in Python, running in Pyodide, running in WebAssembly, running in a browser. Technically impressive, practically useless.

The project’s completion signals that agents had become reliable enough for developers to pursue ambitious, impractical work. This experimentation wave is a leading indicator of tool maturity. When developers start building absurd projects just to see if they can, it means the underlying tool has crossed from “research toy” to “reliable enough to waste time with.”

Willison’s own assessment: “Did anyone out there need a buggy, slow, insecure half-baked implementation of JavaScript in Python? They did not.” But the fact that the agent could complete such a project without constant manual intervention demonstrated the November quality shift.

He describes this period as involving “a form of LLM psychosis as I started spinning up wildly ambitious projects to see how far I could push them,” and notes having “quite a few other projects from that holiday period that I have since quietly retired.”

Technical Verdict

Use coding agents when:

You have comprehensive test suites that define correctness (unit tests, integration tests, type checking). Agents excel when “working” has a verifiable definition.
The task is bounded and well-specified. Implementing a REST endpoint with clear requirements works better than “refactor this module to be more maintainable.”
You can afford multi-turn debugging sessions. Expect 3 to 5 iterations for complex tasks. Token costs add up.
Your environment supports sandboxed execution. Running untrusted code safely is non-negotiable.
Failure is cheap and obvious. If a bad implementation ships to production, the cost exceeds the time saved.

Avoid coding agents when:

You need novel algorithmic approaches not well-represented in training data. Agents pattern-match existing solutions, they don’t invent new algorithms.
Security requirements prohibit executing untrusted code in your infrastructure. Some compliance regimes don’t allow this risk.
The problem space is ambiguous. Requests like “make the UI better” or “optimize performance” lack verifiable success criteria.
You’re operating under strict token budgets. Failed attempts consume tokens without producing value.
The codebase is large enough that context management becomes the bottleneck. Current context windows handle medium-sized projects, not monorepos.

The November 2025 threshold made agents viable for daily use in bounded, verifiable tasks. It didn’t make them universal. They’re power tools that require skill to use effectively. The RLVR training breakthrough improved reliability, but the need for human judgment in architecture, security, and correctness verification remains.

Source Links

The last six months in LLMs in five minutes (Simon Willison)