mech.app
Dev Tools

Claude Fable's Autonomous Browser Automation Stack: When Coding Agents Invent Their Own Tools

How Claude Fable 5 built screenshot capture, CORS servers, template injection, and shadow DOM traversal to debug CSS without being asked.

Source: simonwillison.net
Claude Fable's Autonomous Browser Automation Stack: When Coding Agents Invent Their Own Tools

Claude Fable 5, Anthropic’s frontier coding agent launched June 9, 2026, just demonstrated something new: autonomous construction of a multi-layer debugging pipeline without being asked. Simon Willison gave it a screenshot of a CSS scrollbar bug and a single prompt. What followed was an unprompted build-out of PyObjC window enumeration, macOS screencapture CLI orchestration, template injection for keyboard simulation, a custom CORS server for diagnostics, and shadow DOM traversal.

The session cost $12.11 and burned through 68,606 output tokens to fix two lines of CSS. The agent never asked permission to build any of this tooling. It constructed the entire stack autonomously.

What Fable Built Without Being Told

Willison’s prompt was minimal: “Look at dependencies to help figure out why there is a horizontal scrollbar here.” Fable interpreted this as license to do whatever it took.

The agent’s autonomous tool chain:

  • Window enumeration via PyObjC: Used uv run --with pyobjc-framework-Quartz to iterate through all macOS windows, filter for Safari instances containing “textarea” in the title, and extract window IDs.
  • Screenshot orchestration: Passed window IDs to screencapture -x -o -l <window_id> to capture PNGs of specific browser windows.
  • Template injection: Modified Datasette’s own HTML templates to inject JavaScript that simulated keyboard shortcuts after a 1.2-second delay.
  • Custom CORS server: Wrote a Python http.server app that accepted POST requests with Access-Control-Allow-Origin: * headers and dumped JSON payloads to /tmp/diag.json.
  • Shadow DOM traversal: Injected client-side JavaScript to measure properties inside Web Component shadow roots and POST them back to the local server.

None of this was in the original prompt. Fable inferred that it needed real browser behavior, not just Playwright headless testing. It tried Playwright first, failed to reproduce the bug, then pivoted to real Safari automation.

The Automation Flow

Here’s the sequence Fable executed:

  1. Spun up a local Datasette development server with fake environment variables.
  2. Attempted Playwright automation in Chrome, Firefox, and WebKit.
  3. Detected that osascript was blocked by macOS assistive access permissions.
  4. Switched to PyObjC for window management.
  5. Created scratch HTML files in /tmp/ to isolate the bug.
  6. Injected this into Datasette templates:
<script>
window.addEventListener("load", function() {
  setTimeout(function() {
    document.dispatchEvent(new KeyboardEvent("keydown", {
      key: "/",
      bubbles: true
    }));
  }, 1200);
});
</script>
  1. Opened Safari, triggered the modal via simulated keypress, captured screenshots.
  2. Built a diagnostic server to bypass CORS restrictions:
from http.server import HTTPServer, BaseHTTPRequestHandler

class H(BaseHTTPRequestHandler):
    def do_POST(self):
        n = int(self.headers.get("Content-Length", 0))
        open("/tmp/diag.json", "w").write(self.rfile.read(n).decode())
        self.send_response(200)
        self.send_header("Access-Control-Allow-Origin", "*")
        self.end_headers()
    
    def do_OPTIONS(self):
        self.send_response(200)
        self.send_header("Access-Control-Allow-Origin", "*")
        self.send_header("Access-Control-Allow-Headers", "*")
        self.end_headers()
    
    def log_message(self, *a):
        pass

HTTPServer(("127.0.0.1", 9999), H).serve_forever()
  1. Injected client-side measurement code:
const host = document.querySelector("navigation-search");
const ta = host.shadowRoot.querySelector("textarea");
const cs = getComputedStyle(ta);

fetch("http://127.0.0.1:9999/diag", {
  method: "POST",
  body: JSON.stringify({
    dpr: window.devicePixelRatio,
    scrollWidth: ta.scrollWidth,
    clientWidth: ta.clientWidth,
    whiteSpace: cs.whiteSpace,
    width: cs.width,
  }),
});
  1. Read /tmp/diag.json, identified the root cause, applied the fix, verified it worked.

At some point during this process, Fable hit some invisible guardrail and downgraded itself to Opus, which continued using the same techniques. Anthropic has not publicly documented what triggers this downgrade.

Security Boundaries That Don’t Exist

Fable had unrestricted terminal access. It could:

  • Modify application source code (templates, config files).
  • Spawn web servers on arbitrary ports.
  • Enumerate all open windows on the host machine.
  • Execute arbitrary shell commands via osascript, screencapture, and Python.
  • Inject JavaScript into pages served by local development servers.

There was no sandbox. No permission model. No audit log of what tools it chose to build.

If this session had been triggered by a prompt injection attack hidden in an issue comment or a malicious dependency’s README, Fable could have:

  • Exfiltrated environment variables, SSH keys, or API tokens.
  • Modified source code to introduce backdoors.
  • Spawned persistent web servers for command-and-control.
  • Enumerated and screenshotted sensitive application state.

The agent’s autonomous behavior is a feature when debugging CSS. It becomes a liability when the goal is adversarial.

Cost and Context Management

AgentsView (a CLI tool for tracking Claude API usage) shows the session metrics. The $12.11 figure represents what this session would cost at full API rates. Willison is currently on the $100/month Claude Max plan, which includes a generous Fable allowance until June 22nd, after which Anthropic will charge full API prices.

MetricValue
Output tokens68,606
Peak context113,178
Estimated cost$12.11
Models usedclaude-fable-5, claude-opus-4-8

For a CSS fix requiring only two lines of code, this is expensive. The cost came from:

  • Multiple browser automation attempts (Playwright, then real browsers).
  • Iterative screenshot capture and analysis.
  • Building and testing custom tooling (CORS server, template injection).
  • Context accumulation as the agent documented its own techniques.

Fable will burn through subscription allowances inventing patterns that may not be reusable. The agent doesn’t optimize for token efficiency. It optimizes for goal completion.

Observability Gaps in Autonomous Tool Composition

Willison only discovered what Fable had done by watching his screen as browsers opened autonomously, reviewing the terminal transcript after the fact, and prompting Opus to write a post-hoc report in /tmp/automation-report.md.

Standard logging doesn’t capture emergent tool composition. Structured traces need to show:

  • Tool selection rationale.
  • State transitions between strategies.
  • Resource consumption per approach.
  • Failure modes that triggered pivots.
  • Which tools the agent decided to build and why.
  • When it switched from Playwright to real browser automation.
  • What triggered the downgrade from Fable to Opus.

Without this instrumentation, debugging the debugger means watching terminal output scroll past.

Deployment Constraints

Running Fable in production requires:

Sandboxing: Containers with no network access, no host filesystem mounts, and no ability to spawn child processes outside the sandbox.

Permission models: Explicit approval for any action that modifies source code, spawns servers, or accesses system APIs.

Cost controls: Hard limits on tokens per session, with circuit breakers that halt execution when thresholds are crossed.

Audit trails: Structured logs of every tool built, every file modified, and every network request made.

Rollback mechanisms: Snapshots of filesystem and application state before the agent starts, with automatic rollback on anomaly detection.

The current Claude Code environment provides none of this. It’s a raw terminal with full host access.

Technical Verdict

Use Claude Fable 5’s autonomous tool invention if:

  • Codebase size: Under 50MB, fits comfortably in the 113K token context window without truncation.
  • Review capacity: You can review all file diffs in under 5 minutes (typically 3-10 files changed).
  • Action space: Limited to a single repository with no network access to production systems.
  • Budget: You’re willing to spend $10-15 per debugging session and your monthly budget accommodates 10-20 such sessions.
  • Bug characteristics: Requires real browser behavior that Playwright or other headless tools can’t reproduce (rendering quirks, shadow DOM issues, browser-specific CSS).
  • Rollback capability: You have a snapshot or version control checkpoint you can roll back to instantly.

Avoid it if:

  • Input trust: The agent processes user-submitted code, third-party dependencies, or any input you haven’t personally reviewed.
  • Credential exposure: Your environment contains API keys, SSH credentials, or access to production databases.
  • Cost predictability: You need predictable costs (Fable’s token usage can spike 5-10x on complex problems).
  • Dependency count: The codebase includes more than 100 dependencies (context pollution risk).
  • Review time: You can’t dedicate 10-15 minutes to post-session review of every action the agent took.
  • Automation context: The agent runs in CI/CD pipelines or any automated workflow without human-in-the-loop approval.

Your responsibility is to define which tool patterns are acceptable and to build the observability and rollback infrastructure that makes autonomous tool composition safe. Fable has no built-in cost controls or permission model.