Guardian Runtime: Local Firewall Architecture for AI Coding Agents

AI coding agents like Claude Code, Cursor, and Aider ship with impressive capabilities but zero operational guardrails. They can rack up API bills, read SSH keys, or overwrite production config files unless you manually audit every session. Guardian Runtime is an open-source localhost proxy that intercepts LLM API calls to enforce cost budgets, block file access, and log tool invocations without touching agent code.

You point your agent at localhost:8080 instead of api.anthropic.com, and the proxy sits in the request path. It parses streaming Server-Sent Events (SSE) responses, tracks cumulative token spend across multi-turn sessions, and rejects requests that violate policy rules before they hit the upstream API.

Architecture: Interception Without Instrumentation

Guardian Runtime runs as a local HTTP server. The agent thinks it is talking to the real LLM endpoint, but DNS or environment variables redirect traffic through the proxy.

Request flow:

Agent sends POST to localhost:8080/v1/messages with API key in headers.
Proxy validates request against policy (token budget, file allowlist, rate limits).
If approved, proxy forwards request to real API and streams response back.
Proxy parses SSE chunks to count tokens and detect tool calls in real time.
If budget exceeded mid-stream, proxy closes connection and logs violation.

State model:

The proxy maintains a session store keyed by API key or session ID. Each session tracks:

Cumulative input and output tokens
Total cost (calculated from model pricing tables)
File paths accessed via tool calls
Timestamp of first and last request

Sessions expire after a configurable idle timeout (default 30 minutes). This prevents agents from resetting spend counters by restarting.

File access enforcement:

Agents often use shell tools (bash, zsh) instead of direct file APIs. Guardian Runtime cannot parse arbitrary shell commands, so it relies on two strategies:

Tool call inspection: When the agent invokes read_file or write_file tools, the proxy checks paths against a blocklist (e.g., ~/.ssh/*, .env).
Heuristic scanning: For shell commands, the proxy runs regex patterns to detect suspicious paths. This is not foolproof but catches common mistakes.

If an agent uses curl or wget to exfiltrate data, the proxy will not catch it unless you also configure network-level controls (iptables, Docker network policies).

Policy Configuration

Guardian Runtime uses a YAML config file to define rules. Example:

policies:
  - name: dev_session
    token_budget: 50000
    cost_limit_usd: 5.00
    file_blocklist:
      - "~/.ssh/*"
      - ".env"
      - "*.pem"
    allowed_tools:
      - read_file
      - write_file
      - bash
    rate_limit:
      requests_per_minute: 30
    log_level: debug

Token budget enforcement:

The proxy tracks tokens in two places:

Request body: Counts input tokens from the messages array.
SSE stream: Parses usage events in the response to capture output tokens.

If cumulative tokens exceed the budget, the proxy returns HTTP 429 with a JSON error body. The agent sees this as a rate limit and typically backs off or halts.

Cost calculation:

Guardian Runtime ships with a pricing table for common models (Claude 3.5 Sonnet, GPT-4, etc.). It multiplies token counts by per-token rates and sums across input and output. If you use a model not in the table, you can add custom pricing via config.

Observability Hooks

The proxy logs every request and response to a structured JSON file. Each log entry includes:

Session ID
Model name
Input and output token counts
Tool calls (name, arguments, result status)
Policy decision (approved, rejected, reason)
Latency (time to first token, total duration)

You can tail the log file or ship it to a log aggregator (Loki, Elasticsearch) for dashboards.

Debugging rejected requests:

When the proxy blocks a request, it logs the full policy evaluation trace. Example:

{
  "timestamp": "2026-06-09T08:15:32Z",
  "session_id": "abc123",
  "decision": "rejected",
  "reason": "token_budget_exceeded",
  "cumulative_tokens": 51200,
  "budget": 50000,
  "request_id": "req_xyz"
}

This helps you tune budgets or identify agents that loop indefinitely.

Deployment Shapes

Shape	Use Case	Trade-offs
Single-user localhost	Developer workstation	No auth needed, easy setup, no HA
Shared proxy (Docker)	Team with multiple agents	Requires session isolation, API key per user
Sidecar in CI/CD	Agent runs in GitHub Actions	Ephemeral sessions, must persist logs to artifact store
Gateway with mTLS	Production agent fleet	Full auth, audit trail, but adds latency and ops burden

For single-user setups, run the proxy as a background process and set ANTHROPIC_BASE_URL=http://localhost:8080 in your shell. For shared deployments, add API key validation and session namespacing to prevent users from seeing each other’s logs.

Failure Modes and Mitigations

Proxy crashes:

If Guardian Runtime dies, the agent loses connectivity. Mitigation: run the proxy under a process supervisor (systemd, supervisord) with auto-restart.

SSE parsing errors:

If the upstream API changes its SSE format, the proxy may fail to count tokens. Mitigation: log unparseable chunks and fall back to request-level token estimates.

Shell command evasion:

An agent can bypass file blocklists by encoding paths (base64, hex) or using shell variables. Mitigation: combine Guardian Runtime with OS-level sandboxing (chroot, Docker volumes with read-only mounts).

Token budget races:

If two requests arrive simultaneously, both might pass the budget check before either updates the session state. Mitigation: use atomic compare-and-swap operations on the session store or add a mutex per session.

Technical Verdict

Use Guardian Runtime when:

You run AI coding agents locally and want cost or security guardrails without modifying agent code.
You need session-level observability (token counts, tool calls, file access) for debugging or compliance.
You are prototyping agent workflows and want to fail fast when budgets are exceeded.

Avoid it when:

Your agents run in untrusted environments (the proxy itself has no auth by default).
You need defense-in-depth against adversarial prompts (this is not a prompt injection firewall).
You require sub-millisecond latency (the proxy adds 5-10ms per request for policy checks).

Guardian Runtime is a practical interception layer for teams adopting AI coding agents. It does not replace OS-level sandboxing or network policies, but it fills the gap between agent capabilities and operational controls. The open-source repo includes deployment examples for Docker, systemd, and GitHub Actions.