Most agent security research focuses on making the LLM itself more robust: prompt injection filters, alignment techniques, adversarial training. A new arXiv paper from researchers at Microsoft, Wisconsin, and UCSD argues this is backwards. The model is the untrusted component. Security invariants must be enforced at the system level, not the model level.
The paper (arXiv:2605.18991) analyzes eleven real-world agent attacks and shows how systems principles (process isolation, capability boundaries, resource limits) could have prevented them. This is not a theoretical exercise. Production agent deployments already face these problems when one agent needs to read Slack messages but should not delete channels, or when a code-execution agent must run untrusted scripts without leaking credentials.
The Core Argument
Treat the LLM as you would treat any untrusted binary. You do not rely on a user-space process to police itself. You enforce boundaries at the OS level: file permissions, network policies, syscall filtering, resource quotas. The same logic applies to agents.
Key principles from the paper:
- Least privilege: An agent gets exactly the capabilities it needs, nothing more. A summarization agent does not need write access to your database.
- Isolation: Agents run in separate execution contexts. One compromised agent cannot pivot to another.
- Fail-safe defaults: If a capability check fails, the operation is denied. No fallback to a permissive mode.
- Complete mediation: Every tool call, API request, and file access goes through an authorization layer. No shortcuts.
These are not new ideas. They are the foundation of operating system security. The paper’s contribution is showing how to apply them to agent architectures where the “process” is an LLM inference loop and the “syscalls” are tool invocations.
What This Looks Like in Practice
Capability Model for Tool Calls
An agent orchestrator maintains a capability map. Each agent gets a token that encodes its allowed tools and parameters. When the agent requests a tool call, the orchestrator checks the token before execution.
class AgentCapability:
def __init__(self, agent_id: str, allowed_tools: dict[str, ToolPolicy]):
self.agent_id = agent_id
self.allowed_tools = allowed_tools # tool_name -> ToolPolicy
class ToolPolicy:
def __init__(self, read: bool, write: bool, scopes: list[str]):
self.read = read
self.write = write
self.scopes = scopes # e.g., ["channel:general", "user:self"]
def authorize_tool_call(capability: AgentCapability, tool_name: str, params: dict) -> bool:
if tool_name not in capability.allowed_tools:
return False
policy = capability.allowed_tools[tool_name]
# Check operation type
if params.get("operation") == "delete" and not policy.write:
return False
# Check scope boundaries
requested_scope = params.get("scope")
if requested_scope not in policy.scopes:
return False
return True
This is a simplified example. Production systems need to handle dynamic scopes (e.g., “all channels the user has access to”), time-based policies, and audit logging. The point is that authorization happens outside the agent’s control.
Sandboxing Code Execution
Many agents need to run code: data analysis, file processing, API scripting. The naive approach is to call subprocess.run() and hope the LLM does not generate malicious commands. The systems approach is to run the code in a container with strict resource limits and no network access.
# gVisor runsc config for agent code execution
apiVersion: v1
kind: Pod
metadata:
name: agent-code-executor
spec:
runtimeClassName: gvisor
containers:
- name: executor
image: python:3.11-slim
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
limits:
cpu: "500m"
memory: "256Mi"
ephemeral-storage: "100Mi"
requests:
cpu: "100m"
memory: "128Mi"
volumeMounts:
- name: workspace
mountPath: /workspace
readOnly: false
volumes:
- name: workspace
emptyDir:
sizeLimit: 50Mi
The container has no persistent storage, no network, and limited CPU. If the agent tries to mine cryptocurrency or exfiltrate data, it hits a wall. The orchestrator can also enforce execution time limits and kill runaway processes.
Resource Quotas Per Agent
In a multi-tenant system, one agent should not consume all available API quota or memory. The orchestrator tracks resource usage and enforces limits.
| Resource Type | Limit Mechanism | Failure Mode |
|---|---|---|
| API calls | Token bucket per agent | 429 response, backoff |
| Memory | cgroup limits | OOM kill, restart |
| CPU time | cgroup CPU quota | Throttling, timeout |
| Disk I/O | cgroup blkio | Slow writes, quota exceeded |
| Network egress | iptables rate limit | Dropped packets, retry |
When an agent exceeds its quota, the system logs the event and either throttles or terminates the agent. This prevents one misbehaving agent from degrading the entire system.
Observability and Boundary Detection
How do you tell the difference between an agent legitimately using a tool and an agent exceeding its security boundary? The paper does not provide a full answer, but the primitives are clear:
- Audit logs: Every tool call, with parameters, timestamp, and agent ID. Stored in an append-only log.
- Anomaly detection: Baseline normal behavior (e.g., “this agent calls the Slack API 10 times per hour”) and flag deviations.
- Capability violations: Any denied tool call is a potential security event. Log it, alert on repeated failures.
- Resource usage metrics: Track CPU, memory, API calls per agent. Spike detection catches runaway loops or exfiltration attempts.
The key is that these signals are generated by the orchestrator, not the agent. The agent cannot suppress its own audit logs or reset its resource counters.
Trade-Offs and Implementation Challenges
Functionality vs. Security
Strict sandboxing breaks some agent use cases. If an agent needs to call arbitrary APIs, you cannot whitelist every possible endpoint. If it needs to process user-uploaded files, you cannot predict the file format. The solution is layered defenses:
- Start with the most restrictive sandbox that still allows the core functionality.
- Add capability escalation paths that require human approval.
- Use runtime monitoring to detect unexpected behavior even within allowed capabilities.
Performance Overhead
Process isolation, capability checks, and resource accounting add latency. The paper does not quantify this, but production systems report 10-50ms overhead per tool call for authorization checks and 100-500ms for container startup. For long-running agents, this is acceptable. For high-frequency tool calls, you need to batch operations or use lighter-weight isolation (e.g., seccomp-bpf instead of full containers).
Complexity of Multi-Agent Systems
When agents call other agents, capability propagation becomes tricky. If Agent A delegates a task to Agent B, does B inherit A’s capabilities? Does it get a subset? The paper suggests explicit delegation tokens, but this requires careful design to avoid privilege escalation.
Real-World Attack Analysis
The paper examines eleven attacks, including:
- Prompt injection leading to data exfiltration: Agent tricked into calling an API with attacker-controlled parameters. Systems defense: capability check would have blocked the unauthorized API endpoint.
- Resource exhaustion via infinite loops: Agent generates code that runs forever. Systems defense: CPU quota and execution timeout.
- Credential leakage through tool misuse: Agent reads environment variables and sends them to an external service. Systems defense: read-only filesystem, no network access from the execution sandbox.
In every case, the vulnerability existed because the system trusted the LLM to make correct decisions. The fix is to remove that trust and enforce boundaries externally.
Architecture Example: Multi-Agent Orchestrator with Capability Enforcement
┌─────────────────────────────────────────────────────────┐
│ Orchestrator (Trusted Computing Base) │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Capability │ │ Resource │ │
│ │ Manager │ │ Quota │ │
│ │ │ │ Tracker │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Tool Call Authorization Layer │ │
│ └─────────────────────────────────────┘ │
│ │ │
└─────────┼───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Agent Execution Environments (Untrusted) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Agent 1 │ │ Agent 2 │ │ Agent 3 │ │
│ │ (gVisor) │ │ (gVisor) │ │ (gVisor) │ │
│ │ │ │ │ │ │ │
│ │ LLM Inference│ │ LLM Inference│ │ LLM Inference│ │
│ │ Tool Calls │ │ Tool Calls │ │ Tool Calls │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
The orchestrator is the only trusted component. It holds the capability map, enforces resource limits, and mediates all tool calls. Agents run in isolated containers and cannot bypass the authorization layer.
Technical Verdict
Use this approach when:
- You are deploying agents in production with access to sensitive data or critical systems.
- You have multiple agents with different trust levels (e.g., user-facing agents vs. internal automation).
- You need to enforce compliance requirements (SOC 2, GDPR) that demand audit logs and access controls.
- You expect adversarial inputs (user-generated prompts, untrusted data sources).
- Your infrastructure already supports container orchestration (Kubernetes, Docker, or equivalent).
Avoid this approach when:
- You are prototyping in a sandboxed environment with no production data and no external tool access.
- Your agents have read-only access to non-sensitive information and no ability to modify state.
- Per-tool-call latency overhead exceeding 50ms breaks your use case (e.g., real-time chat interfaces requiring sub-100ms response times).
- You lack infrastructure for container orchestration or cannot justify the operational complexity of running gVisor, seccomp-bpf, or equivalent sandboxing technology.
- You have a single-agent system with no delegation, no tool calling, and no access to external APIs or filesystems.
- Your team does not have systems engineering expertise to maintain capability managers, resource quotas, and audit pipelines.
The paper’s core insight is correct: you cannot secure agents by making the LLM smarter. You secure them by treating the LLM as untrusted and enforcing invariants at the system level. This requires infrastructure investment (orchestrators, capability managers, sandboxes), but it is the only path to predictable security guarantees in production.