Twill.ai's Cloud Sandbox Architecture: How Delegated Coding Agents Return PRs Without Touching Your Laptop

Twill.ai runs Claude Code, OpenCode, and Codex in cloud sandboxes. You assign work through Slack, GitHub, Linear, a web app, or CLI. The agent researches your codebase, writes a plan, waits for approval, implements code, runs tests, and opens a PR. The entire loop happens remotely. No local compute, no laptop fan noise, no credential leakage into your shell history.

This is delegated execution, not local orchestration. The service solves a specific problem: coding agents that consume 8GB of RAM and pin your CPU for 20 minutes are fine for one-off experiments but break down when you want to run three in parallel or hand off work before leaving for the day.

Isolation and Sandbox Lifecycle

Each task gets a fresh sandbox. Twill spins up ephemeral environments with:

Cloned repository state
Installed dependencies (detected from package manifests)
Build and test tooling (npm, cargo, pytest, whatever the agent decides)
Isolated network and filesystem boundaries

The sandbox persists for the duration of the task. Once the PR is opened or the task fails, the environment is torn down. No state bleeds between runs.

Credentials are scoped per sandbox. GitHub tokens, API keys, and cloud provider credentials are injected at runtime and expire with the environment. The agent cannot write secrets to disk in a way that survives the session.

You can SSH into a running sandbox to debug or inspect intermediate state. This is useful when the agent gets stuck or produces unexpected output. The SSH session uses ephemeral keys and terminates when the sandbox shuts down.

Orchestration Flow and Human Approval Gates

Twill enforces a fixed pipeline:

Research: Agent reads the codebase, issue description, and linked documentation. It clarifies ambiguities by asking questions in the same channel (Slack thread, GitHub comment, Linear update).
Plan: Agent writes an implementation spec. This is a structured document listing files to change, tests to add, and expected behavior.
Approval Gate: Human reviews the plan. If approved, the agent proceeds. If rejected, it loops back to research.
Implement: Agent writes code, runs builds, executes tests. If tests fail, it iterates (up to a limit).
AI Code Review: A second agent reviews the diff for style, correctness, and test coverage.
PR Creation: Agent opens a pull request with a summary, test results, and sandbox logs.

The approval gate is mandatory. The agent cannot skip from plan to implementation without explicit human confirmation. This prevents runaway execution and gives you a checkpoint to redirect or cancel work.

State Handoff and Multi-Channel Integration

Twill serializes task state into structured JSON. When you assign work via Slack, the service:

Parses the message for repository, branch, and task description
Creates a task record with a unique ID
Spins up a sandbox and starts the research phase
Posts updates back to the Slack thread as the agent progresses

The same flow works for GitHub issues (via @twill mentions) and Linear tickets. The agent reads context from the issue body, linked PRs, and previous comments. It writes updates as new comments, preserving the discussion thread.

When the agent needs input, it blocks and pings you in the original channel. You respond inline. The agent resumes with your clarification injected into its context window.

PRs include:

Diff summary
Test output (pass/fail counts, logs)
Sandbox infrastructure logs (build steps, dependency installs)
Link to the sandbox session (if you want to SSH in and inspect)

Tool Boundaries and Failure Modes

The agent has access to:

Git (clone, branch, commit, push)
Package managers (npm, pip, cargo, go mod)
Build tools (make, gradle, webpack)
Test runners (jest, pytest, cargo test)
Cloud CLIs (gcloud, aws, terraform) if credentials are provided

It cannot:

Modify production infrastructure directly
Access secrets outside the scoped credential set
Make network requests to arbitrary endpoints (egress is filtered)
Persist state outside the sandbox filesystem

Common failure modes:

Failure	Cause	Recovery
Test loop	Flaky tests or incorrect fix strategy	Agent retries up to N times, then asks for help
Dependency conflict	Incompatible versions or missing system libraries	Agent logs the error, suggests manual intervention
Approval timeout	Human doesn’t respond within SLA	Task pauses, can be resumed later
Sandbox OOM	Agent spawns too many processes or loads large datasets	Sandbox is killed, task fails with logs

The structured pipeline makes failures debuggable. You can see exactly which phase failed and inspect the sandbox logs or SSH in to reproduce the issue.

Deployment Shape and Observability

Twill runs on managed Kubernetes. Each sandbox is a pod with resource limits (CPU, memory, disk). The orchestrator schedules pods across a cluster and handles autoscaling based on task queue depth.

Observability stack:

Logs: Structured JSON logs from agent, sandbox, and orchestrator. Searchable by task ID, repository, or user.
Metrics: Task duration, success rate, sandbox resource usage, approval latency.
Traces: Distributed traces across research, plan, implement, and review phases. Useful for debugging slow tasks.

You can export logs and metrics to your own observability platform (Datadog, Grafana, Splunk). Twill provides webhooks for task lifecycle events (started, approved, failed, completed).

Security Boundaries

Credentials are injected via environment variables at sandbox startup. They are never written to disk or logged. When the sandbox terminates, the credentials are revoked.

Network egress is restricted to:

GitHub API (for cloning and PR creation)
Package registries (npm, PyPI, crates.io)
Cloud provider APIs (if explicitly enabled)

All other outbound traffic is blocked. The agent cannot exfiltrate data to arbitrary endpoints.

Code changes are reviewed by a second agent before the PR is opened. This catches obvious security issues (hardcoded secrets, SQL injection, unsafe deserialization). It’s not a substitute for human review, but it reduces the surface area.

When to Run Agents in the Cloud

Use Twill when:

You want to run multiple agents in parallel without melting your laptop.
You need async execution (assign work, go to sleep, wake up to a PR).
Your team collaborates in Slack or Linear and wants to delegate work without switching tools.
You want isolation guarantees (no credential leakage, no cross-task contamination).

Avoid it when:

You need tight control over the execution environment (custom kernels, exotic dependencies).
Your codebase requires local hardware (GPU training, embedded device testing).
You prefer to run agents locally and inspect every step in real time.
Your security model prohibits sending code to a third-party service.

Technical Verdict

Twill solves the resource and isolation problems that make local coding agents impractical for teams. The structured pipeline and approval gates reduce runaway execution risk. The multi-channel integration (Slack, GitHub, Linear) fits into existing workflows without forcing tool switches.

The trade-off is control. You cannot customize the sandbox environment beyond what Twill supports. If your build process requires Docker-in-Docker, custom kernel modules, or GPU access, you’ll need a different solution.

The security model is sound for most use cases. Credentials are scoped and ephemeral. Network egress is filtered. Code review happens before PR creation. But if your threat model prohibits sending source code to a managed service, this is not the right tool.

For small teams shipping features and bug fixes, the async execution model is compelling. Assign work at 5pm, review the PR at 9am. The agent handles the grind (dependency updates, test fixes, docs) while you focus on architecture and product decisions.