DeltaBox: Millisecond Sandbox Checkpoints for Agent Search

Agent exploration workloads need to checkpoint and rollback sandbox state thousands of times per second. Tree search algorithms, reinforcement learning loops, and test-time compute scaling all depend on the ability to snapshot a running environment, try an action, then rewind if it fails. Existing sandbox mechanisms duplicate entire containers or VM images, which takes hundreds of milliseconds to seconds per checkpoint. That latency makes deep search trees and large fan-outs economically infeasible.

DeltaBox is an OS-level abstraction designed to solve this bottleneck. Instead of copying the entire sandbox state, it tracks only the deltas between consecutive checkpoints. The result is checkpoint and rollback operations that complete in under a millisecond, enabling agent workloads to explore thousands of execution paths per second without duplicating gigabytes of container state.

The Checkpoint Bottleneck

When an agent needs to explore multiple tool-use paths, the orchestrator must:

Snapshot the current sandbox state (filesystem, process memory, open file descriptors)
Execute the candidate action
Evaluate the outcome
Roll back to the snapshot if the action failed or a better path exists

Traditional approaches duplicate the entire sandbox for each checkpoint. Docker container snapshots, VM clones, and CRIU (Checkpoint/Restore In Userspace) all copy the full state. For a typical coding agent sandbox with a few hundred megabytes of filesystem state and process memory, this means 200-500ms per checkpoint. A tree search with a branching factor of 10 and depth of 5 would require 11,110 checkpoints, taking over 30 minutes just for state management overhead.

The observation: consecutive checkpoints in agent workloads are highly similar. An agent editing a file changes a few kilobytes, not the entire filesystem. A process executing a tool call modifies a small region of memory, not the entire address space.

DeltaState Architecture

DeltaBox introduces two co-designed OS mechanisms:

DeltaFS handles filesystem checkpoints using a layered copy-on-write structure. The filesystem is organized into immutable layers. When the agent writes to a file, the change is captured in the topmost writable layer. A checkpoint operation freezes the current writable layer and inserts a new empty layer above it. Rollback is a pointer swap that discards the top layer and restores the previous writable layer.

DeltaCR handles process state checkpoints using incremental dumps. Instead of serializing the entire process memory and context, DeltaCR tracks dirty pages since the last checkpoint. It maintains a frozen template process that represents the checkpoint state. Rollback bypasses the traditional restore pipeline and directly forks from the template process, inheriting its memory and file descriptor state.

Both mechanisms rely on kernel-level tracking of modifications. DeltaFS uses inode versioning and block-level copy-on-write. DeltaCR uses page table dirty bits and a custom fork path that skips unnecessary initialization.

Implementation Details

The layered filesystem uses a structure similar to OverlayFS but optimized for rapid layer insertion. Each layer is a sparse directory tree containing only modified files. Reads traverse the layer stack from top to bottom until a file is found. Writes trigger copy-on-write at the block level, not the file level, so modifying a single byte in a large file only duplicates the affected 4KB block.

The incremental process dump tracks three categories of state:

Memory pages: Only dirty pages since the last checkpoint are serialized
File descriptors: Open file table entries are reference-counted across checkpoints
Kernel contexts: Signal handlers, timers, and namespace memberships are versioned

The template process mechanism avoids the cost of deserializing state during rollback. When a checkpoint is created, the current process is frozen in place. Rollback creates a new process by forking from the frozen template, then applies the incremental delta to restore the exact checkpoint state.

Security Boundaries

The checkpoint mechanism introduces new attack surfaces:

Risk	Mitigation
Malicious code persisting across rollbacks	Each checkpoint is isolated in a separate namespace; rollback destroys all process state
Filesystem layer poisoning	Layers are immutable after freezing; writes to frozen layers trigger copy-on-write
Memory corruption in template process	Template processes are read-only; fork creates a new writable copy
Resource exhaustion via checkpoint spam	Checkpoint depth is capped; oldest layers are garbage collected

The key security property: rollback must guarantee that no state from the rolled-back execution can leak into the restored checkpoint. DeltaBox enforces this by making checkpoints immutable snapshots. The template process is frozen with mprotect(PROT_READ) on all memory regions. Filesystem layers are marked read-only at the kernel level.

One subtle issue is open file descriptors. If an agent opens a network socket, writes data, then rolls back, the socket remains open in the kernel. DeltaBox handles this by tracking file descriptor creation in each checkpoint layer. Rollback closes all descriptors created after the checkpoint, preventing leaked connections.

Exploration Economics

The performance improvement changes the cost structure of agent exploration:

Before DeltaBox (500ms per checkpoint):

10 checkpoints per second per sandbox
Tree search with branching factor 5, depth 4: 156 checkpoints = 78 seconds
Cost-prohibitive for real-time agent interactions

With DeltaBox (0.8ms per checkpoint):

1,250 checkpoints per second per sandbox
Same tree search: 156 checkpoints = 125ms
Enables interactive agent exploration

This makes previously impractical search strategies viable. Monte Carlo tree search, which requires thousands of rollouts, becomes feasible for coding agents. Reinforcement learning loops that need to explore hundreds of action sequences per episode can run in real time.

The memory overhead is also lower. Instead of duplicating 500MB of sandbox state for each checkpoint, DeltaBox stores only the deltas. In the paper’s benchmarks, a checkpoint chain with 100 nodes consumed 2.3GB total, versus 50GB for full duplication.

Code Example: Checkpoint API

import deltabox

# API reconstructed from DeltaBox design; refer to paper for exact interface

# Initialize sandbox with DeltaBox support
sandbox = deltabox.Sandbox(
    image="python:3.11-slim",
    memory_limit="512M",
    max_checkpoint_depth=50
)

# Create initial checkpoint
checkpoint_0 = sandbox.checkpoint()

# Execute agent action
result = sandbox.execute([
    "python", "-c",
    "import json; json.dump({'status': 'ok'}, open('/tmp/result.json', 'w'))"
])

# Create checkpoint after action
checkpoint_1 = sandbox.checkpoint()

# Try alternative action
sandbox.rollback(checkpoint_0)
alt_result = sandbox.execute([
    "python", "-c", 
    "import json; json.dump({'status': 'failed'}, open('/tmp/result.json', 'w'))"
])

# Compare outcomes and choose best path
if evaluate(result) > evaluate(alt_result):
    sandbox.rollback(checkpoint_1)
else:
    # Current state is already the alternative
    pass

# Checkpoint depth is automatically managed
# Oldest checkpoints are garbage collected when depth exceeds limit

The API exposes checkpoint and rollback as first-class operations. The orchestrator can create checkpoints at any point, explore multiple branches, and roll back to any previous state. The garbage collection policy prevents unbounded memory growth by pruning old checkpoints when the depth limit is reached.

Observability Challenges

Checkpoint-heavy workloads may create new observability problems. Traditional tracing assumes linear execution. When an agent rolls back and explores a different path, the trace becomes a tree. Spans that were created in a rolled-back execution need to be marked as discarded, not failed.

DeltaBox does not solve this directly, but it exposes checkpoint metadata that tracing systems can use:

Checkpoint ID and parent checkpoint ID
Timestamp of checkpoint creation and rollback
Number of filesystem blocks and memory pages modified since parent
Process tree at checkpoint time

An observability layer can use this metadata to reconstruct the exploration tree and attribute costs correctly. Without it, you see thousands of “failed” tool calls that were actually successful explorations of suboptimal paths.

Deployment Shape

DeltaBox requires kernel modifications, so it is not a drop-in replacement for Docker or Firecracker. The paper describes a prototype implementation on Linux 6.8 with custom kernel modules for DeltaFS and DeltaCR.

Deployment options:

Bare metal with custom kernel: Full performance, requires kernel patching
VM with nested virtualization: Run DeltaBox kernel inside a VM, some performance overhead
Userspace emulation: Implement layered filesystem and process snapshots in userspace, significantly slower but no kernel changes

For production agent workloads, the bare metal option makes sense if you control the infrastructure. For SaaS platforms, the VM approach provides isolation without requiring customers to run custom kernels.

The checkpoint state is stored in memory by default, but DeltaBox supports persisting checkpoints to disk for long-running explorations. This enables pause/resume of agent sessions and recovery from crashes.

Failure Modes

Checkpoint depth explosion: If the agent creates checkpoints in a loop without garbage collection, memory usage grows linearly. The depth limit prevents unbounded growth, but choosing the right limit requires profiling the agent’s exploration pattern.

Rollback to corrupted checkpoint: If the template process or filesystem layer is corrupted (e.g., by a kernel bug or hardware error), rollback restores corrupted state. DeltaBox does not include checksums or integrity verification, so this is a silent failure.

Resource leaks across rollbacks: File descriptors, network sockets, and kernel objects created during exploration must be cleaned up during rollback. DeltaBox tracks these resources, but bugs in the tracking logic can cause leaks.

Performance degradation with deep checkpoint chains: Filesystem reads traverse the layer stack from top to bottom. A chain with 50 layers means 50 directory lookups for a cache miss. The paper reports that read latency increases linearly with checkpoint depth, so very deep chains (>100) may need periodic compaction.

Technical Verdict

Use DeltaBox when your agent workload requires high-frequency exploration (tree search, reinforcement learning, test-time compute scaling) and you control the kernel environment. The sub-millisecond checkpoint latency makes previously infeasible search strategies practical, particularly for coding agents that need to explore hundreds of execution paths per interaction. The security model is sound for isolated sandboxes, with immutable checkpoints preventing state leakage across rollbacks.

Avoid DeltaBox when your agent uses single-shot inference without exploration, when you need a drop-in replacement for existing container runtimes, or when your deployment environment does not allow kernel modifications. The checkpoint primitive is powerful but not universal. For agents that execute a single tool call and return a result, the overhead is unnecessary. For agents that explore thousands of paths per second, it is the difference between feasible and infeasible. Shared-kernel deployments need careful security review, and unbounded checkpoint chains without pruning will exhaust memory.