Three Ways to Sandbox Agent Tool Calls: Docker, Managed Interpreters, and SDK Proxies

Programmatic tool calling (PTC) lets an agent write code that invokes multiple tools inside a sandbox instead of making round-trip LLM calls for every tool invocation. The model generates Python once, the sandbox executes it, and only the final result returns to the model context. This cuts latency and token consumption for multi-step workflows.

AWS just published implementation patterns for three distinct sandbox architectures on Bedrock. Each has different isolation boundaries, deployment complexity, and failure modes. Here’s how they compare.

The Problem PTC Solves

Traditional tool calling creates compounding latency. For a query like “Which engineering team members exceeded their Q3 travel budget?”, the agent must:

Call a tool to list team members (20 people)
Call a tool to fetch each person’s travel expenses (20 separate calls)
Call a tool to get budget thresholds
Reason about the results and filter

Every intermediate result passes through the model’s context window. With PTC, the model writes code that loops through the list, fetches expenses, applies filters, and returns only the final answer. One model call, one execution, one result.

Architecture 1: Self-Hosted Docker on ECS

You run a containerized Python sandbox on ECS Fargate. The agent sends generated code to an API endpoint. The container executes it, calls tools via HTTP or SDK, and returns the output.

Isolation boundary: Network-level. The container runs in a private subnet with security group rules that whitelist only approved tool endpoints. No internet egress by default.

Deployment shape:

ECS task definition with resource limits (CPU, memory, execution timeout)
Application Load Balancer for the code execution API
IAM task role scoped to specific tool APIs (DynamoDB, S3, Lambda)
CloudWatch Logs for execution traces

State management: Ephemeral. Each execution spins up a fresh container or reuses a warm one. No filesystem persistence between runs unless you mount EFS.

Failure modes:

Code execution timeout (default 5 minutes, configurable)
Out-of-memory crashes if the agent generates unbounded loops
Network timeouts if a tool API is slow
IAM permission errors if the task role lacks access

When to use: You need full control over the execution environment, custom Python packages, or tools that require VPC access. You’re comfortable managing container lifecycle and security patching.

Architecture 2: Bedrock AgentCore Code Interpreter

Amazon Bedrock AgentCore includes a managed code interpreter. You define tools in the agent configuration. The model generates code, Bedrock executes it in a managed sandbox, and returns the result. No infrastructure to deploy.

Isolation boundary: AWS-managed. The sandbox has no internet access. Tool calls route through Bedrock’s internal service mesh. You cannot SSH into the environment or inspect the runtime.

Deployment shape:

Agent definition in Bedrock console or CloudFormation
Tool definitions as Lambda functions or API Gateway endpoints
IAM resource policy on the agent to allow invocation
CloudWatch Logs for agent traces (code generation and execution logs)

State management: Stateless. Each code execution is isolated. No shared filesystem. If you need to pass data between executions, store it in S3 or DynamoDB and reference it in the next tool call.

Failure modes:

Execution timeout (fixed at 120 seconds)
Memory limit exceeded (not documented, but observed around 512 MB)
Tool invocation errors if Lambda times out or returns malformed JSON
Code generation errors if the model halts mid-script

When to use: You want zero infrastructure management and your tools are already Lambda functions or HTTP APIs. You can tolerate a 120-second execution cap and don’t need custom Python libraries beyond the standard library and boto3.

Architecture 3: SDK Proxy for Anthropic Compatibility

Some teams use Anthropic’s SDK and want to switch to Bedrock without rewriting orchestration code. You deploy a proxy that translates Anthropic SDK tool calls into Bedrock API requests. The proxy handles schema conversion, auth, and response formatting.

Isolation boundary: Same as the underlying Bedrock agent. The proxy is a translation layer, not a sandbox. It forwards requests to Bedrock AgentCore or your self-hosted ECS sandbox.

Deployment shape:

Lambda function or ECS service running the proxy
API Gateway or ALB in front
IAM role with bedrock:InvokeAgent permissions
Environment variables for Bedrock agent ARN and region

State management: Proxy is stateless. It maps Anthropic SDK request schemas to Bedrock API schemas and vice versa. Session state lives in the Bedrock agent or your orchestration layer.

Failure modes:

Schema mismatch if Anthropic adds new tool call parameters
Auth errors if the proxy’s IAM role lacks permissions
Latency overhead (typically 50-100ms per request)
Version skew if you update the Anthropic SDK but not the proxy

When to use: You have existing orchestration code built on Anthropic’s SDK and want to swap the backend without changing application logic. You’re willing to maintain a translation layer.

Comparison Table

Dimension	Docker on ECS	Bedrock Code Interpreter	SDK Proxy
Isolation	Network + IAM	AWS-managed black box	Inherits from backend
Execution timeout	Configurable (up to 15 min)	Fixed at 120 seconds	Backend-dependent
Custom packages	Yes (Dockerfile)	No (stdlib + boto3 only)	Backend-dependent
Deployment complexity	High (ECS, ALB, IAM)	Low (agent config only)	Medium (proxy + IAM)
Observability	Full (CloudWatch, X-Ray)	Limited (agent logs)	Proxy logs + backend logs
Cost model	ECS Fargate per-second	Bedrock agent invocation	Proxy compute + backend

Code Snippet: ECS Sandbox Execution Flow

import boto3
import requests
from flask import Flask, request, jsonify

app = Flask(__name__)

# Whitelist of allowed tool endpoints
ALLOWED_TOOLS = {
    "get_team_members": "https://api.internal/team/members",
    "get_travel_expenses": "https://api.internal/expenses/{user_id}",
}

@app.route("/execute", methods=["POST"])
def execute_code():
    code = request.json.get("code")
    timeout = request.json.get("timeout", 300)
    
    # Inject tool functions into the execution namespace
    def call_tool(tool_name, **kwargs):
        if tool_name not in ALLOWED_TOOLS:
            raise ValueError(f"Tool {tool_name} not allowed")
        url = ALLOWED_TOOLS[tool_name].format(**kwargs)
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        return response.json()
    
    namespace = {"call_tool": call_tool}
    
    try:
        # Execute with resource limits
        exec(code, namespace)
        result = namespace.get("result", None)
        return jsonify({"status": "success", "result": result})
    except Exception as e:
        return jsonify({"status": "error", "message": str(e)}), 500

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080)

The ECS task definition sets CPU and memory limits. The security group blocks all outbound traffic except to the whitelisted tool endpoints. If the code tries to call an unapproved tool, the request fails at the network layer.

Observability Differences

Docker on ECS: You get full execution traces in CloudWatch Logs. You can enable X-Ray for distributed tracing across tool calls. You can SSH into a running container for debugging (not recommended in production, but possible).

Bedrock Code Interpreter: You see agent invocation logs and code generation logs, but not the Python execution environment. If the code crashes, you get a generic error message. No way to inspect the sandbox state.

SDK Proxy: You see proxy logs (request/response pairs) and backend logs (Bedrock or ECS). Debugging requires correlating logs across two layers. Latency breakdown is harder to trace.

Security Boundaries

All three architectures assume the model can generate malicious code. The sandbox must prevent:

Network access to unauthorized endpoints
Filesystem writes outside a scratch directory
Subprocess spawns that escape resource limits
Credential leakage via environment variables

Docker on ECS: You control the security group, IAM task role, and container image. You can run static analysis on the generated code before execution. You can enforce seccomp profiles or AppArmor policies.

Bedrock Code Interpreter: AWS manages the sandbox. You trust their isolation. You cannot inspect the runtime or add custom security policies.

SDK Proxy: Security depends on the backend. The proxy itself is not a sandbox. It’s a translation layer. If the backend is Bedrock, you inherit Bedrock’s isolation. If the backend is your ECS sandbox, you inherit your security policies.

Technical Verdict

Use Docker on ECS when:

You need custom Python packages (pandas, numpy, domain-specific libraries)
Tool calls require VPC access (RDS, ElastiCache, internal APIs)
Execution time exceeds 120 seconds
You want full observability and crash dump access

Use Bedrock Code Interpreter when:

Your tools are Lambda functions or public HTTP APIs
Execution time is under 120 seconds
You want zero infrastructure management
Standard library and boto3 cover your needs

Use an SDK proxy when:

You have existing orchestration code built on Anthropic’s SDK
You want to A/B test Bedrock vs. Anthropic without rewriting application logic
You’re willing to maintain a translation layer and accept 50-100ms latency overhead

Avoid all three when:

Your workflow requires stateful execution (use Step Functions or Temporal instead)
You need to debug the model’s reasoning process (PTC hides intermediate steps)
Tool calls are already fast and parallel (traditional tool calling may be simpler)

Source Links

Implementing programmatic tool calling on Amazon Bedrock