Code as Agent Harness: When LLMs Generate Their Own Execution Scaffolding

Most agent frameworks treat code generation as a final output. The agent writes Python, you run it, done. A new pattern flips this: the agent generates code to control its own execution flow. The code becomes the harness, the orchestration layer, the runtime infrastructure.

ArXiv paper 2605.18747v1 documents this shift. Instead of calling predefined tool APIs through JSON schemas, agents now write the glue code that wires tools together, manages state, and decides what to call next. The code is not the artifact. It is the scaffolding.

Why This Matters Now

Traditional tool-calling works like this: you define a function schema, the LLM returns structured JSON, your framework parses it and invokes the function. The orchestration logic lives in your framework. The agent just picks from a menu.

Code-as-harness inverts the relationship. The agent writes a Python function that:

Imports libraries
Defines control flow (loops, conditionals, error handling)
Calls tools in sequence or parallel
Manages intermediate state
Decides when to stop

The framework executes the generated code. The agent owns the orchestration logic.

This pattern shows up in:

Agents that generate Jupyter notebooks to explore data
Systems that write bash scripts to coordinate CLI tools
Workflows where the agent emits a Python module that other agents import

Architecture: Three Layers

The paper organizes code-as-harness into three layers.

Harness Interface

Code connects the agent to three surfaces:

Reasoning: The agent writes code that encodes its plan. A for-loop over search results. A try-except block for retries.
Action: Tool calls become function invocations in generated code. The agent decides argument order, error handling, and retry logic.
Environment modeling: The agent writes data structures (classes, dicts, dataframes) to represent the world state.

This is different from declarative tool schemas. The agent is not filling in blanks. It is writing the control flow.

Harness Mechanisms

To make this reliable, you need:

Planning: The agent outlines the code structure before writing it. Often a comment block or docstring that serves as a spec.
Memory: Generated code can persist state to disk, a database, or a shared context object. The agent decides the serialization format.
Tool use: The agent imports libraries and calls functions. It writes the adapter code if the API does not match its needs.
Feedback loops: The agent runs the code, sees errors or output, and generates a new version. The harness becomes iterative.

Multi-Agent Scaling

When multiple agents share a codebase:

One agent writes a module, another imports it.
Agents review each other’s code through static analysis or test execution.
Shared code artifacts become the coordination protocol. No message bus, just function calls.

Security Boundaries

When an agent generates its own execution harness, you lose the safety of a fixed orchestration framework. Consider:

Risk	Fixed Framework	Code-as-Harness
Arbitrary code execution	Framework restricts callable functions	Agent writes any valid Python
Resource limits	Framework enforces timeouts, memory caps	Agent can spawn subprocesses, open sockets
Audit trail	Tool calls logged by framework	Must parse generated code to understand behavior
Privilege escalation	Framework mediates tool access	Agent can import any library, call any function

Mitigation strategies:

Run generated code in a sandbox (gVisor, Firecracker, Docker with seccomp).
Use static analysis to block dangerous imports (os.system, subprocess, eval).
Require the agent to emit a manifest of intended actions before execution.
Log the generated code alongside execution traces for post-hoc review.

Versioning and Audit

When orchestration logic is declarative config, you version it in Git. When it is generated code, you need a different approach.

Option 1: Treat generated code as ephemeral

Store only the prompt and the execution result.
Regenerate the code if you need to replay the task.
Problem: Non-determinism means you cannot reproduce the exact behavior.

Option 2: Snapshot every generated harness

Save the code to a timestamped file or database row.
Tag it with the agent version, model checkpoint, and input context.
Problem: Storage grows fast. Diffing code is harder than diffing config.

Option 3: Hybrid

Store the generated code for critical workflows.
For exploratory tasks, log only the high-level plan and tool call sequence.

Failure Modes

The agent writes buggy harness code

The agent generates a script with a syntax error or logic bug. The framework catches the exception and feeds it back to the agent. The agent tries again.

This works until:

The agent enters a loop, repeatedly generating the same broken code.
The agent fixes the syntax but introduces a semantic bug (wrong API call, off-by-one error).
The agent gives up and returns a half-working harness.

Mitigation: Set a retry limit. Use a separate verification agent to review the code before execution. Require the agent to write unit tests alongside the harness.

The harness code is correct but inefficient

The agent writes a working script that makes 1000 sequential API calls instead of batching. Or it loads a 10GB file into memory instead of streaming.

Mitigation: Provide the agent with performance guidelines in the system prompt. Use a cost model that penalizes slow or expensive operations. Run the code in a resource-limited sandbox and fail fast if it exceeds limits.

The agent debugs its own scaffolding

The agent generates harness code, runs it, sees an error, and generates new harness code to debug the first harness. This can recurse.

Mitigation: Consider separating the harness generation phase from the execution phase. Avoid allowing the agent to modify the harness after execution starts. If the harness fails, restart from the planning phase instead of patching.

Implementation Example

Here is a minimal code-as-harness setup using Python’s exec:

import ast
import sys
from io import StringIO

def execute_agent_harness(generated_code: str, tools: dict) -> dict:
    """
    Run agent-generated code in a restricted namespace.
    
    Args:
        generated_code: Python code string from the agent
        tools: Dict of allowed functions the agent can call
    
    Returns:
        Dict with 'success', 'output', and 'error' keys
    """
    # Parse to check for dangerous imports
    try:
        tree = ast.parse(generated_code)
        for node in ast.walk(tree):
            if isinstance(node, ast.Import):
                for alias in node.names:
                    if alias.name in ['os', 'subprocess', 'sys']:
                        return {'success': False, 'error': 'Blocked import'}
    except SyntaxError as e:
        return {'success': False, 'error': f'Syntax error: {e}'}
    
    # Capture stdout
    old_stdout = sys.stdout
    sys.stdout = captured_output = StringIO()
    
    # Execute in restricted namespace
    # Note: Production code should use RestrictedPython or similar
    # This minimal example uses empty builtins to block dangerous functions
    namespace = {'tools': tools, '__builtins__': {}}
    try:
        exec(generated_code, namespace)
        output = captured_output.getvalue()
        return {'success': True, 'output': output}
    except Exception as e:
        return {'success': False, 'error': str(e)}
    finally:
        sys.stdout = old_stdout

The agent generates code like:

# Agent-generated harness
result = tools['search']('quantum computing')
papers = tools['parse_results'](result)
for paper in papers[:5]:
    summary = tools['summarize'](paper['abstract'])
    print(f"{paper['title']}: {summary}")

The framework executes it, captures output, and returns the result to the agent.

When to Use Code-as-Harness

Use it when:

The task requires complex control flow that is hard to express in declarative tool schemas.
You want the agent to own the orchestration logic and adapt it over time.
You have strong sandboxing and can tolerate the security risk.
The agent needs to coordinate multiple tools in ways you did not anticipate.

Avoid it when:

You need strict audit trails and reproducibility.
The task is simple enough for fixed tool-calling patterns.
You cannot sandbox execution safely.
You need to guarantee performance or resource limits.

Technical Verdict

Code-as-harness is a powerful pattern for agentic systems that need flexible orchestration. It shifts control from the framework to the agent, which unlocks new capabilities but introduces security and reliability risks.

If you go this route, invest in sandboxing, static analysis, and robust feedback loops. Treat generated code as untrusted input. Log everything. Set hard limits on retries and resource usage.

For most production workflows, start with declarative tool-calling and move to code-as-harness only when you hit the limits of fixed orchestration. The flexibility is real, but so is the operational complexity.