Most agent frameworks treat code generation as a final output. The agent writes Python, you run it, done. A new pattern flips this: the agent generates code to control its own execution flow. The code becomes the harness, the orchestration layer, the runtime infrastructure.
ArXiv paper 2605.18747v1 documents this shift. Instead of calling predefined tool APIs through JSON schemas, agents now write the glue code that wires tools together, manages state, and decides what to call next. The code is not the artifact. It is the scaffolding.
Why This Matters Now
Traditional tool-calling works like this: you define a function schema, the LLM returns structured JSON, your framework parses it and invokes the function. The orchestration logic lives in your framework. The agent just picks from a menu.
Code-as-harness inverts the relationship. The agent writes a Python function that:
- Imports libraries
- Defines control flow (loops, conditionals, error handling)
- Calls tools in sequence or parallel
- Manages intermediate state
- Decides when to stop
The framework executes the generated code. The agent owns the orchestration logic.
This pattern shows up in:
- Agents that generate Jupyter notebooks to explore data
- Systems that write bash scripts to coordinate CLI tools
- Workflows where the agent emits a Python module that other agents import
Architecture: Three Layers
The paper organizes code-as-harness into three layers.
Harness Interface
Code connects the agent to three surfaces:
- Reasoning: The agent writes code that encodes its plan. A for-loop over search results. A try-except block for retries.
- Action: Tool calls become function invocations in generated code. The agent decides argument order, error handling, and retry logic.
- Environment modeling: The agent writes data structures (classes, dicts, dataframes) to represent the world state.
This is different from declarative tool schemas. The agent is not filling in blanks. It is writing the control flow.
Harness Mechanisms
To make this reliable, you need:
- Planning: The agent outlines the code structure before writing it. Often a comment block or docstring that serves as a spec.
- Memory: Generated code can persist state to disk, a database, or a shared context object. The agent decides the serialization format.
- Tool use: The agent imports libraries and calls functions. It writes the adapter code if the API does not match its needs.
- Feedback loops: The agent runs the code, sees errors or output, and generates a new version. The harness becomes iterative.
Multi-Agent Scaling
When multiple agents share a codebase:
- One agent writes a module, another imports it.
- Agents review each other’s code through static analysis or test execution.
- Shared code artifacts become the coordination protocol. No message bus, just function calls.
Security Boundaries
When an agent generates its own execution harness, you lose the safety of a fixed orchestration framework. Consider:
| Risk | Fixed Framework | Code-as-Harness |
|---|---|---|
| Arbitrary code execution | Framework restricts callable functions | Agent writes any valid Python |
| Resource limits | Framework enforces timeouts, memory caps | Agent can spawn subprocesses, open sockets |
| Audit trail | Tool calls logged by framework | Must parse generated code to understand behavior |
| Privilege escalation | Framework mediates tool access | Agent can import any library, call any function |
Mitigation strategies:
- Run generated code in a sandbox (gVisor, Firecracker, Docker with seccomp).
- Use static analysis to block dangerous imports (os.system, subprocess, eval).
- Require the agent to emit a manifest of intended actions before execution.
- Log the generated code alongside execution traces for post-hoc review.
Versioning and Audit
When orchestration logic is declarative config, you version it in Git. When it is generated code, you need a different approach.
Option 1: Treat generated code as ephemeral
- Store only the prompt and the execution result.
- Regenerate the code if you need to replay the task.
- Problem: Non-determinism means you cannot reproduce the exact behavior.
Option 2: Snapshot every generated harness
- Save the code to a timestamped file or database row.
- Tag it with the agent version, model checkpoint, and input context.
- Problem: Storage grows fast. Diffing code is harder than diffing config.
Option 3: Hybrid
- Store the generated code for critical workflows.
- For exploratory tasks, log only the high-level plan and tool call sequence.
Failure Modes
The agent writes buggy harness code
The agent generates a script with a syntax error or logic bug. The framework catches the exception and feeds it back to the agent. The agent tries again.
This works until:
- The agent enters a loop, repeatedly generating the same broken code.
- The agent fixes the syntax but introduces a semantic bug (wrong API call, off-by-one error).
- The agent gives up and returns a half-working harness.
Mitigation: Set a retry limit. Use a separate verification agent to review the code before execution. Require the agent to write unit tests alongside the harness.
The harness code is correct but inefficient
The agent writes a working script that makes 1000 sequential API calls instead of batching. Or it loads a 10GB file into memory instead of streaming.
Mitigation: Provide the agent with performance guidelines in the system prompt. Use a cost model that penalizes slow or expensive operations. Run the code in a resource-limited sandbox and fail fast if it exceeds limits.
The agent debugs its own scaffolding
The agent generates harness code, runs it, sees an error, and generates new harness code to debug the first harness. This can recurse.
Mitigation: Consider separating the harness generation phase from the execution phase. Avoid allowing the agent to modify the harness after execution starts. If the harness fails, restart from the planning phase instead of patching.
Implementation Example
Here is a minimal code-as-harness setup using Python’s exec:
import ast
import sys
from io import StringIO
def execute_agent_harness(generated_code: str, tools: dict) -> dict:
"""
Run agent-generated code in a restricted namespace.
Args:
generated_code: Python code string from the agent
tools: Dict of allowed functions the agent can call
Returns:
Dict with 'success', 'output', and 'error' keys
"""
# Parse to check for dangerous imports
try:
tree = ast.parse(generated_code)
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
if alias.name in ['os', 'subprocess', 'sys']:
return {'success': False, 'error': 'Blocked import'}
except SyntaxError as e:
return {'success': False, 'error': f'Syntax error: {e}'}
# Capture stdout
old_stdout = sys.stdout
sys.stdout = captured_output = StringIO()
# Execute in restricted namespace
# Note: Production code should use RestrictedPython or similar
# This minimal example uses empty builtins to block dangerous functions
namespace = {'tools': tools, '__builtins__': {}}
try:
exec(generated_code, namespace)
output = captured_output.getvalue()
return {'success': True, 'output': output}
except Exception as e:
return {'success': False, 'error': str(e)}
finally:
sys.stdout = old_stdout
The agent generates code like:
# Agent-generated harness
result = tools['search']('quantum computing')
papers = tools['parse_results'](result)
for paper in papers[:5]:
summary = tools['summarize'](paper['abstract'])
print(f"{paper['title']}: {summary}")
The framework executes it, captures output, and returns the result to the agent.
When to Use Code-as-Harness
Use it when:
- The task requires complex control flow that is hard to express in declarative tool schemas.
- You want the agent to own the orchestration logic and adapt it over time.
- You have strong sandboxing and can tolerate the security risk.
- The agent needs to coordinate multiple tools in ways you did not anticipate.
Avoid it when:
- You need strict audit trails and reproducibility.
- The task is simple enough for fixed tool-calling patterns.
- You cannot sandbox execution safely.
- You need to guarantee performance or resource limits.
Technical Verdict
Code-as-harness is a powerful pattern for agentic systems that need flexible orchestration. It shifts control from the framework to the agent, which unlocks new capabilities but introduces security and reliability risks.
If you go this route, invest in sandboxing, static analysis, and robust feedback loops. Treat generated code as untrusted input. Log everything. Set hard limits on retries and resource usage.
For most production workflows, start with declarative tool-calling and move to code-as-harness only when you hit the limits of fixed orchestration. The flexibility is real, but so is the operational complexity.