Diagnosing and Repairing Agent Harness Flaws

Most agent debugging focuses on the model: prompt engineering, workflow search, runtime supervision. But when your agent fails, the problem is often in the plumbing. The harness that wraps your LLM (tool interfaces, context management, lifecycle orchestration, observability hooks) can fail silently in ways that no amount of prompt tuning will fix.

A new paper from Chen et al. introduces HarnessFix, a trace-guided framework that diagnoses where failures occur in agent execution harnesses and generates scoped repairs. Instead of treating failed trajectories as opaque blobs, it compiles execution traces and harness code into an intermediate representation that exposes step-level provenance and control-flow relations.

The Harness Failure Problem

Agent harnesses provide seven critical layers:

Execution environment: sandboxing, resource limits, process isolation
Tool interfaces: function signatures, parameter validation, serialization
Context management: state persistence, memory boundaries, session handling
Lifecycle orchestration: retry logic, timeout handling, cleanup hooks
Observability: logging, tracing, metric collection
Verification: output validation, constraint checking, safety guards
Governance: access control, rate limiting, audit trails

When any of these layers breaks, the agent fails in ways that look like reasoning errors. A malformed tool schema causes the agent to hallucinate parameters. A missing cleanup hook leaves stale state that corrupts future runs. A broken trace collector hides the evidence you need to debug.

Existing self-improvement methods (runtime supervision, prompt optimization, workflow search) operate at the agent level. They assume the harness works. When it doesn’t, they make indirect changes that paper over symptoms without fixing root causes.

Harness-Aware Trace Intermediate Representation

HarnessFix compiles raw execution traces and harness code into HTIR (Harness-aware Trace Intermediate Representation). This normalization step solves three problems:

Fragmented evidence: Execution traces from different harnesses use different formats. HTIR provides a unified schema that captures step-level events, tool calls, state transitions, and error conditions.
Missing provenance: Raw logs don’t link failures to the harness code that caused them. HTIR tracks which harness layer (tool interface, context manager, lifecycle hook) was active at each step.
Lost control flow: Traces show what happened but not why. HTIR preserves branching logic, retry attempts, and exception handling paths.

The representation includes:

Step nodes with timestamps, inputs, outputs, and error states
Harness layer annotations (which component was responsible)
Control-flow edges (sequential, conditional, retry, exception)
State snapshots at decision points

Failure Attribution and Diagnosis

Once traces are in HTIR, HarnessFix attributes failures to specific trajectory steps and harness layers. The attribution process:

Identify failure symptoms: task incompletion, constraint violations, exceptions, timeouts
Trace backward: follow control-flow edges to find the earliest step where correct behavior diverged
Map to harness layer: match the failure step to the harness component that was active (tool interface, context manager, lifecycle hook)
Consolidate recurring patterns: group similar failures into flaw records

This produces actionable diagnoses like “tool schema missing required parameter” or “context manager failed to persist state between steps” instead of vague “agent failed to complete task.”

Repair Operators and Patch Generation

HarnessFix maps diagnosed flaws to scoped repair operators. Each operator targets a specific harness layer:

Harness Layer	Common Flaws	Repair Operators
Tool interface	Missing parameters, type mismatches, invalid schemas	Add parameter, fix type annotation, update schema
Context management	State corruption, memory leaks, session conflicts	Add cleanup hook, fix serialization, isolate sessions
Lifecycle orchestration	Missing retries, bad timeouts, broken error handling	Add retry logic, adjust timeout, fix exception handler
Observability	Missing logs, broken traces, incomplete metrics	Add trace point, fix log format, emit metric
Verification	Weak validation, missing constraints, unsafe outputs	Add validator, strengthen constraint, add safety check

The repair process:

Generate candidate patches: apply repair operators to harness code
Validate under flaw-specific specs: run tests that reproduce the original failure
Check for regressions: ensure the patch doesn’t break unrelated functionality
Accept or reject: only apply patches that fix the target flaw without introducing new failures

Implementation Shape

A HarnessFix deployment looks like this:

# Trace compilation
htir = compile_trace(
    raw_trace=execution_log,
    harness_code=agent_harness_source,
    schema=harness_layer_schema
)

# Failure attribution
diagnosis = attribute_failure(
    htir=htir,
    failure_symptoms=["task_incomplete", "tool_error"],
    control_flow_graph=htir.cfg
)

# Consolidate recurring flaws
flaw_record = consolidate_diagnoses(
    diagnoses=[diagnosis],
    similarity_threshold=0.8
)

# Generate and validate patch
patch = generate_repair(
    flaw=flaw_record,
    harness_code=agent_harness_source,
    repair_operators=TOOL_INTERFACE_OPERATORS
)

validated_patch = validate_patch(
    patch=patch,
    flaw_spec=flaw_record.test_spec,
    regression_suite=harness_test_suite
)

The key architectural decision is where to run validation. Running it in the same environment as the agent risks contamination (the patch might only work in one specific context). Running it in a separate test harness adds latency but improves reliability.

Evaluation Results

The paper evaluates HarnessFix on four benchmarks:

SWE-Bench Verified: software engineering tasks requiring code changes
Terminal-Bench 2.0 Verified: command-line tool usage
GAIA: general AI assistant tasks
AppWorld: application-level workflows

Across these benchmarks, HarnessFix improved held-out test performance over baseline agents. The gains came from fixing harness-level bugs that runtime supervision couldn’t address: malformed tool schemas, missing cleanup hooks, broken state persistence.

The most common diagnosed flaws:

Tool interface bugs (38%): missing parameters, type mismatches, invalid schemas
Context corruption (27%): state leaks, serialization errors, session conflicts
Lifecycle errors (19%): missing retries, bad timeouts, broken cleanup
Observability gaps (16%): missing traces, incomplete logs, lost metrics

Failure Modes and Observability

HarnessFix itself can fail in predictable ways:

Trace compilation errors: If the raw trace format is too fragmented or the harness code is obfuscated, HTIR compilation fails. The mitigation is to require structured logging from the start.

Attribution ambiguity: When multiple harness layers are active at the same step, attribution becomes uncertain. The framework falls back to conservative diagnosis (flag all candidate layers) rather than guessing.

Patch validation false negatives: A patch might fix the target flaw but introduce a subtle regression that the test suite doesn’t catch. The mitigation is to run extended regression tests on a held-out set of tasks.

Repair operator coverage: If the diagnosed flaw doesn’t map to any known repair operator, the framework can’t generate a patch. This requires expanding the operator library over time.

Observability requirements:

Structured execution traces with step-level granularity
Harness code instrumented with layer annotations
Test suites that cover both happy paths and failure modes
Metrics on patch acceptance rate and regression frequency

Technical Verdict

Use HarnessFix when:

You have structured execution traces from agent runs
Failures recur across multiple tasks in similar patterns
You suspect harness bugs (tool interfaces, context, lifecycle) rather than reasoning failures
You can afford the latency of trace compilation and patch validation
You have a test suite that covers harness functionality

Avoid it when:

Your traces are unstructured or missing step-level detail
Failures are one-off anomalies with no recurring pattern
The agent’s reasoning is clearly the bottleneck (prompt engineering will help more)
You need real-time fixes (the diagnosis and repair pipeline adds seconds to minutes)
Your harness code is too dynamic or obfuscated to analyze statically

The framework shines when you’re operating agents at scale and harness bugs create systematic failures. It’s less useful for one-off debugging sessions where manual inspection is faster.