mech.app
AI Agents

Diagnosing and Repairing Agent Harness Flaws

How to systematically debug tool interfaces, context corruption, and lifecycle errors in LLM agent execution harnesses using trace-guided diagnosis.

Source: arxiv.org
Diagnosing and Repairing Agent Harness Flaws

Most agent debugging focuses on the model: prompt engineering, workflow search, runtime supervision. But when your agent fails, the problem is often in the plumbing. The harness that wraps your LLM (tool interfaces, context management, lifecycle orchestration, observability hooks) can fail silently in ways that no amount of prompt tuning will fix.

A new paper from Chen et al. introduces HarnessFix, a trace-guided framework that diagnoses where failures occur in agent execution harnesses and generates scoped repairs. Instead of treating failed trajectories as opaque blobs, it compiles execution traces and harness code into an intermediate representation that exposes step-level provenance and control-flow relations.

The Harness Failure Problem

Agent harnesses provide seven critical layers:

  • Execution environment: sandboxing, resource limits, process isolation
  • Tool interfaces: function signatures, parameter validation, serialization
  • Context management: state persistence, memory boundaries, session handling
  • Lifecycle orchestration: retry logic, timeout handling, cleanup hooks
  • Observability: logging, tracing, metric collection
  • Verification: output validation, constraint checking, safety guards
  • Governance: access control, rate limiting, audit trails

When any of these layers breaks, the agent fails in ways that look like reasoning errors. A malformed tool schema causes the agent to hallucinate parameters. A missing cleanup hook leaves stale state that corrupts future runs. A broken trace collector hides the evidence you need to debug.

Existing self-improvement methods (runtime supervision, prompt optimization, workflow search) operate at the agent level. They assume the harness works. When it doesn’t, they make indirect changes that paper over symptoms without fixing root causes.

Harness-Aware Trace Intermediate Representation

HarnessFix compiles raw execution traces and harness code into HTIR (Harness-aware Trace Intermediate Representation). This normalization step solves three problems:

  1. Fragmented evidence: Execution traces from different harnesses use different formats. HTIR provides a unified schema that captures step-level events, tool calls, state transitions, and error conditions.

  2. Missing provenance: Raw logs don’t link failures to the harness code that caused them. HTIR tracks which harness layer (tool interface, context manager, lifecycle hook) was active at each step.

  3. Lost control flow: Traces show what happened but not why. HTIR preserves branching logic, retry attempts, and exception handling paths.

The representation includes:

  • Step nodes with timestamps, inputs, outputs, and error states
  • Harness layer annotations (which component was responsible)
  • Control-flow edges (sequential, conditional, retry, exception)
  • State snapshots at decision points

Failure Attribution and Diagnosis

Once traces are in HTIR, HarnessFix attributes failures to specific trajectory steps and harness layers. The attribution process:

  1. Identify failure symptoms: task incompletion, constraint violations, exceptions, timeouts
  2. Trace backward: follow control-flow edges to find the earliest step where correct behavior diverged
  3. Map to harness layer: match the failure step to the harness component that was active (tool interface, context manager, lifecycle hook)
  4. Consolidate recurring patterns: group similar failures into flaw records

This produces actionable diagnoses like “tool schema missing required parameter” or “context manager failed to persist state between steps” instead of vague “agent failed to complete task.”

Repair Operators and Patch Generation

HarnessFix maps diagnosed flaws to scoped repair operators. Each operator targets a specific harness layer:

Harness LayerCommon FlawsRepair Operators
Tool interfaceMissing parameters, type mismatches, invalid schemasAdd parameter, fix type annotation, update schema
Context managementState corruption, memory leaks, session conflictsAdd cleanup hook, fix serialization, isolate sessions
Lifecycle orchestrationMissing retries, bad timeouts, broken error handlingAdd retry logic, adjust timeout, fix exception handler
ObservabilityMissing logs, broken traces, incomplete metricsAdd trace point, fix log format, emit metric
VerificationWeak validation, missing constraints, unsafe outputsAdd validator, strengthen constraint, add safety check

The repair process:

  1. Generate candidate patches: apply repair operators to harness code
  2. Validate under flaw-specific specs: run tests that reproduce the original failure
  3. Check for regressions: ensure the patch doesn’t break unrelated functionality
  4. Accept or reject: only apply patches that fix the target flaw without introducing new failures

Implementation Shape

A HarnessFix deployment looks like this:

# Trace compilation
htir = compile_trace(
    raw_trace=execution_log,
    harness_code=agent_harness_source,
    schema=harness_layer_schema
)

# Failure attribution
diagnosis = attribute_failure(
    htir=htir,
    failure_symptoms=["task_incomplete", "tool_error"],
    control_flow_graph=htir.cfg
)

# Consolidate recurring flaws
flaw_record = consolidate_diagnoses(
    diagnoses=[diagnosis],
    similarity_threshold=0.8
)

# Generate and validate patch
patch = generate_repair(
    flaw=flaw_record,
    harness_code=agent_harness_source,
    repair_operators=TOOL_INTERFACE_OPERATORS
)

validated_patch = validate_patch(
    patch=patch,
    flaw_spec=flaw_record.test_spec,
    regression_suite=harness_test_suite
)

The key architectural decision is where to run validation. Running it in the same environment as the agent risks contamination (the patch might only work in one specific context). Running it in a separate test harness adds latency but improves reliability.

Evaluation Results

The paper evaluates HarnessFix on four benchmarks:

  • SWE-Bench Verified: software engineering tasks requiring code changes
  • Terminal-Bench 2.0 Verified: command-line tool usage
  • GAIA: general AI assistant tasks
  • AppWorld: application-level workflows

Across these benchmarks, HarnessFix improved held-out test performance over baseline agents. The gains came from fixing harness-level bugs that runtime supervision couldn’t address: malformed tool schemas, missing cleanup hooks, broken state persistence.

The most common diagnosed flaws:

  1. Tool interface bugs (38%): missing parameters, type mismatches, invalid schemas
  2. Context corruption (27%): state leaks, serialization errors, session conflicts
  3. Lifecycle errors (19%): missing retries, bad timeouts, broken cleanup
  4. Observability gaps (16%): missing traces, incomplete logs, lost metrics

Failure Modes and Observability

HarnessFix itself can fail in predictable ways:

Trace compilation errors: If the raw trace format is too fragmented or the harness code is obfuscated, HTIR compilation fails. The mitigation is to require structured logging from the start.

Attribution ambiguity: When multiple harness layers are active at the same step, attribution becomes uncertain. The framework falls back to conservative diagnosis (flag all candidate layers) rather than guessing.

Patch validation false negatives: A patch might fix the target flaw but introduce a subtle regression that the test suite doesn’t catch. The mitigation is to run extended regression tests on a held-out set of tasks.

Repair operator coverage: If the diagnosed flaw doesn’t map to any known repair operator, the framework can’t generate a patch. This requires expanding the operator library over time.

Observability requirements:

  • Structured execution traces with step-level granularity
  • Harness code instrumented with layer annotations
  • Test suites that cover both happy paths and failure modes
  • Metrics on patch acceptance rate and regression frequency

Technical Verdict

Use HarnessFix when:

  • You have structured execution traces from agent runs
  • Failures recur across multiple tasks in similar patterns
  • You suspect harness bugs (tool interfaces, context, lifecycle) rather than reasoning failures
  • You can afford the latency of trace compilation and patch validation
  • You have a test suite that covers harness functionality

Avoid it when:

  • Your traces are unstructured or missing step-level detail
  • Failures are one-off anomalies with no recurring pattern
  • The agent’s reasoning is clearly the bottleneck (prompt engineering will help more)
  • You need real-time fixes (the diagnosis and repair pipeline adds seconds to minutes)
  • Your harness code is too dynamic or obfuscated to analyze statically

The framework shines when you’re operating agents at scale and harness bugs create systematic failures. It’s less useful for one-off debugging sessions where manual inspection is faster.

Tags

agentic-ai orchestration infrastructure

Primary Source

arxiv.org