mech.app
AI Agents

Monitoring Agentic Systems Before They're Reliable: Why Structural Failure Detection Matters More Than Task-Level Evals

Structural defects dominate early agent systems. Learn how to instrument tool boundaries, state invariants, and integration gaps before task evals work.

Source: arxiv.org
Monitoring Agentic Systems Before They're Reliable: Why Structural Failure Detection Matters More Than Task-Level Evals

Most agent observability tooling assumes your system works well enough to evaluate task outcomes. That assumption breaks when you deploy partially integrated assemblies where the plumbing itself is the problem.

A new paper from Boston et al. (ArXiv 2606.02494v1) argues that structural defects, not task-level errors, dominate the failure landscape in early-stage agentic systems. The core insight: structural failure modes mask the signals that task-level monitors depend on. You need a different monitoring layer before outcome-based evals become feasible.

The Structural vs. Task Failure Distinction

Structural failures happen at the assembly layer:

  • Malformed tool calls (wrong schema, missing parameters)
  • State corruption between stages
  • Integration gaps (agent expects data that never arrives)
  • Broken boundaries (tool A writes state that tool B can’t parse)

Task failures happen at the outcome layer:

  • Wrong answer to a user query
  • Incomplete document generation
  • Incorrect classification

The problem: if your agent can’t reliably wire tools together, task-level metrics like accuracy or completion rate produce noise. A failed tool call looks identical to a bad reasoning step in the trace.

Why Task Evals Fail Early

The paper tested 220 runs across 120 document bundles with controlled error injection. Key finding: injected task-level errors were indistinguishable from clean baselines when structural defects were present.

When an integration gap causes a stage to receive no input, the agent might hallucinate a response, skip the stage silently, or throw an error. All three outcomes corrupt downstream task metrics. You can’t tell if the agent made a bad decision or if the plumbing broke.

Three Monitoring Scopes

The paper decomposes monitoring into three scopes, each catching different failure types:

ScopeDetectsCoefficient of VariationExample
Within-runDeterministic stage defects0.02Tool call always fails with same schema error
Cross-runStochastic integration issues1.25 (24% at severity L2)Intermittent state corruption across retries
StructuralIntegration gaps0.00 (perfect consistency)Missing data handoff between stages

Within-run monitors catch repeatable defects in a single execution. Cross-run monitors surface variance across retries or different inputs. Structural monitors validate the assembly itself: does every stage have a valid input source? Are tool boundaries well-defined?

Instrumentation Primitives for Structural Monitoring

Here’s what you actually instrument:

1. Trace shape validation

Check that the execution graph matches the expected topology. If your orchestrator expects a linear chain (retrieve → reason → format → respond) but the trace shows (retrieve → respond), you have a missing stage.

def validate_trace_shape(trace, expected_stages):
    """
    Structural monitor: verify all expected stages executed.
    """
    executed = {span.name for span in trace.spans}
    missing = expected_stages - executed
    
    if missing:
        return {
            "severity": "L1",  # Blocks task completion
            "defect_type": "missing_stage",
            "missing": list(missing),
            "cv": 0.0  # Deterministic if topology is fixed
        }
    return None

2. State invariants

Define what valid state looks like at each handoff. If stage A produces a dict with keys {query, context} and stage B expects {question, docs}, you have a schema mismatch.

def check_state_invariants(stage_output, next_stage_input_schema):
    """
    Structural monitor: verify state shape at boundaries.
    """
    if not next_stage_input_schema.validate(stage_output):
        return {
            "severity": "L1",
            "defect_type": "state_schema_mismatch",
            "expected": next_stage_input_schema.fields,
            "actual": stage_output.keys(),
            "cv": 0.0  # Deterministic for fixed schemas
        }
    return None

3. Tool boundary checks

Validate that every tool call conforms to its contract. Log the full request/response pair, not just success/failure.

def validate_tool_call(tool_name, request, response, tool_registry):
    """
    Within-run monitor: check tool call conformance.
    """
    spec = tool_registry[tool_name]
    
    if not spec.request_schema.validate(request):
        return {
            "severity": "L1",
            "defect_type": "malformed_request",
            "tool": tool_name,
            "cv": 0.0  # Deterministic if agent always malforms
        }
    
    if response.status == "error":
        return {
            "severity": "L2",
            "defect_type": "tool_execution_failure",
            "tool": tool_name,
            "cv": calculate_cv_across_runs(tool_name, "error")
        }
    
    return None

Variance as a Characterization Signal

The paper uses coefficient of variation (CV) to distinguish deterministic defects from stochastic ones:

  • CV near 0.0: Deterministic defect. Route to automated tracking.
  • CV > 1.0: High variance. Route to human investigation.

In their testbed, 97% of findings had CV < 0.1 and were routed to automated triage. The remaining 3% showed variable behavior (intermittent state corruption, race conditions) and required human attention.

This matters for operational cost. If you alert on every defect, you drown in noise. If you only alert on high-variance issues, you focus human effort where non-determinism makes root cause analysis hard.

FMEA-Based Severity Classification

The paper adapts Failure Mode and Effects Analysis (FMEA) to route findings:

  • L1 (Critical): Blocks task completion. Example: missing stage, malformed tool call.
  • L2 (Major): Degrades output quality but task completes. Example: partial state corruption.
  • L3 (Minor): Inefficiency or cosmetic issue. Example: redundant tool calls.

Severity determines routing:

  • L1 + low CV → automated ticket, block deployment
  • L2 + high CV → human investigation queue
  • L3 → log for post-deployment analysis

When Structural Monitoring Becomes Obsolete

The paper proposes a maturity-staging model. As your agent system stabilizes:

  1. Stage 0 (Assembly): Structural monitors dominate. Task evals produce noise.
  2. Stage 1 (Integration): Cross-run monitors surface stochastic defects. Task evals start to correlate with outcomes.
  3. Stage 2 (Optimization): Task-level monitors become primary. Structural monitors shift to regression detection.

You don’t abandon structural monitoring. You shift it from primary signal to guardrail. Once your agent reliably completes tasks, structural defects become regressions, not the baseline.

Deployment Shape

For early-stage systems, run structural monitors synchronously in the hot path:

  • Validate trace shape after each run
  • Check state invariants at every stage boundary
  • Log tool call conformance in real time

Cost: 5-10ms per stage for schema validation, negligible compared to LLM latency.

For mature systems, shift structural monitors to asynchronous batch analysis:

  • Sample 10% of traces for structural validation
  • Run nightly jobs to detect schema drift
  • Alert only on new defect patterns

Likely Failure Modes

False negatives on schema evolution

If you update a tool’s output schema but forget to update the next stage’s input schema, structural monitors won’t catch it until runtime. Mitigation: version your schemas and enforce compatibility checks in CI.

Alert fatigue from deterministic defects

If a tool always fails, you’ll get an alert on every run. Mitigation: deduplicate by defect signature (tool name + error type) and alert once per unique signature.

Missed stochastic defects in low-traffic stages

If a stage only executes 10 times per day, you won’t have enough samples to calculate meaningful CV. Mitigation: set a minimum sample threshold (e.g., 50 runs) before routing to human investigation.

Technical Verdict

Use structural monitoring when:

  • Your agent system is partially integrated (stages exist but handoffs are brittle)
  • Task-level metrics show high variance with no clear pattern
  • You’re debugging silent failures (agent completes but output is garbage)
  • You’re adding new tools or stages and need to validate integration

Avoid structural monitoring when:

  • Your agent system is mature and task evals correlate with outcomes
  • You’re optimizing for latency and can’t afford synchronous validation
  • Your orchestrator already enforces strict schemas (e.g., typed state machines)

Combine with task-level monitoring when:

  • You’re transitioning from Stage 0 to Stage 1 maturity
  • You need to prove that structural fixes improve task outcomes
  • You’re running A/B tests on agent architecture changes

The key insight: structural defects mask task-level signal. Fix the plumbing before you tune the reasoning. Monitor the assembly before you evaluate the outcome.

Tags

agentic-ai orchestration infrastructure

Primary Source

arxiv.org