mech.app
AI Agents

Sibyl-AutoResearch: Why Autonomous Research Agents Need Trial Harnesses, Not Just Paper Generators

How autonomous research agents lose trial experience when they optimize for paper generation instead of building self-evolving experimental harnesses.

Source: arxiv.org
Sibyl-AutoResearch: Why Autonomous Research Agents Need Trial Harnesses, Not Just Paper Generators

Autonomous research agents can now propose hypotheses, execute experiments, and draft papers. But most systems treat failed experiments as noise to discard rather than signal to preserve. The result is agents that repeat mistakes, overstate weak findings, and never learn why certain approaches fail.

Sibyl-AutoResearch addresses this by building trial harnesses instead of paper generators. A trial harness captures both positive and negative outcomes, routes failure signals into future planning, and updates its own execution logic when process failures recur.

The Trial Experience Problem

Current autonomous research systems lose information at four points:

  • Weak evidence becomes prose: Pilot results with low confidence get written as definitive claims
  • Negative results disappear: Failed experiments vanish from memory after the iteration completes
  • Memory stays textual: Agents store summaries instead of queryable evidence structures
  • Process failures don’t update behavior: When validation gates fail repeatedly, the system doesn’t change its approach

This happens because most systems optimize for paper generation. They execute a linear workflow (propose, experiment, write) without feedback loops that connect trial outcomes to future decisions.

What a Trial Harness Does

A trial harness is infrastructure that wraps experimental execution with three capabilities:

  1. Bounded trial execution: Run experiments with explicit success/failure criteria
  2. Outcome preservation: Store both positive and negative results in queryable form
  3. Behavioral routing: Connect trial signals to downstream actions (planning, validation, claim scope, critique)

The key difference from standard experiment tracking is the routing layer. A harness doesn’t just log what happened. It changes what the agent does next based on accumulated evidence.

Two Conversion Units

Sibyl formalizes trial learning through two auditable conversion types:

Trial-to-Behavior Conversion

Links trial signals to research actions:

  • Failed validation → narrower claim scope in next draft
  • Low-confidence pilot → additional experiment in planning queue
  • Negative result → explicit “tried and failed” note in memory

Trial-to-Harness-Behavior Conversion

Links recurring process failures to system updates:

  • Validation gate fails three times → update validation criteria
  • Experiment timeout pattern → adjust resource allocation
  • Critique loop stalls → modify critique prompt structure

Both conversions are file-backed and auditable. You can inspect which trial triggered which behavior change and when.

Architecture: File-Backed State

SIBYL implements the framework with a file-based architecture:

research_project/
├── trials/
│   ├── trial_001_hypothesis_a.json
│   ├── trial_002_hypothesis_a_variant.json
│   └── trial_003_hypothesis_b.json
├── memory/
│   ├── evidence_index.json
│   ├── failure_registry.json
│   └── claim_confidence.json
├── harness/
│   ├── validation_gates.py
│   ├── scheduling_policy.py
│   └── critique_prompts.json
└── artifacts/
    ├── draft_v1.md
    ├── draft_v2.md
    └── experiment_logs/

Each trial file contains:

  • Hypothesis being tested
  • Execution trace (code, data, results)
  • Success/failure classification
  • Confidence score
  • Links to related trials

The evidence index is a queryable structure that lets agents ask:

  • “What have we tried for hypothesis X?”
  • “Which experiments failed with error Y?”
  • “What’s the confidence distribution for claim Z?”

State Management for Experimental Branches

Research agents need to fork experimental paths, compare outcomes, and merge learnings without losing the failure branch. Sibyl handles this through trial lineage tracking:

{
  "trial_id": "trial_003",
  "parent_trial": "trial_001",
  "fork_reason": "low_confidence_pilot",
  "hypothesis": "variant with adjusted parameters",
  "outcome": "failed",
  "failure_mode": "convergence_timeout",
  "preserved_for": ["future_parameter_tuning", "failure_pattern_analysis"]
}

When an agent proposes a new experiment, it queries the evidence index for related trials. If a similar approach already failed, the system surfaces that information during planning.

Evidence Accumulation: Experiment vs. Hypothesis Failure

The system distinguishes between two types of negative results:

Failure TypeMeaningAgent Action
Experiment failedImplementation error, resource limit, or execution bugRetry with fixes, adjust harness
Hypothesis wrongEvidence contradicts the claimUpdate claim scope, mark hypothesis as tested

This distinction matters for memory. An experiment failure should trigger debugging. A hypothesis failure should update the agent’s belief state and prevent future work on that path.

The failure registry tracks both:

{
  "experiment_failures": [
    {
      "trial_id": "trial_002",
      "error": "cuda_out_of_memory",
      "resolution": "reduced_batch_size",
      "harness_update": "added_memory_check_gate"
    }
  ],
  "hypothesis_failures": [
    {
      "hypothesis": "approach_x_improves_metric_y",
      "trials": ["trial_001", "trial_002", "trial_003"],
      "confidence": 0.05,
      "status": "rejected"
    }
  ]
}

Observability: Auditing Conversion Events

The paper reports a retrospective audit that identified eight high-confidence conversion events with a median latency of one iteration. This means trial signals changed agent behavior within a single research cycle.

To make this auditable, Sibyl logs conversion events:

{
  "conversion_type": "trial_to_behavior",
  "trigger_trial": "trial_004",
  "trigger_signal": "low_confidence_pilot",
  "resulting_action": "added_validation_experiment",
  "iteration": 2,
  "latency": 1
}

You can trace from a trial outcome to the specific planning decision it influenced. This is critical for debugging why an agent chose a particular research path.

Failure Modes

Trial harnesses introduce new failure surfaces:

Evidence index corruption: If the queryable evidence structure becomes inconsistent, agents make decisions on incomplete information. Mitigation: append-only trial logs with periodic index rebuilds.

Conversion latency: If trial signals take too long to route into behavior changes, agents waste iterations. Mitigation: explicit conversion gates that block progression until key signals are processed.

Harness update loops: If trial-to-harness-behavior conversion is too aggressive, the system can thrash by constantly updating its own validation logic. Mitigation: require multiple failure instances before triggering harness updates.

Memory bloat: Preserving all negative results can overwhelm storage and query performance. Mitigation: time-based or relevance-based pruning with explicit preservation rules for high-value failures.

Deployment Shape

A production trial harness system needs:

  • Persistent storage: File-backed or database-backed trial logs that survive agent restarts
  • Query layer: Fast lookups for “similar trials” and “related failures”
  • Conversion scheduler: Background process that evaluates trial outcomes and routes signals to behavior modules
  • Audit interface: UI or CLI for inspecting conversion events and failure registries
  • Harness versioning: Track which version of validation gates, scheduling policies, and critique prompts were active for each trial

The file-backed approach in SIBYL works for single-agent research projects. Multi-agent research teams would need a shared evidence index with conflict resolution for concurrent trial execution.

Security Boundaries

Autonomous research agents that update their own harness logic create privilege escalation risks:

  • Validation bypass: Agent modifies validation gates to accept weak evidence
  • Memory poisoning: Agent writes false failure records to avoid certain research paths
  • Resource exhaustion: Agent schedules unbounded experiments that consume compute budget

Mitigation strategies:

  • Separate harness update permissions from trial execution permissions
  • Require human approval for harness modifications above a risk threshold
  • Implement resource quotas at the trial level with hard limits
  • Log all harness updates with diff tracking and rollback capability

Technical Verdict

Use trial harnesses when:

  • Your research agents run multiple iterations on the same problem space
  • Negative results contain valuable information for future planning
  • You need to audit why an agent chose a particular research direction
  • Process failures (validation, critique, scheduling) recur across iterations

Avoid trial harnesses when:

  • You’re running one-shot experiments with no iteration
  • Storage and query overhead outweigh the value of preserved failures
  • Your agents don’t have planning or validation stages that could consume trial signals
  • You can’t afford the engineering cost of building conversion routing and evidence indexing

The core insight is that research judgment requires memory of what didn’t work. If your autonomous research system treats failed experiments as errors to discard, it will never learn to avoid unproductive paths. Trial harnesses turn negative results into queryable infrastructure that shapes future behavior.

Tags

agentic-ai orchestration infrastructure

Primary Source

arxiv.org