Autonomous research agents can now propose hypotheses, execute experiments, and draft papers. But most systems treat failed experiments as noise to discard rather than signal to preserve. The result is agents that repeat mistakes, overstate weak findings, and never learn why certain approaches fail.
Sibyl-AutoResearch addresses this by building trial harnesses instead of paper generators. A trial harness captures both positive and negative outcomes, routes failure signals into future planning, and updates its own execution logic when process failures recur.
The Trial Experience Problem
Current autonomous research systems lose information at four points:
- Weak evidence becomes prose: Pilot results with low confidence get written as definitive claims
- Negative results disappear: Failed experiments vanish from memory after the iteration completes
- Memory stays textual: Agents store summaries instead of queryable evidence structures
- Process failures don’t update behavior: When validation gates fail repeatedly, the system doesn’t change its approach
This happens because most systems optimize for paper generation. They execute a linear workflow (propose, experiment, write) without feedback loops that connect trial outcomes to future decisions.
What a Trial Harness Does
A trial harness is infrastructure that wraps experimental execution with three capabilities:
- Bounded trial execution: Run experiments with explicit success/failure criteria
- Outcome preservation: Store both positive and negative results in queryable form
- Behavioral routing: Connect trial signals to downstream actions (planning, validation, claim scope, critique)
The key difference from standard experiment tracking is the routing layer. A harness doesn’t just log what happened. It changes what the agent does next based on accumulated evidence.
Two Conversion Units
Sibyl formalizes trial learning through two auditable conversion types:
Trial-to-Behavior Conversion
Links trial signals to research actions:
- Failed validation → narrower claim scope in next draft
- Low-confidence pilot → additional experiment in planning queue
- Negative result → explicit “tried and failed” note in memory
Trial-to-Harness-Behavior Conversion
Links recurring process failures to system updates:
- Validation gate fails three times → update validation criteria
- Experiment timeout pattern → adjust resource allocation
- Critique loop stalls → modify critique prompt structure
Both conversions are file-backed and auditable. You can inspect which trial triggered which behavior change and when.
Architecture: File-Backed State
SIBYL implements the framework with a file-based architecture:
research_project/
├── trials/
│ ├── trial_001_hypothesis_a.json
│ ├── trial_002_hypothesis_a_variant.json
│ └── trial_003_hypothesis_b.json
├── memory/
│ ├── evidence_index.json
│ ├── failure_registry.json
│ └── claim_confidence.json
├── harness/
│ ├── validation_gates.py
│ ├── scheduling_policy.py
│ └── critique_prompts.json
└── artifacts/
├── draft_v1.md
├── draft_v2.md
└── experiment_logs/
Each trial file contains:
- Hypothesis being tested
- Execution trace (code, data, results)
- Success/failure classification
- Confidence score
- Links to related trials
The evidence index is a queryable structure that lets agents ask:
- “What have we tried for hypothesis X?”
- “Which experiments failed with error Y?”
- “What’s the confidence distribution for claim Z?”
State Management for Experimental Branches
Research agents need to fork experimental paths, compare outcomes, and merge learnings without losing the failure branch. Sibyl handles this through trial lineage tracking:
{
"trial_id": "trial_003",
"parent_trial": "trial_001",
"fork_reason": "low_confidence_pilot",
"hypothesis": "variant with adjusted parameters",
"outcome": "failed",
"failure_mode": "convergence_timeout",
"preserved_for": ["future_parameter_tuning", "failure_pattern_analysis"]
}
When an agent proposes a new experiment, it queries the evidence index for related trials. If a similar approach already failed, the system surfaces that information during planning.
Evidence Accumulation: Experiment vs. Hypothesis Failure
The system distinguishes between two types of negative results:
| Failure Type | Meaning | Agent Action |
|---|---|---|
| Experiment failed | Implementation error, resource limit, or execution bug | Retry with fixes, adjust harness |
| Hypothesis wrong | Evidence contradicts the claim | Update claim scope, mark hypothesis as tested |
This distinction matters for memory. An experiment failure should trigger debugging. A hypothesis failure should update the agent’s belief state and prevent future work on that path.
The failure registry tracks both:
{
"experiment_failures": [
{
"trial_id": "trial_002",
"error": "cuda_out_of_memory",
"resolution": "reduced_batch_size",
"harness_update": "added_memory_check_gate"
}
],
"hypothesis_failures": [
{
"hypothesis": "approach_x_improves_metric_y",
"trials": ["trial_001", "trial_002", "trial_003"],
"confidence": 0.05,
"status": "rejected"
}
]
}
Observability: Auditing Conversion Events
The paper reports a retrospective audit that identified eight high-confidence conversion events with a median latency of one iteration. This means trial signals changed agent behavior within a single research cycle.
To make this auditable, Sibyl logs conversion events:
{
"conversion_type": "trial_to_behavior",
"trigger_trial": "trial_004",
"trigger_signal": "low_confidence_pilot",
"resulting_action": "added_validation_experiment",
"iteration": 2,
"latency": 1
}
You can trace from a trial outcome to the specific planning decision it influenced. This is critical for debugging why an agent chose a particular research path.
Failure Modes
Trial harnesses introduce new failure surfaces:
Evidence index corruption: If the queryable evidence structure becomes inconsistent, agents make decisions on incomplete information. Mitigation: append-only trial logs with periodic index rebuilds.
Conversion latency: If trial signals take too long to route into behavior changes, agents waste iterations. Mitigation: explicit conversion gates that block progression until key signals are processed.
Harness update loops: If trial-to-harness-behavior conversion is too aggressive, the system can thrash by constantly updating its own validation logic. Mitigation: require multiple failure instances before triggering harness updates.
Memory bloat: Preserving all negative results can overwhelm storage and query performance. Mitigation: time-based or relevance-based pruning with explicit preservation rules for high-value failures.
Deployment Shape
A production trial harness system needs:
- Persistent storage: File-backed or database-backed trial logs that survive agent restarts
- Query layer: Fast lookups for “similar trials” and “related failures”
- Conversion scheduler: Background process that evaluates trial outcomes and routes signals to behavior modules
- Audit interface: UI or CLI for inspecting conversion events and failure registries
- Harness versioning: Track which version of validation gates, scheduling policies, and critique prompts were active for each trial
The file-backed approach in SIBYL works for single-agent research projects. Multi-agent research teams would need a shared evidence index with conflict resolution for concurrent trial execution.
Security Boundaries
Autonomous research agents that update their own harness logic create privilege escalation risks:
- Validation bypass: Agent modifies validation gates to accept weak evidence
- Memory poisoning: Agent writes false failure records to avoid certain research paths
- Resource exhaustion: Agent schedules unbounded experiments that consume compute budget
Mitigation strategies:
- Separate harness update permissions from trial execution permissions
- Require human approval for harness modifications above a risk threshold
- Implement resource quotas at the trial level with hard limits
- Log all harness updates with diff tracking and rollback capability
Technical Verdict
Use trial harnesses when:
- Your research agents run multiple iterations on the same problem space
- Negative results contain valuable information for future planning
- You need to audit why an agent chose a particular research direction
- Process failures (validation, critique, scheduling) recur across iterations
Avoid trial harnesses when:
- You’re running one-shot experiments with no iteration
- Storage and query overhead outweigh the value of preserved failures
- Your agents don’t have planning or validation stages that could consume trial signals
- You can’t afford the engineering cost of building conversion routing and evidence indexing
The core insight is that research judgment requires memory of what didn’t work. If your autonomous research system treats failed experiments as errors to discard, it will never learn to avoid unproductive paths. Trial harnesses turn negative results into queryable infrastructure that shapes future behavior.