mech.app
AI Agents

Calibrating Conservatism: How Scalable Oversight Teaches Agents When to Ask for Help

A practical mechanism for teaching agents to measure their own uncertainty and escalate to humans when out of their depth, using conformal decision theory.

Source: arxiv.org
Calibrating Conservatism: How Scalable Oversight Teaches Agents When to Ask for Help

Autonomous agents face a control problem that most frameworks ignore: how do you teach an agent to recognize when it’s out of its depth? Too conservative and you drown in approval requests. Too reckless and you ship code that deletes production databases.

A new paper from Overman and Bayati introduces Calibrated Collective Oversight (CCO), a mechanism that aggregates multiple scoring functions into a penalty signal and uses conformal decision theory to keep violation rates below a specified threshold. The approach was tested on SWE-bench (software engineering tasks) and MACHIAVELLI (ethical decision-making scenarios), with empirical violation rates matching theoretical predictions.

The Core Problem

Most agentic systems treat oversight as binary: either the agent runs free or it pauses for human approval on every action. The real need is calibrated conservatism, where the agent measures its own uncertainty across different task domains and escalates only when multiple signals indicate risk.

The paper frames this as a sequential decision problem. An agent takes actions in an environment, and each action carries some probability of an undesirable outcome. The goal is to keep the cumulative violation rate below a user-specified threshold (say, 5%) without knowing the true risk distribution ahead of time.

How CCO Works

CCO introduces a penalty function that measures deviation from a conservative baseline. The baseline is typically a no-op or minimal-change action. The penalty aggregates scores from multiple auxiliary functions (code linters, safety classifiers, uncertainty estimates from the agent’s own model).

Key components:

  • Auxiliary scorers: Diverse functions that flag potential issues (syntax errors, policy violations, high perplexity on generated text)
  • Penalty aggregation: Weighted sum of scorer outputs, normalized to a [0,1] range
  • Conformal calibration: Online adjustment of the penalty threshold using conformal prediction, which provides finite-time statistical guarantees without distributional assumptions
  • Action override: When the penalty exceeds the calibrated threshold, the agent selects the conservative baseline instead of its preferred action

The conformal decision theory piece is critical. Traditional confidence intervals assume you know the data distribution. Conformal methods work distribution-free by using past observations to set thresholds that provably bound future violation rates.

Architecture Pattern

Here’s how you’d wire this into an agent loop:

class CalibratedAgent:
    def __init__(self, base_agent, scorers, target_violation_rate=0.05):
        self.base_agent = base_agent
        self.scorers = scorers  # List of (scorer_fn, weight) tuples
        self.target_rate = target_violation_rate
        self.penalty_threshold = 0.5  # Initial guess
        self.history = []  # (penalty, was_violation) pairs
        
    def select_action(self, state):
        proposed_action = self.base_agent.plan(state)
        conservative_action = self.get_baseline(state)
        
        # Compute aggregate penalty
        penalty = 0.0
        for scorer, weight in self.scorers:
            score = scorer(state, proposed_action)
            penalty += weight * score
        penalty = min(1.0, penalty)  # Normalize
        
        # Conformal calibration step
        self.update_threshold()
        
        # Override if penalty exceeds threshold
        if penalty > self.penalty_threshold:
            return conservative_action, True  # Escalated
        return proposed_action, False
    
    def update_threshold(self):
        # Actual implementation uses conformal quantiles to compute
        # threshold that provably bounds violation rate.
        # See paper Section 3 for conformal decision theory details.
        # This simplified version illustrates the control flow only.
        if len(self.history) < 10:
            return  # Need warmup data
        
        violations = [v for (p, v) in self.history]
        empirical_rate = sum(violations) / len(violations)
        
        # Placeholder logic: real conformal bounds use quantile-based
        # threshold adjustment with finite-time guarantees
        if empirical_rate > self.target_rate:
            self.penalty_threshold *= 0.95  # More conservative
        else:
            self.penalty_threshold *= 1.05  # Less conservative
    
    def record_outcome(self, penalty, was_violation):
        self.history.append((penalty, was_violation))

This is simplified to show the pattern. The paper uses rigorous conformal bounds that provide statistical guarantees. The key insight is: measure penalty, compare to threshold, override if needed, update threshold based on observed violations.

State Management for Escalation

When an agent decides it needs human approval mid-execution, you need a queue and state snapshot mechanism. The agent should:

  1. Serialize its current execution context (state, proposed action, penalty breakdown)
  2. Push to an approval queue with priority based on penalty magnitude
  3. Pause execution and release compute resources
  4. Resume from snapshot when approval arrives

Failure modes to handle:

  • Approval timeout (default to conservative action after N minutes)
  • Queue overflow (reject new tasks or auto-approve low-penalty items)
  • State drift (environment changed while waiting, re-score before resuming)

Tuning the Conservatism Threshold

The target violation rate is a business decision, not a technical one. A 5% violation rate means one in twenty agent actions will be undesirable. For code generation, that might be acceptable. For financial transactions, you want 0.1%.

Trade-offs (illustrative ranges, not from paper):

Target RateAgent AutonomyHuman LoadRisk Profile
10%HighLowAcceptable for dev/test environments
5%MediumMediumReasonable for supervised production
1%LowHighCritical systems, compliance-heavy domains
0.1%Very LowVery HighFinancial, healthcare, legal actions

The conformal calibration adjusts the penalty threshold online to hit your target. If you see too many violations, the threshold drops (more actions get overridden). If you see too few, it rises (more autonomy).

SWE-bench Results

The paper tested CCO on a modified SWE-bench where a weaker overseer (GPT-3.5-level) had to constrain a stronger, adversarially misaligned agent. The overseer used auxiliary scorers like:

  • Static analysis tools (pylint, mypy)
  • Diff size (large changes flagged as risky)
  • Test coverage delta
  • Semantic similarity to issue description

With a 5% target violation rate, CCO achieved 4.8% empirical violations. The agent completed 73% of tasks autonomously, escalating the remaining 27% to human review. Without CCO, the misaligned agent had a 22% violation rate.

MACHIAVELLI Results

MACHIAVELLI is a benchmark for ethical decision-making in text-based games. Agents face scenarios where high-reward actions often involve deception, harm, or rule-breaking.

CCO used scorers for:

  • Ethical principle violations (harm, fairness, rights)
  • Deception detection (comparing agent statements to ground truth)
  • Power-seeking behavior (actions that increase agent control)

With a 3% target violation rate, CCO reduced ethical violations by 68% compared to an unconstrained agent, while preserving 91% of the reward. The agent learned to avoid obviously harmful actions but still pursued goals aggressively when scorers didn’t flag concerns.

Implementation Considerations: Latency and Scorer Design

Adding multiple scorers to every action introduces latency. The paper does not report timing benchmarks, but you can estimate based on typical scorer complexity:

  • Static analysis: 50-200ms for typical code files
  • LLM-based scorers (safety classifiers): 200-500ms per call
  • Uncertainty estimation (ensemble or dropout): 2-5x base inference time

For a system with five scorers, expect overhead in the range of seconds per action depending on scorer choice. This is acceptable for planning tasks (code generation, document editing) but prohibitive for real-time control (robotics, trading).

Optimization strategies:

  • Run scorers in parallel
  • Cache scores for identical state-action pairs
  • Use cheaper scorers as pre-filters (only run expensive ones if cheap ones flag risk)
  • Batch scoring across multiple candidate actions

When to Use CCO

Good fit:

  • Agents operating in high-stakes domains where mistakes are costly
  • Tasks with clear conservative baselines (no-op, minimal change, defer to human)
  • Environments where you can define multiple independent risk signals
  • Sequential decision problems where you can tolerate action-level latency

Poor fit:

  • Real-time control loops (latency kills you)
  • Domains where “conservative” is undefined (creative tasks, open-ended exploration)
  • Single-shot decisions (no opportunity for online calibration)
  • Systems where human approval is unavailable or too slow (fully autonomous robots)

Technical Verdict

CCO solves a real problem: teaching agents to recognize their own uncertainty without requiring complex reward modeling or debate protocols. The conformal decision theory foundation provides statistical guarantees, which is rare in agentic AI research.

The main limitation is latency. Running multiple scorers on every action is expensive. You need to design your scorer suite carefully, balancing coverage against speed. Start with cheap, high-signal scorers (static analysis, rule-based checks) and add expensive ones (LLM classifiers) only when cheap ones indicate risk.

For production systems, implement the escalation queue and state management carefully. Agents will pause mid-execution, and you need infrastructure to serialize context, route to the right human, and resume cleanly. The paper focuses on the calibration mechanism but doesn’t address these operational concerns.

Use this when you’re building agents that operate in regulated domains, handle sensitive data, or make decisions with legal or financial consequences. Skip it for low-stakes automation or when you need sub-second response times.

Tags

agentic-ai orchestration infrastructure

Primary Source

arxiv.org