Locally Coherent, Globally Incoherent: How Multi-Component LLM Agents Violate Probability Axioms

Multi-component agent architectures split reasoning across specialized LLMs. One retrieves context, another plans, a third evaluates. Each component produces locally valid probabilistic claims. The orchestration layer assembles them into a joint response. The problem: the assembled output can violate basic probability axioms even when every component passes validation.

A new paper formalizes this failure mode and introduces a runtime metric to detect it. The research tested 1,876 ensemble cliques across four mid-tier LLMs and found ε* > 0 (indicating non-zero compositional residual) on 33-94% of cliques. The compositional residual (ε*) measures the L2 distance from the composed output to the nearest point in the joint coherent polytope. When ε* exceeds a threshold (the paper uses 0.05 for betting tasks), your agent is making mathematically invalid claims.

The Failure Mode

Multi-component agents fail when components see different slices of a joint probability space. Each component produces a marginal distribution that looks correct in isolation. The orchestration layer concatenates or merges these marginals without checking whether they can coexist in a valid joint distribution.

Probability axioms require:

Non-negativity: P(A) ≥ 0 for all events A
Normalization: P(Ω) = 1 where Ω is the sample space
Additivity: P(A ∪ B) = P(A) + P(B) for disjoint events A, B

Multi-component compositions can violate these axioms when components make incompatible independence assumptions or condition on overlapping context with different priors.

Example scenario:

Component A (retrieval): “There’s a 70% chance the user wants product X.”
Component B (reasoning): “Given product X, there’s an 80% chance the user needs feature Y.”
Component C (planning): “There’s a 90% chance feature Y requires configuration Z.”

The orchestrator combines these probabilities using a product rule (0.7 × 0.8 × 0.9 = 0.504) to score decision paths. But if the components made incompatible independence assumptions (for example, if B and C both condition on overlapping retrieval context but assume different priors), the joint distribution violates probability axioms. The output claims can sum to more than 1.0 or assign negative probability mass.

Why local validation misses this:

Each component’s output is a valid probability distribution over its local variables.
Standard multi-agent evaluation checks per-component accuracy, retrieval precision, or planning success rate.
No validation step checks whether the assembled claims form a coherent joint distribution.

Compositional Residual: Runtime Detection

The paper introduces ε* (epsilon-star), the compositional residual. It measures how far the assembled output is from the nearest valid joint distribution in the coherent polytope.

Calculation:

Extract probabilistic claims from each component’s output (confidence scores, probability estimates, weighted predictions).
Represent the composed output as a point in probability space.
Compute the L2 distance from the composed output to the coherent polytope (the set of all valid joint distributions respecting the declared cross-component coupling constraints).

Interpretation:

ε* = 0: The composition is globally coherent.
ε* > 0: The composition violates probability axioms. Higher values indicate worse violations.

The metric is computable at runtime from system output and the declared coupling constraints between components. You do not need ground truth labels.

The paper also introduces a Rayleigh-quotient predictor that estimates ε* from component covariance structure before assembly. This predictor matched observed residuals within 7% on three of four relation classes tested. Performance degraded on the fourth class, suggesting domain-specific tuning may be required for reliable proactive coherence checks during orchestration design.

When Local Coherence Suffices

The paper identifies a product-structure dichotomy. Local coherence guarantees global coherence when components operate on independent subproblems. If your agent architecture satisfies these conditions, you can skip global coherence checks:

Independence: Each component reasons over disjoint variables.
No shared state: Components do not condition on overlapping context.
Deterministic orchestration: The merge operation is a simple concatenation or product, not a learned aggregator.

Most production agent architectures violate at least one condition. Retrieval and reasoning components share context. Planning components condition on retrieval outputs. Aggregator LLMs merge outputs using learned heuristics.

Architectural Patterns to Prevent Violations

Pattern	How It Helps	Trade-off
Explicit coupling constraints	Declare which variables each component can condition on. Enforce constraints during orchestration.	Requires upfront design. Adds orchestration overhead for constraint validation.
Hierarchical projection	Use Boyle-Dykstra projection to repair the composition deterministically. Projects the assembled output onto the coherent polytope.	Adds latency per request depending on constraint complexity. May change component outputs in high-incoherence cases.
Sequential coherence monitoring	Use an anytime-valid e-process to monitor ε* over time. Reject or flag outputs when ε* exceeds a threshold.	Requires threshold tuning (paper suggests 0.05 for betting tasks). Increased rejection rate (8-12% in paper’s experiments).
Single-component fallback	Route requests to a single large model when ε* is high. Avoid composition entirely for high-risk queries.	Loses specialization benefits. Increases cost per query for frontier models.

The e-process monitoring treats ε* as a test statistic in a sequential hypothesis test. It accumulates evidence against the null hypothesis (composition is coherent) and triggers an alert when the likelihood ratio exceeds a predefined threshold. This allows continuous monitoring without requiring a fixed sample size.

Failed Mitigations

The paper tested three intuitive fixes. All failed or regressed performance:

Retrieval augmentation: Giving all components access to the full retrieval context. The paper found that components still made incompatible independence assumptions.
Partition-aware prompting: Explicitly telling each component which variables it should condition on. The paper found that LLMs ignored the instructions or misinterpreted them.
Aggregator LLM: Adding a final LLM to merge component outputs. The paper found that the aggregator introduced new incoherence or amplified existing violations. The failure of aggregator LLMs is particularly notable because adding a learned merge layer does not fix the problem. The aggregator sees only the component outputs, not the underlying probability space, so it cannot detect or repair axiom violations.

Implementation: Coherence Check in Orchestration

The Boyle-Dykstra algorithm is a standard alternating projection method for convex constraint sets. It converges to the nearest point in the coherent polytope without requiring explicit constraint matrix inversion. The algorithm maintains incremental corrections at each iteration to ensure convergence even when constraints are not independent.

Here’s the structure for adding runtime coherence checks to a multi-component agent orchestrator:

import numpy as np

class CoherenceChecker:
    def __init__(self, coupling_constraints, variable_order):
        """
        coupling_constraints: list of dicts, each specifying:
          - 'components': tuple of (component_i, component_j)
          - 'shared_vars': list of variable indices
          - 'constraint_type': 'sum_to_one' or 'non_negative'
        variable_order: list of (component_id, var_name) tuples 
                       defining global variable ordering
        """
        self.constraints = coupling_constraints
        self.variable_order = variable_order
    
    def compute_residual(self, component_outputs):
        """
        component_outputs: dict mapping component_id to 
        probability distribution (numpy array)
        
        Returns: (residual, repaired_output)
        """
        joint = self._assemble_joint(component_outputs)
        coherent = self._boyle_dykstra_project(joint)
        residual = np.linalg.norm(joint - coherent)
        return residual, coherent
    
    def _assemble_joint(self, outputs):
        # Flatten component distributions into joint vector
        # following the declared variable ordering
        flat = []
        for comp_id, var_name in self.variable_order:
            if comp_id in outputs:
                flat.extend(outputs[comp_id])
        return np.array(flat)
    
    def _boyle_dykstra_project(self, point, max_iter=100, tol=1e-6):
        """
        Boyle-Dykstra alternating projection onto constraint polytope.
        Maintains incremental corrections to ensure convergence.
        """
        current = point.copy()
        corrections = [np.zeros_like(point) for _ in self.constraints]
        
        for iteration in range(max_iter):
            prev = current.copy()
            
            for i, constraint in enumerate(self.constraints):
                # Subtract previous correction
                temp = current - corrections[i]
                # Project onto constraint hyperplane
                projected = self._project_single_constraint(temp, constraint)
                # Update correction for this constraint
                corrections[i] = projected - temp
                current = projected
            
            if np.linalg.norm(current - prev) < tol:
                break
        
        return current
    
    def _project_single_constraint(self, point, constraint):
        """
        Project onto single constraint hyperplane.
        For sum-to-one: normalize specified indices.
        For non-negative: clip negative values to zero.
        """
        result = point.copy()
        indices = constraint['shared_vars']
        
        if constraint['constraint_type'] == 'sum_to_one':
            current_sum = result[indices].sum()
            if current_sum > 0:
                result[indices] /= current_sum
        elif constraint['constraint_type'] == 'non_negative':
            result[indices] = np.maximum(result[indices], 0)
        
        return result

# Example: 3-component retrieval → reasoning → planning pipeline
variable_order = [
    ('retrieval', 'product_x'),
    ('retrieval', 'product_y'),
    ('reasoning', 'feature_a'),
    ('reasoning', 'feature_b'),
    ('planning', 'config_z'),
    ('planning', 'config_w')
]

coupling_constraints = [
    {
        'components': ('retrieval', 'reasoning'),
        # indices [0, 1] correspond to retrieval product_x and product_y
        'shared_vars': [0, 1],
        'constraint_type': 'sum_to_one'
    },
    {
        'components': ('reasoning', 'planning'),
        # indices [2, 3] correspond to reasoning feature_a and feature_b
        'shared_vars': [2, 3],
        'constraint_type': 'sum_to_one'
    },
    {
        'components': ('retrieval', 'reasoning'),
        'shared_vars': [0, 1, 2, 3],
        'constraint_type': 'non_negative'
    }
]

checker = CoherenceChecker(coupling_constraints, variable_order)

component_outputs = {
    'retrieval': np.array([0.7, 0.3]),
    'reasoning': np.array([0.8, 0.2]),
    'planning': np.array([0.9, 0.1])
}

residual, repaired = checker.compute_residual(component_outputs)

if residual > 0.05:  # threshold from paper
    final_output = repaired
else:
    final_output = component_outputs

Observability: What to Monitor

Unlike latency or token-count metrics, ε* requires declaring coupling constraints upfront and tracking how component interactions violate those constraints. Track these metrics in your orchestration layer:

ε distribution:* Histogram of compositional residuals across requests. Spike indicates systemic composition failure.
Rejection rate: Percentage of requests where ε* exceeds threshold. High rate suggests architectural mismatch.
Per-component ε contribution:* Which component pairs produce the highest residuals. Guides refactoring.
Repair delta: How much the projection changes component outputs. Large deltas indicate aggressive corrections.

Correlate ε* with downstream task accuracy to tune rejection thresholds. The paper suggests a threshold of 0.05 for betting tasks. Tune this based on your domain: 0.02 for high-stakes decisions (financial, medical), 0.10 for exploratory or low-risk queries.

Failure Modes in Production

Betting and forecasting agents: The paper tested on a betting scenario using a four-LLM panel across 1,876 ensemble cliques. Each clique represented a set of probabilistic forecasts that should form a coherent joint distribution. The experiment resolved 1,770 bets using a proportional allocation rule (bet size proportional to claimed probability).

The paper found that incoherent compositions produced +0.115 nats per bet of regret compared to coherent baselines. The regret collapsed to +0.006 when bettors themselves coherentized (applied the projection before placing bets). This suggests downstream systems may silently correct your agent’s mistakes, hiding the problem until they fail.

Multi-step planning: Planning agents that chain retrieval, reasoning, and action selection are particularly vulnerable. Each step conditions on previous outputs. Incoherence compounds across the chain.

Ensemble voting: Aggregating predictions from multiple models via weighted voting can violate axioms if weights are learned independently of the models’ correlation structure.

Technical Verdict

Use compositional coherence checks when:

Your agent architecture splits reasoning across multiple LLMs with overlapping context.
Components produce probabilistic outputs (confidence scores, probability distributions, weighted predictions).
Downstream systems make decisions based on the assembled probabilities (betting, resource allocation, risk assessment).
You can afford the computational overhead of projection (depends on constraint complexity and probability space size).

Avoid or deprioritize when:

Components operate on truly independent subproblems with no shared state.
Your orchestration layer uses deterministic rules (if-then logic, simple concatenation) rather than probabilistic merging.
Downstream systems ignore probabilities and use only the highest-confidence prediction.
You are prototyping and can tolerate occasional incoherent outputs.

The research shows that intuitive fixes (better prompts, learned aggregators) do not work. If you are building multi-component agents that produce probabilistic outputs, you need explicit coherence enforcement. The compositional residual gives you a runtime signal. The projection algorithm gives you a repair mechanism. Both are computable from system outputs without ground truth labels.