mech.app
Dev Tools

Goedel-Architect: How Blueprint Graphs Turn Theorem Proving into a Multi-Agent Dependency Pipeline

Blueprint dependency graphs structure agent task allocation, state management, and proof orchestration in Lean 4 formal verification workflows.

Source: arxiv.org
Goedel-Architect: How Blueprint Graphs Turn Theorem Proving into a Multi-Agent Dependency Pipeline

Goedel-Architect treats formal theorem proving as a dependency resolution problem. Instead of recursively decomposing lemmas until an agent gets stuck, it generates a blueprint: a directed acyclic graph of definitions and lemmas that build toward the target theorem. Agents then prove lemmas in parallel, respecting dependency order. Failed proofs trigger blueprint refinement rather than local retries.

This architecture shifts orchestration from recursive depth-first search to explicit graph traversal. The result is a multi-agent pipeline with clear state boundaries, observable failure modes, and a refinement loop that operates on the entire proof plan rather than individual steps.

Blueprint as Orchestration Primitive

A blueprint is a dependency graph where nodes are formally stated Lean 4 definitions or lemmas, and edges represent logical dependencies. The main theorem sits at the root. Leaf nodes are axioms or previously proven results.

Generation phase:

  • LLM produces blueprint from theorem statement and optional natural language proof sketch
  • Each node gets a formal Lean 4 signature (type, parameters, dependencies)
  • Dependencies are explicitly declared, not inferred during proof attempts

Execution phase:

  • Prover agents work on leaf nodes first (no unproven dependencies)
  • Parallel execution across independent branches
  • Proven nodes unlock dependent nodes for proving

This structure makes task allocation trivial. The scheduler assigns leaf nodes to available agents. When a node closes, its dependents become eligible. No agent needs global state beyond the current blueprint and proven node set.

State Management Across Refinement Loops

The blueprint is mutable global state. Agents read it to understand dependencies and write proof results back. Refinement mutates the graph when proofs fail.

State layers:

LayerScopeMutation Trigger
Blueprint graphGlobal, shared across agentsFailed proof attempts
Proven node setGlobal, append-onlySuccessful proof completion
Agent proof contextLocal to single lemmaTool calls, tactic execution
Lean 4 kernel statePer-node, isolatedType checking, unification

Refinement mechanics:

  • Failed lemma triggers blueprint regeneration
  • LLM receives failure context (error message, attempted tactics, dependency chain)
  • New blueprint may add intermediate lemmas, restate definitions, or restructure dependencies
  • Proven nodes are preserved; only failed branches get rewritten

This creates a feedback loop where proof failures inform structural changes. The system does not retry the same decomposition with different tactics. It changes the decomposition itself.

Proof Orchestration Flow

# Simplified orchestration pseudocode
class BlueprintOrchestrator:
    def __init__(self, theorem, optional_nl_proof):
        self.blueprint = generate_blueprint(theorem, optional_nl_proof)
        self.proven = set()
        self.failed = {}
    
    def execute(self):
        while not self.is_complete():
            # Get all leaf nodes with satisfied dependencies
            ready = [n for n in self.blueprint.nodes 
                     if self.dependencies_proven(n) and n not in self.proven]
            
            # Parallel proof attempts
            results = parallel_map(prove_node, ready)
            
            for node, result in results:
                if result.success:
                    self.proven.add(node)
                else:
                    self.failed[node] = result.error
            
            # Refine if any failures
            if self.failed and not ready:
                self.blueprint = refine_blueprint(
                    self.blueprint, 
                    self.failed, 
                    self.proven
                )
                self.failed.clear()
    
    def dependencies_proven(self, node):
        return all(dep in self.proven for dep in node.dependencies)

Key orchestration properties:

  • No recursive calls within proof attempts
  • Parallelism bounded by blueprint width, not depth
  • Failure handling is centralized, not distributed across agents
  • Refinement operates on graph structure, not proof tactics

Tool-Equipped Prover Component

Each prover agent has access to Lean 4 tactics as tools. The agent does not write complete proofs. It issues tactic calls, observes kernel responses, and iterates until the goal closes or a timeout hits.

Tool interface:

  • apply(theorem): attempt to apply existing result
  • intro(name): introduce hypothesis
  • cases(expr): case split on inductive type
  • simp(lemmas): simplification with specified lemmas
  • rw(equation): rewrite using equality

Execution model:

  • Agent maintains proof state (goals, hypotheses, context)
  • Each tool call updates state via Lean kernel
  • Kernel enforces type safety and logical correctness
  • Agent cannot produce invalid proofs, only fail to find valid ones

This is different from code generation. The agent does not emit a proof script for later verification. It interacts with the kernel in real time, and the kernel rejects invalid moves immediately.

Failure Modes and Observability

Common failure patterns:

  1. Lemma too hard for prover: Agent exhausts tactics without closing goal. Refinement adds intermediate steps.
  2. Incorrect dependency declaration: Lemma depends on unproven result. Blueprint regeneration fixes dependency edges.
  3. Type mismatch in blueprint: Formal statement does not match intended semantics. Refinement restates definition.
  4. Circular dependencies: Blueprint generation creates cycle. Validation catches this before execution.

Observability hooks:

  • Blueprint graph visualization (nodes colored by status: proven, in-progress, failed)
  • Per-node proof trace (tactic sequence, kernel responses, time spent)
  • Refinement diff (which nodes were added, removed, or restated)
  • Dependency chain for failed nodes (what was this lemma supposed to enable?)

The system logs every blueprint mutation. You can replay the refinement history to see how the proof plan evolved. This is critical for debugging because the final blueprint may look nothing like the initial one.

Deployment Shape

Compute requirements:

  • LLM inference for blueprint generation and refinement (DeepSeek-V4-Flash or equivalent)
  • Lean 4 kernel instances (one per prover agent, CPU-bound)
  • Orchestrator process (lightweight, mostly I/O and graph traversal)

Scaling strategy:

  • Horizontal: spin up more prover agents for wider blueprints
  • Vertical: faster LLM inference reduces refinement latency
  • Lean kernel is single-threaded per instance, so parallelism requires multiple processes

State persistence:

  • Blueprint stored as JSON or Lean-native structure
  • Proven nodes cached to avoid re-proving across refinement loops
  • Proof traces logged for audit and replay

You can checkpoint the system mid-execution and resume. The blueprint and proven set are sufficient to reconstruct state. Prover agents are stateless between lemma attempts.

Performance Characteristics

The paper reports 99.2% pass@1 on MiniF2F-test and 75.6% on PutnamBench using DeepSeek-V4-Flash. With natural language proof hints seeding the initial blueprint, it hits 100% on MiniF2F-test and 88.8% on PutnamBench.

What this means for infrastructure:

  • Blueprint generation is the bottleneck for cold starts
  • Refinement loops add latency but improve success rate
  • Parallel proving scales linearly with blueprint width
  • Natural language hints reduce refinement iterations

The system is not real-time. A single theorem can take minutes to hours depending on complexity and refinement depth. But it is deterministic given the same LLM and random seed. You can cache blueprints for common theorem patterns.

When Blueprint Graphs Beat Recursive Decomposition

Traditional autoformalization agents recursively decompose goals until they hit a base case or timeout. This creates deep call stacks and makes it hard to recover from dead ends. Goedel-Architect’s blueprint approach wins when:

  • Proof structure is predictable: Many formal proofs follow standard patterns (induction, case analysis, algebraic manipulation). A blueprint can encode this structure upfront.
  • Parallelism matters: Independent lemmas can be proven simultaneously. Recursive decomposition serializes work.
  • Failure recovery is expensive: Backtracking from a deep recursion wastes work. Blueprint refinement preserves proven nodes.

It loses when:

  • Proof discovery is exploratory: If the path to a theorem is unclear, generating a blueprint upfront is guesswork. Recursive search may find creative solutions faster.
  • Theorems are small: Overhead of blueprint generation and orchestration dominates for single-step proofs.

Technical Verdict

Use Goedel-Architect’s blueprint approach when you are formalizing mathematical results with known proof structures, need to parallelize proof work across multiple agents, or want explicit observability into proof plan evolution. The dependency graph makes orchestration straightforward and failure recovery surgical.

Avoid it for exploratory theorem proving where the proof strategy is unknown, or for simple lemmas where recursive decomposition completes faster than blueprint generation. The upfront planning cost only pays off for complex, multi-step proofs.

The architecture generalizes beyond theorem proving. Any task that decomposes into a dependency graph with formal verification at each node (build systems, data pipelines, compliance checks) could adopt this pattern. The key insight is treating the task graph as mutable state that agents refine based on execution feedback.


Tags

agentic-ai orchestration infrastructure

Primary Source

arxiv.org