SpecBench: Evaluating Agent Requirements Reasoning Before Code Generation

Most SWE agent benchmarks measure code generation against fixed requirements. SpecBench flips the script: it evaluates whether agents can identify flaws in the requirements themselves before any code gets written.

The benchmark draws tasks from real Request for Comments (RFC) processes in mature open-source projects. An agent receives an initial design proposal, the project codebase, and historical RFC discussions. Its job is to surface specification deficiencies: omissions, ambiguities, inconsistencies, or incorrect assumptions that human maintainers caught during review.

This matters because production software engineering is not a one-shot code generation problem. Requirements evolve through multiple review cycles. Agents that can participate in this phase unlock full-lifecycle automation instead of just acting as autocomplete on steroids.

Why Specification-Level Reasoning Is Different

Code generation benchmarks like SWE-Bench assume the spec is correct and complete. You get a GitHub issue, you write a patch, you pass tests. SpecBench targets the phase before that: when the spec itself is still wrong.

Key differences:

No single ground truth: Multiple valid critiques exist for any proposal. Human experts disagree on priority and severity.
Context window stress: Agents must synthesize the proposal, codebase architecture, past RFC discussions, and project conventions.
Subjective evaluation: Unlike test pass/fail, you are comparing agent critiques against what human maintainers actually raised during review.

The benchmark uses historical RFC data from five repositories: Kubernetes, Rust, Python, Envoy, and TiKV. Each task includes the initial proposal and the critiques that maintainers surfaced before accepting or rejecting the design.

Evaluation Mechanics

SpecBench does not measure code correctness. It measures critique quality across four dimensions:

Dimension	What It Measures	Example Failure
Completeness	Does the agent catch all critical omissions?	Missing error handling paths, unspecified edge cases
Ambiguity detection	Does it flag vague or underspecified behavior?	”Handle failures gracefully” without defining retry logic
Consistency	Does it spot contradictions with existing design or prior decisions?	New API violates established naming conventions
Correctness	Does it identify technically incorrect assumptions?	Proposed algorithm has wrong time complexity claim

Scoring is tricky because ground truth is a set of human-generated critiques, not a binary pass/fail. The paper uses overlap metrics: how many of the agent’s critiques match issues that maintainers actually raised, and how many maintainer critiques the agent missed.

False positives matter too. An agent that flags everything is useless. The benchmark penalizes critiques that maintainers never mentioned and that do not represent real issues upon expert review.

Architecture Implications

Running a specification agent requires different plumbing than a code generation agent:

State management across review cycles: Real RFC processes involve multiple rounds. An agent needs to track which critiques were addressed, which were deferred, and which spawned new sub-proposals. This is not a stateless prompt-response loop.

Artifact handoff to code agents: If the spec agent refines requirements, how does that flow to the implementation agent? Options include:

Shared context window (expensive, hits token limits fast)
Structured artifact pipeline (spec agent writes JSON schema, code agent consumes it)
Human-in-the-loop handoff (spec agent proposes, human approves, code agent executes)

Tool calls for codebase analysis: Specification reasoning requires understanding existing architecture. The agent needs tools to query dependency graphs, trace API usage, and retrieve past design decisions. Static analysis hooks are mandatory, not optional.

Observability for subjective tasks: You cannot just log “test passed.” You need to capture which critiques the agent raised, which it missed, and why. This requires structured logging of reasoning chains and comparison against human expert annotations.

Failure Modes

Specification agents fail differently than code agents:

Over-critique: Flagging trivial style issues while missing fundamental design flaws. This happens when the agent lacks project-specific context about what maintainers actually care about.
Hallucinated constraints: Inventing requirements that do not exist in the codebase or project conventions. The agent “remembers” a pattern it saw in training data but that does not apply here.
Missed implicit assumptions: Human experts catch issues based on unwritten knowledge (team norms, operational constraints, political considerations). Agents trained only on public RFC text miss this.
Context window collapse: Large proposals plus full codebase plus RFC history exceeds token limits. The agent either truncates critical context or fails to synthesize across all inputs.

When This Matters

Specification-level reasoning is not useful for every automation scenario. It matters when:

Requirements are genuinely unclear or contested
Multiple stakeholders need to align before implementation
The cost of building the wrong thing exceeds the cost of extended design review
You are automating contributions to mature open-source projects with formal RFC processes

It does not matter when:

Requirements are already precise (e.g., “fix this bug”)
You are prototyping and iteration is cheap
The agent is operating in a narrow domain with well-established patterns

Implementation Sketch

A minimal specification agent pipeline:

class SpecificationAgent:
    def __init__(self, llm, codebase_index, rfc_history):
        self.llm = llm
        self.codebase_index = codebase_index  # Vector store of architecture docs
        self.rfc_history = rfc_history        # Past RFC discussions
        
    def critique_proposal(self, proposal_text):
        # Retrieve relevant codebase context
        relevant_code = self.codebase_index.query(
            proposal_text, 
            top_k=10
        )
        
        # Retrieve similar past RFCs
        similar_rfcs = self.rfc_history.find_similar(
            proposal_text,
            top_k=5
        )
        
        # Build critique prompt
        prompt = self._build_critique_prompt(
            proposal=proposal_text,
            codebase_context=relevant_code,
            past_rfcs=similar_rfcs
        )
        
        # Generate critiques with structured output
        critiques = self.llm.generate(
            prompt,
            response_format={
                "type": "json_schema",
                "schema": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "category": {"enum": ["omission", "ambiguity", "inconsistency", "incorrectness"]},
                            "severity": {"enum": ["critical", "major", "minor"]},
                            "description": {"type": "string"},
                            "suggested_fix": {"type": "string"}
                        }
                    }
                }
            }
        )
        
        return critiques

The key is structured output. You need critiques as data, not prose, so downstream systems can route them to the right reviewers or feed them back into the next iteration.

Technical Verdict

Use SpecBench-style specification agents when:

You are building agents for mature projects with formal design review processes
Requirements ambiguity is a known source of implementation failures
You can afford the latency of multi-round review cycles
You have access to historical RFC or design review data for training and evaluation

Avoid when:

You need fast iteration and can tolerate building the wrong thing
Requirements are already precise and uncontested
You lack the infrastructure to manage multi-turn state and artifact handoff
Your domain has no historical critique data to validate against

The real unlock is not replacing human reviewers. It is surfacing the obvious issues faster so human experts can focus on the subtle judgment calls that actually require expertise.