CVE-Bench: What Happens When You Ask LLM Agents to Fix Real Security Vulnerabilities

CVE-Bench is a benchmark that measures whether LLM agents can fix real security vulnerabilities. It runs 20 real CVEs from Python projects (Pillow, GitPython, yt-dlp, urllib3) through five agents with three different prompt strategies, scoring each fix against hidden security tests. The results are blunt: the best model hits 50% overall and 60% under ideal conditions. The most dangerous failure mode is not a crash or a refusal. It is a plausible patch that passes regression tests but leaves the vulnerability intact.

Why This Matters Now

Anthropic reported finding 1,596 vulnerabilities in open-source software. As of May 2026, 97 have been patched. Discovery is now the easy part. Verification, triage, and patching are the bottleneck. CVE-Bench measures that bottleneck directly by asking agents to produce working fixes, not just identify issues.

The benchmark exposes the gap between agent coding demos and real-world security workflows. Most agent benchmarks test greenfield tasks or synthetic bugs. CVE-Bench uses real CVEs with real exploit tests. The agent must modify vulnerable code, pass hidden security tests, and avoid breaking existing functionality.

Benchmark Architecture

CVE-Bench runs each agent in a sandboxed container. The agent receives a CVE description, access to the vulnerable codebase, and a set of visible regression tests. The security tests remain hidden. The agent must produce a patch that fixes the vulnerability without seeing the exploit validation logic.

Sandboxing and Isolation

Each run executes in a fresh Docker container with:

Read-only access to the vulnerable codebase
Write access to a working directory for patches
Network isolation (no external API calls during fix generation)
Time limit of 10 minutes per CVE

The agent can read files, run tests, and modify code. It cannot see the hidden security test suite. This mirrors real-world conditions where developers do not have access to exploit PoCs during initial remediation.

Hidden Test Architecture

The benchmark separates visible regression tests from hidden security tests. Visible tests validate that the patch does not break existing functionality. Hidden tests validate that the patch actually closes the vulnerability.

This separation prevents agents from overfitting to exploit patterns. If the agent could see the security test, it could patch the specific attack vector without addressing the root cause. The hidden test architecture forces the agent to reason about the vulnerability class, not just the exploit.

Prompt Strategies

CVE-Bench tests three prompt types:

Full Advisory: The agent receives the complete CVE description, including vulnerability class, affected versions, and remediation guidance.
Locate: The agent receives only the vulnerability class and must locate the vulnerable code before fixing it.
Diagnose: The agent receives a minimal description and must diagnose the root cause before patching.

Each strategy changes the agent’s tool usage and reasoning path. Full advisory prompts produce faster fixes but sometimes miss edge cases. Locate prompts force more thorough code analysis. Diagnose prompts produce the most thorough reasoning but also the highest failure rate.

Scoring Methodology

A fix passes if:

All visible regression tests pass
All hidden security tests pass
The patch applies cleanly to the vulnerable version

A fix fails if:

Any regression test fails (breaks existing functionality)
Any security test fails (vulnerability remains exploitable)
The patch does not apply (syntax errors, wrong file paths)

The benchmark does not score partial fixes. A patch that reduces attack surface but leaves the vulnerability exploitable scores as a failure. This binary scoring reflects production reality: a partially fixed vulnerability is still a vulnerability.

Dangerous Failure Modes

The most operationally dangerous failure is a false positive fix. The agent modifies the right file, passes all visible tests, and reports success. But the vulnerability remains exploitable through a different code path.

Example from the benchmark: an agent patched one branch of a conditional that checked user input. The patch added validation to the if branch but left the else branch untouched. All regression tests passed because they only exercised the if branch. The security test failed because the exploit used the else branch.

This failure mode is worse than an obvious crash or a refusal. It creates false confidence. A human reviewer sees passing tests and assumes the fix is complete. The vulnerability ships to production.

Results Across Models and Prompts

Model	Full Advisory	Locate	Diagnose	Overall
gpt-5.5	60%	50%	40%	50%
gpt-5.4-mini	55%	45%	35%	45%
gpt-5.4-nano	50%	40%	30%	40%
laguna-m.1	55%	45%	35%	45%
laguna-xs.2	50%	40%	30%	40%

The expensive models are statistically indistinguishable from cheaper alternatives within the same family. gpt-5.5 costs 12x more per run than gpt-5.4-nano but only improves solve rate by 10 percentage points. The cost-per-success ratio favors smaller models for bulk remediation workflows.

Full advisory prompts consistently outperform locate and diagnose prompts by 10-20 percentage points. This suggests that agents benefit from explicit vulnerability class information. They struggle to infer root causes from minimal descriptions.

Implementation Details

The benchmark uses a standard agent loop:

Parse CVE description and extract vulnerability class
Locate vulnerable code using grep, AST analysis, or LLM-guided search
Generate patch using LLM with full file context
Apply patch and run visible regression tests
If tests fail, retry with error context (max 3 retries)
Submit final patch for hidden security test validation

Agents use a fixed tool set:

read_file(path): Read file contents
write_file(path, content): Write file contents
run_command(cmd): Execute shell command
run_tests(): Run visible regression tests

The benchmark does not provide specialized security tools. Agents must use standard development tools to locate and fix vulnerabilities.

State Management

Each agent maintains a working directory with:

Original vulnerable codebase (read-only)
Modified files (read-write)
Test output logs
Patch history

The agent can revert changes and retry different approaches. The benchmark tracks all intermediate states but only scores the final submitted patch.

Observability and Debugging

The benchmark logs:

All tool calls with arguments and results
LLM reasoning traces (when available)
Test output for each retry attempt
Final patch diff

These logs expose common failure patterns:

Incomplete search: Agent finds one vulnerable function but misses related code paths
Overfitting to exploit: Agent patches the specific attack vector from the CVE description but misses the general vulnerability class
Test misinterpretation: Agent assumes passing regression tests mean the vulnerability is fixed

The logs also show that agents rarely use static analysis tools. Most agents rely on pattern matching and LLM-guided code search. This limits their ability to find all instances of a vulnerability class.

Cost and Latency Trade-offs

Model	Cost per Run	Median Latency	Cost per Success
gpt-5.5	$2.40	180s	$4.80
gpt-5.4-mini	$0.60	120s	$1.33
gpt-5.4-nano	$0.20	90s	$0.50
laguna-m.1	$0.80	150s	$1.78
laguna-xs.2	$0.30	100s	$0.75

The cost-per-success metric accounts for failed runs. Cheaper models fail more often but cost less per attempt. For bulk remediation workflows, smaller models provide better ROI.

Latency scales with model size. Larger models spend more time on reasoning traces and tool planning. Smaller models make faster tool calls but retry more often.

When to Use This Benchmark

CVE-Bench is useful for:

Evaluating agent reliability on security tasks: Measures whether agents can fix real vulnerabilities, not just synthetic bugs.
Testing prompt strategies: Compares full advisory, locate, and diagnose prompts on the same CVE set.
Validating sandboxing approaches: Provides a reference implementation for hidden test architectures.

CVE-Bench is not useful for:

Comparing agent performance on general coding tasks: The benchmark only covers security vulnerabilities in Python projects.
Measuring agent performance on non-Python languages: All CVEs are from Python codebases.
Evaluating agent performance on novel vulnerability classes: The benchmark uses known CVEs with published fixes.

Technical Verdict

Use CVE-Bench when you need to measure agent reliability on real security remediation tasks. The hidden test architecture prevents overfitting to exploit patterns. The three prompt strategies expose how much context agents need to produce working fixes.

Avoid CVE-Bench if you need to evaluate agents on general coding tasks or non-Python languages. The benchmark is narrow by design. It measures one specific capability: fixing known security vulnerabilities in Python projects.

The results suggest that agents are not yet reliable enough for autonomous security remediation. The 50% solve rate means half of all fixes fail. The false positive failure mode means some fixes create false confidence. Human review remains essential.

For production workflows, use agents as assistants, not replacements. Let the agent generate a candidate patch. Have a human reviewer validate the fix against the vulnerability class, not just the regression tests. Run the hidden security tests before merging.

The cost analysis suggests using smaller models for initial patch generation and larger models for validation. This two-stage approach balances cost and reliability.