CVE-Bench is a benchmark that measures whether LLM agents can fix real security vulnerabilities. It runs 20 real CVEs from Python projects (Pillow, GitPython, yt-dlp, urllib3) through five agents with three different prompt strategies, scoring each fix against hidden security tests. The results are blunt: the best model hits 50% overall and 60% under ideal conditions. The most dangerous failure mode is not a crash or a refusal. It is a plausible patch that passes regression tests but leaves the vulnerability intact.
Why This Matters Now
Anthropic reported finding 1,596 vulnerabilities in open-source software. As of May 2026, 97 have been patched. Discovery is now the easy part. Verification, triage, and patching are the bottleneck. CVE-Bench measures that bottleneck directly by asking agents to produce working fixes, not just identify issues.
The benchmark exposes the gap between agent coding demos and real-world security workflows. Most agent benchmarks test greenfield tasks or synthetic bugs. CVE-Bench uses real CVEs with real exploit tests. The agent must modify vulnerable code, pass hidden security tests, and avoid breaking existing functionality.
Benchmark Architecture
CVE-Bench runs each agent in a sandboxed container. The agent receives a CVE description, access to the vulnerable codebase, and a set of visible regression tests. The security tests remain hidden. The agent must produce a patch that fixes the vulnerability without seeing the exploit validation logic.
Sandboxing and Isolation
Each run executes in a fresh Docker container with:
- Read-only access to the vulnerable codebase
- Write access to a working directory for patches
- Network isolation (no external API calls during fix generation)
- Time limit of 10 minutes per CVE
The agent can read files, run tests, and modify code. It cannot see the hidden security test suite. This mirrors real-world conditions where developers do not have access to exploit PoCs during initial remediation.
Hidden Test Architecture
The benchmark separates visible regression tests from hidden security tests. Visible tests validate that the patch does not break existing functionality. Hidden tests validate that the patch actually closes the vulnerability.
This separation prevents agents from overfitting to exploit patterns. If the agent could see the security test, it could patch the specific attack vector without addressing the root cause. The hidden test architecture forces the agent to reason about the vulnerability class, not just the exploit.
Prompt Strategies
CVE-Bench tests three prompt types:
- Full Advisory: The agent receives the complete CVE description, including vulnerability class, affected versions, and remediation guidance.
- Locate: The agent receives only the vulnerability class and must locate the vulnerable code before fixing it.
- Diagnose: The agent receives a minimal description and must diagnose the root cause before patching.
Each strategy changes the agent’s tool usage and reasoning path. Full advisory prompts produce faster fixes but sometimes miss edge cases. Locate prompts force more thorough code analysis. Diagnose prompts produce the most thorough reasoning but also the highest failure rate.
Scoring Methodology
A fix passes if:
- All visible regression tests pass
- All hidden security tests pass
- The patch applies cleanly to the vulnerable version
A fix fails if:
- Any regression test fails (breaks existing functionality)
- Any security test fails (vulnerability remains exploitable)
- The patch does not apply (syntax errors, wrong file paths)
The benchmark does not score partial fixes. A patch that reduces attack surface but leaves the vulnerability exploitable scores as a failure. This binary scoring reflects production reality: a partially fixed vulnerability is still a vulnerability.
Dangerous Failure Modes
The most operationally dangerous failure is a false positive fix. The agent modifies the right file, passes all visible tests, and reports success. But the vulnerability remains exploitable through a different code path.
Example from the benchmark: an agent patched one branch of a conditional that checked user input. The patch added validation to the if branch but left the else branch untouched. All regression tests passed because they only exercised the if branch. The security test failed because the exploit used the else branch.
This failure mode is worse than an obvious crash or a refusal. It creates false confidence. A human reviewer sees passing tests and assumes the fix is complete. The vulnerability ships to production.
Results Across Models and Prompts
| Model | Full Advisory | Locate | Diagnose | Overall |
|---|---|---|---|---|
| gpt-5.5 | 60% | 50% | 40% | 50% |
| gpt-5.4-mini | 55% | 45% | 35% | 45% |
| gpt-5.4-nano | 50% | 40% | 30% | 40% |
| laguna-m.1 | 55% | 45% | 35% | 45% |
| laguna-xs.2 | 50% | 40% | 30% | 40% |
The expensive models are statistically indistinguishable from cheaper alternatives within the same family. gpt-5.5 costs 12x more per run than gpt-5.4-nano but only improves solve rate by 10 percentage points. The cost-per-success ratio favors smaller models for bulk remediation workflows.
Full advisory prompts consistently outperform locate and diagnose prompts by 10-20 percentage points. This suggests that agents benefit from explicit vulnerability class information. They struggle to infer root causes from minimal descriptions.
Implementation Details
The benchmark uses a standard agent loop:
- Parse CVE description and extract vulnerability class
- Locate vulnerable code using grep, AST analysis, or LLM-guided search
- Generate patch using LLM with full file context
- Apply patch and run visible regression tests
- If tests fail, retry with error context (max 3 retries)
- Submit final patch for hidden security test validation
Agents use a fixed tool set:
read_file(path): Read file contentswrite_file(path, content): Write file contentsrun_command(cmd): Execute shell commandrun_tests(): Run visible regression tests
The benchmark does not provide specialized security tools. Agents must use standard development tools to locate and fix vulnerabilities.
State Management
Each agent maintains a working directory with:
- Original vulnerable codebase (read-only)
- Modified files (read-write)
- Test output logs
- Patch history
The agent can revert changes and retry different approaches. The benchmark tracks all intermediate states but only scores the final submitted patch.
Observability and Debugging
The benchmark logs:
- All tool calls with arguments and results
- LLM reasoning traces (when available)
- Test output for each retry attempt
- Final patch diff
These logs expose common failure patterns:
- Incomplete search: Agent finds one vulnerable function but misses related code paths
- Overfitting to exploit: Agent patches the specific attack vector from the CVE description but misses the general vulnerability class
- Test misinterpretation: Agent assumes passing regression tests mean the vulnerability is fixed
The logs also show that agents rarely use static analysis tools. Most agents rely on pattern matching and LLM-guided code search. This limits their ability to find all instances of a vulnerability class.
Cost and Latency Trade-offs
| Model | Cost per Run | Median Latency | Cost per Success |
|---|---|---|---|
| gpt-5.5 | $2.40 | 180s | $4.80 |
| gpt-5.4-mini | $0.60 | 120s | $1.33 |
| gpt-5.4-nano | $0.20 | 90s | $0.50 |
| laguna-m.1 | $0.80 | 150s | $1.78 |
| laguna-xs.2 | $0.30 | 100s | $0.75 |
The cost-per-success metric accounts for failed runs. Cheaper models fail more often but cost less per attempt. For bulk remediation workflows, smaller models provide better ROI.
Latency scales with model size. Larger models spend more time on reasoning traces and tool planning. Smaller models make faster tool calls but retry more often.
When to Use This Benchmark
CVE-Bench is useful for:
- Evaluating agent reliability on security tasks: Measures whether agents can fix real vulnerabilities, not just synthetic bugs.
- Testing prompt strategies: Compares full advisory, locate, and diagnose prompts on the same CVE set.
- Validating sandboxing approaches: Provides a reference implementation for hidden test architectures.
CVE-Bench is not useful for:
- Comparing agent performance on general coding tasks: The benchmark only covers security vulnerabilities in Python projects.
- Measuring agent performance on non-Python languages: All CVEs are from Python codebases.
- Evaluating agent performance on novel vulnerability classes: The benchmark uses known CVEs with published fixes.
Technical Verdict
Use CVE-Bench when you need to measure agent reliability on real security remediation tasks. The hidden test architecture prevents overfitting to exploit patterns. The three prompt strategies expose how much context agents need to produce working fixes.
Avoid CVE-Bench if you need to evaluate agents on general coding tasks or non-Python languages. The benchmark is narrow by design. It measures one specific capability: fixing known security vulnerabilities in Python projects.
The results suggest that agents are not yet reliable enough for autonomous security remediation. The 50% solve rate means half of all fixes fail. The false positive failure mode means some fixes create false confidence. Human review remains essential.
For production workflows, use agents as assistants, not replacements. Let the agent generate a candidate patch. Have a human reviewer validate the fix against the vulnerability class, not just the regression tests. Run the hidden security tests before merging.
The cost analysis suggests using smaller models for initial patch generation and larger models for validation. This two-stage approach balances cost and reliability.