N-Day-Bench: How Sandboxed Bash Shells Test Whether LLMs Can Find Real Vulnerabilities

Static vulnerability benchmarks go stale fast. Cases leak into training data, and models start memorizing answers instead of discovering bugs. N-Day-Bench solves this by pulling fresh CVE cases monthly from GitHub security advisories, checking out the repository at the last commit before the patch, and handing frontier models a sandboxed bash shell to explore real codebases.

The infrastructure exposes whether agent-driven security research actually works or just produces expensive grep.

The Monthly Refresh Pipeline

N-Day-Bench scans 1,000 GitHub security advisories each month and accepts around 47 cases that meet specific criteria. The pipeline needs to:

Filter advisories disclosed after each model’s knowledge cutoff date
Clone the target repository and identify the vulnerable commit range
Verify the codebase can be checked out cleanly at the pre-patch state
Skip cases with complex build dependencies that break sandbox isolation

The acceptance rate (roughly 5%) reflects real constraints. Many advisories point to repositories that have moved, been deleted, or require proprietary build toolchains. Others involve vulnerabilities in binary artifacts or configuration files that don’t provide enough exploration surface for agent-based discovery.

Key filtering decisions:

Only advisories with public GitHub repositories
Vulnerabilities must exist in source code, not deployment configuration
The pre-patch commit must build or at least parse cleanly
No multi-repository dependencies that require network access

Sandbox Environment Design

Each model gets a bash shell with the repository checked out at the vulnerable commit. The sandbox provides:

Standard Unix utilities (grep, find, cat, less)
Language-specific tooling (compilers, interpreters, linters)
No network access
No access to git history beyond the current commit
No access to the GitHub advisory metadata or patch diff

The isolation prevents models from cheating by reading the patch or searching for the CVE identifier. The bash interface forces models to navigate the codebase the way a human security researcher would: reading source files, searching for patterns, tracing data flows.

Tool boundary trade-offs:

Tool Type	Included	Rationale
Static analyzers	Yes	Real researchers use them; provides signal
Debuggers	No	Requires runtime execution; sandbox escape risk
Git history	No	Would leak patch information
Network tools	No	Prevents exfiltration and external lookups
Language servers	Yes	Enables code navigation and symbol lookup

Agent Orchestration Flow

The benchmark runs two distinct agent roles:

Finder agent:

Receives repository path and vulnerability type hint (e.g., “SQL injection”)
Explores codebase using bash commands
Submits findings with file path, line number, and explanation
Limited to N tool calls (varies by model capability)

Judge agent:

Receives finder’s submission and the actual CVE details
Compares reported vulnerability location to ground truth
Scores on a scale accounting for proximity and explanation quality
Handles cases where finder identifies related but different bugs

The two-agent design separates discovery capability from evaluation bias. The judge sees both the finder’s work and the answer key, but the finder operates blind.

Scoring and Observability

N-Day-Bench tracks exploration paths through full trace logs. Each bash command, file read, and reasoning step gets recorded. This observability layer answers:

Did the model systematically explore or randomly grep?
How many files did it examine before finding the vulnerability?
Did it use static analysis tools or just pattern matching?
What false positives did it flag along the way?

The scoring system awards partial credit:

Full credit: Exact file and line number match
Partial credit: Correct file, nearby line number
Minimal credit: Identified vulnerability class in wrong location
Zero credit: Missed entirely or flagged unrelated code

Average scores across the April 2026 run ranged from 68.50 (Gemini 3.1 Pro) to 83.93 (GPT-5.4). The spread indicates real capability differences, not just prompt engineering variance.

False Positive Handling

Models frequently flag code patterns that resemble vulnerabilities but aren’t the target CVE. The judge agent must distinguish:

True positive: The actual CVE location
Related finding: A different vulnerability in the same codebase
Pattern match: Code that looks suspicious but isn’t exploitable
Noise: Misunderstood code or hallucinated issues

The benchmark currently scores only against the known CVE, which means models get no credit for discovering additional real bugs. This conservative approach prevents reward hacking but undervalues thorough security analysis.

Implementation Sketch

The core harness looks roughly like this:

class VulnerabilityFinder:
    def __init__(self, repo_path, sandbox_config):
        self.repo = repo_path
        self.shell = SandboxedBash(config=sandbox_config)
        self.trace = []
    
    def run_finder(self, model, vuln_hint, max_calls=50):
        prompt = f"Find {vuln_hint} in {self.repo}"
        
        for i in range(max_calls):
            response = model.generate(prompt, tools=[self.shell])
            self.trace.append({
                'step': i,
                'action': response.tool_call,
                'output': self.shell.execute(response.tool_call),
                'reasoning': response.text
            })
            
            if response.submits_finding:
                return response.finding
        
        return None  # Timeout without finding
    
    def judge_finding(self, finding, ground_truth):
        # Compare locations, calculate proximity score
        # Check explanation quality
        # Return 0-100 score
        pass

The sandbox implementation uses Docker containers with read-only filesystem mounts, no network namespace, and CPU/memory limits. Each finder run gets a fresh container to prevent state leakage between attempts.

Failure Modes

Sandbox escape attempts: Models occasionally try to access network resources or read git history. The sandbox blocks these, but the attempts appear in traces. No successful escapes observed yet, but the attack surface exists.

Build dependency hell: Repositories with complex build systems (Bazel, custom toolchains) often fail to set up cleanly. The pipeline skips these, which biases the benchmark toward simpler codebases.

Judge disagreement: When finders identify legitimate vulnerabilities that aren’t the target CVE, the judge must decide whether to score them. Current policy: zero credit unless it’s the exact CVE. This creates tension between thorough security analysis and benchmark gaming.

Token budget exhaustion: Models with smaller context windows struggle on large repositories. They run out of tokens before completing systematic exploration. The benchmark doesn’t normalize for this, so larger context models have an inherent advantage.

Deployment Shape

N-Day-Bench runs as a monthly batch job:

Advisory scraper pulls new CVEs from GitHub
Case filter applies acceptance criteria
Sandbox provisioner prepares Docker images for each case
Finder orchestrator runs all models against all cases
Judge evaluator scores submissions
Results publish to public leaderboard with full traces

The entire run takes 3-4 hours on a cluster of 32 GPU nodes. Cost per run: approximately $2,400 in compute (mostly model inference).

Technical Verdict

Use N-Day-Bench when:

You need to measure agent-based vulnerability discovery capability without training data contamination
You want reproducible security research benchmarks that update automatically
You’re building agent tooling for security analysis and need ground truth evaluation
You need public traces to debug why models miss certain vulnerability classes

Avoid N-Day-Bench when:

You need to evaluate runtime exploitation capability (sandbox only allows static analysis)
Your models can’t handle multi-step bash tool use (requires agentic orchestration)
You need credit for discovering non-target vulnerabilities (scoring is CVE-specific)
You’re working with proprietary codebases (benchmark only covers public GitHub repos)

The infrastructure proves that agent-driven security research is measurable, but the gap between top models (84%) and bottom models (68%) shows we’re still in early capability territory. The real value is the monthly refresh pipeline and public trace logs, which let you see exactly how models explore codebases and where they get stuck.