Three-Class Credential Detection: Why Agent Secret Scanners Need to Distinguish Real Keys from Placeholders and False Positives

AI coding agents generate code at scale without human review. When an agent writes a configuration file, a test fixture, or a deployment script, it might include credentials. The problem is not just detecting secrets. The problem is distinguishing real secrets from placeholders like YOUR_API_KEY_HERE or sk-test-1234567890abcdef without flooding security teams with false positives.

Traditional secret scanners use binary classification: secret or not secret. This approach produces high false-positive rates because it treats placeholder credentials the same as real ones. A new hybrid CNN-CodeBERT framework introduces a third class specifically for placeholders and weak credentials, reducing high-severity alerts by 33% while maintaining 93% recall on genuine leaks.

Why Binary Classification Fails

Most secret scanners work in two stages:

Pattern matching to find strings that look like API keys, tokens, or passwords
Binary classification to decide if each match is a real secret

This breaks down when you encounter:

Test fixtures with placeholder keys (sk-test-...)
Documentation examples (YOUR_SECRET_HERE)
Weak or dummy credentials (password123)
Environment variable templates (${API_KEY})

A binary classifier must choose: is sk-test-1234567890abcdef a secret or not? If it says yes, you get a false positive. If it says no, you might miss a real test key that accidentally went to production.

The Three-Class Architecture

The hybrid framework splits detection into three categories:

Class	Definition	Example
Genuine	Real credentials with entropy and context suggesting production use	`sk-live-7x9mK3nP2qR8vW4tY6uZ1aB5cD0eF`
Placeholder	Template strings, test keys, or weak credentials	`YOUR_API_KEY`, `sk-test-123`, `password`
Benign	Code that matches patterns but is not credential-related	Variable names, comments, random strings

The architecture combines two detection paths:

Character-level CNN extracts syntactic features:

Entropy distribution across the string
Character class transitions (alphanumeric to special)
Length and format patterns
Prefix/suffix markers (sk-, -test-)

CodeBERT semantic encoder captures context:

Surrounding code structure
Variable naming conventions
File type and location
Comments and documentation markers

Both paths feed into a final classifier that outputs three probabilities. The model was trained on 9,426 samples across 10 programming languages.

Implementation Considerations

Where to Scan

Agent pipelines have multiple insertion points for secret detection:

Pre-generation filtering: Block the agent from accessing real secrets during code generation. This prevents leakage but limits the agent’s ability to write working deployment code.

Post-generation scanning: Run the three-class detector on generated code before committing. This catches leaks but requires a rollback mechanism.

Continuous monitoring: Scan repositories on every push. This is standard practice but happens too late if the agent already committed secrets.

Hybrid approach: Use placeholder injection during generation (agent writes {{API_KEY}}), scan the output, then substitute real secrets only in secure deployment contexts.

Training Pipeline

Building a three-class detector requires labeled data in all three categories:

# Pseudo-code for training data construction
def build_training_set():
    genuine = scrape_leaked_secrets(github_archive)
    placeholders = extract_from_docs_and_tests(
        patterns=["YOUR_", "EXAMPLE_", "test-", "dummy-"]
    )
    benign = sample_non_secret_strings(
        source=code_corpus,
        filters=["variable_names", "comments", "literals"]
    )
    
    # Balance classes to prevent bias toward benign
    return balance_classes(genuine, placeholders, benign)

The dataset needs:

Real leaked credentials (from public breach datasets)
Placeholder examples (from documentation and test suites)
Benign strings (from normal code that matches secret patterns)

Cross-language generalization is critical. The model achieves F1 > 0.80 on 9 of 10 languages under leave-one-language-out evaluation, meaning it can detect secrets in languages it has never seen during training.

Integration with Agent Tool Boundaries

Secret scanning should happen at tool call boundaries:

File write operations: Scan before writing to disk

@tool
def write_file(path: str, content: str):
    scan_result = secret_scanner.classify(content)
    if scan_result.has_genuine_secrets():
        raise SecurityError(
            f"Genuine credentials detected: {scan_result.locations}"
        )
    if scan_result.has_placeholders():
        log_warning(f"Placeholders found: {scan_result.placeholders}")
    filesystem.write(path, content)

Git commit hooks: Block commits with genuine secrets

#!/bin/bash
# pre-commit hook
python -m secret_scanner --mode=three-class --input=staged_files
if [ $? -eq 2 ]; then
    echo "Genuine credentials detected. Commit blocked."
    exit 1
fi

API response validation: Scan agent outputs before returning to users

def validate_agent_response(response: str) -> str:
    scan = secret_scanner.classify(response)
    if scan.genuine_count > 0:
        # Redact or reject
        return redact_secrets(response, scan.genuine_locations)
    return response

Performance Characteristics

The hybrid model achieves:

Matthews Correlation Coefficient: 0.86 (strong correlation between predictions and ground truth)
Macro F1-score: 0.90 (balanced performance across all three classes)
Genuine credential recall: 93% (catches most real leaks)
Genuine credential precision: 89% (low false positive rate)
Placeholder F1: 0.81 (up from 0.54 in character-only models)

The key improvement is placeholder detection. Earlier models treated placeholders as either genuine (false positive) or benign (missed opportunity to warn developers). The three-class approach explicitly models this middle ground.

Alert reduction matters for operational teams. Reducing high-severity alerts from 373 to 250 (33% reduction) without sacrificing recall means security engineers spend less time investigating false positives.

Failure Modes

Obfuscated secrets: Base64-encoded or hex-encoded credentials may bypass pattern matching. The semantic encoder helps but is not foolproof.

Dynamic secret generation: Secrets constructed at runtime from multiple variables are invisible to static analysis.

# Hard to detect
api_key = f"{prefix}{middle}{suffix}"

Context collapse: Short code snippets lack context. The model performs best with 10-20 lines of surrounding code.

Language drift: New languages or frameworks may introduce novel credential formats. Retraining is required.

Adversarial evasion: Developers who intentionally want to bypass scanning can craft strings that look benign to the model.

Observability and Tuning

Effective secret scanning requires monitoring:

Classification distribution: Track the ratio of genuine/placeholder/benign detections over time. A sudden spike in placeholders might indicate sloppy code generation.

False positive feedback loop: Let security teams mark false positives and feed them back into training.

Language-specific performance: Monitor F1 scores per language. If a new language shows poor performance, collect more training data for that language.

Latency budget: The hybrid model is slower than pure regex scanning. Budget 50-200ms per file depending on file size and hardware.

# Monitoring pseudo-code
metrics.histogram("secret_scanner.latency_ms", duration)
metrics.counter("secret_scanner.class", tags={"class": result.class_name})
if result.is_genuine():
    metrics.counter("secret_scanner.genuine.type", 
                   tags={"type": result.secret_type})

Technical Verdict

Use this approach when:

You have agent-generated code flowing into production without manual review
Your existing scanner produces too many false positives for security teams to triage
You need to distinguish test fixtures from real credentials
You work across multiple programming languages

Avoid or defer when:

You have strong pre-commit human review processes
Your agents never touch real credentials (they only write application logic)
You can enforce placeholder injection at the framework level
Your security team prefers high recall even at the cost of false positives

The three-class model is not a silver bullet. It reduces false positives but does not eliminate them. It works best as part of a layered defense: block agents from accessing real secrets during generation, scan outputs with the three-class detector, and use runtime secret management to inject credentials only in secure contexts.

For agent pipelines generating code at scale, the 33% reduction in false positives translates directly to reduced operational burden. The placeholder class gives you a middle ground: warn developers without blocking their workflow.