96 Hours of Autonomous Bounty Hunting: What Breaks When AI Agents Compete for Real GitHub Issues

An engineer ran an autonomous agent against real GitHub bounties for 96 hours. It submitted 240+ PRs, got 72 merged, and earned $500-800. The interesting part is not the money. It’s the failure telemetry from a multi-day stateful workflow that had to manage OAuth tokens, rate limits, repository forks, and competitive task acquisition without human intervention.

This is the plumbing breakdown.

Architecture: The Bounty Hunting Loop

The agent (called ZKA) runs on a 30-minute cron schedule. Each cycle:

Scans GitHub for open bounties via API
Evaluates legitimacy, difficulty, and competition
Clones repositories and analyzes codebases
Writes fixes with tests
Submits PRs with descriptions
Monitors review feedback and responds to bots

The stack is straightforward: GitHub CLI for API interactions, Python for orchestration, Hermes Agent (a self-hosted AI framework), and cron for scheduling.

# Simplified bounty hunting loop
while True:
    bounties = search_bounties()
    for bounty in bounties:
        if is_legitimate(bounty) and is_low_competition(bounty):
            repo = clone_repository(bounty.repo_url)
            fix = generate_fix(repo, bounty.issue)
            if run_tests(fix):
                submit_pr(fix, bounty)
    sleep(1800)  # 30-minute interval

The critical detail: this is not a one-shot script. It’s a persistent agent that must maintain state across multiple days, handle API failures, and avoid triggering anti-abuse mechanisms.

Authentication Boundaries and Token Management

GitHub OAuth tokens have scopes. A token with repo:read cannot push to a repository. A token with repo:write can, but it also has higher rate limits and stricter abuse detection.

The agent needs to:

Fork repositories (requires repo:write)
Clone forks (requires repo:read)
Push commits (requires repo:write)
Open PRs (requires repo:write)
Comment on issues (requires public_repo or repo:write)

The naive approach is to use a single token with full repo scope. This works until GitHub’s abuse detection flags the account for suspicious activity. The agent submitted 240+ PRs in 96 hours. That’s 2.5 PRs per hour, every hour, for four days straight. No human does that.

The better approach is to use multiple tokens with different scopes and rotate them based on operation type. Read operations use a low-privilege token. Write operations use a higher-privilege token. This reduces the blast radius if one token gets rate-limited or flagged.

Token refresh is another issue. OAuth tokens expire. The agent needs to detect expiration (usually a 401 response) and refresh the token before retrying the operation. If the refresh fails, the agent should pause and alert a human.

Rate Limiting and Backoff Strategies

GitHub’s API has two rate limits:

Primary rate limit: 5,000 requests per hour for authenticated requests
Secondary rate limit: Triggered by rapid bursts of requests, even if under the primary limit

The agent hit the secondary rate limit multiple times. The symptom: 403 responses with a Retry-After header. The cause: submitting PRs too quickly in succession.

The fix: exponential backoff with jitter. After each API call, the agent waits a random interval between 1 and 5 seconds. If it receives a 403, it waits for the duration specified in Retry-After, then doubles the wait time for the next retry.

def api_call_with_backoff(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError as e:
            wait_time = e.retry_after or (2 ** attempt + random.uniform(0, 1))
            time.sleep(wait_time)
    raise Exception("Max retries exceeded")

The agent also tracks its own rate limit budget. Before each API call, it checks the X-RateLimit-Remaining header. If the remaining budget is below 100, it pauses until the reset time.

State Persistence Across Multi-Day Workflows

A bounty workflow is not atomic. It spans multiple steps:

Find bounty
Fork repository
Clone fork
Write fix
Run tests
Submit PR
Monitor review feedback
Respond to comments
Wait for merge
Claim bounty

If the agent crashes or restarts between steps, it needs to resume from where it left off. This requires persistent state.

The agent uses a SQLite database to track:

Bounties in progress
Repository forks created
PRs submitted
Review comments received
Merge status

Each bounty has a state machine:

discovered → forked → cloned → fixed → tested → submitted → under_review → merged → claimed

The agent queries the database at the start of each cycle to determine which bounties need attention. If a PR is in under_review, it checks for new comments. If a PR is in merged, it attempts to claim the bounty.

State	Next Action	Failure Mode
`discovered`	Fork repository	Repository is archived or private
`forked`	Clone fork	Fork creation failed silently
`cloned`	Generate fix	Codebase analysis timeout
`fixed`	Run tests	Tests fail due to environment mismatch
`tested`	Submit PR	PR already exists for the same issue
`submitted`	Monitor reviews	Review bot requests changes
`under_review`	Respond to comments	Comment parsing fails
`merged`	Claim bounty	Bounty platform API is down

The most common failure mode: tests pass locally but fail in CI. The agent has no visibility into the CI environment. It can only see the final status (pass/fail) and any logs the CI system exposes. If the logs are not machine-readable, the agent cannot diagnose the failure.

Isolation Boundaries: Preventing Accidental Commits

The agent operates on forks, not upstream repositories. This is critical. If the agent accidentally pushes to the upstream repository, it could corrupt the main branch or trigger security alerts.

The isolation strategy:

Always fork before cloning
Set the fork as the origin remote
Set the upstream repository as the upstream remote (read-only)
Never push to upstream

The agent validates the remote configuration before every push:

def safe_push(repo_path, branch):
    remotes = subprocess.check_output(
        ["git", "remote", "-v"], cwd=repo_path
    ).decode()
    
    if "origin" not in remotes:
        raise Exception("No origin remote found")
    
    origin_url = get_remote_url(repo_path, "origin")
    if not is_fork(origin_url):
        raise Exception("Origin is not a fork")
    
    subprocess.run(["git", "push", "origin", branch], cwd=repo_path, check=True)

Another isolation boundary: the agent runs in a containerized environment with limited file system access. It can only write to a designated workspace directory. This prevents it from accidentally modifying system files or other repositories.

Observability: What You Need to See

The agent logs every API call, every state transition, and every error. The logs are structured JSON, not plain text. This makes them queryable.

Key metrics tracked:

Bounties discovered per cycle
PRs submitted per hour
Merge rate (merged PRs / submitted PRs)
Average time from submission to merge
Rate limit budget remaining
Error rate by type (API errors, test failures, merge conflicts)

The agent also exposes a Prometheus endpoint for real-time monitoring. Grafana dashboards show:

Active bounties by state
API call latency
Rate limit consumption over time
Error spikes

The most useful alert: “No PRs submitted in the last 2 hours.” This usually means the agent is stuck in a retry loop or the bounty platform API is down.

What Actually Broke

The agent ran for 96 hours. Here’s what failed:

Test environment mismatches: 40% of PRs that passed local tests failed in CI. The agent could not reproduce the CI environment locally. The fix: run tests in a Docker container that matches the CI environment.

Review bot parsing: CodeRabbit and Cubic post structured comments with suggested changes. The agent could not parse these comments reliably. It treated them as human feedback and generated nonsensical responses. The fix: add explicit parsers for known review bots.

Merge conflicts: When multiple agents (or humans) work on the same repository, merge conflicts are inevitable. The agent could not resolve conflicts automatically. It marked the PR as failed and moved on. The fix: implement a conflict resolution strategy (rebase on upstream, regenerate fix, resubmit).

Bounty platform API instability: The bounty platform API went down twice during the 96-hour run. The agent could not claim bounties during these outages. The fix: implement a retry queue with exponential backoff for bounty claims.

Token expiration: One of the OAuth tokens expired mid-run. The agent detected the 401 response but failed to refresh the token because the refresh token was also expired. The fix: monitor token expiration proactively and refresh before expiration.

Deployment Shape

The agent runs on a single EC2 instance (t3.medium). It does not need horizontal scaling because the bottleneck is API rate limits, not compute. Adding more instances would just hit rate limits faster.

The deployment includes:

Cron daemon for scheduling
SQLite database for state persistence
Docker for test isolation
Prometheus for metrics
Grafana for dashboards
CloudWatch for logs

The agent is stateful, so it cannot be deployed as a serverless function. It needs persistent storage and long-running processes.

Likely Failure Modes in Production

If you run a similar agent, expect these failures:

Rate limit exhaustion: GitHub’s secondary rate limit is unpredictable. You will hit it.
Test flakiness: Tests that pass locally will fail in CI. You need environment parity.
Review bot incompatibility: Every repository uses different review bots. You need custom parsers.
Merge conflicts: Multiple agents competing for the same bounties will create conflicts.
Token expiration: OAuth tokens expire. You need proactive refresh.
Bounty platform downtime: Third-party APIs go down. You need retry queues.
Repository access changes: Repositories can become private or archived mid-workflow. You need to handle 404s gracefully.

Technical Verdict

Use this approach when:

You have a high-volume, repetitive workflow (e.g., bounty hunting, issue triage, dependency updates)
You can tolerate a 40-60% success rate
You have robust observability and alerting
You can handle OAuth token management and rate limiting
You are willing to invest in test environment parity

Avoid this approach when:

You need 100% reliability
You cannot afford to hit API rate limits
You do not have structured logs and metrics
You are working with repositories that have complex CI/CD pipelines
You cannot handle merge conflicts programmatically

The agent earned $500-800 in 96 hours, but it also generated noise (160+ failed PRs). If you run this in production, you need to balance throughput with quality. The best strategy: start with a whitelist of known-good repositories, monitor merge rates closely, and expand gradually.

Source Links

Primary source: 96-hour bounty hunting experiment