mech.app
AI Agents

Building a GitHub Repo Tracker Agent: Polling, Diffing, and Notification Plumbing for Daily Updates

Practical infrastructure for a GitHub monitoring agent: polling intervals, diff detection, state persistence, notification routing, and rate-limit handl...

Source: dev.to
Building a GitHub Repo Tracker Agent: Polling, Diffing, and Notification Plumbing for Daily Updates

A GitHub repository tracker is a straightforward agentic pattern: poll APIs, detect changes, store state, and route notifications. The implementation choices matter because you are balancing API rate limits, storage overhead, notification noise, and failure recovery.

This article walks through the plumbing of a working repo tracker agent built with Python, SQLite, and cron. It covers polling vs webhooks, diff detection strategies, state persistence, notification routing, and rate-limit handling.

Why Poll Instead of Webhooks

GitHub webhooks deliver real-time events, but they require a publicly accessible endpoint, TLS termination, and webhook secret validation. For a personal tracker monitoring a dozen repos, polling is simpler:

  • No server infrastructure to maintain
  • No firewall rules or reverse proxy configuration
  • No webhook signature verification code
  • Works from a laptop, cron job, or serverless function

The trade-off is API rate limits. GitHub allows 5,000 authenticated requests per hour. If you poll 20 repos every 15 minutes, you consume 80 requests per hour, leaving headroom for other API calls.

When to use webhooks instead:

  • You monitor hundreds of repos
  • You need sub-minute latency
  • You already run a persistent service with a public endpoint

State Persistence: What to Track

A repo tracker needs to persist enough state to detect changes between runs. The minimal schema includes:

  • Repository identifier (owner/name)
  • Last commit SHA on the default branch
  • Last release tag
  • Issue count
  • Pull request count
  • Last check timestamp

SQLite works well for this. It is a single file, requires no daemon, and supports concurrent reads. The schema:

CREATE TABLE repos (
    id INTEGER PRIMARY KEY,
    owner TEXT NOT NULL,
    name TEXT NOT NULL,
    last_commit_sha TEXT,
    last_release_tag TEXT,
    issue_count INTEGER,
    pr_count INTEGER,
    last_checked TIMESTAMP,
    UNIQUE(owner, name)
);

CREATE TABLE events (
    id INTEGER PRIMARY KEY,
    repo_id INTEGER,
    event_type TEXT,
    event_data TEXT,
    detected_at TIMESTAMP,
    FOREIGN KEY(repo_id) REFERENCES repos(id)
);

The events table logs every detected change. This gives you an audit trail and supports notification deduplication.

Diff Detection Strategy

The agent fetches current state from the GitHub API and compares it to the last known state. Changes trigger events:

Change TypeDetection MethodNotification Threshold
New commitSHA comparisonAlways notify
New releaseTag comparisonAlways notify
Issue count increaseNumeric diffNotify if delta > 5
PR count increaseNumeric diffNotify if delta > 2
Issue count decreaseNumeric diffSilent log only
PR count decreaseNumeric diffSilent log only

The thresholds prevent notification spam when issue counts fluctuate by one or two. You can tune these per repo or per event type.

def detect_changes(repo_id, current_state, previous_state):
    events = []
    
    if current_state['commit_sha'] != previous_state['commit_sha']:
        events.append({
            'type': 'new_commit',
            'data': current_state['commit_sha']
        })
    
    if current_state['release_tag'] != previous_state['release_tag']:
        events.append({
            'type': 'new_release',
            'data': current_state['release_tag']
        })
    
    issue_delta = current_state['issue_count'] - previous_state['issue_count']
    if issue_delta > 5:
        events.append({
            'type': 'issue_spike',
            'data': issue_delta
        })
    
    return events

API Rate Limit Handling

GitHub returns rate limit headers with every response:

  • X-RateLimit-Limit: Total requests allowed per hour
  • X-RateLimit-Remaining: Requests left in current window
  • X-RateLimit-Reset: Unix timestamp when the limit resets

The agent checks these headers after every request. If remaining requests drop below a threshold (e.g., 100), it logs a warning and skips non-critical repos.

def check_rate_limit(response):
    remaining = int(response.headers.get('X-RateLimit-Remaining', 0))
    reset_time = int(response.headers.get('X-RateLimit-Reset', 0))
    
    if remaining < 100:
        wait_seconds = reset_time - time.time()
        logger.warning(f"Rate limit low: {remaining} remaining, resets in {wait_seconds}s")
        return False
    return True

For long-running agents, you can implement exponential backoff or prioritize repos by importance.

Notification Routing

The agent supports multiple notification channels: Slack, email, Discord, or stdout. The routing logic is a simple plugin pattern:

class NotificationRouter:
    def __init__(self):
        self.handlers = []
    
    def register(self, handler):
        self.handlers.append(handler)
    
    def send(self, event):
        for handler in self.handlers:
            try:
                handler.send(event)
            except Exception as e:
                logger.error(f"Notification failed: {e}")

Each handler implements a send(event) method. The Slack handler posts to a webhook URL, the email handler uses SMTP, and the stdout handler prints JSON.

Deduplication: Before sending, the router checks the events table. If an identical event was sent in the last 24 hours, it skips the notification.

Orchestration Flow

The agent runs as a cron job every 15 minutes. The execution flow:

  1. Load repo list from config file
  2. For each repo:
    • Fetch current state from GitHub API
    • Load previous state from SQLite
    • Detect changes
    • Log events to database
  3. For each new event:
    • Check deduplication rules
    • Route to notification handlers
  4. Update last_checked timestamp

The entire run takes 5-10 seconds for 20 repos. If a run fails (network error, API timeout), cron retries in 15 minutes. No state is lost because the database transaction only commits after all repos are processed.

Failure Modes and Recovery

FailureImpactRecovery
GitHub API timeoutPartial run, some repos skippedNext cron run retries
SQLite lock contentionWrite fails, no state updateNext run detects same changes
Notification webhook downEvent logged, notification lostManual replay from events table
Rate limit exhaustedAll requests failWait for reset, resume next hour
Corrupt database fileAgent crashesRestore from backup, rebuild state

The agent logs every error with context (repo name, API endpoint, error message). For critical failures, it sends a notification to a separate alerting channel.

Observability

The agent writes structured logs to stdout and a rotating log file. Each log entry includes:

  • Timestamp
  • Repo identifier
  • Event type
  • API response time
  • Rate limit remaining

You can pipe logs to a log aggregator (Loki, Datadog) or parse them with jq for quick analysis:

cat github-monitor.log | jq 'select(.event_type == "new_release")'

For dashboards, the SQLite database is queryable directly. A simple Flask app can serve a web UI showing recent events, repo health, and API usage trends.

Deployment Shape

The minimal deployment is a cron job on a single machine:

*/15 * * * * /usr/bin/python3 /path/to/github_monitor.py >> /var/log/github-monitor.log 2>&1

For redundancy, run the agent on two machines with staggered schedules (e.g., 0,15,30,45 and 7,22,37,52). SQLite does not support concurrent writes, so only one instance should write at a time. Use file locking or a distributed lock (Redis, etcd) to coordinate.

For serverless, package the agent as a Lambda function triggered by EventBridge. The SQLite file lives in EFS or S3. Cold start latency is acceptable because the agent runs every 15 minutes.

Security Boundaries

The agent needs a GitHub personal access token with repo scope. Store it in an environment variable or secret manager, never in code.

If you expose the web dashboard, add authentication. The SQLite database contains repo names and event history, which may reveal internal projects or security issues.

For notification webhooks (Slack, Discord), validate the destination URL. A misconfigured webhook could leak event data to an attacker-controlled endpoint.

Technical Verdict

Use this pattern when:

  • You monitor fewer than 100 repos
  • You need change detection, not real-time streaming
  • You run on a single machine or serverless function
  • You want minimal infrastructure (no webhook server, no message queue)

Avoid this pattern when:

  • You monitor hundreds of repos (webhooks scale better)
  • You need sub-minute latency (polling introduces delay)
  • You already run a persistent service (webhooks are simpler)
  • You need complex event correlation (use a proper event bus)

The repo tracker agent is a practical example of agentic infrastructure: autonomous, stateful, and resilient. The plumbing is straightforward, the failure modes are manageable, and the deployment footprint is small.

Tags

agentic-ai orchestration infrastructure

Primary Source

dev.to