Building a GitHub Repo Tracker Agent: Polling, Diffing, and Notification Plumbing for Daily Updates

A GitHub repository tracker is a straightforward agentic pattern: poll APIs, detect changes, store state, and route notifications. The implementation choices matter because you are balancing API rate limits, storage overhead, notification noise, and failure recovery.

This article walks through the plumbing of a working repo tracker agent built with Python, SQLite, and cron. It covers polling vs webhooks, diff detection strategies, state persistence, notification routing, and rate-limit handling.

Why Poll Instead of Webhooks

GitHub webhooks deliver real-time events, but they require a publicly accessible endpoint, TLS termination, and webhook secret validation. For a personal tracker monitoring a dozen repos, polling is simpler:

No server infrastructure to maintain
No firewall rules or reverse proxy configuration
No webhook signature verification code
Works from a laptop, cron job, or serverless function

The trade-off is API rate limits. GitHub allows 5,000 authenticated requests per hour. If you poll 20 repos every 15 minutes, you consume 80 requests per hour, leaving headroom for other API calls.

When to use webhooks instead:

You monitor hundreds of repos
You need sub-minute latency
You already run a persistent service with a public endpoint

State Persistence: What to Track

A repo tracker needs to persist enough state to detect changes between runs. The minimal schema includes:

Repository identifier (owner/name)
Last commit SHA on the default branch
Last release tag
Issue count
Pull request count
Last check timestamp

SQLite works well for this. It is a single file, requires no daemon, and supports concurrent reads. The schema:

CREATE TABLE repos (
    id INTEGER PRIMARY KEY,
    owner TEXT NOT NULL,
    name TEXT NOT NULL,
    last_commit_sha TEXT,
    last_release_tag TEXT,
    issue_count INTEGER,
    pr_count INTEGER,
    last_checked TIMESTAMP,
    UNIQUE(owner, name)
);

CREATE TABLE events (
    id INTEGER PRIMARY KEY,
    repo_id INTEGER,
    event_type TEXT,
    event_data TEXT,
    detected_at TIMESTAMP,
    FOREIGN KEY(repo_id) REFERENCES repos(id)
);

The events table logs every detected change. This gives you an audit trail and supports notification deduplication.

Diff Detection Strategy

The agent fetches current state from the GitHub API and compares it to the last known state. Changes trigger events:

Change Type	Detection Method	Notification Threshold
New commit	SHA comparison	Always notify
New release	Tag comparison	Always notify
Issue count increase	Numeric diff	Notify if delta > 5
PR count increase	Numeric diff	Notify if delta > 2
Issue count decrease	Numeric diff	Silent log only
PR count decrease	Numeric diff	Silent log only

The thresholds prevent notification spam when issue counts fluctuate by one or two. You can tune these per repo or per event type.

def detect_changes(repo_id, current_state, previous_state):
    events = []
    
    if current_state['commit_sha'] != previous_state['commit_sha']:
        events.append({
            'type': 'new_commit',
            'data': current_state['commit_sha']
        })
    
    if current_state['release_tag'] != previous_state['release_tag']:
        events.append({
            'type': 'new_release',
            'data': current_state['release_tag']
        })
    
    issue_delta = current_state['issue_count'] - previous_state['issue_count']
    if issue_delta > 5:
        events.append({
            'type': 'issue_spike',
            'data': issue_delta
        })
    
    return events

API Rate Limit Handling

GitHub returns rate limit headers with every response:

X-RateLimit-Limit: Total requests allowed per hour
X-RateLimit-Remaining: Requests left in current window
X-RateLimit-Reset: Unix timestamp when the limit resets

The agent checks these headers after every request. If remaining requests drop below a threshold (e.g., 100), it logs a warning and skips non-critical repos.

def check_rate_limit(response):
    remaining = int(response.headers.get('X-RateLimit-Remaining', 0))
    reset_time = int(response.headers.get('X-RateLimit-Reset', 0))
    
    if remaining < 100:
        wait_seconds = reset_time - time.time()
        logger.warning(f"Rate limit low: {remaining} remaining, resets in {wait_seconds}s")
        return False
    return True

For long-running agents, you can implement exponential backoff or prioritize repos by importance.

Notification Routing

The agent supports multiple notification channels: Slack, email, Discord, or stdout. The routing logic is a simple plugin pattern:

class NotificationRouter:
    def __init__(self):
        self.handlers = []
    
    def register(self, handler):
        self.handlers.append(handler)
    
    def send(self, event):
        for handler in self.handlers:
            try:
                handler.send(event)
            except Exception as e:
                logger.error(f"Notification failed: {e}")

Each handler implements a send(event) method. The Slack handler posts to a webhook URL, the email handler uses SMTP, and the stdout handler prints JSON.

Deduplication: Before sending, the router checks the events table. If an identical event was sent in the last 24 hours, it skips the notification.

Orchestration Flow

The agent runs as a cron job every 15 minutes. The execution flow:

Load repo list from config file
For each repo:
- Fetch current state from GitHub API
- Load previous state from SQLite
- Detect changes
- Log events to database
For each new event:
- Check deduplication rules
- Route to notification handlers
Update last_checked timestamp

The entire run takes 5-10 seconds for 20 repos. If a run fails (network error, API timeout), cron retries in 15 minutes. No state is lost because the database transaction only commits after all repos are processed.

Failure Modes and Recovery

Failure	Impact	Recovery
GitHub API timeout	Partial run, some repos skipped	Next cron run retries
SQLite lock contention	Write fails, no state update	Next run detects same changes
Notification webhook down	Event logged, notification lost	Manual replay from events table
Rate limit exhausted	All requests fail	Wait for reset, resume next hour
Corrupt database file	Agent crashes	Restore from backup, rebuild state

The agent logs every error with context (repo name, API endpoint, error message). For critical failures, it sends a notification to a separate alerting channel.

Observability

The agent writes structured logs to stdout and a rotating log file. Each log entry includes:

Timestamp
Repo identifier
Event type
API response time
Rate limit remaining

You can pipe logs to a log aggregator (Loki, Datadog) or parse them with jq for quick analysis:

cat github-monitor.log | jq 'select(.event_type == "new_release")'

For dashboards, the SQLite database is queryable directly. A simple Flask app can serve a web UI showing recent events, repo health, and API usage trends.

Deployment Shape

The minimal deployment is a cron job on a single machine:

*/15 * * * * /usr/bin/python3 /path/to/github_monitor.py >> /var/log/github-monitor.log 2>&1

For redundancy, run the agent on two machines with staggered schedules (e.g., 0,15,30,45 and 7,22,37,52). SQLite does not support concurrent writes, so only one instance should write at a time. Use file locking or a distributed lock (Redis, etcd) to coordinate.

For serverless, package the agent as a Lambda function triggered by EventBridge. The SQLite file lives in EFS or S3. Cold start latency is acceptable because the agent runs every 15 minutes.

Security Boundaries

The agent needs a GitHub personal access token with repo scope. Store it in an environment variable or secret manager, never in code.

If you expose the web dashboard, add authentication. The SQLite database contains repo names and event history, which may reveal internal projects or security issues.

For notification webhooks (Slack, Discord), validate the destination URL. A misconfigured webhook could leak event data to an attacker-controlled endpoint.

Technical Verdict

Use this pattern when:

You monitor fewer than 100 repos
You need change detection, not real-time streaming
You run on a single machine or serverless function
You want minimal infrastructure (no webhook server, no message queue)

Avoid this pattern when:

You monitor hundreds of repos (webhooks scale better)
You need sub-minute latency (polling introduces delay)
You already run a persistent service (webhooks are simpler)
You need complex event correlation (use a proper event bus)

The repo tracker agent is a practical example of agentic infrastructure: autonomous, stateful, and resilient. The plumbing is straightforward, the failure modes are manageable, and the deployment footprint is small.

Source Links

Primary Article: The Repo Tracker: Automating My Daily GitHub Catch-Up