A GitHub repository tracker is a straightforward agentic pattern: poll APIs, detect changes, store state, and route notifications. The implementation choices matter because you are balancing API rate limits, storage overhead, notification noise, and failure recovery.
This article walks through the plumbing of a working repo tracker agent built with Python, SQLite, and cron. It covers polling vs webhooks, diff detection strategies, state persistence, notification routing, and rate-limit handling.
Why Poll Instead of Webhooks
GitHub webhooks deliver real-time events, but they require a publicly accessible endpoint, TLS termination, and webhook secret validation. For a personal tracker monitoring a dozen repos, polling is simpler:
- No server infrastructure to maintain
- No firewall rules or reverse proxy configuration
- No webhook signature verification code
- Works from a laptop, cron job, or serverless function
The trade-off is API rate limits. GitHub allows 5,000 authenticated requests per hour. If you poll 20 repos every 15 minutes, you consume 80 requests per hour, leaving headroom for other API calls.
When to use webhooks instead:
- You monitor hundreds of repos
- You need sub-minute latency
- You already run a persistent service with a public endpoint
State Persistence: What to Track
A repo tracker needs to persist enough state to detect changes between runs. The minimal schema includes:
- Repository identifier (owner/name)
- Last commit SHA on the default branch
- Last release tag
- Issue count
- Pull request count
- Last check timestamp
SQLite works well for this. It is a single file, requires no daemon, and supports concurrent reads. The schema:
CREATE TABLE repos (
id INTEGER PRIMARY KEY,
owner TEXT NOT NULL,
name TEXT NOT NULL,
last_commit_sha TEXT,
last_release_tag TEXT,
issue_count INTEGER,
pr_count INTEGER,
last_checked TIMESTAMP,
UNIQUE(owner, name)
);
CREATE TABLE events (
id INTEGER PRIMARY KEY,
repo_id INTEGER,
event_type TEXT,
event_data TEXT,
detected_at TIMESTAMP,
FOREIGN KEY(repo_id) REFERENCES repos(id)
);
The events table logs every detected change. This gives you an audit trail and supports notification deduplication.
Diff Detection Strategy
The agent fetches current state from the GitHub API and compares it to the last known state. Changes trigger events:
| Change Type | Detection Method | Notification Threshold |
|---|---|---|
| New commit | SHA comparison | Always notify |
| New release | Tag comparison | Always notify |
| Issue count increase | Numeric diff | Notify if delta > 5 |
| PR count increase | Numeric diff | Notify if delta > 2 |
| Issue count decrease | Numeric diff | Silent log only |
| PR count decrease | Numeric diff | Silent log only |
The thresholds prevent notification spam when issue counts fluctuate by one or two. You can tune these per repo or per event type.
def detect_changes(repo_id, current_state, previous_state):
events = []
if current_state['commit_sha'] != previous_state['commit_sha']:
events.append({
'type': 'new_commit',
'data': current_state['commit_sha']
})
if current_state['release_tag'] != previous_state['release_tag']:
events.append({
'type': 'new_release',
'data': current_state['release_tag']
})
issue_delta = current_state['issue_count'] - previous_state['issue_count']
if issue_delta > 5:
events.append({
'type': 'issue_spike',
'data': issue_delta
})
return events
API Rate Limit Handling
GitHub returns rate limit headers with every response:
X-RateLimit-Limit: Total requests allowed per hourX-RateLimit-Remaining: Requests left in current windowX-RateLimit-Reset: Unix timestamp when the limit resets
The agent checks these headers after every request. If remaining requests drop below a threshold (e.g., 100), it logs a warning and skips non-critical repos.
def check_rate_limit(response):
remaining = int(response.headers.get('X-RateLimit-Remaining', 0))
reset_time = int(response.headers.get('X-RateLimit-Reset', 0))
if remaining < 100:
wait_seconds = reset_time - time.time()
logger.warning(f"Rate limit low: {remaining} remaining, resets in {wait_seconds}s")
return False
return True
For long-running agents, you can implement exponential backoff or prioritize repos by importance.
Notification Routing
The agent supports multiple notification channels: Slack, email, Discord, or stdout. The routing logic is a simple plugin pattern:
class NotificationRouter:
def __init__(self):
self.handlers = []
def register(self, handler):
self.handlers.append(handler)
def send(self, event):
for handler in self.handlers:
try:
handler.send(event)
except Exception as e:
logger.error(f"Notification failed: {e}")
Each handler implements a send(event) method. The Slack handler posts to a webhook URL, the email handler uses SMTP, and the stdout handler prints JSON.
Deduplication: Before sending, the router checks the events table. If an identical event was sent in the last 24 hours, it skips the notification.
Orchestration Flow
The agent runs as a cron job every 15 minutes. The execution flow:
- Load repo list from config file
- For each repo:
- Fetch current state from GitHub API
- Load previous state from SQLite
- Detect changes
- Log events to database
- For each new event:
- Check deduplication rules
- Route to notification handlers
- Update
last_checkedtimestamp
The entire run takes 5-10 seconds for 20 repos. If a run fails (network error, API timeout), cron retries in 15 minutes. No state is lost because the database transaction only commits after all repos are processed.
Failure Modes and Recovery
| Failure | Impact | Recovery |
|---|---|---|
| GitHub API timeout | Partial run, some repos skipped | Next cron run retries |
| SQLite lock contention | Write fails, no state update | Next run detects same changes |
| Notification webhook down | Event logged, notification lost | Manual replay from events table |
| Rate limit exhausted | All requests fail | Wait for reset, resume next hour |
| Corrupt database file | Agent crashes | Restore from backup, rebuild state |
The agent logs every error with context (repo name, API endpoint, error message). For critical failures, it sends a notification to a separate alerting channel.
Observability
The agent writes structured logs to stdout and a rotating log file. Each log entry includes:
- Timestamp
- Repo identifier
- Event type
- API response time
- Rate limit remaining
You can pipe logs to a log aggregator (Loki, Datadog) or parse them with jq for quick analysis:
cat github-monitor.log | jq 'select(.event_type == "new_release")'
For dashboards, the SQLite database is queryable directly. A simple Flask app can serve a web UI showing recent events, repo health, and API usage trends.
Deployment Shape
The minimal deployment is a cron job on a single machine:
*/15 * * * * /usr/bin/python3 /path/to/github_monitor.py >> /var/log/github-monitor.log 2>&1
For redundancy, run the agent on two machines with staggered schedules (e.g., 0,15,30,45 and 7,22,37,52). SQLite does not support concurrent writes, so only one instance should write at a time. Use file locking or a distributed lock (Redis, etcd) to coordinate.
For serverless, package the agent as a Lambda function triggered by EventBridge. The SQLite file lives in EFS or S3. Cold start latency is acceptable because the agent runs every 15 minutes.
Security Boundaries
The agent needs a GitHub personal access token with repo scope. Store it in an environment variable or secret manager, never in code.
If you expose the web dashboard, add authentication. The SQLite database contains repo names and event history, which may reveal internal projects or security issues.
For notification webhooks (Slack, Discord), validate the destination URL. A misconfigured webhook could leak event data to an attacker-controlled endpoint.
Technical Verdict
Use this pattern when:
- You monitor fewer than 100 repos
- You need change detection, not real-time streaming
- You run on a single machine or serverless function
- You want minimal infrastructure (no webhook server, no message queue)
Avoid this pattern when:
- You monitor hundreds of repos (webhooks scale better)
- You need sub-minute latency (polling introduces delay)
- You already run a persistent service (webhooks are simpler)
- You need complex event correlation (use a proper event bus)
The repo tracker agent is a practical example of agentic infrastructure: autonomous, stateful, and resilient. The plumbing is straightforward, the failure modes are manageable, and the deployment footprint is small.