HackerRank’s Hiring Agent hit GitHub trending at #10 for Python because it solves a real B2B problem: turning unstructured resume PDFs into explainable, fair candidate scores. The repository exposes a five-stage pipeline that decouples PDF extraction, LLM-based parsing, GitHub API enrichment, and deterministic scoring. Each boundary teaches a lesson about when to trust an LLM and when to enforce hard constraints.
Why This Pipeline Matters
Most hiring automation demos stop at “LLM reads resume, outputs score.” Production systems need:
- Error isolation: If PDF extraction fails, the LLM never sees garbage.
- Auditability: Scores must cite evidence, not hallucinate qualifications.
- Fairness constraints: Category weights and deduction rules must be deterministic.
- External enrichment: GitHub signals add objective data the resume omits.
Hiring Agent implements all four. The architecture shows how to compose stochastic (LLM) and deterministic (API, scoring rules) components without letting one contaminate the other.
Pipeline Architecture
The orchestrator lives in score.py. It calls five modules in sequence:
| Stage | Module | Input | Output | Failure Mode |
|---|---|---|---|---|
| 1. PDF to Markdown | pymupdf_rag.py | PDF bytes | Markdown text per page | Corrupted PDF, scanned image |
| 2. Section extraction | pdf.py | Markdown + Jinja templates | Pydantic-validated JSON per section | LLM refuses, malformed JSON |
| 3. GitHub enrichment | github.py | GitHub username | Profile + top 7 repos | Rate limit, private profile |
| 4. Normalization | transform.py | Loose JSON | JSON Resume schema | Missing required fields |
| 5. Scoring | evaluator.py | Normalized JSON + job description | Category scores + evidence | LLM ignores constraints |
Each stage writes intermediate artifacts when DEVELOPMENT_MODE=true. The CSV output includes raw LLM responses, GitHub API payloads, and final scores for audit trails.
PDF Extraction Boundary
pymupdf_rag.py uses PyMuPDF to convert each page into Markdown-like text. It handles:
- Multi-column layouts (common in academic CVs)
- Embedded images (extracts alt text when present)
- Tables (converts to pipe-delimited Markdown)
The output is plain text. No LLM involvement. This boundary matters because:
- Error handling is deterministic: If PyMuPDF raises an exception, you know the PDF is corrupt. You do not waste tokens on a retry loop.
- Observability is simple: Log the Markdown. Diff it against the PDF. No prompt archaeology.
- Cost control: You pay for LLM tokens only after extraction succeeds.
The alternative (feeding raw PDF bytes to a vision model) burns tokens on layout parsing and introduces a second source of hallucination.
LLM Section Parsing
pdf.py splits the Markdown into sections (contact, experience, education, skills) using Jinja templates under prompts/templates. Each template asks the LLM to extract structured fields:
# Simplified from pdf.py
def extract_section(markdown: str, section: str, provider: LLMProvider) -> dict:
template = jinja_env.get_template(f"{section}.jinja")
prompt = template.render(resume_text=markdown)
response = provider.generate(prompt)
return json.loads(response)
The templates enforce Pydantic schemas defined in models.py. For example, WorkExperience requires:
company(string)position(string)start_dateandend_date(ISO 8601)responsibilities(list of strings)
If the LLM returns malformed JSON, pdf.py retries once with the error message appended to the prompt. After two failures, it logs the raw response and moves on. The pipeline does not block.
Why Per-Section Calls?
Sending the entire resume in one prompt risks:
- Context window overflow: A 10-page CV can exceed 8k tokens.
- Attention dilution: The LLM might skip minor sections.
- Retry cost: If one section fails, you re-process everything.
Per-section calls cost more in latency (serial requests) but isolate failures. You can parallelize them with asyncio if the provider supports concurrent requests.
GitHub Enrichment
github.py fetches the candidate’s profile and repositories via the REST API. It classifies each repo by language and topic, then asks the LLM to select the top 7 based on:
- Stars and forks (popularity signal)
- Commit frequency (activity signal)
- README quality (documentation signal)
The LLM prompt includes the full repo list as JSON. The response is a ranked list of repo names. github.py validates that all names exist in the input, then fetches detailed stats for the top 7.
Tool Boundary Lesson
The agent does not call the GitHub API directly. github.py is a Python module that wraps requests and handles:
- Rate limiting (respects
X-RateLimit-Remaining) - Authentication (accepts
GITHUB_TOKENenv var) - Pagination (fetches all repos, not just the first page)
The LLM sees only the final JSON. This separation prevents:
- Credential leakage: The LLM never touches the API token.
- Retry loops: If the API returns 429, the module sleeps and retries. The LLM does not see the error.
- Hallucinated API calls: The LLM cannot invent endpoints or parameters.
If you let the LLM generate API calls (function calling), you must sandbox execution and validate every parameter. Wrapping the API in a module is simpler.
Scoring and Fairness Constraints
evaluator.py runs the final evaluation. It takes:
- Normalized JSON Resume
- Job description
- Scoring rubric (category weights, bonus rules, deduction rules)
The rubric is a YAML file. Example:
categories:
- name: Technical Skills
weight: 0.3
max_score: 10
- name: Experience
weight: 0.4
max_score: 10
- name: Education
weight: 0.2
max_score: 10
- name: Projects
weight: 0.1
max_score: 10
bonuses:
- condition: "GitHub stars > 100"
points: 2
- condition: "Open source contributions > 10"
points: 1
deductions:
- condition: "Employment gap > 1 year"
points: -1
The LLM receives the rubric and the resume. It must return:
- A score (0-10) for each category
- Evidence (quoted text from the resume)
- Applied bonuses and deductions
evaluator.py validates:
- Scores are within bounds
- Evidence exists in the input
- Bonuses and deductions match the rubric
If validation fails, the score is rejected. The pipeline does not trust the LLM to do arithmetic or follow rules. It treats the LLM as a text-to-structured-data translator, then enforces constraints in Python.
Deterministic vs. Stochastic Split
| Component | Type | Why |
|---|---|---|
| Category score (0-10) | Stochastic | Requires judgment (e.g., “Is 5 years of Python experience worth 8/10?”) |
| Evidence extraction | Stochastic | Requires reading comprehension |
| Bonus/deduction application | Deterministic | Rules are boolean (stars > 100 is true or false) |
| Final score calculation | Deterministic | Weighted sum of category scores + bonuses + deductions |
The LLM proposes category scores and evidence. Python enforces the rest. This split prevents the LLM from inventing bonus points or ignoring deductions.
Provider Flexibility
llm_utils.py abstracts LLM providers. The repository supports:
- Ollama (local): Runs
llama3.2,qwen2.5, or any model you pull. - Google Gemini (hosted): Uses
gemini-1.5-flashorgemini-1.5-pro.
You set LLM_PROVIDER=ollama or LLM_PROVIDER=gemini in .env. The provider interface is:
class LLMProvider(Protocol):
def generate(self, prompt: str, schema: dict | None = None) -> str:
...
If schema is provided, the provider uses structured output mode (Gemini’s response_schema or Ollama’s JSON mode). If not, it returns raw text and llm_utils.py cleans it (strips Markdown fences, fixes trailing commas).
Why Two Providers?
- Ollama: Zero API cost, full control, slower inference.
- Gemini: Fast, cheap ($0.075 per 1M input tokens for Flash), no local GPU required.
For development, Ollama lets you iterate on prompts without burning credits. For production, Gemini handles scale. The abstraction makes switching trivial.
Observability and Audit Trails
When DEVELOPMENT_MODE=true, score.py writes:
resume_markdown.txt: Raw PDF extraction outputsections.json: Per-section LLM responsesgithub_data.json: API payloadsevaluation.json: Final scores and evidenceresults.csv: One row per candidate with all fields
The CSV is the audit trail. You can:
- Spot-check evidence against the resume
- Compare GitHub signals to claimed experience
- Identify scoring drift (if category scores creep up over time)
The repository does not include observability hooks (OpenTelemetry, Datadog). You would add spans around each stage and log:
- Latency per stage
- Token counts per LLM call
- GitHub API rate limit headroom
- Validation failure rates
Deployment Shape
The repository is a CLI tool. You run:
python -m hiring_agent.score --resume resume.pdf --job-description jd.txt
It prints the final score and writes artifacts to output/. For production, you would wrap it in:
- API server: FastAPI endpoint that accepts PDF uploads and returns JSON.
- Queue worker: Celery task that processes resumes asynchronously.
- Batch job: Kubernetes CronJob that scores a directory of resumes nightly.
The pipeline is stateless. Each invocation is independent. You can scale horizontally by running multiple workers. The only shared state is the GitHub API rate limit (5,000 requests/hour for authenticated users).
Likely Failure Modes
| Failure | Cause | Mitigation |
|---|---|---|
| PDF extraction fails | Scanned image, password-protected | Pre-flight check with PyPDF2.PdfReader.is_encrypted |
| LLM returns malformed JSON | Prompt drift, model update | Retry with error message, log raw response |
| GitHub API rate limit | Too many candidates | Cache profile data, use conditional requests (If-None-Match) |
| Scoring hallucination | LLM ignores rubric | Validate all scores in Python, reject invalid responses |
| Evidence fabrication | LLM quotes text not in resume | Fuzzy match evidence against input, flag low-confidence matches |
The pipeline handles the first four. Evidence fabrication is harder. You need a second LLM call to verify quotes or a vector search over the resume text.
Technical Verdict
Use Hiring Agent when:
- You need explainable scores with cited evidence.
- You want to enrich resumes with GitHub or LinkedIn data.
- You can tolerate 10-30 seconds of latency per resume.
- You need both local (Ollama) and hosted (Gemini) LLM options.
Avoid it when:
- You need sub-second scoring (the pipeline is serial, not optimized for speed).
- You cannot validate LLM outputs (the scoring logic assumes you enforce constraints in code).
- You need multi-modal inputs (the pipeline does not handle video interviews or coding tests).
The repository is a teaching tool. It shows how to build a multi-stage agent pipeline with clear boundaries between stochastic and deterministic components. The code is readable, the prompts are versioned, and the audit trail is built in. If you are building hiring automation, this is a better starting point than a single LLM call.