📊 HR Analytics 🔍 Web Scraping & Data

50,000 Job Listings Collected Daily Across 12 Countries

An enterprise HR analytics platform needed continuous, structured job market data at global scale. We engineered a pipeline that never sleeps — 15,000+ sources, zero manual effort, delivered daily.

50K+

Job listings collected daily

15K+

Sources monitored

Countries covered

99%

Pipeline uptime

📊 Job Market Intelligence — Live Pipeline

Collecting

Processing

Crawled today

52,847

job listings

After dedup

48,203

unique listings

Normalised

48,203

structured records

QA passed

47,990

99.6% accuracy

Delivered

47,990

via API + S3

🇺🇸 United States

14,820

🇬🇧 United Kingdom

8,340

🇮🇳 India

7,210

🇩🇪 Germany

5,490

🇦🇺 Australia

4,120

🇸🇬 Singapore

2,980

🇨🇦 Canada

2,640

🌍 + 5 more

2,390

📋 The Brief

The Problem: Building a Labour Market Intelligence Product

An enterprise HR analytics company was building a product that would let large organisations benchmark compensation, track hiring trends and identify skill demand patterns across global markets. The product was only as good as its underlying data — and they had no data.

They needed a continuously-updated feed of job listings from across the world — structured, deduplicated, normalised and delivered in a format their product could query. Not a one-time export. Not a CSV from a data broker. A living, breathing pipeline that refreshed daily and kept pace with the job market in real time.

The added complexity: job listings are notoriously messy. The same role can appear on 8 different platforms with different titles, salary formats, location formats and closing dates. The real engineering challenge wasn't collection — it was normalisation.

❌ Challenge

15,000+ job boards, aggregators and company career pages — all with different structures and anti-bot protections

✓ Solution

Tiered scraper architecture — high-volume sources on dedicated spiders, long-tail sources batched via generalised crawler

❌ Challenge

Same job duplicated across 5–10 platforms with different titles, salary formats and date formats

✓ Solution

Fuzzy deduplication engine using title + company + location hashing — reduces 52K raw listings to ~48K unique per day

❌ Challenge

Salary data wildly inconsistent — hourly, monthly, annual, currency variations, ranges vs single values

✓ Solution

Salary normalisation layer converts all compensation data to standardised annual USD range with original values preserved

📦 Data Schema

What Each Job Record Contains

Every job listing is normalised into a consistent schema before delivery — making it immediately queryable without any client-side cleaning.

Job Listing Record — Normalised Schema

job_idstringUnique canonical ID (deduplicated across sources)

title_normalisedstring"Senior Software Engineer" (standardised)

company_namestringResolved against company entity database

locationobject{ city, country, remote: true/false, lat, lng }

salary_usd_annualobject{ min: 80000, max: 120000, currency_orig: "GBP" }

skills_extractedarray["Python", "PostgreSQL", "AWS"] — NLP extracted

seniority_levelenumjunior / mid / senior / lead / executive

posted_atdatetimeNormalised UTC timestamp

source_countintegerNumber of platforms this listing appeared on

🏗️ Architecture

The Pipeline Architecture

The system runs as a distributed pipeline across AWS infrastructure — each stage is independently scalable and has its own monitoring, alerting and dead-letter queue so failures in one source don't cascade.

🕸️

15K+ Sources

Boards, aggregators, careers pages

→

🤖

Scrapy + Playwright

JS-heavy sites handled

→

🔄

Dedupe + Normalise

Fuzzy hash + schema mapping

→

🧠

NLP Enrichment

Skills & seniority extraction

→

📡

API + S3 Delivery

Daily batch + real-time

Apache Airflow orchestrates the daily schedule — high-priority sources (LinkedIn, Indeed, Glassdoor) are scraped every 4 hours while long-tail sources run once daily. A dead-letter queue catches any failed crawls and retries them automatically within the same daily window.

⚙️ Tech Stack

Technologies Used

Python 3.11 Scrapy Playwright Apache Airflow PostgreSQL 15 Redis spaCy (NLP) AWS EC2 + S3 SQS (dead-letter) FastAPI Docker Bright Data Proxies

📅 Timeline

Delivered in 10 Weeks

Week 1–2

Source Mapping & Schema Design

Catalogued all 15,000+ target sources by priority tier. Designed the normalised job record schema and agreed on delivery format with the client's data team.

Week 3–5

Tier 1 Scrapers + Core Pipeline

Built scrapers for the top 50 highest-volume sources. Core pipeline with deduplication and normalisation running. First 20K daily records delivered to staging environment.

Week 6–7

NLP Enrichment Layer

Trained spaCy model for skill extraction from job descriptions. Seniority classification model built and validated. Salary normalisation across 12 currencies implemented.

Week 8–9

Long-Tail Sources + Full Scale

Generalised crawler deployed for 14,000+ lower-volume sources. Full 50K+ daily volume achieved and validated across all 12 countries. QA benchmarks met.

Week 10

Production Handover & Monitoring Live

Production deployment on AWS. Grafana dashboards and PagerDuty alerts configured. Client data team trained on the API. Pipeline running autonomously from day one.

📈 Results

The Data That Powers a Product

The pipeline became the core data layer powering the client's HR analytics product — used by enterprise customers across 12 countries to benchmark compensation, identify skill demand trends and track hiring velocity by role, location and industry.

50K+

Job listings collected and delivered every day

99%

Pipeline uptime — zero missed daily deliveries since launch

10wk

From brief to full production deployment

The client launched their analytics product six months after we delivered the pipeline — and credited the data quality as a key differentiator versus competitors using lower-quality third-party data feeds. The salary normalisation layer in particular was called out by enterprise buyers as something they hadn't seen elsewhere.

←

VC Startup Intelligence Platform

Next Case Study

AI Customer Support Agent

→

Need a Job Data Pipeline?

Tell us your target markets, volumes and delivery format. We'll scope a pipeline that works at your scale.

Book a Free Call → View More Case Studies