The Problem: Building a Labour Market Intelligence Product

An enterprise HR analytics company was building a product that would let large organisations benchmark compensation, track hiring trends and identify skill demand patterns across global markets. The product was only as good as its underlying data — and they had no data.

They needed a continuously-updated feed of job listings from across the world — structured, deduplicated, normalised and delivered in a format their product could query. Not a one-time export. Not a CSV from a data broker. A living, breathing pipeline that refreshed daily and kept pace with the job market in real time.

The added complexity: job listings are notoriously messy. The same role can appear on 8 different platforms with different titles, salary formats, location formats and closing dates. The real engineering challenge wasn't collection — it was normalisation.

❌ Challenge
15,000+ job boards, aggregators and company career pages — all with different structures and anti-bot protections
✓ Solution
Tiered scraper architecture — high-volume sources on dedicated spiders, long-tail sources batched via generalised crawler
❌ Challenge
Same job duplicated across 5–10 platforms with different titles, salary formats and date formats
✓ Solution
Fuzzy deduplication engine using title + company + location hashing — reduces 52K raw listings to ~48K unique per day
❌ Challenge
Salary data wildly inconsistent — hourly, monthly, annual, currency variations, ranges vs single values
✓ Solution
Salary normalisation layer converts all compensation data to standardised annual USD range with original values preserved

What Each Job Record Contains

Every job listing is normalised into a consistent schema before delivery — making it immediately queryable without any client-side cleaning.

Job Listing Record — Normalised Schema
job_idstringUnique canonical ID (deduplicated across sources)
title_normalisedstring"Senior Software Engineer" (standardised)
company_namestringResolved against company entity database
locationobject{ city, country, remote: true/false, lat, lng }
salary_usd_annualobject{ min: 80000, max: 120000, currency_orig: "GBP" }
skills_extractedarray["Python", "PostgreSQL", "AWS"] — NLP extracted
seniority_levelenumjunior / mid / senior / lead / executive
posted_atdatetimeNormalised UTC timestamp
source_countintegerNumber of platforms this listing appeared on

The Pipeline Architecture

The system runs as a distributed pipeline across AWS infrastructure — each stage is independently scalable and has its own monitoring, alerting and dead-letter queue so failures in one source don't cascade.

🕸️
15K+ Sources
Boards, aggregators, careers pages
🤖
Scrapy + Playwright
JS-heavy sites handled
🔄
Dedupe + Normalise
Fuzzy hash + schema mapping
🧠
NLP Enrichment
Skills & seniority extraction
📡
API + S3 Delivery
Daily batch + real-time

Apache Airflow orchestrates the daily schedule — high-priority sources (LinkedIn, Indeed, Glassdoor) are scraped every 4 hours while long-tail sources run once daily. A dead-letter queue catches any failed crawls and retries them automatically within the same daily window.

Technologies Used

Python 3.11 Scrapy Playwright Apache Airflow PostgreSQL 15 Redis spaCy (NLP) AWS EC2 + S3 SQS (dead-letter) FastAPI Docker Bright Data Proxies

Delivered in 10 Weeks

1
Week 1–2
Source Mapping & Schema Design
Catalogued all 15,000+ target sources by priority tier. Designed the normalised job record schema and agreed on delivery format with the client's data team.
2
Week 3–5
Tier 1 Scrapers + Core Pipeline
Built scrapers for the top 50 highest-volume sources. Core pipeline with deduplication and normalisation running. First 20K daily records delivered to staging environment.
3
Week 6–7
NLP Enrichment Layer
Trained spaCy model for skill extraction from job descriptions. Seniority classification model built and validated. Salary normalisation across 12 currencies implemented.
4
Week 8–9
Long-Tail Sources + Full Scale
Generalised crawler deployed for 14,000+ lower-volume sources. Full 50K+ daily volume achieved and validated across all 12 countries. QA benchmarks met.
5
Week 10
Production Handover & Monitoring Live
Production deployment on AWS. Grafana dashboards and PagerDuty alerts configured. Client data team trained on the API. Pipeline running autonomously from day one.

The Data That Powers a Product

The pipeline became the core data layer powering the client's HR analytics product — used by enterprise customers across 12 countries to benchmark compensation, identify skill demand trends and track hiring velocity by role, location and industry.

50K+
Job listings collected and delivered every day
99%
Pipeline uptime — zero missed daily deliveries since launch
10wk
From brief to full production deployment

The client launched their analytics product six months after we delivered the pipeline — and credited the data quality as a key differentiator versus competitors using lower-quality third-party data feeds. The salary normalisation layer in particular was called out by enterprise buyers as something they hadn't seen elsewhere.