📋 The Brief
The Problem: Blind Spots in Every Direction
A PropTech startup was building an investment intelligence platform for property buyers, agents and portfolio managers. The core product promise was simple: tell users where prices are rising before the mainstream news does, flag undervalued properties the moment they hit the market, and show which areas are cooling before buyers overpay.
The problem was data. Public property data in the UK is fragmented — listing portals don't offer APIs, Land Registry data is months old by the time it's published, and no single source captures the full market. Building a product on stale, incomplete data would undermine the entire value proposition.
The brief was to build a scraping infrastructure that treated eight major portals as a distributed data source — ingesting, deduplicating and normalising listings daily, tracking price changes and status updates over time, and generating the market signals that would power the product's intelligence layer.
❌ Challenge
No official APIs from UK listing portals — all data behind JavaScript rendering, anti-bot protections and rate limiting
✓ Solution
Playwright-based scraper fleet with rotating residential proxies, fingerprint randomisation and adaptive request pacing
❌ Challenge
The same property listed on 5 different portals under different reference numbers and slightly different addresses
✓ Solution
Multi-signal deduplication: postcode + bedrooms + price + address fuzzy match gives 96.4% accuracy on cross-portal matching
❌ Challenge
Price, status and description changes happen intraday — daily scrapes miss the full picture of market dynamics
✓ Solution
High-velocity listings (flagged by previous-day engagement proxy) scraped 3× daily — full delta tracking stored per listing
🗃️ Data Model
What a Normalised Listing Looks Like
Every listing from every portal is normalised into the same schema before it touches the database — regardless of source format. This is the record structure that powers every signal and dashboard in the product.
listing_id: uuid
source_refs: string[]
address_normalised: object
property_type: enum
bedrooms: int
price_history: object[]
price_per_sqft: number | null
status_history: enum[]
first_seen_at: timestamp
days_on_market: int
postcode_avg_price: number
discount_to_avg_pct: number
features_extracted: string[]
🌐 Sources Covered
Eight Portals, One Unified Feed
Each portal required its own scraper implementation — different rendering engines, different anti-bot approaches, different data structures. All eight feed into the same normalisation layer and end up as identical record formats in the database.
Rightmove
~220K
Sale + Rental
🇬🇧 UK
Zoopla
~140K
Sale + Rental
🇬🇧 UK
OnTheMarket
~65K
Sale
🇬🇧 UK
PrimeLocation
~28K
Luxury / Sale
🇬🇧 UK
SpareRoom
~18K
Rental
🇬🇧 UK
OpenRent
~12K
Rental
🇬🇧 UK
Auction Sites
~8K
Auction
🇬🇧 3 sources
🏗️ Pipeline Architecture
How 500K Listings Flow Every Day
The pipeline runs on a distributed worker fleet orchestrated by Apache Airflow. Each portal has an independent scraper worker that can scale horizontally — if one source becomes temporarily unavailable, the rest of the pipeline continues unaffected.
🌐
8 Portals
Playwright scrapers
→
🧹
Normalise
Address, price, type
→
🔁
Deduplicate
96.4% accuracy
→
🧠
NLP Enrich
Feature extraction
→
📊
Signal Engine
Alerts + trends
The signal engine is the product layer — it runs on top of the clean, normalised data and computes market signals using configurable rules. A "below-market cluster" alert fires when 10+ listings in a postcode are priced more than 7% below the rolling 90-day average. A "velocity spike" fires when days-on-market drops more than 30% week-over-week. All thresholds are configurable by the product team without touching the pipeline code.
⚙️ Tech Stack
Technologies Used
Python 3.11
Playwright
Scrapy
Apache Airflow
PostgreSQL 15
TimescaleDB
Redis
spaCy (NLP)
Bright Data Proxies
AWS EC2 + S3
FastAPI
Docker + ECS
📅 Timeline
Pipeline Live in 11 Weeks
1
Week 1–2
Portal Audit & Architecture Design
Profiled all 8 portals — rendering engine, anti-bot measures, data structure, update frequency. Designed the normalised schema and deduplication strategy. Architecture reviewed and signed off.
2
Week 3–5
Tier 1 Scrapers — Rightmove & Zoopla
The two highest-volume scrapers built and stabilised. Proxy rotation, fingerprinting and adaptive pacing tuned. First 200K normalised listings in the database by end of week 5. Deduplication engine v1 live.
3
Week 6–8
Remaining 6 Sources + NLP Enrichment
OnTheMarket, PrimeLocation, SpareRoom, OpenRent, Gumtree and auction scrapers built. spaCy NLP pipeline for feature extraction from descriptions live. Price history delta tracking operational.
4
Week 9–10
Signal Engine + API Layer
Market signal detection engine built with configurable rules. FastAPI query layer live — product team can query any postcode, filter by signal type, pull price history time series. 18-month backfill completed.
5
Week 11
Monitoring, Alerting & Production Handover
Airflow DAG monitoring live with PagerDuty alerting for scraper failures. Dead-letter queue for failed pages. Runbook documented. Pipeline handed over — client's product team shipped their first dashboard feature the same week.
📈 Results
Half a Million Listings. Daily. Reliably.
The pipeline has run every day since launch without a single full-day outage. Individual portal scrapers occasionally encounter temporary blocks — the monitoring system detects these within 15 minutes and the dead-letter queue ensures no listings are permanently lost, just delayed.
500K+
Property listings ingested and normalised every day
99.2%
Pipeline uptime since launch — zero full-day outages
18mo
Historical data backfilled — giving the product a head start
The client launched their beta product six weeks after pipeline handover. The investment signal feature — which surfaces below-market listings and velocity hotspots — became the product's primary acquisition driver, with users sharing alerts with their networks. The pipeline is the product's moat.
The client has since commissioned a second phase to expand coverage to Ireland and Scotland, and to add rental yield calculations using Land Registry sold prices as a benchmark layer.