🏠 PropTech 🕷️ Web Scraping & Data

500,000 Property Listings. Scraped Daily. Turned Into Intelligence.

A PropTech startup needed a data edge: real-time visibility into price movements, days-on-market trends and emerging investment hotspots — across every major listing portal, every single day. We built the pipeline that makes it possible.

500K+

Listings scraped daily

Portals covered

18mo

Historical data backfilled

99.2%

Pipeline uptime

intelligence.proptech.io/market-dashboard

Live data

🗺️ UK Regional Price Heat Map 30-day avg · 500K listings

🔴 London

£682

+2.1%

41d

🟠 Manchester

£348

+3.8%

28d

🟡 Birmingham

£298

+1.4%

34d

🟢 Leeds

£264

+0.7%

39d

🔵 Sheffield

£218

-0.3%

52d

🟠 Bristol

£394

+2.9%

25d

⚡ Investment Signals — Detected Today

📈

Manchester M14 — Price Acceleration

Avg list price up 6.2% in 14 days. Days-on-market dropped from 34d → 19d. High velocity signal.

Bullish

⚠️

Sheffield S1 — Inventory Spike

New listings up 28% week-on-week. Price reductions on 41% of stock. Watch for correction.

Watch

🏗️

Bristol BS3 — Below-Market Cluster

14 listings detected at 8–12% below postcode average. Potential acquisition window.

Opportunity

📉

London SW15 — Softening Demand

Days-on-market up to 68d (+24%). Price reduction rate at 34% — highest in 6 months.

Caution

📋 The Brief

The Problem: Blind Spots in Every Direction

A PropTech startup was building an investment intelligence platform for property buyers, agents and portfolio managers. The core product promise was simple: tell users where prices are rising before the mainstream news does, flag undervalued properties the moment they hit the market, and show which areas are cooling before buyers overpay.

The problem was data. Public property data in the UK is fragmented — listing portals don't offer APIs, Land Registry data is months old by the time it's published, and no single source captures the full market. Building a product on stale, incomplete data would undermine the entire value proposition.

The brief was to build a scraping infrastructure that treated eight major portals as a distributed data source — ingesting, deduplicating and normalising listings daily, tracking price changes and status updates over time, and generating the market signals that would power the product's intelligence layer.

❌ Challenge

No official APIs from UK listing portals — all data behind JavaScript rendering, anti-bot protections and rate limiting

✓ Solution

Playwright-based scraper fleet with rotating residential proxies, fingerprint randomisation and adaptive request pacing

❌ Challenge

The same property listed on 5 different portals under different reference numbers and slightly different addresses

✓ Solution

Multi-signal deduplication: postcode + bedrooms + price + address fuzzy match gives 96.4% accuracy on cross-portal matching

❌ Challenge

Price, status and description changes happen intraday — daily scrapes miss the full picture of market dynamics

✓ Solution

High-velocity listings (flagged by previous-day engagement proxy) scraped 3× daily — full delta tracking stored per listing

🗃️ Data Model

What a Normalised Listing Looks Like

Every listing from every portal is normalised into the same schema before it touches the database — regardless of source format. This is the record structure that powers every signal and dashboard in the product.

// Normalised property listing record listing_id: uuid // canonical deduped ID source_refs: string[] // ["rightmove:123", "zoopla:abc"] address_normalised: object // line1, line2, postcode, lat, lng property_type: enum // terraced|semi|detached|flat|other bedrooms: int price_history: object[] // [{date, price, source, change_pct}] price_per_sqft: number | null status_history: enum[] // active|under_offer|sold|withdrawn first_seen_at: timestamp days_on_market: int // calculated field postcode_avg_price: number // benchmark at scrape time discount_to_avg_pct: number // +/- vs postcode benchmark features_extracted: string[] // NLP: ["garden","parking","period"]

🌐 Sources Covered

Eight Portals, One Unified Feed

Each portal required its own scraper implementation — different rendering engines, different anti-bot approaches, different data structures. All eight feed into the same normalisation layer and end up as identical record formats in the database.

Rightmove

~220K

Sale + Rental

🇬🇧 UK

Zoopla

~140K

Sale + Rental

🇬🇧 UK

OnTheMarket

~65K

Sale

🇬🇧 UK

PrimeLocation

~28K

Luxury / Sale

🇬🇧 UK

SpareRoom

~18K

Rental

🇬🇧 UK

OpenRent

~12K

Rental

🇬🇧 UK

Gumtree

~9K

Rental

🇬🇧 UK

Auction Sites

~8K

Auction

🇬🇧 3 sources

🏗️ Pipeline Architecture

How 500K Listings Flow Every Day

The pipeline runs on a distributed worker fleet orchestrated by Apache Airflow. Each portal has an independent scraper worker that can scale horizontally — if one source becomes temporarily unavailable, the rest of the pipeline continues unaffected.

🌐

8 Portals

Playwright scrapers

→

🧹

Normalise

Address, price, type

→

🔁

Deduplicate

96.4% accuracy

→

🧠

NLP Enrich

Feature extraction

→

📊

Signal Engine

Alerts + trends

The signal engine is the product layer — it runs on top of the clean, normalised data and computes market signals using configurable rules. A "below-market cluster" alert fires when 10+ listings in a postcode are priced more than 7% below the rolling 90-day average. A "velocity spike" fires when days-on-market drops more than 30% week-over-week. All thresholds are configurable by the product team without touching the pipeline code.

⚙️ Tech Stack

Technologies Used

Python 3.11 Playwright Scrapy Apache Airflow PostgreSQL 15 TimescaleDB Redis spaCy (NLP) Bright Data Proxies AWS EC2 + S3 FastAPI Docker + ECS

📅 Timeline

Pipeline Live in 11 Weeks

Week 1–2

Portal Audit & Architecture Design

Profiled all 8 portals — rendering engine, anti-bot measures, data structure, update frequency. Designed the normalised schema and deduplication strategy. Architecture reviewed and signed off.

Week 3–5

Tier 1 Scrapers — Rightmove & Zoopla

The two highest-volume scrapers built and stabilised. Proxy rotation, fingerprinting and adaptive pacing tuned. First 200K normalised listings in the database by end of week 5. Deduplication engine v1 live.

Week 6–8

Remaining 6 Sources + NLP Enrichment

OnTheMarket, PrimeLocation, SpareRoom, OpenRent, Gumtree and auction scrapers built. spaCy NLP pipeline for feature extraction from descriptions live. Price history delta tracking operational.

Week 9–10

Signal Engine + API Layer

Market signal detection engine built with configurable rules. FastAPI query layer live — product team can query any postcode, filter by signal type, pull price history time series. 18-month backfill completed.

Week 11

Monitoring, Alerting & Production Handover

Airflow DAG monitoring live with PagerDuty alerting for scraper failures. Dead-letter queue for failed pages. Runbook documented. Pipeline handed over — client's product team shipped their first dashboard feature the same week.

📈 Results

Half a Million Listings. Daily. Reliably.

The pipeline has run every day since launch without a single full-day outage. Individual portal scrapers occasionally encounter temporary blocks — the monitoring system detects these within 15 minutes and the dead-letter queue ensures no listings are permanently lost, just delayed.

500K+

Property listings ingested and normalised every day

99.2%

Pipeline uptime since launch — zero full-day outages

18mo

Historical data backfilled — giving the product a head start

The client launched their beta product six weeks after pipeline handover. The investment signal feature — which surfaces below-market listings and velocity hotspots — became the product's primary acquisition driver, with users sharing alerts with their networks. The pipeline is the product's moat.

The client has since commissioned a second phase to expand coverage to Ireland and Scotland, and to add rental yield calculations using Land Registry sold prices as a benchmark layer.

←

Job Management SaaS

Next Case Study

Multi-Tenant API Platform

→

Need a Data Pipeline That Actually Runs?

Real estate, job markets, product data — if it's on the web, we can collect it, normalise it and turn it into something your product can use.

Book a Free Call → View More Case Studies