The Problem: A Monolith Masquerading as an API Business

The client had a commercially successful API product — real estate data sold to PropTech companies, agents and investment platforms. But underneath the commercial success was a technical time bomb: a single Node.js monolith that treated all tenants identically, had no usage metering, applied blanket rate limits shared across every customer, and had no developer portal or self-service tooling.

The consequences were piling up. A single heavy tenant could degrade response times for every other customer simultaneously. The billing team was manually reconciling API usage against invoices at month end because there was no reliable metering. New clients had to email the support team to get API keys. And the engineering team had no way to identify which endpoints were under load — they found out about problems when customers complained.

The rebuild brief had three non-negotiable requirements: complete tenant isolation so no customer can affect any other, real-time usage metering so billing is automatic and auditable, and a self-serve developer portal so the sales and support teams are never in the API onboarding loop again.

❌ Challenge
Shared rate limits — one high-volume tenant's burst traffic degraded response times for all other customers simultaneously
✓ Solution
Per-tenant rate limiting with token bucket algorithm in Redis — each tenant has an isolated quota with configurable burst allowance
❌ Challenge
No usage metering — billing team manually tallied API calls from logs at month end, taking 2+ days and producing errors
✓ Solution
Real-time usage metering via event stream — every call atomically incremented in Redis, persisted to TimescaleDB, feeds billing automatically
❌ Challenge
No developer portal — new customers emailed support for API keys, waited 1–2 days, and had nowhere to view their usage or read docs
✓ Solution
Full self-serve developer portal — sign up, generate keys, explore interactive docs, view real-time usage graphs and manage billing in one place

Four Layers. Zero Single Points of Failure.

The platform is built in clearly separated layers — each with a single responsibility, independently scalable and replaceable without touching the others. The gateway layer handles auth, rate limiting and metering before a request ever touches business logic.

Client Layer
🌐REST API clients
🔗Webhook consumers
💻Developer portal
📱SDK users
API Gateway Layer — Auth · Rate Limiting · Metering · Routing
🔑API key validation
⏱️Token bucket rate limiter
📊Usage metering (Redis)
🔀Tenant-aware routing
Service Layer — Business Logic (isolated per service)
🏠Listings service
Signals service
🔍Enrichment service
📜History service
🪝Webhooks service
Data Layer
🗃️PostgreSQL (primary)
Redis (cache + rate limits)
📈TimescaleDB (metering)
📨SQS (async jobs)

Per-Tenant Isolation, Exposed to Developers

Every API response includes rate limit headers — so developers always know exactly where they stand without needing to check a dashboard. The token bucket algorithm means burst traffic is handled gracefully rather than immediately throttled, and the 429 response includes a Retry-After header so well-behaved clients can back off correctly.

HTTP Response Headers — every API response
HTTP/2 200 OK X-RateLimit-Tenant: acme-corp X-RateLimit-Plan: enterprise X-RateLimit-Limit: 1000000 // daily quota X-RateLimit-Remaining: 289543 // calls left today X-RateLimit-Reset: 1740960000 // unix timestamp of quota reset X-RateLimit-BurstRemaining: 847 // burst tokens left (per-minute bucket) X-Request-ID: req_4f8a2c91d3e7b6 // traceable in logs X-Response-Time: 38ms

Everything a Commercial API Business Needs

🔑
Self-Serve Developer Portal
Developers sign up, generate API keys, explore interactive OpenAPI docs and start making calls — all without touching the support team.
⏱️
Per-Tenant Rate Limiting
Token bucket algorithm in Redis. Each tenant has independent daily quotas and per-minute burst limits. One tenant's usage never affects another.
📊
Real-Time Usage Metering
Every call metered atomically. Tenants see live usage graphs in their dashboard. Billing integrates directly with metering data — no manual reconciliation.
🪝
Webhook Delivery System
Tenants subscribe to data events via webhook. Guaranteed delivery with exponential backoff retry. Dead-letter queue for persistent failures. Full delivery log per tenant.
📈
Admin Operations Console
Internal team view — per-tenant usage, latency percentiles, error rates, quota management and billing controls. PagerDuty alerting on anomalies.
🛡️
Versioned API with Deprecation
Full API versioning (v1, v2) with 6-month deprecation notices. Migration guides auto-generated per tenant. Breaking changes never surprise a customer.

Technologies Used

Node.js / Fastify TypeScript PostgreSQL 15 TimescaleDB Redis (rate limits) AWS SQS AWS ECS (Fargate) AWS CloudFront Stripe (billing) OpenAPI 3.1 PagerDuty Datadog APM

Rebuilt Without Breaking Existing Customers

1
Week 1–2
Architecture Audit & Migration Plan
Audited the existing monolith — traffic patterns, tenant distribution, endpoint usage. Designed the new architecture and the zero-downtime migration strategy. Existing customers couldn't be disrupted at any point.
2
Week 3–5
Gateway Layer — Auth, Rate Limiting & Metering
The gateway was built and deployed in shadow mode first — processing all real traffic in parallel with the old system but not yet serving responses. Metering validated against old log counts. Rate limiting logic stress-tested.
3
Week 6–9
Service Layer & Data Migration
Five business logic services extracted from the monolith and redeployed as independent services. Database schema migrated with zero downtime using expand-contract pattern. 80 tenants migrated in batches of 10.
4
Week 10–12
Developer Portal & Webhook System
Self-serve developer portal built — API key management, interactive docs, real-time usage dashboard, billing integration. Webhook delivery system with retry logic and dead-letter queue delivered.
5
Week 13–14
Full Cutover, Monitoring & Handover
Final traffic cutover from old monolith to new platform. Datadog APM and PagerDuty alerts configured. Old system decommissioned after 2-week parallel run. Zero tenant disruptions throughout the entire migration.

1M+ Calls Daily. Zero Tenant Complaints.

The migration was completed without a single customer-facing incident. The "shadow mode" approach — running the new gateway in parallel before cutting over — was the key: every edge case was caught and resolved before real traffic hit the new system.

99.95%
Uptime SLA achieved and maintained since launch
42ms
Median response latency — down from 180ms on the old monolith
0
Tenant disruptions during the full 14-week migration

The commercial impact was immediate. Self-serve onboarding cut average time-to-first-API-call from 2 days to 8 minutes. The billing team's monthly reconciliation task — previously 2 days of manual work — was replaced entirely by automated metering. The sales team could close deals and have the customer calling the API the same afternoon.

The platform currently handles peak loads of around 2,200 requests per minute without degradation, with headroom to 10× before infrastructure changes are needed. The client has since on-boarded 30 new tenants they said would have been commercially impossible to support on the old system.