Pattern·Specialised shapes

Rate limiting / quota enforcement

Token bucket + distributed counter. The bucket math is trivial; the distribution, identity, and failure posture are the real interview.

~50 min read·16 sections

A fair rate limit protects users from each other; a good one protects the system from attackers. A public API without rate limits is not a product — it is a target.

Token bucket + distributed counter. The bucket math is trivial; the distribution, identity, and failure posture are the real interview.

Architecture diagram· Layered rate limiting spine

Three enforcement layers — edge (per-IP), gateway (per-key), app (per-user/path) — backed by a shared counter store.

You’re looking at this pattern when

Public API with tiered plans (free, paid, enterprise)
Anti-abuse on write-heavy or expensive endpoints
Per-tenant fairness in a multi-tenant system

Shows up in

API gateway quotas (Stripe, GitHub, Twilio)
Login / signup flood control

Try it on

Rate Limiter

What most people get wrong

Internal service-to-service where every caller is trusted (use backpressure / load shedding instead)

When to reach for this

Reach for this when…

Public API with tiered plans (free, paid, enterprise)
Anti-abuse on write-heavy or expensive endpoints
Per-tenant fairness in a multi-tenant system
Protection against volumetric / credential-stuffing attacks
Interviewer says "how do you stop one client taking down the system?"

Not really this pattern when…

Internal service-to-service where every caller is trusted (use backpressure / load shedding instead)
Predictable fixed-capacity batch jobs (schedule them; don't throttle them)
A single client whose burst you actually want to absorb (use a queue, not a 429)

Good vs bad answer

Interviewer probe

“How do you rate-limit your public API?”

Weak answer

"1000 requests per minute per IP, return a 429 when they go over."

Strong answer

"Token bucket per API key — burst 100, sustained 10/s for free tier, 10× for paid. State in Redis as a hash with the refill-and-spend in a Lua script so it's atomic and globally consistent across gateways. Edge CDN adds a per-IP volumetric layer at ~1000/min as a DDoS safety net only — I won't make IP the primary axis because corporate NATs would punish whole offices. Expensive endpoints like /export and /search get their own per-path buckets so cheap traffic can't starve them. Response is 429 with Retry-After and RateLimit-Limit/-Remaining/-Reset so clients back off with exponential jitter. If Redis is unreachable I fail open to a conservative local bucket — I won't DoS myself to protect a quota. The hot-key risk (a celebrity key overloading one shard) I'd handle with local pre-limiting at the gateway and a dedicated quota shard for known-hot tenants."

Why it wins: Names all four axes — algorithm, identity, storage, failure posture — plus per-path budgets, the response contract, and the hot-key mitigation. The weak answer picks the one identity (IP) that breaks on NATs and ignores distribution and outage behaviour.

Cheat sheet

•Four axes: algorithm, identity, storage, failure posture. Name all four.
•Default algorithm: token bucket. Burst B + sustained R are separate knobs.
•Sliding-window counter for cheap O(1) edge limits; fixed window never.
•Identity: per-IP (edge net) < per-key (gateway) < per-user/per-path (app). Layer them.
•Per-IP punishes NATs — safety net only, never the primary axis.
•Storage: Redis atomic INCR+PEXPIRE or a Lua refill-and-spend script. No read-modify-write race.
•Shard by identity hash; isolate hot tenants on dedicated quota shards.
•Response: 429 + Retry-After + RateLimit-Limit/-Remaining/-Reset. Always.
•Clients: exponential backoff with jitter, never lockstep retries.
•Fail-open (or local fallback) for abuse limits; the limiter must never be the outage.
•Charge tokens by cost for variable-cost endpoints (GitHub GraphQL point costs).
•Concurrency limit (in-flight cap) for slow expensive endpoints, alongside the rate limit.

Core concept

Rate limiting answers two questions at once: "how fast may this client go?" (capacity / fairness) and "is this client abusing us?" (anti-abuse). Every strong answer decomposes the problem into three axes — algorithm, identity, and storage — and then adds a fourth that juniors forget: failure posture.

Architecture diagram· Layered rate limiting spine

Three enforcement layers — edge (per-IP), gateway (per-key), app (per-user/path) — backed by a shared counter store.

Algorithm — pick by burst tolerance.

Token bucket is the default. A bucket holds up to B tokens and refills at R per second; each request spends one token. It permits bursts up to B and sustains R/s. Two numbers per identity capture the whole state: token count and last-refill timestamp.
Leaky bucket smooths output to a constant R/s — use it when a downstream needs a steady drip rather than bursts.
Sliding-window counter approximates a rolling window with O(1) storage and avoids the fixed-window doubling bug.
Fixed window is the simplest but lets a client send 2B across a window boundary. Use it only for coarse guardrails.

Architecture diagram· Token bucket mechanics

Bucket holds up to B tokens, refilled at R/s. Each request spends one. Empty bucket → reject.

Identity — layer them, never rely on one.

Per-IP at the edge is a volumetric safety net, but corporate and carrier NATs hide thousands of users behind one address, so it is a net, not an identity.
Per-API-key at the gateway is the right axis for authenticated public APIs and plan quotas.
Per-user-id at the app enforces fairness between users that share a single tenant key.
Per-path at the app gives expensive endpoints (exports, search, inference) their own budget so a cheap endpoint's traffic can't starve them.

Architecture diagram· Defence in depth across layers

Each layer catches a distinct abuse class. One layer failing does not expose the origin.

Storage — a global limit needs a shared counter. A single logical limit enforced across a fleet of gateways requires shared state. The default is Redis with an atomic INCR + PEXPIRE per (identity, window), or a Lua script that implements token-bucket refill atomically. One round-trip per request scales to 100k+ req/s per Redis shard; shard by identity hash beyond that.

Architecture diagram· Shared counter across gateway fleet

N gateways enforce one global limit by sharing a Redis counter keyed by (identity, window).

Failure posture — the limiter must never become the outage. If the counter store is unreachable, decide deliberately: fail-open (or fall back to a conservative local bucket) for abuse guardrails, because dropping all traffic to protect against a hypothetical flood is self-inflicted denial of service; fail-closed only for hard commercial quotas where over-serving has real cost. Name the difference — it shows you understand the business meaning of the limit, not just the algorithm.

Architecture diagram· Local fallback when the counter store is unreachable

On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.

Response contract — non-optional. Every throttled request returns 429 Too Many Requests with Retry-After and the RateLimit-Limit / RateLimit-Remaining / RateLimit-Reset headers (the IETF draft standard). Well-behaved clients read these and back off; without them, clients retry blindly and amplify the very overload you are defending against.

Interview walkthrough

Worked example: rate limiting for a public API with free and paid plans

Scenario. A public REST API with three tiers: free (60 req/min), paid (1,000 req/min), enterprise (custom). Endpoints range from a trivial GET /me to an expensive POST /reports/export that fans out across the database. Abuse history: credential-stuffing on POST /login and scrapers hammering GET /search.

Architecture diagram· Layered rate limiting spine

Three enforcement layers — edge (per-IP), gateway (per-key), app (per-user/path) — backed by a shared counter store.

Step 1 — separate abuse from quota. Two different problems wearing the same hat. Plan limits (60 / 1,000 req/min) are a quota keyed by API key. Credential stuffing and scraping are abuse, better caught by per-IP edge limits and per-path app limits. Don't conflate them into one number.

Step 2 — edge layer (volumetric). At the CDN, a per-IP sliding-window limit at, say, 2,000 req/min — generous, because it must not punish a NAT'd office, but low enough to blunt a single-host flood. Fails open: if the edge limiter has trouble, traffic passes to the gateway, which still enforces the plan.

Step 3 — gateway layer (plan quota). Token bucket per API key in Redis. Free: B=60, R=1/s. Paid: B=200, R=16/s. Enterprise: per-contract. Refill-and-spend in a Lua script for atomicity. Every response carries RateLimit-Remaining so integrators self-throttle. Over limit → 429 + Retry-After.

Step 4 — app layer (expensive paths + auth abuse). Per-path budgets: POST /reports/export gets a tight bucket (B=5, R=1/min) and a concurrency cap of 2 in-flight per tenant, because each export is heavy. POST /login gets a per-IP-plus-username bucket (e.g. 5 attempts / 15 min) to throttle credential stuffing without locking out a whole NAT.

Step 5 — cost-weighting. For GET /search, charge tokens proportional to result-set fan-out so a wildcard query costs more than a keyed lookup — aligning the limit with real work rather than request count.

Step 6 — failure posture. Redis down? Plan and abuse limits fall back to conservative local buckets per gateway and reconcile on recovery; the API stays up. The only thing that fails closed is the metered enterprise overage billing path, where over-serving has direct cost — and even that uses a conservative local cap, not hard rejection.

Step 7 — observability. Emit metrics for 429 rate per tier and per endpoint, false-positive signals (sudden 429 spikes for paid users), Redis latency on the limiter path, and top throttled identities. A rising paid-tier 429 rate is an incident, not a success.

Result. Free users get a fair, predictable budget; paid users rarely see a 429; /login and /export are protected without collateral damage to legitimate offices; and a Redis blip degrades to slightly looser local enforcement instead of an outage.

Interview playbook

Interview playbook5-7 minutes in a 45-minute round: 1 min on the four axes, 2 min on algorithm + layered identity, 2 min on distribution and failure posture, 1-2 min on the response contract and hot-key handling.

When it comes up

The prompt has public APIs, abusive clients, tenants, or paid plan tiers
A shared multi-tenant system must protect users from each other
Requests differ in cost by endpoint or by plan
Interviewer asks "what stops one client overwhelming the system?"

Order of reveal

1
Name the four axes. Rate limiting is algorithm, identity, storage, and failure posture. Most candidates name the first three and forget the fourth.
2
Pick token bucket. Token bucket is my default — independent burst and sustained knobs match how real clients behave: bursty, then idle.
3
Layer the identities. Per-IP at the edge as a volumetric net, per-API-key at the gateway for plan quota, per-user and per-path at the app for fairness and expensive endpoints.
4
Distribute the counter. Shared Redis with an atomic Lua refill-and-spend for global consistency; shard by identity hash and isolate hot tenants.
5
State the response contract. 429 with Retry-After and RateLimit-* headers; clients back off with exponential jitter.
6
Define the failure posture. Abuse guardrails fail open or to a local bucket; commercial quotas may fail conservative. The limiter must never be the outage.

Signature phrases

“Per-IP is a safety net, not an identity”

“Burst and sustained are separate knobs”

“The limiter cannot become the outage”

“Charge tokens by cost, not by count”

“Atomic refill or you have a race”

“Per-IP is a safety net, not an identity” — Shows you know NATs hide many users behind one address.
“Burst and sustained are separate knobs” — Demonstrates token-bucket understanding over a flat rate.
“The limiter cannot become the outage” — Names the fail-open / local-fallback insight that juniors miss.
“Charge tokens by cost, not by count” — Connects the limit to real server work for expensive endpoints.
“Atomic refill or you have a race” — Shows awareness of the read-modify-write counter bug.

Likely follow-ups

?“What if one API key is extremely hot?”Reveal

Shard the counter by identity hash so keys spread across Redis shards, add gateway-local pre-limiting so most requests are decided in-process and only reconcile with Redis periodically, and for known-hot tenants give them a dedicated quota shard so their load is isolated from everyone else.

?“What if Redis is down?”Reveal

Depends on the limit class. Abuse guardrails fail open or fall back to a conservative per-gateway local bucket — I will not drop all traffic to defend against a hypothetical flood. Billing quotas use a conservative local cap and reconcile on recovery. The hot-path Redis call always has a tight timeout so a slow Redis cannot block requests or exhaust the connection pool.

?“How do you handle expensive vs cheap endpoints under one budget?”Reveal

Give expensive endpoints their own per-path bucket and, if they are long-running, a concurrency cap on in-flight requests. For variable-cost endpoints like search or GraphQL, charge tokens proportional to the work — GitHub does this with query point costs — so one expensive call consumes more budget than one cheap call.

?“How do clients know how to behave?”Reveal

The RateLimit-Limit/-Remaining/-Reset headers let them self-throttle before hitting the wall, and Retry-After tells them exactly how long to wait after a 429. Clients should implement exponential backoff with jitter so a fleet of throttled clients does not retry in lockstep and create a thundering herd.

Canonical examples

→API gateway quotas (Stripe, GitHub, Twilio)
→Login / signup flood control
→Email and SMS send-rate caps
→Per-tenant request budgets in SaaS
→Public chatbot / LLM inference APIs

Variants

Per-key token bucket in Redis

Atomic refill in a Lua script per (api_key, bucket), one round-trip per request.

Architecture diagram· Token bucket mechanics

Bucket holds up to B tokens, refilled at R/s. Each request spends one. Empty bucket → reject.

The default for authenticated public APIs. State per identity is two fields — current token count and last-refill timestamp — stored in a Redis hash. A Lua script runs the refill-and-spend atomically so concurrent requests on the same key can't both read a stale count and over-spend:

1Read tokens and last_refill.
2Add (now - last_refill) × R tokens, capped at B.
3If tokens >= 1, decrement and allow; else reject.
4Write back tokens and last_refill with a TTL so idle keys expire.

This is globally consistent: every gateway sees the same bucket. The cost is one Redis round-trip (~0.5–1 ms) per request and one hot key per identity. It scales to millions of keys because each key is tiny and idle keys self-expire.

Pros

+Globally consistent across the whole gateway fleet
+Atomic refill via Lua — no read-modify-write race
+Cheap per request (~1 ms) and scales to millions of keys

Cons

−Redis sits on the hot path — needs its own HA and a fallback
−A celebrity / abusive key becomes a hot shard

Choose this variant when

Public authenticated APIs with plan quotas
You need cross-gateway consistency for billing-grade limits

Local counter with periodic sync

Each gateway enforces its own bucket and reconciles to a shared store every few seconds.

Architecture diagram· Local fallback when the counter store is unreachable

On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.

When per-request Redis round-trips are too expensive (very high QPS edges, or latency budgets in the single-digit milliseconds), give each gateway a local bucket sized to its share of the global limit and reconcile periodically. Worst-case over-admission is roughly gateway_count × local_slack during a sync interval — acceptable for DDoS-scale guardrails, not for precise billing quotas.

A common refinement is the "approximated global" scheme used by Stripe and others: each node keeps a local view, periodically pushes its consumption to Redis, and pulls back the global total to adjust its local allowance. Convergence is eventual but the per-request path stays in-process.

Pros

+Zero per-request external call — lowest possible latency
+Survives a counter-store outage by design

Cons

−Over-limit by up to (gateway_count × local_slack) in a window
−Not suitable for precise commercial quotas

Choose this variant when

Very high QPS edge where a Redis hop per request is too costly
Approximate enforcement is acceptable (abuse guardrails)

Layered (edge + gateway + app)

Per-IP at the CDN, per-key at the gateway, per-user-and-path at the app.

Architecture diagram· Defence in depth across layers

Each layer catches a distinct abuse class. One layer failing does not expose the origin.

Defence in depth. Each layer targets a different abuse class, and the failure of any single layer does not expose the origin:

Edge (per-IP / ASN): absorbs volumetric floods and trivial bots cheaply, before traffic reaches your fleet.
Gateway (per-API-key / tenant): enforces plan quotas and tenant fairness.
App (per-user + per-path): protects expensive endpoints and enforces fairness between users inside a shared tenant key.

The subtlety is coordinating the limits so a legitimate user can't trip two layers for one logical action (which produces confusing, hard-to-debug 429s). Document the budget at each layer and keep the edge limit generously above the gateway limit.

Pros

+Covers volumetric, plan-abuse, and hot-endpoint abuse separately
+Failure of one layer still leaves the others enforcing

Cons

−Operational complexity — three configs to keep consistent
−Mis-tuned layers cause false-positive 429s for real users

Choose this variant when

Production public APIs exposed to the internet
Systems with a history of abuse or scraping

Concurrency limit (in-flight cap)

Cap simultaneous in-flight requests per identity instead of a rate over time.

Some resources are bounded by concurrency, not throughput — a tenant running 50 simultaneous report exports can exhaust worker pools even at a modest request rate. A concurrency limiter tracks in-flight requests per identity (a semaphore in Redis: INCR on entry, DECR on completion, with a safety TTL so crashed requests don't leak permits).

This composes with a token bucket: the bucket governs how often you may start work, the semaphore governs how much may run at once. For expensive, long-running endpoints (LLM inference, video transcode kickoff), the concurrency cap is often the more important of the two.

Pros

+Directly protects bounded resources (worker pools, DB connections)
+Catches slow, expensive requests that a rate limit misses

Cons

−Leaked permits if you forget the safety TTL on crash
−Harder for clients to reason about than a simple rate

Choose this variant when

Expensive, long-running endpoints (exports, inference, transcode)
Downstream has a hard concurrency ceiling (connection pools)

Scaling path

v1 — single-instance in-memory bucket

Stop the obvious abuse with the simplest thing that works.

One app instance, one in-process token bucket per identity in a local map. Zero infrastructure, sub-microsecond checks. This is genuinely the right answer when you have a single instance or when each instance owns a disjoint set of clients.

Architecture diagram· Token bucket mechanics

Bucket holds up to B tokens, refilled at R/s. Each request spends one. Empty bucket → reject.

The honest caveat: the moment you run two instances behind a load balancer, each enforces the limit independently, so the effective global limit is instance_count × local_limit. That is the trigger to move to v2.

What triggers the next iteration

Multiple instances each enforce locally — global limit is N× too loose
State lost on restart — buckets reset, briefly allowing bursts
No cross-instance view of an abusive client

v2 — shared Redis counter

Enforce one global limit across the whole fleet.

Move bucket state into Redis. Each request runs an atomic refill-and-spend Lua script keyed by (identity, bucket). Now every gateway sees the same bucket and the global limit is exact.

Architecture diagram· Shared counter across gateway fleet

N gateways enforce one global limit by sharing a Redis counter keyed by (identity, window).

Add the response contract here: 429 + Retry-After + RateLimit-* headers. The cost is one Redis round-trip per request and a hard dependency on Redis availability — which sets up the next two steps.

What triggers the next iteration

Redis is now on the hot path — its outage threatens every request
A single hot identity hammers one Redis shard
Per-request round-trip adds ~1 ms latency

v3 — sharding + local fallback

Survive hot keys and counter-store outages.

Shard the counter store by identity hash so no single shard owns all the hot keys. Add a local-fallback path: if Redis times out, gateways switch to a conservative in-process bucket rather than failing every request.

Architecture diagram· Local fallback when the counter store is unreachable

On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.

Decide the failure posture explicitly per limit class: abuse guardrails fail open (or local), commercial quotas may fail closed. This is the difference between "we briefly over-served during a Redis blip" and "we took the whole API down to protect a quota."

What triggers the next iteration

Local fallback admits more traffic than the global limit during outages
Hot tenants still need dedicated quota shards
Coordinating fail-open vs fail-closed across limit classes

v4 — layered defence in depth

Separate volumetric, plan, and endpoint abuse into independent layers.

Add edge per-IP limiting at the CDN for volumetric floods and per-path limits at the app for expensive endpoints. Each layer is independently tuned and independently fails. The gateway per-key layer handles plan quotas in the middle.

Architecture diagram· Defence in depth across layers

Each layer catches a distinct abuse class. One layer failing does not expose the origin.

This is the mature shape for a public internet-facing API. The remaining work is continuous tuning: watching false-positive 429 rates, adjusting per-tenant overrides, and feeding abusive-IP signals to the edge.

What triggers the next iteration

Mis-tuned layers produce confusing double-throttling
Per-tenant overrides multiply configuration surface
Edge IP signals need a feedback loop from app-level abuse detection

Deep dives

Token bucket vs sliding window — when each wins

Architecture diagram· Sliding-window counter vs fixed window

Fixed window allows 2B at the boundary; sliding window weights the previous window to smooth it.

The four classic algorithms differ only in how they treat bursts and how much state they need.

Fixed window counts requests per calendar window (e.g. per minute) and resets at the boundary. It needs one integer per identity but has a fatal edge: a client can send B requests in the last second of one window and B in the first second of the next — 2B in two seconds, double the intended rate.

Architecture diagram· Sliding-window counter vs fixed window

Fixed window allows 2B at the boundary; sliding window weights the previous window to smooth it.

Sliding-window counter fixes this cheaply: keep the current and previous window counts and compute a weighted estimate — current + previous × (overlap_fraction). It needs two integers and a multiply, has no boundary doubling, and is accurate to within a few percent. This is what Cloudflare popularised for edge rate limiting because it is O(1) and approximate-but-good-enough.

Token bucket is the most flexible: B and R are independent knobs, so you can allow a generous burst (B = 100) on top of a modest sustained rate (R = 10/s). State is two fields. It is the right default for API quotas because real clients are bursty — a page load fires ten requests at once, then nothing for a second.

Leaky bucket is token bucket's mirror: it enforces a constant output rate by draining a queue at R/s. Reach for it only when a downstream genuinely needs a smooth drip (e.g. a third-party API with its own strict steady-rate limit you must not exceed).

The interview-grade summary: token bucket for API quotas, sliding window for cheap edge volumetric limits, leaky bucket for smoothing into a rate-sensitive downstream, fixed window never (except as a crude guardrail).

Identity choice decides your false-positive rate

Architecture diagram· Defence in depth across layers

Each layer catches a distinct abuse class. One layer failing does not expose the origin.

The single most common production incident with rate limiting is throttling legitimate users — and the cause is almost always the wrong identity axis.

Architecture diagram· Defence in depth across layers

Each layer catches a distinct abuse class. One layer failing does not expose the origin.

Per-IP is cheap and available pre-auth, which makes it tempting as the primary axis. But a corporate office, a university, or a mobile carrier can put thousands of distinct users behind a single NAT'd IP. Rate-limit that IP and you deny service to an entire building. Per-IP belongs at the edge as a volumetric safety net with a generous threshold, not as the primary fairness mechanism.

Per-API-key is the right primary axis for public APIs: it maps to a billing account and a plan tier. But a single key can represent a whole company with many end users, so within a key you may still need per-user fairness.

Per-user-id is the fairest axis but is only available after authentication. Use it at the app layer to stop one user inside a shared tenant from starving the others.

Per-path is orthogonal to all of the above: GET /search and POST /export have wildly different costs, so they deserve separate budgets even for the same identity. Without it, a flood of cheap requests can exhaust the budget an expensive endpoint needed.

Strong designs layer these and tune each independently. The edge blocks obvious floods, the gateway enforces the plan, and the app enforces fairness and protects expensive paths.

Distributed enforcement and the hot-key problem

Architecture diagram· Shared counter across gateway fleet

N gateways enforce one global limit by sharing a Redis counter keyed by (identity, window).

A global limit enforced across a fleet needs shared state, and the default is Redis. The naive approach — GET, check, INCR — has a read-modify-write race: two concurrent requests both read count = 99 against a limit of 100 and both proceed. The fix is an atomic server-side operation: either INCR followed by PEXPIRE (for window counters) or a Lua script that does the whole token-bucket refill-and-spend in one round-trip so no interleaving is possible.

Architecture diagram· Shared counter across gateway fleet

N gateways enforce one global limit by sharing a Redis counter keyed by (identity, window).

The hot-key problem. A celebrity API key, a viral tenant, or an attacker hammering one identity concentrates all traffic on the single Redis shard that owns that key. Mitigations, in order of escalation:

1Shard by identity hash so different identities land on different shards (helps the aggregate, not a single hot identity).
2Local pre-limiting at the gateway: each gateway enforces a fraction of the global limit locally and only consults Redis to reconcile, cutting the per-request load on the hot shard.
3Dedicated quota shard for known-hot tenants, isolating their load from everyone else.

Latency budget. Each Redis hop is ~0.5–1 ms. If your endpoint's p99 budget is 20 ms, one hop is fine; if it is 2 ms, you need the local-counter variant. Always set a tight timeout on the Redis call (e.g. 50 ms) and a fallback, so a slow Redis can't blow your latency SLO or, worse, queue requests until the pool exhausts.

Fail-open vs fail-closed is a product decision

Architecture diagram· Local fallback when the counter store is unreachable

On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.

A limiter on the hot path is itself a dependency, and dependencies fail. The question "what happens when the counter store is down?" separates senior answers from junior ones, because the right answer depends on the business meaning of the limit, not the algorithm.

Architecture diagram· Local fallback when the counter store is unreachable

On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.

Abuse guardrails should fail open (or fail to a conservative local bucket). The purpose of a volumetric limit is to protect the origin from a flood. If Redis is down and you fail closed, you have just denied 100% of traffic to defend against a hypothetical flood that may not even be happening — you have DoS'd yourself. Far better to admit traffic (the origin's own capacity and load-shedding become the backstop) or fall back to a generous local bucket.

Commercial quotas may fail closed — but carefully. If over-serving a metered, billable resource (e.g. paid LLM tokens) has real cost, a brief fail-closed window may be cheaper than the over-serve. Even then, a conservative local fallback that reconciles later is usually better than hard rejection.

The discipline is to classify each limit as abuse or quota and document its outage behaviour. State it explicitly in the interview: "volumetric limits fail open with a local fallback; billing quotas fail to a conservative local cap and reconcile on recovery; we never take the API down because the limiter's datastore blinked."

The response contract clients depend on

Architecture diagram· 429 response contract

A throttled response carries Retry-After and RateLimit-* headers so well-behaved clients self-regulate.

Rate limiting is a collaboration with well-behaved clients, not just a wall against bad ones. The response contract is what makes that collaboration possible.

Architecture diagram· 429 response contract

A throttled response carries Retry-After and RateLimit-* headers so well-behaved clients self-regulate.

When you throttle, return `429 Too Many Requests` — never 503 (which implies your fault) and never a silent drop (which forces clients to guess). Attach:

`Retry-After` — seconds (or an HTTP date) until the client may retry. This is the single most important header; without it clients retry immediately and amplify the overload.
`RateLimit-Limit`, `RateLimit-Remaining`, `RateLimit-Reset` — the IETF draft standard headers that let clients self-regulate before hitting the limit, smoothing their own traffic.

On the client side, the contract is exponential backoff with jitter: on a 429, wait Retry-After if present, otherwise back off exponentially with randomised jitter to avoid a thundering-herd retry storm where every throttled client retries in lockstep.

The anti-pattern is the limit that returns a bare 429 with no headers. Clients can't tell whether to retry in one second or one minute, so they guess — usually too aggressively — and turn your protective limit into a retry amplifier.

Cost-weighted limits and priority load shedding

Not all requests cost the same, and not all clients matter equally under stress. Two refinements separate a staff-level answer.

Cost-weighted (weighted) rate limiting. Instead of "one request, one token," charge tokens proportional to the work a request triggers. A simple GET costs 1 token; a fan-out search across ten shards costs 10; an LLM completion costs tokens proportional to output length. GitHub's GraphQL API does exactly this — it computes a "point cost" per query and debits the budget accordingly, so a single expensive query consumes more of the quota than a cheap one. This aligns the limit with actual resource consumption rather than request count.

Priority-aware load shedding. When the system is genuinely overloaded (not just one client misbehaving), rate limiting blurs into load shedding. Tag traffic by priority — paid before free, interactive before batch, write-path health checks before analytics — and shed the lowest priority first. This keeps the system up for the traffic that matters when capacity is the hard constraint. The mental model: rate limiting enforces fairness per client; load shedding enforces survival under aggregate overload. A mature system does both, and a strong candidate names the distinction.

Decision levers

Algorithm

Token bucket is the default (independent burst B and sustained R knobs). Sliding-window counter for cheap O(1) edge limits. Leaky bucket only when a downstream needs a smooth constant drip. Fixed window never, except as a crude guardrail — it allows 2B across the boundary.

Identity axis

Layer per-IP (edge, volumetric net), per-API-key (gateway, plan quota), per-user (app, fairness within a tenant), and per-path (expensive endpoints get their own budget). Never make per-IP the primary axis on authenticated traffic — corporate NATs make it punish whole offices.

Storage / distribution

Redis with atomic INCR+PEXPIRE or a Lua token-bucket script for globally consistent limits (~1 ms/request). Local counters with periodic sync when the round-trip is too expensive, accepting brief over-admission. Shard by identity hash; isolate hot tenants on dedicated shards.

Failure posture

Classify each limit as abuse-guardrail (fail open or local fallback — never DoS yourself) or commercial-quota (may fail closed/conservative). Always set a tight timeout on the counter-store call so a slow store cannot blow latency or exhaust the connection pool.

Response contract

429 + Retry-After + RateLimit-Limit/-Remaining/-Reset on every throttle. Clients must back off exponentially with jitter. Expose remaining budget proactively so good clients self-regulate before hitting the wall.

Failure modes

Per-IP as the primary identity

Corporate, university, and carrier NATs hide thousands of users behind one IP. Limiting that IP denies service to all of them. Use per-IP only as an edge volumetric net; key authenticated traffic on API key or user id.

Fixed-window boundary doubling

A client sends B requests at the end of one window and B at the start of the next — 2B in two seconds. Use sliding-window counter or token bucket instead.

No Retry-After / RateLimit headers

Clients retry blindly, often immediately, amplifying the overload the limit was meant to prevent. Always emit Retry-After and the RateLimit-* headers and document exponential-backoff-with-jitter for clients.

Read-modify-write counter raceAdvanced

GET-check-INCR lets two concurrent requests both read 99/100 and both proceed. Use atomic server-side INCR+PEXPIRE or a Lua refill script so the operation is indivisible.

Limiter on the hot path with no fallbackAdvanced

Counter store dies → every request fails → self-inflicted outage. Fail open or fall back to a conservative local bucket for abuse guardrails; never take the API down to protect a limit.

Hot-key concentrationAdvanced

A celebrity or abusive key sends all its traffic to one Redis shard, overloading it. Shard by identity hash, add gateway-local pre-limiting, and isolate known-hot tenants on dedicated quota shards.

Uniform cost assumptionAdvanced

Treating a cheap GET and an expensive multi-shard search as one token each lets a flood of expensive calls exhaust capacity within the rate limit. Use cost-weighted tokens and per-path budgets for expensive endpoints.

Case studies

Stripe

Stripe — request rate limiters and concurrency limiters in production

Stripe published one of the most-cited engineering posts on production rate limiting. They run two distinct limiter types because they protect against two different failure modes.

The request rate limiter caps requests per second per API key using a token-bucket scheme backed by Redis, with the refill logic implemented as a Redis Lua script so the read-modify-write is atomic. This handles the common case of a client looping too fast.

The concurrency limiter caps the number of in-flight requests per client — because a handful of slow, expensive requests can exhaust capacity even at a low request rate. This is the in-flight semaphore pattern: increment on entry, decrement on exit, with a safety timeout to release permits leaked by crashed requests.

Crucially, Stripe also separates limiters (protect Stripe's own infrastructure, fail toward protecting the system) from load shedders (shed non-critical traffic first when overloaded — they reserve capacity so that critical request types like charge creation are served before less critical ones). Their guidance: build the limiter as a thin, fast, fail-safe layer, and always return clear 429s with retry guidance so clients back off correctly.

Takeaway: Rate limit and concurrency limit are different defences — cap both requests/sec and in-flight requests, and separate "protect the system" limiters from "shed low-priority traffic" load shedders.

GitHub

GitHub API — cost-weighted limits and explicit headers

GitHub's REST API enforces 5,000 requests/hour for authenticated users (much lower for unauthenticated, keyed by IP), and every response carries X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset. Clients are expected to read these and self-throttle; GitHub's own documentation tells integrators to watch Remaining and pause before they hit zero.

The more interesting design is the GraphQL API's cost-weighted limit. Because a single GraphQL query can request wildly different amounts of work, GitHub computes a point cost for each query based on the number of nodes it will touch, and debits a 5,000-point-per-hour budget. A trivial query costs 1 point; a deeply nested one fetching thousands of connected objects costs hundreds. This decouples the limit from raw request count and ties it to actual server work — the same query-cost idea that protects expensive endpoints.

GitHub also layers secondary rate limits that catch abusive patterns (too many concurrent requests, too many points per minute, too many content-creating requests) on top of the primary hourly budget — a concrete example of defence in depth.

Takeaway: Tie the limit to real work (query cost), expose RateLimit headers so good clients self-regulate, and layer secondary limits to catch abuse the primary budget misses.

Cloudflare

Cloudflare — sliding-window counters at the edge

Cloudflare enforces rate limits at the edge across a global network of data centres, where the per-request cost of the algorithm matters enormously because it runs on every request to millions of sites. They popularised the sliding-window counter approach precisely because it is O(1) in storage and computation while avoiding the fixed-window boundary-doubling problem.

The trick: rather than store a precise timestamp for every request (accurate but expensive — that is the "sliding log" approach, which is O(requests) memory), they keep just the count for the current and previous fixed windows and estimate the rolling rate as current_count + previous_count × (overlap fraction of the previous window). This is approximate — it assumes requests in the previous window were evenly distributed — but the error is small (single-digit percent) and bounded, and the memory cost is two integers per key regardless of traffic.

At edge scale, that approximation is the entire point: an exact sliding log would require storing and scanning a timestamp per request per identity across the whole network. The lesson is that the "best" algorithm depends on where it runs — at the edge, "cheap, O(1), and 99% accurate" beats "exact but O(n)."

Takeaway: At edge scale the algorithm's per-request cost dominates — sliding-window counters trade a few percent of accuracy for O(1) memory, which is the right trade when you run on every request globally.

Decision table

Rate-limit design is algorithm + identity + storage + failure posture.

Layer	Identity	Purpose	Failure posture
Edge / CDN	IP / ASN / country	Volumetric abuse, cheap bot filtering	Fail-open or coarse local rule
Gateway	API key / tenant	Plan quota and tenant fairness	Conservative local fallback
App	User + path	Fairness and expensive-endpoint protection	Graceful 429 over dependency failure
Billing quota	Subscription / metered resource	Precise commercial limit	Fail-closed only if product demands it

Per-IP is a safety net, not a primary identity — NATs hide many users behind one address.
Layer limits so one logical action does not trip two layers and confuse legitimate clients.

Drills

Explain a token bucket in 30 seconds.Reveal

A bucket holds up to B tokens and refills at R per second. Each request spends one token; if the bucket is empty, reject. It allows bursts up to B then sustains R/s. State is just two numbers per identity — token count and last-refill timestamp — and the refill is computed lazily as min(B, tokens + elapsed × R) on the next access.

Why is a fixed-window counter dangerous?Reveal

Boundary doubling. A client can send B requests in the last second of one window and B in the first second of the next — 2B in two seconds, double the intended rate. A sliding-window counter fixes this with two integers by weighting the previous window's count by the overlap fraction.

Why not rate-limit purely by IP?Reveal

Corporate offices, universities, and mobile carriers put thousands of distinct users behind one NAT'd IP. Limiting that IP denies service to all of them. Per-IP belongs at the edge as a generous volumetric safety net; authenticated traffic should be keyed on API key or user id where each identity maps to one actor.

Your limiter's Redis is down. What happens?Reveal

Decide by limit class. Abuse / volumetric guardrails fail open or fall back to a conservative per-gateway local bucket — the origin's own capacity is the backstop, and you must not DoS yourself. Commercial quotas use a conservative local cap and reconcile on recovery. The Redis call always has a tight timeout so a slow Redis can't block requests or exhaust the pool.

Two requests hit the same key concurrently at 99/100. What can go wrong and how do you prevent it?Reveal

With a GET-check-INCR sequence both requests read 99, both decide they're under 100, and both proceed — over-admitting. Prevent it with an atomic server-side operation: INCR+PEXPIRE for window counters, or a Lua script that does the entire token-bucket refill-and-spend in one round-trip so no interleaving is possible.

How do you protect one expensive endpoint sharing a budget with cheap ones?Reveal

Give it its own per-path bucket so a flood of cheap requests can't drain the budget it needs, and if it's long-running, add a concurrency cap on in-flight requests per identity. For variable-cost endpoints (search, GraphQL), charge tokens proportional to the work — a wildcard query costs more tokens than a keyed lookup — so the limit tracks real server load rather than raw request count.

When to reach for this

Layer

Identity

Purpose

Failure posture

Edge / CDN

IP / ASN / country

Volumetric abuse, cheap bot filtering

Fail-open or coarse local rule

Gateway

API key / tenant

Plan quota and tenant fairness

Conservative local fallback

App

User + path

Fairness and expensive-endpoint protection

Graceful 429 over dependency failure

Billing quota

Subscription / metered resource

Precise commercial limit

Fail-closed only if product demands it