Rate limiting / quota enforcement
Token bucket + distributed counter. The bucket math is trivial; the distribution, identity, and failure posture are the real interview.
When to reach for this
Reach for this when…
- Public API with tiered plans (free, paid, enterprise)
- Anti-abuse on write-heavy or expensive endpoints
- Per-tenant fairness in a multi-tenant system
- Protection against volumetric / credential-stuffing attacks
- Interviewer says "how do you stop one client taking down the system?"
Not really this pattern when…
- Internal service-to-service where every caller is trusted (use backpressure / load shedding instead)
- Predictable fixed-capacity batch jobs (schedule them; don't throttle them)
- A single client whose burst you actually want to absorb (use a queue, not a 429)
Good vs bad answer
Interviewer probe
“How do you rate-limit your public API?”
Weak answer
"1000 requests per minute per IP, return a 429 when they go over."
Strong answer
"Token bucket per API key — burst 100, sustained 10/s for free tier, 10× for paid. State in Redis as a hash with the refill-and-spend in a Lua script so it's atomic and globally consistent across gateways. Edge CDN adds a per-IP volumetric layer at ~1000/min as a DDoS safety net only — I won't make IP the primary axis because corporate NATs would punish whole offices. Expensive endpoints like /export and /search get their own per-path buckets so cheap traffic can't starve them. Response is 429 with Retry-After and RateLimit-Limit/-Remaining/-Reset so clients back off with exponential jitter. If Redis is unreachable I fail open to a conservative local bucket — I won't DoS myself to protect a quota. The hot-key risk (a celebrity key overloading one shard) I'd handle with local pre-limiting at the gateway and a dedicated quota shard for known-hot tenants."
Why it wins: Names all four axes — algorithm, identity, storage, failure posture — plus per-path budgets, the response contract, and the hot-key mitigation. The weak answer picks the one identity (IP) that breaks on NATs and ignores distribution and outage behaviour.
Cheat sheet
- •Four axes: algorithm, identity, storage, failure posture. Name all four.
- •Default algorithm: token bucket. Burst B + sustained R are separate knobs.
- •Sliding-window counter for cheap O(1) edge limits; fixed window never.
- •Identity: per-IP (edge net) < per-key (gateway) < per-user/per-path (app). Layer them.
- •Per-IP punishes NATs — safety net only, never the primary axis.
- •Storage: Redis atomic INCR+PEXPIRE or a Lua refill-and-spend script. No read-modify-write race.
- •Shard by identity hash; isolate hot tenants on dedicated quota shards.
- •Response: 429 + Retry-After + RateLimit-Limit/-Remaining/-Reset. Always.
- •Clients: exponential backoff with jitter, never lockstep retries.
- •Fail-open (or local fallback) for abuse limits; the limiter must never be the outage.
- •Charge tokens by cost for variable-cost endpoints (GitHub GraphQL point costs).
- •Concurrency limit (in-flight cap) for slow expensive endpoints, alongside the rate limit.
Core concept
Rate limiting answers two questions at once: "how fast may this client go?" (capacity / fairness) and "is this client abusing us?" (anti-abuse). Every strong answer decomposes the problem into three axes — algorithm, identity, and storage — and then adds a fourth that juniors forget: failure posture.
Three enforcement layers — edge (per-IP), gateway (per-key), app (per-user/path) — backed by a shared counter store.
Algorithm — pick by burst tolerance.
- Token bucket is the default. A bucket holds up to
Btokens and refills atRper second; each request spends one token. It permits bursts up toBand sustainsR/s. Two numbers per identity capture the whole state: token count and last-refill timestamp. - Leaky bucket smooths output to a constant
R/s— use it when a downstream needs a steady drip rather than bursts. - Sliding-window counter approximates a rolling window with O(1) storage and avoids the fixed-window doubling bug.
- Fixed window is the simplest but lets a client send
2Bacross a window boundary. Use it only for coarse guardrails.
Bucket holds up to B tokens, refilled at R/s. Each request spends one. Empty bucket → reject.
Identity — layer them, never rely on one.
- Per-IP at the edge is a volumetric safety net, but corporate and carrier NATs hide thousands of users behind one address, so it is a net, not an identity.
- Per-API-key at the gateway is the right axis for authenticated public APIs and plan quotas.
- Per-user-id at the app enforces fairness between users that share a single tenant key.
- Per-path at the app gives expensive endpoints (exports, search, inference) their own budget so a cheap endpoint's traffic can't starve them.
Each layer catches a distinct abuse class. One layer failing does not expose the origin.
Storage — a global limit needs a shared counter. A single logical limit enforced across a fleet of gateways requires shared state. The default is Redis with an atomic INCR + PEXPIRE per (identity, window), or a Lua script that implements token-bucket refill atomically. One round-trip per request scales to 100k+ req/s per Redis shard; shard by identity hash beyond that.
N gateways enforce one global limit by sharing a Redis counter keyed by (identity, window).
Failure posture — the limiter must never become the outage. If the counter store is unreachable, decide deliberately: fail-open (or fall back to a conservative local bucket) for abuse guardrails, because dropping all traffic to protect against a hypothetical flood is self-inflicted denial of service; fail-closed only for hard commercial quotas where over-serving has real cost. Name the difference — it shows you understand the business meaning of the limit, not just the algorithm.
On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.
Response contract — non-optional. Every throttled request returns 429 Too Many Requests with Retry-After and the RateLimit-Limit / RateLimit-Remaining / RateLimit-Reset headers (the IETF draft standard). Well-behaved clients read these and back off; without them, clients retry blindly and amplify the very overload you are defending against.
Interview walkthrough
Worked example: rate limiting for a public API with free and paid plans
Scenario. A public REST API with three tiers: free (60 req/min), paid (1,000 req/min), enterprise (custom). Endpoints range from a trivial GET /me to an expensive POST /reports/export that fans out across the database. Abuse history: credential-stuffing on POST /login and scrapers hammering GET /search.
Three enforcement layers — edge (per-IP), gateway (per-key), app (per-user/path) — backed by a shared counter store.
Step 1 — separate abuse from quota. Two different problems wearing the same hat. Plan limits (60 / 1,000 req/min) are a quota keyed by API key. Credential stuffing and scraping are abuse, better caught by per-IP edge limits and per-path app limits. Don't conflate them into one number.
Step 2 — edge layer (volumetric). At the CDN, a per-IP sliding-window limit at, say, 2,000 req/min — generous, because it must not punish a NAT'd office, but low enough to blunt a single-host flood. Fails open: if the edge limiter has trouble, traffic passes to the gateway, which still enforces the plan.
Step 3 — gateway layer (plan quota). Token bucket per API key in Redis. Free: B=60, R=1/s. Paid: B=200, R=16/s. Enterprise: per-contract. Refill-and-spend in a Lua script for atomicity. Every response carries RateLimit-Remaining so integrators self-throttle. Over limit → 429 + Retry-After.
Step 4 — app layer (expensive paths + auth abuse). Per-path budgets: POST /reports/export gets a tight bucket (B=5, R=1/min) and a concurrency cap of 2 in-flight per tenant, because each export is heavy. POST /login gets a per-IP-plus-username bucket (e.g. 5 attempts / 15 min) to throttle credential stuffing without locking out a whole NAT.
Step 5 — cost-weighting. For GET /search, charge tokens proportional to result-set fan-out so a wildcard query costs more than a keyed lookup — aligning the limit with real work rather than request count.
Step 6 — failure posture. Redis down? Plan and abuse limits fall back to conservative local buckets per gateway and reconcile on recovery; the API stays up. The only thing that fails closed is the metered enterprise overage billing path, where over-serving has direct cost — and even that uses a conservative local cap, not hard rejection.
Step 7 — observability. Emit metrics for 429 rate per tier and per endpoint, false-positive signals (sudden 429 spikes for paid users), Redis latency on the limiter path, and top throttled identities. A rising paid-tier 429 rate is an incident, not a success.
Result. Free users get a fair, predictable budget; paid users rarely see a 429; /login and /export are protected without collateral damage to legitimate offices; and a Redis blip degrades to slightly looser local enforcement instead of an outage.
Interview playbook
When it comes up
- The prompt has public APIs, abusive clients, tenants, or paid plan tiers
- A shared multi-tenant system must protect users from each other
- Requests differ in cost by endpoint or by plan
- Interviewer asks "what stops one client overwhelming the system?"
Order of reveal
- 1Name the four axes. Rate limiting is algorithm, identity, storage, and failure posture. Most candidates name the first three and forget the fourth.
- 2Pick token bucket. Token bucket is my default — independent burst and sustained knobs match how real clients behave: bursty, then idle.
- 3Layer the identities. Per-IP at the edge as a volumetric net, per-API-key at the gateway for plan quota, per-user and per-path at the app for fairness and expensive endpoints.
- 4Distribute the counter. Shared Redis with an atomic Lua refill-and-spend for global consistency; shard by identity hash and isolate hot tenants.
- 5State the response contract. 429 with Retry-After and RateLimit-* headers; clients back off with exponential jitter.
- 6Define the failure posture. Abuse guardrails fail open or to a local bucket; commercial quotas may fail conservative. The limiter must never be the outage.
Signature phrases
- “Per-IP is a safety net, not an identity” — Shows you know NATs hide many users behind one address.
- “Burst and sustained are separate knobs” — Demonstrates token-bucket understanding over a flat rate.
- “The limiter cannot become the outage” — Names the fail-open / local-fallback insight that juniors miss.
- “Charge tokens by cost, not by count” — Connects the limit to real server work for expensive endpoints.
- “Atomic refill or you have a race” — Shows awareness of the read-modify-write counter bug.
Likely follow-ups
?“What if one API key is extremely hot?”Reveal
Shard the counter by identity hash so keys spread across Redis shards, add gateway-local pre-limiting so most requests are decided in-process and only reconcile with Redis periodically, and for known-hot tenants give them a dedicated quota shard so their load is isolated from everyone else.
?“What if Redis is down?”Reveal
Depends on the limit class. Abuse guardrails fail open or fall back to a conservative per-gateway local bucket — I will not drop all traffic to defend against a hypothetical flood. Billing quotas use a conservative local cap and reconcile on recovery. The hot-path Redis call always has a tight timeout so a slow Redis cannot block requests or exhaust the connection pool.
?“How do you handle expensive vs cheap endpoints under one budget?”Reveal
Give expensive endpoints their own per-path bucket and, if they are long-running, a concurrency cap on in-flight requests. For variable-cost endpoints like search or GraphQL, charge tokens proportional to the work — GitHub does this with query point costs — so one expensive call consumes more budget than one cheap call.
?“How do clients know how to behave?”Reveal
The RateLimit-Limit/-Remaining/-Reset headers let them self-throttle before hitting the wall, and Retry-After tells them exactly how long to wait after a 429. Clients should implement exponential backoff with jitter so a fleet of throttled clients does not retry in lockstep and create a thundering herd.
Canonical examples
- →API gateway quotas (Stripe, GitHub, Twilio)
- →Login / signup flood control
- →Email and SMS send-rate caps
- →Per-tenant request budgets in SaaS
- →Public chatbot / LLM inference APIs
Variants
Per-key token bucket in Redis
Atomic refill in a Lua script per (api_key, bucket), one round-trip per request.
Bucket holds up to B tokens, refilled at R/s. Each request spends one. Empty bucket → reject.
The default for authenticated public APIs. State per identity is two fields — current token count and last-refill timestamp — stored in a Redis hash. A Lua script runs the refill-and-spend atomically so concurrent requests on the same key can't both read a stale count and over-spend:
- 1Read
tokensandlast_refill. - 2Add
(now - last_refill) × Rtokens, capped atB. - 3If
tokens >= 1, decrement and allow; else reject. - 4Write back
tokensandlast_refillwith a TTL so idle keys expire.
This is globally consistent: every gateway sees the same bucket. The cost is one Redis round-trip (~0.5–1 ms) per request and one hot key per identity. It scales to millions of keys because each key is tiny and idle keys self-expire.
Pros
- +Globally consistent across the whole gateway fleet
- +Atomic refill via Lua — no read-modify-write race
- +Cheap per request (~1 ms) and scales to millions of keys
Cons
- −Redis sits on the hot path — needs its own HA and a fallback
- −A celebrity / abusive key becomes a hot shard
Choose this variant when
- Public authenticated APIs with plan quotas
- You need cross-gateway consistency for billing-grade limits
Local counter with periodic sync
Each gateway enforces its own bucket and reconciles to a shared store every few seconds.
On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.
When per-request Redis round-trips are too expensive (very high QPS edges, or latency budgets in the single-digit milliseconds), give each gateway a local bucket sized to its share of the global limit and reconcile periodically. Worst-case over-admission is roughly gateway_count × local_slack during a sync interval — acceptable for DDoS-scale guardrails, not for precise billing quotas.
A common refinement is the "approximated global" scheme used by Stripe and others: each node keeps a local view, periodically pushes its consumption to Redis, and pulls back the global total to adjust its local allowance. Convergence is eventual but the per-request path stays in-process.
Pros
- +Zero per-request external call — lowest possible latency
- +Survives a counter-store outage by design
Cons
- −Over-limit by up to (gateway_count × local_slack) in a window
- −Not suitable for precise commercial quotas
Choose this variant when
- Very high QPS edge where a Redis hop per request is too costly
- Approximate enforcement is acceptable (abuse guardrails)
Layered (edge + gateway + app)
Per-IP at the CDN, per-key at the gateway, per-user-and-path at the app.
Each layer catches a distinct abuse class. One layer failing does not expose the origin.
Defence in depth. Each layer targets a different abuse class, and the failure of any single layer does not expose the origin:
- Edge (per-IP / ASN): absorbs volumetric floods and trivial bots cheaply, before traffic reaches your fleet.
- Gateway (per-API-key / tenant): enforces plan quotas and tenant fairness.
- App (per-user + per-path): protects expensive endpoints and enforces fairness between users inside a shared tenant key.
The subtlety is coordinating the limits so a legitimate user can't trip two layers for one logical action (which produces confusing, hard-to-debug 429s). Document the budget at each layer and keep the edge limit generously above the gateway limit.
Pros
- +Covers volumetric, plan-abuse, and hot-endpoint abuse separately
- +Failure of one layer still leaves the others enforcing
Cons
- −Operational complexity — three configs to keep consistent
- −Mis-tuned layers cause false-positive 429s for real users
Choose this variant when
- Production public APIs exposed to the internet
- Systems with a history of abuse or scraping
Concurrency limit (in-flight cap)
Cap simultaneous in-flight requests per identity instead of a rate over time.
Some resources are bounded by concurrency, not throughput — a tenant running 50 simultaneous report exports can exhaust worker pools even at a modest request rate. A concurrency limiter tracks in-flight requests per identity (a semaphore in Redis: INCR on entry, DECR on completion, with a safety TTL so crashed requests don't leak permits).
This composes with a token bucket: the bucket governs how often you may start work, the semaphore governs how much may run at once. For expensive, long-running endpoints (LLM inference, video transcode kickoff), the concurrency cap is often the more important of the two.
Pros
- +Directly protects bounded resources (worker pools, DB connections)
- +Catches slow, expensive requests that a rate limit misses
Cons
- −Leaked permits if you forget the safety TTL on crash
- −Harder for clients to reason about than a simple rate
Choose this variant when
- Expensive, long-running endpoints (exports, inference, transcode)
- Downstream has a hard concurrency ceiling (connection pools)
Scaling path
v1 — single-instance in-memory bucket
Stop the obvious abuse with the simplest thing that works.
One app instance, one in-process token bucket per identity in a local map. Zero infrastructure, sub-microsecond checks. This is genuinely the right answer when you have a single instance or when each instance owns a disjoint set of clients.
Bucket holds up to B tokens, refilled at R/s. Each request spends one. Empty bucket → reject.
The honest caveat: the moment you run two instances behind a load balancer, each enforces the limit independently, so the effective global limit is instance_count × local_limit. That is the trigger to move to v2.
What triggers the next iteration
- Multiple instances each enforce locally — global limit is N× too loose
- State lost on restart — buckets reset, briefly allowing bursts
- No cross-instance view of an abusive client
v2 — shared Redis counter
Enforce one global limit across the whole fleet.
Move bucket state into Redis. Each request runs an atomic refill-and-spend Lua script keyed by (identity, bucket). Now every gateway sees the same bucket and the global limit is exact.
N gateways enforce one global limit by sharing a Redis counter keyed by (identity, window).
Add the response contract here: 429 + Retry-After + RateLimit-* headers. The cost is one Redis round-trip per request and a hard dependency on Redis availability — which sets up the next two steps.
What triggers the next iteration
- Redis is now on the hot path — its outage threatens every request
- A single hot identity hammers one Redis shard
- Per-request round-trip adds ~1 ms latency
v3 — sharding + local fallback
Survive hot keys and counter-store outages.
Shard the counter store by identity hash so no single shard owns all the hot keys. Add a local-fallback path: if Redis times out, gateways switch to a conservative in-process bucket rather than failing every request.
On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.
Decide the failure posture explicitly per limit class: abuse guardrails fail open (or local), commercial quotas may fail closed. This is the difference between "we briefly over-served during a Redis blip" and "we took the whole API down to protect a quota."
What triggers the next iteration
- Local fallback admits more traffic than the global limit during outages
- Hot tenants still need dedicated quota shards
- Coordinating fail-open vs fail-closed across limit classes
v4 — layered defence in depth
Separate volumetric, plan, and endpoint abuse into independent layers.
Add edge per-IP limiting at the CDN for volumetric floods and per-path limits at the app for expensive endpoints. Each layer is independently tuned and independently fails. The gateway per-key layer handles plan quotas in the middle.
Each layer catches a distinct abuse class. One layer failing does not expose the origin.
This is the mature shape for a public internet-facing API. The remaining work is continuous tuning: watching false-positive 429 rates, adjusting per-tenant overrides, and feeding abusive-IP signals to the edge.
What triggers the next iteration
- Mis-tuned layers produce confusing double-throttling
- Per-tenant overrides multiply configuration surface
- Edge IP signals need a feedback loop from app-level abuse detection
Deep dives
Token bucket vs sliding window — when each wins
Fixed window allows 2B at the boundary; sliding window weights the previous window to smooth it.
The four classic algorithms differ only in how they treat bursts and how much state they need.
Fixed window counts requests per calendar window (e.g. per minute) and resets at the boundary. It needs one integer per identity but has a fatal edge: a client can send B requests in the last second of one window and B in the first second of the next — 2B in two seconds, double the intended rate.
Fixed window allows 2B at the boundary; sliding window weights the previous window to smooth it.
Sliding-window counter fixes this cheaply: keep the current and previous window counts and compute a weighted estimate — current + previous × (overlap_fraction). It needs two integers and a multiply, has no boundary doubling, and is accurate to within a few percent. This is what Cloudflare popularised for edge rate limiting because it is O(1) and approximate-but-good-enough.
Token bucket is the most flexible: B and R are independent knobs, so you can allow a generous burst (B = 100) on top of a modest sustained rate (R = 10/s). State is two fields. It is the right default for API quotas because real clients are bursty — a page load fires ten requests at once, then nothing for a second.
Leaky bucket is token bucket's mirror: it enforces a constant output rate by draining a queue at R/s. Reach for it only when a downstream genuinely needs a smooth drip (e.g. a third-party API with its own strict steady-rate limit you must not exceed).
The interview-grade summary: token bucket for API quotas, sliding window for cheap edge volumetric limits, leaky bucket for smoothing into a rate-sensitive downstream, fixed window never (except as a crude guardrail).
Identity choice decides your false-positive rate
Each layer catches a distinct abuse class. One layer failing does not expose the origin.
The single most common production incident with rate limiting is throttling legitimate users — and the cause is almost always the wrong identity axis.
Each layer catches a distinct abuse class. One layer failing does not expose the origin.
Per-IP is cheap and available pre-auth, which makes it tempting as the primary axis. But a corporate office, a university, or a mobile carrier can put thousands of distinct users behind a single NAT'd IP. Rate-limit that IP and you deny service to an entire building. Per-IP belongs at the edge as a volumetric safety net with a generous threshold, not as the primary fairness mechanism.
Per-API-key is the right primary axis for public APIs: it maps to a billing account and a plan tier. But a single key can represent a whole company with many end users, so within a key you may still need per-user fairness.
Per-user-id is the fairest axis but is only available after authentication. Use it at the app layer to stop one user inside a shared tenant from starving the others.
Per-path is orthogonal to all of the above: GET /search and POST /export have wildly different costs, so they deserve separate budgets even for the same identity. Without it, a flood of cheap requests can exhaust the budget an expensive endpoint needed.
Strong designs layer these and tune each independently. The edge blocks obvious floods, the gateway enforces the plan, and the app enforces fairness and protects expensive paths.
Distributed enforcement and the hot-key problem
N gateways enforce one global limit by sharing a Redis counter keyed by (identity, window).
A global limit enforced across a fleet needs shared state, and the default is Redis. The naive approach — GET, check, INCR — has a read-modify-write race: two concurrent requests both read count = 99 against a limit of 100 and both proceed. The fix is an atomic server-side operation: either INCR followed by PEXPIRE (for window counters) or a Lua script that does the whole token-bucket refill-and-spend in one round-trip so no interleaving is possible.
N gateways enforce one global limit by sharing a Redis counter keyed by (identity, window).
The hot-key problem. A celebrity API key, a viral tenant, or an attacker hammering one identity concentrates all traffic on the single Redis shard that owns that key. Mitigations, in order of escalation:
- 1Shard by identity hash so different identities land on different shards (helps the aggregate, not a single hot identity).
- 2Local pre-limiting at the gateway: each gateway enforces a fraction of the global limit locally and only consults Redis to reconcile, cutting the per-request load on the hot shard.
- 3Dedicated quota shard for known-hot tenants, isolating their load from everyone else.
Latency budget. Each Redis hop is ~0.5–1 ms. If your endpoint's p99 budget is 20 ms, one hop is fine; if it is 2 ms, you need the local-counter variant. Always set a tight timeout on the Redis call (e.g. 50 ms) and a fallback, so a slow Redis can't blow your latency SLO or, worse, queue requests until the pool exhausts.
Fail-open vs fail-closed is a product decision
On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.
A limiter on the hot path is itself a dependency, and dependencies fail. The question "what happens when the counter store is down?" separates senior answers from junior ones, because the right answer depends on the business meaning of the limit, not the algorithm.
On Redis failure, gateways fall back to conservative local buckets so the limiter never becomes the outage.
Abuse guardrails should fail open (or fail to a conservative local bucket). The purpose of a volumetric limit is to protect the origin from a flood. If Redis is down and you fail closed, you have just denied 100% of traffic to defend against a hypothetical flood that may not even be happening — you have DoS'd yourself. Far better to admit traffic (the origin's own capacity and load-shedding become the backstop) or fall back to a generous local bucket.
Commercial quotas may fail closed — but carefully. If over-serving a metered, billable resource (e.g. paid LLM tokens) has real cost, a brief fail-closed window may be cheaper than the over-serve. Even then, a conservative local fallback that reconciles later is usually better than hard rejection.
The discipline is to classify each limit as abuse or quota and document its outage behaviour. State it explicitly in the interview: "volumetric limits fail open with a local fallback; billing quotas fail to a conservative local cap and reconcile on recovery; we never take the API down because the limiter's datastore blinked."
The response contract clients depend on
A throttled response carries Retry-After and RateLimit-* headers so well-behaved clients self-regulate.
Rate limiting is a collaboration with well-behaved clients, not just a wall against bad ones. The response contract is what makes that collaboration possible.
A throttled response carries Retry-After and RateLimit-* headers so well-behaved clients self-regulate.
When you throttle, return `429 Too Many Requests` — never 503 (which implies your fault) and never a silent drop (which forces clients to guess). Attach:
- `Retry-After` — seconds (or an HTTP date) until the client may retry. This is the single most important header; without it clients retry immediately and amplify the overload.
- `RateLimit-Limit`, `RateLimit-Remaining`, `RateLimit-Reset` — the IETF draft standard headers that let clients self-regulate before hitting the limit, smoothing their own traffic.
On the client side, the contract is exponential backoff with jitter: on a 429, wait Retry-After if present, otherwise back off exponentially with randomised jitter to avoid a thundering-herd retry storm where every throttled client retries in lockstep.
The anti-pattern is the limit that returns a bare 429 with no headers. Clients can't tell whether to retry in one second or one minute, so they guess — usually too aggressively — and turn your protective limit into a retry amplifier.
Cost-weighted limits and priority load shedding
Not all requests cost the same, and not all clients matter equally under stress. Two refinements separate a staff-level answer.
Cost-weighted (weighted) rate limiting. Instead of "one request, one token," charge tokens proportional to the work a request triggers. A simple GET costs 1 token; a fan-out search across ten shards costs 10; an LLM completion costs tokens proportional to output length. GitHub's GraphQL API does exactly this — it computes a "point cost" per query and debits the budget accordingly, so a single expensive query consumes more of the quota than a cheap one. This aligns the limit with actual resource consumption rather than request count.
Priority-aware load shedding. When the system is genuinely overloaded (not just one client misbehaving), rate limiting blurs into load shedding. Tag traffic by priority — paid before free, interactive before batch, write-path health checks before analytics — and shed the lowest priority first. This keeps the system up for the traffic that matters when capacity is the hard constraint. The mental model: rate limiting enforces fairness per client; load shedding enforces survival under aggregate overload. A mature system does both, and a strong candidate names the distinction.
Decision levers
Algorithm
Token bucket is the default (independent burst B and sustained R knobs). Sliding-window counter for cheap O(1) edge limits. Leaky bucket only when a downstream needs a smooth constant drip. Fixed window never, except as a crude guardrail — it allows 2B across the boundary.
Identity axis
Layer per-IP (edge, volumetric net), per-API-key (gateway, plan quota), per-user (app, fairness within a tenant), and per-path (expensive endpoints get their own budget). Never make per-IP the primary axis on authenticated traffic — corporate NATs make it punish whole offices.
Storage / distribution
Redis with atomic INCR+PEXPIRE or a Lua token-bucket script for globally consistent limits (~1 ms/request). Local counters with periodic sync when the round-trip is too expensive, accepting brief over-admission. Shard by identity hash; isolate hot tenants on dedicated shards.
Failure posture
Classify each limit as abuse-guardrail (fail open or local fallback — never DoS yourself) or commercial-quota (may fail closed/conservative). Always set a tight timeout on the counter-store call so a slow store cannot blow latency or exhaust the connection pool.
Response contract
429 + Retry-After + RateLimit-Limit/-Remaining/-Reset on every throttle. Clients must back off exponentially with jitter. Expose remaining budget proactively so good clients self-regulate before hitting the wall.
Failure modes
Corporate, university, and carrier NATs hide thousands of users behind one IP. Limiting that IP denies service to all of them. Use per-IP only as an edge volumetric net; key authenticated traffic on API key or user id.
A client sends B requests at the end of one window and B at the start of the next — 2B in two seconds. Use sliding-window counter or token bucket instead.
Clients retry blindly, often immediately, amplifying the overload the limit was meant to prevent. Always emit Retry-After and the RateLimit-* headers and document exponential-backoff-with-jitter for clients.
GET-check-INCR lets two concurrent requests both read 99/100 and both proceed. Use atomic server-side INCR+PEXPIRE or a Lua refill script so the operation is indivisible.
Counter store dies → every request fails → self-inflicted outage. Fail open or fall back to a conservative local bucket for abuse guardrails; never take the API down to protect a limit.
A celebrity or abusive key sends all its traffic to one Redis shard, overloading it. Shard by identity hash, add gateway-local pre-limiting, and isolate known-hot tenants on dedicated quota shards.
Treating a cheap GET and an expensive multi-shard search as one token each lets a flood of expensive calls exhaust capacity within the rate limit. Use cost-weighted tokens and per-path budgets for expensive endpoints.
Case studies
Stripe
Stripe — request rate limiters and concurrency limiters in production
Stripe published one of the most-cited engineering posts on production rate limiting. They run two distinct limiter types because they protect against two different failure modes.
The request rate limiter caps requests per second per API key using a token-bucket scheme backed by Redis, with the refill logic implemented as a Redis Lua script so the read-modify-write is atomic. This handles the common case of a client looping too fast.
The concurrency limiter caps the number of in-flight requests per client — because a handful of slow, expensive requests can exhaust capacity even at a low request rate. This is the in-flight semaphore pattern: increment on entry, decrement on exit, with a safety timeout to release permits leaked by crashed requests.
Crucially, Stripe also separates limiters (protect Stripe's own infrastructure, fail toward protecting the system) from load shedders (shed non-critical traffic first when overloaded — they reserve capacity so that critical request types like charge creation are served before less critical ones). Their guidance: build the limiter as a thin, fast, fail-safe layer, and always return clear 429s with retry guidance so clients back off correctly.
GitHub
GitHub API — cost-weighted limits and explicit headers
GitHub's REST API enforces 5,000 requests/hour for authenticated users (much lower for unauthenticated, keyed by IP), and every response carries X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset. Clients are expected to read these and self-throttle; GitHub's own documentation tells integrators to watch Remaining and pause before they hit zero.
The more interesting design is the GraphQL API's cost-weighted limit. Because a single GraphQL query can request wildly different amounts of work, GitHub computes a point cost for each query based on the number of nodes it will touch, and debits a 5,000-point-per-hour budget. A trivial query costs 1 point; a deeply nested one fetching thousands of connected objects costs hundreds. This decouples the limit from raw request count and ties it to actual server work — the same query-cost idea that protects expensive endpoints.
GitHub also layers secondary rate limits that catch abusive patterns (too many concurrent requests, too many points per minute, too many content-creating requests) on top of the primary hourly budget — a concrete example of defence in depth.
Cloudflare
Cloudflare — sliding-window counters at the edge
Cloudflare enforces rate limits at the edge across a global network of data centres, where the per-request cost of the algorithm matters enormously because it runs on every request to millions of sites. They popularised the sliding-window counter approach precisely because it is O(1) in storage and computation while avoiding the fixed-window boundary-doubling problem.
The trick: rather than store a precise timestamp for every request (accurate but expensive — that is the "sliding log" approach, which is O(requests) memory), they keep just the count for the current and previous fixed windows and estimate the rolling rate as current_count + previous_count × (overlap fraction of the previous window). This is approximate — it assumes requests in the previous window were evenly distributed — but the error is small (single-digit percent) and bounded, and the memory cost is two integers per key regardless of traffic.
At edge scale, that approximation is the entire point: an exact sliding log would require storing and scanning a timestamp per request per identity across the whole network. The lesson is that the "best" algorithm depends on where it runs — at the edge, "cheap, O(1), and 99% accurate" beats "exact but O(n)."
Decision table
Rate-limit design is algorithm + identity + storage + failure posture.
| Layer | Identity | Purpose | Failure posture |
|---|---|---|---|
| Edge / CDN | IP / ASN / country | Volumetric abuse, cheap bot filtering | Fail-open or coarse local rule |
| Gateway | API key / tenant | Plan quota and tenant fairness | Conservative local fallback |
| App | User + path | Fairness and expensive-endpoint protection | Graceful 429 over dependency failure |
| Billing quota | Subscription / metered resource | Precise commercial limit | Fail-closed only if product demands it |
- Per-IP is a safety net, not a primary identity — NATs hide many users behind one address.
- Layer limits so one logical action does not trip two layers and confuse legitimate clients.
Drills
Explain a token bucket in 30 seconds.Reveal
A bucket holds up to B tokens and refills at R per second. Each request spends one token; if the bucket is empty, reject. It allows bursts up to B then sustains R/s. State is just two numbers per identity — token count and last-refill timestamp — and the refill is computed lazily as min(B, tokens + elapsed × R) on the next access.
Why is a fixed-window counter dangerous?Reveal
Boundary doubling. A client can send B requests in the last second of one window and B in the first second of the next — 2B in two seconds, double the intended rate. A sliding-window counter fixes this with two integers by weighting the previous window's count by the overlap fraction.
Why not rate-limit purely by IP?Reveal
Corporate offices, universities, and mobile carriers put thousands of distinct users behind one NAT'd IP. Limiting that IP denies service to all of them. Per-IP belongs at the edge as a generous volumetric safety net; authenticated traffic should be keyed on API key or user id where each identity maps to one actor.
Your limiter's Redis is down. What happens?Reveal
Decide by limit class. Abuse / volumetric guardrails fail open or fall back to a conservative per-gateway local bucket — the origin's own capacity is the backstop, and you must not DoS yourself. Commercial quotas use a conservative local cap and reconcile on recovery. The Redis call always has a tight timeout so a slow Redis can't block requests or exhaust the pool.
Two requests hit the same key concurrently at 99/100. What can go wrong and how do you prevent it?Reveal
With a GET-check-INCR sequence both requests read 99, both decide they're under 100, and both proceed — over-admitting. Prevent it with an atomic server-side operation: INCR+PEXPIRE for window counters, or a Lua script that does the entire token-bucket refill-and-spend in one round-trip so no interleaving is possible.
How do you protect one expensive endpoint sharing a budget with cheap ones?Reveal
Give it its own per-path bucket so a flood of cheap requests can't drain the budget it needs, and if it's long-running, add a concurrency cap on in-flight requests per identity. For variable-cost endpoints (search, GraphQL), charge tokens proportional to the work — a wildcard query costs more tokens than a keyed lookup — so the limit tracks real server load rather than raw request count.
When to reach for this