Loading…
Loading…
Learn
A complete, interview-graded curriculum — 19 graded skills, 15 deep dives, 17 patterns, and 14 technology breakdowns, each with worked answers, decision tables, and an interview playbook. Start with a guided track, or jump to the concept your last debrief flagged.
Fundamentals
skillsEstimation, API contracts, storage, reliability, trade-offs, and other scored skills.
Deep dives
topicsIdempotency, indexing, protocol choice, backpressure, leader election, CDC, and more.
Patterns
shapesUse these when you need the system shape, not just one isolated concept.
Technologies
toolsRedis, Postgres, Kafka and friends — internals, performance numbers, and interview playbooks.
New here? Start with the framework
The six phases of a design round, with time budgets and exactly what to say in each. The meta-skill candidates fail on most — learn it before the concepts.
Start here
Pick the closest track, work top to bottom, then let debriefs choose the next lesson.
First time
Build the core vocabulary every senior engineer has.
Crunch time
The 90-minute crash course that plugs the most common gaps.
Data modelling
For engineers whose designs are strong on compute, weak on storage.
Recognise fast
Learn the six system shapes interviewers expect you to name on sight.
Full library
Fundamentals, deep dives, patterns, technologies, and paths are below. Search first if a debrief named a concept.
Frame a prompt, bound its scope, and draft a defensible API contract in under 10 minutes.
Best for: Engineers who run out of time before reaching the architecture
Fundamentals
Every card is a graded skill, grouped by area and labelled so you can build the skill map deliberately — or jump straight to the one your last debrief flagged.
Requirements & scope
The first five minutes decide the next forty. Candidates who skip scoping design the wrong system brilliantly — and still fail. The ones who nail it look senior before they've drawn a single box.
Open lessonCapacity & estimation
Every downstream decision — cache size, shard count, replica count — collapses onto one question: what are the numbers? Candidates who skip this are designing vibes, not systems.
Open lessonCapacity & estimation
If your design decisions aren't backed by numbers, they're opinions. Knowing that Redis handles 100K ops/sec or that a cross-region RTT is 60ms isn't trivia — it's what separates "we'll add a cache" from "we need 3 Redis instances because our hot-path is 200K reads/sec."
Open lessonArchitecture
The HLD is the diagram every subsequent question is asked against. Clear boundaries + explicit dataflow beats clever components every time. Most candidates over-draw; seniors underdraw and label.
Open lessonArchitecture
You can't design distributed systems without understanding how bytes travel from client to server and back. DNS, TCP, TLS, HTTP — these aren't trivia. They're the latency budget, the failure modes, and the protocol choices that underpin every design decision you make.
Open lessonArchitecture
Queues decouple producers from consumers — but the delivery semantics come with sharp edges. Exactly-once is a lie; at-least-once + idempotent consumers is the truth.
Open lessonAPI design
The API is the contract every client writes code against. Vague endpoints here metastasize into ambiguity everywhere else in the design. Interviewers use API design to separate candidates who have shipped from candidates who have read blog posts.
Open lessonData & storage
"We'll put it in Postgres" is not a data model. The data model is entities, keys, relationships, cardinalities, and the access patterns each one has to serve — and it locks in every trade-off you will chase for the rest of the design.
Open lessonData & storage
Picking a database is a first-principles decision, not a defaults one. "We use Postgres" is a cultural statement; "the access pattern is point-lookup at 100k QPS with eventual consistency, so we use DynamoDB" is a design.
Open lessonData & storage
The partition key is the single most consequential decision in a distributed data design. Pick it wrong and no amount of horsepower recovers you — the hot shard stays hot, the rebalance never finishes, and the team spends a quarter migrating.
Open lessonScalability
"We'll add a cache" is where weak designs die. Interviewers ask: which one, caching what exactly, with what TTL, invalidated how, behind which API boundary? If you can't answer all five, the cache line in your diagram is decoration.
Open lessonScalability
L4 vs L7 is not a trivia question — it's about whether the LB can make decisions based on the request content. One is dumb and fast; the other is smart and expensive. Most prompts want L7 at the edge and L4 between services.
Open lessonScalability
Simple hash(key) % N breaks the moment N changes — nearly every key remaps, every cache goes cold, and every shard rebalances. Consistent hashing moves only 1/N of keys. It's the algorithm behind every production cache cluster and most distributed databases.
Open lessonReliability
Systems don't fail because you didn't think they could. They fail the way you failed to think about. Failure-mode analysis is structured paranoia — and interviewers grade on whether you can produce it on demand.
Open lessonReliability
Replication is how you survive a node death; durability is how you survive a bad deploy. Candidates confuse the two and end up with a design that's highly available but cheerfully corrupt.
Open lessonReliability
You cannot operate what you cannot see; you cannot page on what you cannot measure. Candidates who design beautiful systems with no metrics, no logs, and no alerts are designing systems their on-call team will hate.
Open lessonTrade-offs
CAP is not a trivia question. It's the trade-off that every distributed system lives under, and getting it wrong is how you end up with "strong consistency" backed by a single node — or "eventual consistency" on data that absolutely cannot be eventually wrong.
Open lessonPerformance
A budget you don't compute is a budget you'll blow. Every synchronous hop costs milliseconds you don't get back — and tail latency isn't the average plus a bit, it's a different animal.
Open lessonSecurity & abuse
Rate limits are the only thing between your free tier and a botnet. A system without them is not a product — it's a target.
Open lessonDeep dives
Single-concept lessons that extend the grading taxonomy — idempotency, indexing, CDC, protocol choice, backpressure, leader election, and more. If a topic is also a pattern, read the lesson for the mechanism and the pattern for the end-to-end answer shape.
Architecture
REST for humans, gRPC for services, GraphQL for views, async for anything that takes more than a second. Most candidates default to "REST everywhere" — that's fine until it isn't, and the interviewer will find the seam.
Open lessonArchitecture
Polling, long-polling, SSE, WebSocket — the transport choice is about connection count and direction, not cleverness. Pick wrong and you pay in money or in user-visible lag.
Open lessonAPI design
Every retry is a test of your idempotency design. Networks drop, clients retry, at-least-once queues redeliver — if your write path can't absorb a duplicate without double-charging or double-sending, you'll discover it at 2 AM on a Sunday.
Open lessonData & storage
The index you picked three months ago decides your query latency today — and the one you didn't create decides which queries you can't ship. Indexing is not "add indexes until it's fast"; it's a first-principles match between query shape and index structure.
Open lessonData & storage
Full-text search is an inverted index plus relevance; the hard part is relevance. LIKE '%foo%' on a SQL table is not search — it's a table scan waiting to die.
Open lessonData & storage
Files above a few MB should never touch your app server. Every byte that flows through your app is bandwidth you pay for, memory you stress, and latency you inflict.
Open lessonData & storage
Picking the right partition key is more important than picking the right database. Wrong key = hot shards, unhappy scans, and painful rebalancing — regardless of how good the database is.
Open lessonData & storage
Change-data-capture is how you keep read models fresh without dual writes. The DB's log is already your most reliable event stream — use it.
Open lessonData & storage
Rectangular lat/lng queries on a plain B-tree die at a few thousand rows. "Find all riders within 5 km" is not a B-tree query — it needs a spatial index.
Open lessonScalability
The best cache is the one that never hits your origin. A CDN turns "my origin is overloaded" into "my origin is bored" — if you use it right.
Open lessonReliability
Exactly-once is a lie you tell clients; at-least-once + idempotent consumers is the truth. Even Kafka's "exactly-once" is exactly-once *within Kafka* — not end to end.
Open lessonReliability
Raft and Paxos aren't trivia — they're the reason your leader-election design either works or deadlocks. Most interview failures here are: "we'll elect a leader somehow" with no quorum story and no fencing.
Open lessonReliability
Active-active is not just active-passive with two active sides — it's a whole conflict-resolution design. You either pick partitioning, last-write-wins, or CRDTs. There is no fourth option.
Open lessonPerformance
Little's Law isn't optional — it's why your queue grows without bound when the producer outruns the consumer. Unbounded queues are a bug, not a feature.
Open lessonSecurity & abuse
Session vs JWT is not a religion; it's a trade-off between revocation and statelessness. Most systems need both: short JWT access tokens for speed, long refresh tokens with server state for revocation.
Open lessonPatterns
Patterns compose fundamentals and deep dives into complete architectures. Once the mechanisms click, naming the shape on sight is what separates a strong design round from a struggle.
Workload shape
Reads dominate writes by 10:1 or more. Every layer exists to keep the primary out of the hot path.
Open patternWorkload shape
Writes arrive faster than any single node can persist them synchronously. The design is about absorbing, spreading, and deferring them.
Open patternWorkload shape
The boring default. Synchronous HTTP with a cache. Works for 70% of APIs and you should say so.
Open patternExecution shape
Accept work with 202 + job_id, process asynchronously, and let clients track progress via poll, push, or webhook.
Open patternExecution shape
Decouple rate of production from rate of consumption with a durable queue and autoscaled workers.
Open patternExecution shape
Coordinate multi-service workflows via compensating transactions instead of distributed locks — choreography for simple flows, orchestration for everything else.
Open patternExecution shape
Separate the connection tier from the delivery tier so live updates scale without losing reconnect and catch-up guarantees.
Open patternData movement & fan-out
Where does the work live — at write time (push to every follower's inbox) or at read time (gather from each followed user)? Both break at the extremes; hybrids win.
Open patternData movement & fan-out
The cheapest request is the one that never hits your origin. Push static and near-static content to the edge and let the CDN absorb 80–99% of reads.
Open patternData movement & fan-out
Geographic distribution for latency, DR, and compliance. Active-passive is operationally sane; active-active is a conflict-resolution project.
Open patternSpecialised shapes
Inverted index + ranking service. The hard part isn't indexing — it's relevance, freshness, and a rebuild path.
Open patternSpecialised shapes
Geohash / S2 / H3 spatial index for "nearby X" queries. Straight lat/lng on a B-tree dies at a few thousand rows.
Open patternSpecialised shapes
Chunked, resumable uploads direct to blob storage via signed URLs. The app server never touches the bytes; processing is always async.
Open patternSpecialised shapes
Feature store + candidate generation + ranking. Offline training, online serving. Separate the cheap retrieval from the expensive scoring.
Open patternSpecialised shapes
Token bucket + distributed counter. The bucket math is trivial; the distribution, identity, and failure posture are the real interview.
Open patternReliability & infra
Redundancy + graceful degradation + operational discipline. You don't buy 99.99% — you earn it.
Open patternReliability & infra
Raft / Paxos via etcd / ZooKeeper. When exactly-one-of-N must do the thing, use a consensus service — don't roll your own.
Open patternTechnologies
Interviewers expect you to name a specific technology and go deep — a Redis sorted set, a Postgres GIN index — not just 'a cache' or 'a database'. Each deep dive covers internals, real performance numbers, interview capabilities, and failure modes.
Databases & datastores
The sensible default database for almost every product-design interview — ACID, rich SQL, JSONB, full-text and geo extensions, and far more scale on one node than candidates assume.
Open deep diveDatabases & datastores
A fully-managed, horizontally-scaling key-value and document store with single-digit-millisecond latency at any scale — when your access patterns are known and key-based, it removes sharding and capacity planning entirely.
Open deep diveDatabases & datastores
A masterless, linearly-scalable wide-column store built for enormous write throughput and multi-region availability — when you have a write firehose and known partition-key queries, it scales by simply adding nodes.
Open deep diveCache, search & streaming
An in-memory, single-threaded data-structure store — the most versatile tool in an interview because one technology covers caching, locks, leaderboards, rate limits, geo, queues, and pub/sub.
Open deep diveCache, search & streaming
A distributed search and analytics engine built on Lucene — the default answer for full-text search, faceted filtering, and log/observability analytics when SQL LIKE and a B-tree index fall over.
Open deep diveCache, search & streaming
A store that indexes high-dimensional embedding vectors for fast similarity search — the retrieval engine behind semantic search, recommendations, and RAG over your own data.
Open deep diveCache, search & streaming
A distributed, partitioned, replicated append-only log — the default backbone for high-throughput event streaming, decoupling services, and feeding many consumers from one durable stream.
Open deep diveCache, search & streaming
A distributed stream-processing engine for stateful, low-latency computation over unbounded data — windowed aggregations, stream joins, and real-time analytics with exactly-once guarantees.
Open deep diveStorage, edge & coordination
Effectively infinite, cheap, durable object storage for large unstructured files — images, video, backups, logs — that should never live in your database.
Open deep diveStorage, edge & coordination
The single front door to a backend: one entry point that authenticates, rate-limits, routes, and shapes every request so individual services do not each re-implement cross-cutting concerns.
Open deep diveStorage, edge & coordination
The component that spreads incoming traffic across a pool of servers, removes unhealthy ones, and gives clients one stable endpoint — the foundation of horizontal scaling and availability.
Open deep diveStorage, edge & coordination
A globally distributed cache that serves content from edge locations near users — cutting latency, absorbing read traffic off your origin, and shielding it from spikes.
Open deep diveStorage, edge & coordination
Strongly-consistent coordination services that solve the hard distributed problems — leader election, distributed locks, configuration, and service discovery — so you never have to implement consensus yourself.
Open deep diveStorage, edge & coordination
A buffer between producers and workers that decouples them, absorbs bursts, and processes work asynchronously with built-in retries and dead-letter handling — the simple task-queue answer when you do not need a full event log.
Open deep diveReading paths
Each path is an opinionated sequence of lessons and patterns that builds one capability end to end. Pick the closest to your gap — or the recommended one above if you have practice data.
6 stops
Turn a vague prompt into a designable problem, sketch the right high-level shape, and defend the API contract.
Start the path6 stops
Walk into the round with the five most-tested patterns and the failure-mode framework rehearsed.
Start the path5 stops
Pick the right store, the right partition key, and the right indexes for a given prompt — with defensible reasoning.
Start the path7 stops
Recognise which of six system shapes a prompt maps to within the first two minutes, and narrate the v1 → v2 → v3 scaling path cold.
Start the path6 stops
Name a failure mode for each component, a mitigation for each, and an availability target + topology that matches.
Start the path5 stops
Defend every design choice with a specific trade-off — not "it's faster" but "we trade X for Y at our scale".
Start the path5 stops
Frame a prompt, bound its scope, and draft a defensible API contract in under 10 minutes.
Start the path6 stops
Sound like an engineer who has shipped systems at scale, not one who has read about them.
Start the path6 stops
Design a push-based real-time system end-to-end: protocol, fan-out strategy, presence, back-pressure, and reconnection semantics.
Start the path7 stops
Name the right store, the right partition key, the right indexes, and the right replication mode — and defend each.
Start the path