Data partitioning strategies
Range, hash, directory, tenant, and time partitioning with resharding consequences.
Picking the right partition key is more important than picking the right database. Wrong key = hot shards, unhappy scans, and painful rebalancing — regardless of how good the database is.
Read this if your last attempt…
- You picked a partition key by vibes
- You can't state how a query routes to a shard
- You don't have a hot-key mitigation plan
- You need a refresher before designing a multi-TB system
The concept
Partitioning divides a dataset across N nodes. The partition key determines which node owns each row. Three classic strategies:
- Range partitioning — partition by ordered ranges (user_id 0–999, 1000–1999, …). Natural for range scans (timestamps). Hotspot risk: all writes land on the last partition if the key is monotonic (auto-increment, now()).
- Hash partitioning — hash the key, mod by N. Even distribution by default. No efficient range scans. Resharding on N change is brutal unless you use consistent hashing.
- Directory / lookup — explicit mapping (user_id → shard_id) stored in a small coordination service. Flexible; adds a lookup hop. Used when access patterns are irregular (some users massive, most small).
Range: natural scans, monotonic hotspot. Hash: even, no scans. Directory: flexible, extra hop.
Partitioning strategies — pick by query pattern.
| Strategy | Best for | Weakness |
|---|---|---|
| Range by key | Range scans (time windows, pagination) | Monotonic keys = hotspot on last shard |
| Hash (simple mod N) | Even distribution, known N | Rehash storm when N changes |
| Consistent hash (ring + vnodes) | Dynamic scaling | No efficient range scans |
| Directory / lookup | Uneven tenants, explicit control | Lookup hop + coordination service |
| Compound (hash + range) | Both point and range queries | More complex key design |
How interviewers grade this
- You state the partition key and why it matches the dominant query.
- You name the strategy (range / hash / directory / consistent hash).
- You identify hot-key risk and the mitigation.
- You describe how resharding works when you add nodes.
- You acknowledge queries that don't use the partition key need fan-out.
Variants
Consistent hash ring
Hash both keys and nodes onto a ring; each key goes to the next node clockwise.
Modern default for caches and wide-column stores. Virtual nodes smooth the distribution. Adding a node moves only ~1/N of the keys instead of reshuffling everything.
Pros
- +Trivial scaling up/down
- +Even distribution with vnodes
- +Industry standard (Cassandra, Redis, DynamoDB)
Cons
- −No range scans
- −Does not solve hot-key problem
Choose this variant when
- Cache clusters
- Wide-column stores
- When N will change over time
Range partitioning
Partitions own contiguous ranges of the key; routing is by range lookup.
Great for time-series and pagination. Split a hot range in two when it grows. Be careful: auto-increment ids or timestamps make the newest partition a write hotspot — bucket with a hash prefix if needed.
Pros
- +Natural range scans
- +Incremental rebalancing via range splits
- +Good for time-series
Cons
- −Monotonic-key hotspot
- −Router must track range bounds
Choose this variant when
- Time-series
- Lexicographically-queried keys
- Workloads dominated by range scans
Directory / lookup
Explicit (key → shard) map in a coordination service.
For uneven tenants or workloads where strategy-based routing can't express the required placement. Move a specific big tenant to its own shard without rehashing the world.
Pros
- +Maximum flexibility
- +Easy per-tenant isolation
- +Good for multi-tenancy with skewed tenants
Cons
- −Directory is on the hot path — cache aggressively
- −Directory itself must be HA
Choose this variant when
- Multi-tenant SaaS with big + small tenants
- When predictable placement matters more than simplicity
Worked example
Design: partitioning for a messaging app, 100M users, skewed activity.
Dominant query: "recent messages for user X" — strongly user-scoped.
Choice: hash partition by user_id. 256 shards to start; consistent-hash ring for growth.
Hot key handling: celebrity users (million-follower accounts) produce disproportionate fan-out on write (message delivered to followers). Mitigation:
- Outbound delivery fanned through Kafka partitioned by recipient, so delivery load distributes even if one sender is hot.
- Inbox reads for celebrities cached aggressively at CDN.
- Outbox for celebrities stored in a separate shard with dedicated capacity.
Resharding: Each shard ~300k users at steady state. When a shard exceeds 500k, we split using consistent-hash ring ops: add 2 new virtual nodes, migrate the affected key ranges with a dual-write window, verify, cut over. Takes hours per shard; done behind the scenes.
Query patterns that DON'T match:
- "all messages sent in the last 5 minutes across everyone" = fan-out to all 256 shards. Expensive; we don't support it from the primary path. Instead, a separate analytics pipeline (CDC → ClickHouse) handles time-windowed global queries.
Good vs bad answer
Interviewer probe
“How do you shard your messaging data?”
Weak answer
"Hash partition by message_id. Even distribution."
Strong answer
"Hash by user_id, not message_id. The dominant query is 'recent messages for user X' — if I partition by message_id, every read fans out to all shards. user_id keeps reads shard-local. Consistent-hash ring with virtual nodes for easy scaling. Celebrity hot keys mitigated by (1) aggressive CDN caching of their inboxes, (2) Kafka fan-out for delivery so recipients distribute the write load. Global-time queries (analytics) go through a separate CDC-fed warehouse, not the primary shards. Resharding is a ring op + dual-write migration; done per shard when a shard exceeds ~500k users."
Why it wins: Matches key to dominant query, names hot-key mitigation, explains scaling, acknowledges the queries that don't fit.
When it comes up
- The dataset is multi-TB and will not fit on one node
- The interviewer asks "how do you shard this?"
- Write or storage volume exceeds a single primary
- A hot tenant / celebrity / viral item threatens one shard
Order of reveal
- 11. Key matches the dominant query. I pick the partition key that matches the most common access pattern, so the hot query stays on one shard instead of fanning out to all of them.
- 22. Strategy. Hash for even distribution, range for scan-heavy workloads, directory when tenants are wildly uneven.
- 33. Consistent hashing. For elastic membership I use a consistent-hash ring with virtual nodes so adding a node moves ~1/N of keys, not all of them.
- 44. Hot-key plan. I name the hot-key risk up front and the mitigation: cache it, split it with a suffix, or isolate it onto its own shard.
- 55. Cross-partition queries. Queries that do not include the partition key scatter-gather; I push those to an analytics replica rather than the primary shards.
Signature phrases
- “The partition key matters more than the database.” — Reframes sharding as an access-pattern decision, which is the senior lens.
- “Pick the key that matches the dominant query, or every read fans out.” — Ties the key choice directly to latency and cost.
- “Consistent hashing so adding a node moves 1/N of keys, not all of them.” — Shows you know why naive mod-N is a trap.
- “Name the hot key before launch — one celebrity takes down a shard.” — Demonstrates you anticipate skew instead of discovering it in prod.
Likely follow-ups
?“A query does not include the partition key. How is it served?”Reveal
It scatter-gathers: the router fans the query to every shard, each returns partial results, and a coordinator merges them — O(N shards) per query, fine occasionally but not on the hot path. If such a query is frequent, I maintain a global secondary index (a separate table partitioned by that other key) or serve it from an analytics replica fed by CDC, so the primary shards stay dedicated to their access pattern.
?“Walk me through adding a shard to a consistent-hash ring.”Reveal
The new node is hashed onto the ring at its virtual-node positions. Only the keys falling between each new vnode and its predecessor move — about 1/N of the total — and they migrate from exactly one existing node per range. During migration I dual-read (check new owner, fall back to old) until the copy completes, then cut over and drop the old copies. Contrast with mod-N, where changing N remaps nearly every key at once and cold-starts the whole cache.
?“Range or hash partitioning for a time-series workload?”Reveal
Range by time enables the killer query — efficient scans over a time window — but a monotonic key (now()) sends every write to the newest partition, a hotspot. The usual compromise is a compound key that buckets: partition by (date, hash(entity) % 16) so writes spread across 16 buckets per day while still allowing day-ranged scans. Pure hash kills the range scan; pure range hotspots the tail — bucketing balances both.
Common mistakes
Reads fan out to all shards; latency + cost explode. The key must align with your common access pattern.
Auto-increment ids or timestamps write to the last shard only. Bucket with a hash prefix or use hash partitioning.
Adding a node rehashes every key. Use consistent hashing.
One celebrity, one viral product, one power user will bring down a shard. Identify hot keys up-front and have a mitigation.
Practice drills
Why isn't simple `hash(key) mod N` enough?Reveal
Adding a node changes N → nearly every key remaps → massive data movement. Consistent hashing moves only ~1/N of keys when one is added. For a stable-N cache, mod N is fine. For any system where N will grow, use consistent hashing.
Interviewer: "what if one user has 10× the data of average?"Reveal
Hash partitioning spreads that one user's rows across one shard — they become a skew. Options: (1) shard their data at a finer granularity (by (user_id, object_type) or (user_id, time_bucket)); (2) move that user to a dedicated shard via directory partitioning; (3) cache their hot reads so the shard isn't reads-bound; (4) if the user is genuinely 100×, it's a product decision — a separate tier.
You're at the whiteboard. Shard count: 16 or 256?Reveal
Err higher. Resharding is the painful operation; you can run 256 shards on 8 physical nodes (multi-tenant the shards) initially, then scale nodes as shards grow. Going 16 → 256 later is 16× more painful than starting at 256 and growing from 8 physical nodes to 256 over time.
Cheat sheet
- •Partition key = the axis of your dominant query.
- •Hash for even distribution; range for scans; directory for uneven tenants.
- •Consistent hashing for dynamic N.
- •Hot keys: cache, split, or isolate. Plan before launch.
- •Cross-shard queries exist but are expensive — offload to analytics pipeline.
- •Resharding is an operational migration — plan the runbook.
Practice this skill
These problems exercise Data partitioning strategies. Try one now to apply what you just learned.
Read this if