Consensus and leader election
Raft-style coordination, leases, fencing tokens, quorum, and exactly-one leadership.
Raft and Paxos aren't trivia — they're the reason your leader-election design either works or deadlocks. Most interview failures here are: "we'll elect a leader somehow" with no quorum story and no fencing.
Read this if your last attempt…
- You said "we'll have a leader" without a quorum
- You can't explain why majority quorum matters
- You don't know what a fencing token is
- You think ZooKeeper and etcd are the same as your DB
The concept
Consensus protocols (Raft, Paxos, Multi-Paxos, ZAB) solve one question: how do a group of N nodes agree on one value in the face of crashes and network partitions?
Majority quorum (N/2 + 1): the only way to guarantee that two concurrent decisions can't both succeed. 3 nodes → 2-of-3. 5 nodes → 3-of-5. Even-numbered clusters are wasteful — 4 nodes still needs 3, same as 5 but with less fault tolerance.
Majority (3) needed. Odd numbers only. Terms prevent split-brain. Fencing tokens protect shared resources from stale leaders.
Where consensus lives in your system.
| Need | Use | Why |
|---|---|---|
| Distributed lock / single writer | etcd, ZooKeeper | Built-in leader election + fencing tokens |
| Replicated state machine | Raft-based DB (etcd, TiKV, CockroachDB) | Linearizable writes via consensus |
| Config / service discovery | Consul, etcd | Strong consistency on a small dataset |
| Leader per shard (app-level) | ZooKeeper ephemeral nodes | Established pattern; many libraries |
| Eventual consistency is fine | Don't use consensus | It's expensive; avoid when not needed |
How interviewers grade this
- You name the consensus system (etcd / ZooKeeper / Raft-based DB).
- You state the quorum (majority of N, N odd).
- You have a fencing story for shared-resource access.
- You don't reinvent consensus — you delegate to a proven system.
- You understand that consensus is slow (N RTTs) and keep it off the hot path.
Variants
Delegate to etcd/ZK/Consul
Don't implement consensus; run a battle-tested system.
The correct default. etcd / ZooKeeper / Consul each expose leader-election + distributed-lock primitives. Your app uses the library; the heavy lifting is someone else's problem.
Pros
- +Battle-tested (years of production)
- +Fencing tokens / lease IDs built in
- +Well-understood failure modes
Cons
- −Another system to run / depend on
- −Adds network hops — keep off the hot path
Choose this variant when
- Distributed locks
- Leader-per-shard coordination
- Config / service discovery
Raft inside the DB
Use a DB whose replication is Raft-based.
CockroachDB, TiKV, FoundationDB, some MongoDB configs — the DB handles consensus for you. Every write is Raft-committed across replicas. You get linearizable reads/writes for free.
Pros
- +Consensus invisible to app code
- +Strong consistency out of the box
- +Failover is automatic
Cons
- −Cross-region writes are slow (consensus latency)
- −Operational complexity of distributed DB
- −Fewer DBs support this than support async replication
Choose this variant when
- Need linearizable writes
- Can tolerate consensus latency
- Team can operate a distributed DB
Don't use consensus
Eventual consistency is usually fine.
The honest answer for most interview systems. Tweets, photos, likes, timelines — none need linearizability. Consensus is for coordination hot-spots (inventory, payments, locks), not for most data.
Pros
- +No consensus cost
- +Easy to scale reads/writes independently
- +Simpler operationally
Cons
- −Can't provide "right now" consistency
- −Some workflows fundamentally need it
Choose this variant when
- Feeds, social, analytics, most data
- When eventual is acceptable
Worked example
Scenario: a payment processor needs to ensure only one worker processes a given order at a time, and that a slow-and-then-resumed worker can't double-charge.
Bad design: "lock in Redis with SET NX EX 30". Worker takes lock, GCs for 40s, lock expires, another worker takes it — two workers processing the same order. Classic fencing-token failure.
Correct design: use etcd's lease + compare-and-swap (or ZooKeeper's sequential ephemeral nodes).
- 1Worker acquires lock on order_id, gets lease_id = 42.
- 2Every downstream call (payment gateway, DB update) includes the lease_id.
- 3Downstream storage rejects writes with lease_id < current_max_for_this_order.
- 4If worker GCs and a new worker takes the lock, the new worker gets lease_id = 43.
- 5Old worker comes back, tries to commit with lease 42 — rejected. No double-charge.
Cluster sizing: etcd with 3 nodes (tolerates 1 failure) or 5 nodes (tolerates 2). 7+ rarely justified — the quorum-RTT cost grows without much fault-tolerance gain. Never 2 or 4 (same fault tolerance as 1 or 3, with more cost).
Good vs bad answer
Interviewer probe
“How do you elect a leader for a shard?”
Weak answer
"Redis SETNX as a lock with a TTL."
Strong answer
"Delegate to etcd or ZooKeeper. Either gives us a proven leader-election + lease primitive. etcd's lease_id is a monotonic fencing token — I include it in every downstream write, and the downstream resource rejects stale tokens. This prevents the classic zombie-leader bug where a GC-paused old leader comes back and clobbers the new leader's writes. Cluster size: 3 or 5, odd, for majority quorum. Never roll our own consensus — Raft is 30 pages of subtle correctness, and etcd has 8+ years of production hardening."
Why it wins: Delegates to proven system, explains fencing, gets quorum right, rejects DIY.
When it comes up
- You need a distributed lock or a single writer per shard
- A primary/leader must be elected among replicas
- Config, metadata, or service discovery needs strong consistency
- The interviewer asks "how do you avoid two leaders?"
Order of reveal
- 11. Delegate, do not build. I reach for etcd, ZooKeeper, or Consul rather than implementing Raft — the spec is subtle and these have years of production hardening.
- 22. Quorum + odd count. Majority quorum on an odd cluster: 3 tolerates one failure, 5 tolerates two. Even sizes cost more for the same fault tolerance.
- 33. Fencing tokens. Every lease carries a monotonic fencing token; the shared resource rejects writes with a stale token, so a paused old leader cannot clobber the new one.
- 44. Keep it off the hot path. Consensus costs multiple round-trips, so I use it for coordination points — leader changes, config — not on every request.
- 55. Or do not use it. For feeds, social, most data, eventual consistency is fine and consensus is wasted cost. I only pay for it at real coordination hot-spots.
Signature phrases
- “I delegate consensus to etcd or ZooKeeper — Raft is 30 pages of subtle correctness.” — Signals you know not to roll your own.
- “Majority quorum, odd node count — 3 tolerates one failure, 5 tolerates two.” — Gets the quorum math right, which many candidates fumble.
- “Every lock carries a fencing token the downstream validates.” — The detail that makes a distributed lock actually safe.
- “Consensus is slow by design — keep it off the request path.” — Shows you understand the latency cost.
Likely follow-ups
?“Why is a Redis SETNX lock with a TTL unsafe?”Reveal
It has no fencing. The holder can stall (a GC pause, a slow disk) past the TTL; the lock expires; a second worker acquires it; then the first worker wakes up still believing it owns the lock and writes — two writers clobber the resource. The fix is a monotonic fencing token issued with the lock that the downstream resource checks, rejecting any write carrying a token lower than the highest it has seen.
?“Why odd cluster sizes?”Reveal
Fault tolerance is determined by the majority threshold. 3 nodes need 2 for quorum (tolerate 1 failure); 4 nodes need 3 (still tolerate only 1) — so 4 costs more than 3 for identical resilience. 5 needs 3 (tolerate 2). Each even size has the same tolerance as the odd below it at higher cost, so you always pick odd: 3 for most, 5 for critical.
?“5-node cluster, 3 nodes are partitioned away from the other 2. Can the 2 serve writes?”Reveal
No — 2 is not a majority of 5, so the minority side cannot elect a leader or commit writes. The 3-node side has quorum, elects a leader, and keeps serving. The 2-node side can at most serve stale follower reads. When the partition heals, the minority rejoins as followers and catches up from the leader’s log. This is exactly how majority quorum prevents split-brain.
Common mistakes
Redis SETNX-style locks: a slow process thinks it still owns the lock, writes unsafely. Any distributed lock MUST have a monotonic fencing token that downstream resources validate.
4-node cluster has the same fault tolerance as 3 (tolerates 1 failure) but requires 3-of-4 quorum. Wasted. Always odd.
Consensus is slow (multiple RTTs). Don't put it on every request. Use it for coordination points (leader changes, config updates) and keep normal ops out of it.
The spec is subtle; edge cases are hostile. Use etcd, ZooKeeper, Consul, or a consensus-based DB. Roll-your-own is an antipattern outside academia.
Practice drills
Why exactly 3 or 5 nodes?Reveal
3 tolerates 1 failure (quorum 2 of 3). 5 tolerates 2 failures (quorum 3 of 5). 7 tolerates 3 but the write latency grows (more nodes to ack). 4 tolerates 1 (quorum 3 of 4) but costs more than 3 for the same fault tolerance. Even numbers are strictly worse than the next-lower odd. Production default: 3 for most; 5 for critical systems that need survival of simultaneous-two-node failures.
Interviewer: "what's a fencing token, concretely?"Reveal
A monotonically increasing number issued by the coordinator (etcd lease_id, ZooKeeper zxid). Every lease/lock reissue bumps the token. Downstream resources (DBs, storage) track the highest token they've seen and reject writes with a lower token. This prevents a slow ex-leader from committing writes after a new leader has been elected. Without fencing, distributed locks are fundamentally unsafe.
Your Raft cluster has 5 nodes; 3 are in a partition. Can the other 2 serve writes?Reveal
No. 2 is not a majority of 5. They can't elect a leader; they can't commit writes. They can serve stale reads (if the DB allows "follower reads") but anything requiring consensus halts. The 3-node side has quorum, elects a leader, continues serving. When the partition heals, the 2-node side rejoins as followers and catches up.
Cheat sheet
- •Consensus = majority agreement in N nodes. Majority = N/2 + 1.
- •Always odd cluster sizes. 3 and 5 are normal.
- •Delegate to etcd / ZooKeeper / Consul / Raft-based DB. Don't build it.
- •Fencing tokens prevent zombie-leader clobber. Non-optional for locks.
- •Split-brain impossible with majority quorum + term numbers.
- •Keep consensus off the hot path — it's slow by design.
Practice this skill
These problems exercise Consensus and leader election. Try one now to apply what you just learned.
Read this if