advanceddeep dive

Consensus and leader election

Raft-style coordination, leases, fencing tokens, quorum, and exactly-one leadership.

~10 min read

Raft and Paxos aren't trivia — they're the reason your leader-election design either works or deadlocks. Most interview failures here are: "we'll elect a leader somehow" with no quorum story and no fencing.

Read this if your last attempt…

You said "we'll have a leader" without a quorum
You can't explain why majority quorum matters
You don't know what a fencing token is
You think ZooKeeper and etcd are the same as your DB

The concept

Consensus protocols (Raft, Paxos, Multi-Paxos, ZAB) solve one question: how do a group of N nodes agree on one value in the face of crashes and network partitions?

Majority quorum (N/2 + 1): the only way to guarantee that two concurrent decisions can't both succeed. 3 nodes → 2-of-3. 5 nodes → 3-of-5. Even-numbered clusters are wasteful — 4 nodes still needs 3, same as 5 but with less fault tolerance.

Architecture diagram· Raft election with 5 nodes

Majority (3) needed. Odd numbers only. Terms prevent split-brain. Fencing tokens protect shared resources from stale leaders.

Where consensus lives in your system.

Need	Use	Why
Distributed lock / single writer	etcd, ZooKeeper	Built-in leader election + fencing tokens
Replicated state machine	Raft-based DB (etcd, TiKV, CockroachDB)	Linearizable writes via consensus
Config / service discovery	Consul, etcd	Strong consistency on a small dataset
Leader per shard (app-level)	ZooKeeper ephemeral nodes	Established pattern; many libraries
Eventual consistency is fine	Don't use consensus	It's expensive; avoid when not needed

How interviewers grade this

You name the consensus system (etcd / ZooKeeper / Raft-based DB).
You state the quorum (majority of N, N odd).
You have a fencing story for shared-resource access.
You don't reinvent consensus — you delegate to a proven system.
You understand that consensus is slow (N RTTs) and keep it off the hot path.

Variants

Delegate to etcd/ZK/Consul

Don't implement consensus; run a battle-tested system.

The correct default. etcd / ZooKeeper / Consul each expose leader-election + distributed-lock primitives. Your app uses the library; the heavy lifting is someone else's problem.

Pros

+Battle-tested (years of production)
+Fencing tokens / lease IDs built in
+Well-understood failure modes

Cons

−Another system to run / depend on
−Adds network hops — keep off the hot path

Choose this variant when

Distributed locks
Leader-per-shard coordination
Config / service discovery

Raft inside the DB

Use a DB whose replication is Raft-based.

CockroachDB, TiKV, FoundationDB, some MongoDB configs — the DB handles consensus for you. Every write is Raft-committed across replicas. You get linearizable reads/writes for free.

Pros

+Consensus invisible to app code
+Strong consistency out of the box
+Failover is automatic

Cons

−Cross-region writes are slow (consensus latency)
−Operational complexity of distributed DB
−Fewer DBs support this than support async replication

Choose this variant when

Need linearizable writes
Can tolerate consensus latency
Team can operate a distributed DB

Don't use consensus

Eventual consistency is usually fine.

The honest answer for most interview systems. Tweets, photos, likes, timelines — none need linearizability. Consensus is for coordination hot-spots (inventory, payments, locks), not for most data.

Pros

+No consensus cost
+Easy to scale reads/writes independently
+Simpler operationally

Cons

−Can't provide "right now" consistency
−Some workflows fundamentally need it

Choose this variant when

Feeds, social, analytics, most data
When eventual is acceptable

Worked example

Scenario: a payment processor needs to ensure only one worker processes a given order at a time, and that a slow-and-then-resumed worker can't double-charge.

Bad design: "lock in Redis with SET NX EX 30". Worker takes lock, GCs for 40s, lock expires, another worker takes it — two workers processing the same order. Classic fencing-token failure.

Correct design: use etcd's lease + compare-and-swap (or ZooKeeper's sequential ephemeral nodes).

1Worker acquires lock on order_id, gets lease_id = 42.
2Every downstream call (payment gateway, DB update) includes the lease_id.
3Downstream storage rejects writes with lease_id < current_max_for_this_order.
4If worker GCs and a new worker takes the lock, the new worker gets lease_id = 43.
5Old worker comes back, tries to commit with lease 42 — rejected. No double-charge.

Cluster sizing: etcd with 3 nodes (tolerates 1 failure) or 5 nodes (tolerates 2). 7+ rarely justified — the quorum-RTT cost grows without much fault-tolerance gain. Never 2 or 4 (same fault tolerance as 1 or 3, with more cost).

Good vs bad answer

Interviewer probe

“How do you elect a leader for a shard?”

Weak answer

"Redis SETNX as a lock with a TTL."

Strong answer

"Delegate to etcd or ZooKeeper. Either gives us a proven leader-election + lease primitive. etcd's lease_id is a monotonic fencing token — I include it in every downstream write, and the downstream resource rejects stale tokens. This prevents the classic zombie-leader bug where a GC-paused old leader comes back and clobbers the new leader's writes. Cluster size: 3 or 5, odd, for majority quorum. Never roll our own consensus — Raft is 30 pages of subtle correctness, and etcd has 8+ years of production hardening."

Why it wins: Delegates to proven system, explains fencing, gets quorum right, rejects DIY.

Interview playbook2–3 min on coordination-heavy designs (locks, scheduler, primary election)

When it comes up

You need a distributed lock or a single writer per shard
A primary/leader must be elected among replicas
Config, metadata, or service discovery needs strong consistency
The interviewer asks "how do you avoid two leaders?"

Order of reveal

1
1. Delegate, do not build. I reach for etcd, ZooKeeper, or Consul rather than implementing Raft — the spec is subtle and these have years of production hardening.
2
2. Quorum + odd count. Majority quorum on an odd cluster: 3 tolerates one failure, 5 tolerates two. Even sizes cost more for the same fault tolerance.
3
3. Fencing tokens. Every lease carries a monotonic fencing token; the shared resource rejects writes with a stale token, so a paused old leader cannot clobber the new one.
4
4. Keep it off the hot path. Consensus costs multiple round-trips, so I use it for coordination points — leader changes, config — not on every request.
5
5. Or do not use it. For feeds, social, most data, eventual consistency is fine and consensus is wasted cost. I only pay for it at real coordination hot-spots.

Signature phrases

“I delegate consensus to etcd or ZooKeeper — Raft is 30 pages of subtle correctness.”

“Majority quorum, odd node count — 3 tolerates one failure, 5 tolerates two.”

“Every lock carries a fencing token the downstream validates.”

“Consensus is slow by design — keep it off the request path.”

“I delegate consensus to etcd or ZooKeeper — Raft is 30 pages of subtle correctness.” — Signals you know not to roll your own.
“Majority quorum, odd node count — 3 tolerates one failure, 5 tolerates two.” — Gets the quorum math right, which many candidates fumble.
“Every lock carries a fencing token the downstream validates.” — The detail that makes a distributed lock actually safe.
“Consensus is slow by design — keep it off the request path.” — Shows you understand the latency cost.

Likely follow-ups

?“Why is a Redis SETNX lock with a TTL unsafe?”Reveal

It has no fencing. The holder can stall (a GC pause, a slow disk) past the TTL; the lock expires; a second worker acquires it; then the first worker wakes up still believing it owns the lock and writes — two writers clobber the resource. The fix is a monotonic fencing token issued with the lock that the downstream resource checks, rejecting any write carrying a token lower than the highest it has seen.

?“Why odd cluster sizes?”Reveal

Fault tolerance is determined by the majority threshold. 3 nodes need 2 for quorum (tolerate 1 failure); 4 nodes need 3 (still tolerate only 1) — so 4 costs more than 3 for identical resilience. 5 needs 3 (tolerate 2). Each even size has the same tolerance as the odd below it at higher cost, so you always pick odd: 3 for most, 5 for critical.

?“5-node cluster, 3 nodes are partitioned away from the other 2. Can the 2 serve writes?”Reveal

No — 2 is not a majority of 5, so the minority side cannot elect a leader or commit writes. The 3-node side has quorum, elects a leader, and keeps serving. The 2-node side can at most serve stale follower reads. When the partition heals, the minority rejoins as followers and catches up from the leader’s log. This is exactly how majority quorum prevents split-brain.

Common mistakes

Lock without fencingAdvanced

Redis SETNX-style locks: a slow process thinks it still owns the lock, writes unsafely. Any distributed lock MUST have a monotonic fencing token that downstream resources validate.

Even-numbered quorum

4-node cluster has the same fault tolerance as 3 (tolerates 1 failure) but requires 3-of-4 quorum. Wasted. Always odd.

Consensus on the hot pathAdvanced

Consensus is slow (multiple RTTs). Don't put it on every request. Use it for coordination points (leader changes, config updates) and keep normal ops out of it.

Rolling your own Raft

The spec is subtle; edge cases are hostile. Use etcd, ZooKeeper, Consul, or a consensus-based DB. Roll-your-own is an antipattern outside academia.

Practice drills

Why exactly 3 or 5 nodes?Reveal

3 tolerates 1 failure (quorum 2 of 3). 5 tolerates 2 failures (quorum 3 of 5). 7 tolerates 3 but the write latency grows (more nodes to ack). 4 tolerates 1 (quorum 3 of 4) but costs more than 3 for the same fault tolerance. Even numbers are strictly worse than the next-lower odd. Production default: 3 for most; 5 for critical systems that need survival of simultaneous-two-node failures.

Interviewer: "what's a fencing token, concretely?"Reveal

A monotonically increasing number issued by the coordinator (etcd lease_id, ZooKeeper zxid). Every lease/lock reissue bumps the token. Downstream resources (DBs, storage) track the highest token they've seen and reject writes with a lower token. This prevents a slow ex-leader from committing writes after a new leader has been elected. Without fencing, distributed locks are fundamentally unsafe.

Your Raft cluster has 5 nodes; 3 are in a partition. Can the other 2 serve writes?Reveal

No. 2 is not a majority of 5. They can't elect a leader; they can't commit writes. They can serve stale reads (if the DB allows "follower reads") but anything requiring consensus halts. The 3-node side has quorum, elects a leader, continues serving. When the partition heals, the 2-node side rejoins as followers and catches up.

Cheat sheet

•Consensus = majority agreement in N nodes. Majority = N/2 + 1.
•Always odd cluster sizes. 3 and 5 are normal.
•Delegate to etcd / ZooKeeper / Consul / Raft-based DB. Don't build it.
•Fencing tokens prevent zombie-leader clobber. Non-optional for locks.
•Split-brain impossible with majority quorum + term numbers.
•Keep consensus off the hot path — it's slow by design.

Practice this skill

These problems exercise Consensus and leader election. Try one now to apply what you just learned.

job scheduler

Read this if

Need

Use

Why

Distributed lock / single writer

etcd, ZooKeeper

Built-in leader election + fencing tokens

Replicated state machine

Raft-based DB (etcd, TiKV, CockroachDB)

Linearizable writes via consensus

Config / service discovery

Consul, etcd

Strong consistency on a small dataset

Leader per shard (app-level)

ZooKeeper ephemeral nodes

Established pattern; many libraries

Eventual consistency is fine

Don't use consensus

It's expensive; avoid when not needed