Leader election / consensus
Raft / Paxos via etcd / ZooKeeper. When exactly-one-of-N must do the thing, use a consensus service — don't roll your own.
When to reach for this
Reach for this when…
- Distributed locks
- Primary election in a replica set
- Job / workflow coordination ("only one runs at a time")
- Configuration-change coordination
- Singleton services across a fleet
Not really this pattern when…
- Work is naturally partitioned (each worker owns disjoint keys)
- At-most-once can be relaxed (most async work tolerates retries)
- A single-master database already serialises it for you
Good vs bad answer
Interviewer probe
“One worker must run a nightly migration. Ensure only one runs.”
Weak answer
"Each worker checks a flag in the database; whoever sets it first wins, and clears it when done."
Strong answer
"A DB flag races under concurrent reads and can't survive leader failure without manual cleanup — and it has no fencing. I'd use etcd: each worker grants a 30-second lease and does a transactional put to /migrations/nightly conditioned on the key being empty. The CAS winner is leader; the others watch. The leader KeepAlives the lease every ~10 s while working, so if it crashes or partitions the lease expires within 30 s and a watcher takes over automatically — no human in the loop. Every write the worker makes to the migration tracker carries the fencing token from its lease (etcd's revision), and the tracker rejects writes with a stale token, so even an old leader that resumes after a GC pause can't corrupt state. As defence in depth, each migration step is idempotent, keyed by step_id, so a duplicate attempt in the worst case is a safe no-op. The etcd cluster is 3 nodes, isolated from the workload, and I set the lease TTL above our worst-case GC pause to avoid thrash."
Why it wins: Rejects the DB flag with the *three* reasons, uses a real consensus lease with heartbeat/watch/automatic failover, enforces fencing at the sink (the actual correctness property), adds idempotency as a safety net, and gets the quorum size, cluster isolation, and TTL-vs-pause tuning right. The weak answer races, can't detect a dead leader, blocks on crash, and has no split-brain defence.
Cheat sheet
- •Use etcd / ZooKeeper / Consul — don't roll your own consensus.
- •Lease + heartbeat + watch; CAS to acquire leadership.
- •Fencing tokens are the real safety property — enforce at the sink.
- •A lock without fencing is not a correctness guarantee.
- •Quorum: 3 or 5 nodes; never 2 or 4; survive ⌊(N-1)/2⌋ failures.
- •On quorum loss, refuse progress (consistency over availability).
- •Lease TTL longer than the worst-case GC pause; 10–30 s typical.
- •Trust the consensus service's clock, not node wall time.
- •Redlock / SET NX is best-effort only — not for correctness.
- •Idempotent writes (keyed by business id) as defence in depth.
- •Isolate the consensus cluster; don't co-locate with the workload.
- •Per-entity locks cut contention vs one global lock.
Core concept
Consensus gives you one guarantee: exactly one node holds this role, and every other node agrees. The algorithms that provide it are Paxos (the theory) and Raft (the practical, understandable one that powers etcd, Consul, and CockroachDB). The single most important sentence in this whole pattern is don't implement them — use a battle-tested consensus service and spend your effort on using it correctly.
Nodes register candidacy with etcd/ZooKeeper; the winner takes a lease, heartbeats to keep it, and the rest watch — on lease expiry they race again.
The practical recipe with etcd / ZooKeeper:
- 1Each node calls
LeaseGrant(ttl=10s)and receives a unique lease. - 2Each node attempts
Put(key="/service/leader", value=node_id, lease=<mine>)via a compare-and-swap that only succeeds if the key is empty. - 3The node that wins the CAS is the leader; the rest
Watchthe key for changes. - 4The leader
KeepAlives its lease (heartbeat). If it dies or partitions, the lease expires, the key is deleted, watchers observe the change, and they race to acquire it again.
Each node grants a lease and CAS-puts the leader key; the winner KeepAlives while watchers wait; on crash/partition the lease expires, the key is deleted, and watchers race.
The fencing problem is the heart of the pattern. A network partition (or a long GC pause) can make an old leader still believe it's the leader while a new one is elected. Both then try to write — split-brain. The fix is fencing tokens: every lease carries a monotonically increasing token, and the downstream sink (storage) rejects any write bearing a token lower than the highest it has seen. Without fencing at the sink, a lock is not a correctness guarantee — split-brain corruption is only a pause away.
Each lease carries a monotonically increasing token; the storage sink rejects any write bearing a token older than the highest it has seen, so a revived old leader cannot corrupt state.
Quorum sizing: run 3 or 5 nodes in the consensus cluster. An N-node cluster survives ⌊(N-1)/2⌋ failures. Never run 2 or 4 (no majority on a tie buys you nothing), and avoid going past 7 (every write needs a majority ack, so latency suffers). And prefer a client library over rolling your own: Curator (ZooKeeper), etcd's lease/election helpers, and similar — they encode the subtleties you'd otherwise rediscover in production.
Interview walkthrough
Worked example: ensure exactly one scheduler runs hourly billing reconciliation
Nodes register candidacy with etcd/ZooKeeper; the winner takes a lease, heartbeats to keep it, and the rest watch — on lease expiry they race again.
Step 0 — reject the shortcuts. Not a process-local flag (lost on restart, no coordination across replicas) and not a database boolean (races without CAS, sticks on crash, no dead-leader detection, no fencing). This is a consensus problem.
Step 1 — acquire leadership via a lease. Each scheduler replica calls LeaseGrant(ttl=30s) on etcd and attempts a transactional Put(/jobs/billing-reconcile, node_id, lease) that only succeeds if the key is empty. The CAS winner is the leader; the losers Watch the key.
Each node grants a lease and CAS-puts the leader key; the winner KeepAlives while watchers wait; on crash/partition the lease expires, the key is deleted, and watchers race.
Step 2 — heartbeat while working. The leader KeepAlives its lease (every ~10 s) for as long as it runs reconciliation. The consensus service's clock owns the lease — the node trusts the service's notion of expiry, not its own wall clock.
Step 3 — automatic failover. If the leader crashes, pauses past the TTL, or partitions onto the minority side, the lease expires within 30 s, etcd deletes the key, watchers are notified, and one of them wins the next CAS and becomes leader — no manual cleanup.
A network partition can leave an old leader still believing it leads while a new one is elected on the majority side; only fencing at the sink prevents both from committing.
Step 4 — fence every write. Each write to the reconciliation output carries the fencing token (etcd revision/epoch) from the leader's lease. The billing store remembers the highest token it has accepted and rejects any write with a lower one. So if the old leader resumes after a GC pause and tries to commit with token 33 while the new leader is on token 34, the store rejects it — no split-brain corruption.
Each lease carries a monotonically increasing token; the storage sink rejects any write bearing a token older than the highest it has seen, so a revived old leader cannot corrupt state.
Step 5 — idempotent steps as a safety net. Each reconciliation step is keyed by billing_period + account_id and written as an upsert, so even if a duplicate ever slipped through, it's a no-op rather than a double-charge.
Even with fencing, downstream work is keyed so a duplicate attempt under the worst case is a safe no-op — belt and braces against double execution.
Step 6 — size and operate the cluster. Run a 3-node etcd cluster (survives 1 failure; use 5 if a double failure must be survivable), isolated from the scheduler workload so app load can't starve it, with the lease TTL set above the worst-case GC pause to avoid thrash, plus health monitoring and a quorum-loss recovery playbook.
Odd cluster sizes (3 or 5) maximise fault tolerance per node; on quorum loss the safe behaviour is to stop electing rather than risk split-brain.
Result. Exactly one scheduler reconciles at a time; failover is automatic and human-free; a partitioned or paused old leader is fenced out at the store; duplicates are harmless by construction; and the consensus cluster is sized, isolated, and operated as the critical infrastructure it now is.
Interview playbook
When it comes up
- The prompt says "exactly one worker", primary election, distributed lock, or singleton job
- Duplicate execution would corrupt data or double-charge users
- There are multiple replicas for availability but only one should act
- A scheduler, controller, migration runner, or replica-set primary is in scope
Order of reveal
- 1Reject DIY locks. Use a consensus service for correctness-critical leadership — don't roll your own or use a DB flag.
- 2Explain the lease flow. CAS to acquire, heartbeat to hold, watchers compete on expiry — automatic failover.
- 3Add fencing. Every leader gets a monotonic token; the sink rejects stale tokens. This is the real safety property.
- 4Choose quorum size. 3 or 5 nodes, never 2 or 4; on quorum loss, refuse progress.
- 5Add idempotency. Even with fencing, key downstream writes so a duplicate is a safe no-op.
- 6Operate the cluster. Isolate the consensus cluster, tune TTL above worst-case pause, monitor quorum health.
Signature phrases
- “A lock without fencing is not a correctness guarantee” — Names the split-brain failure and the real safety property.
- “The lease is owned by the consensus service's clock” — Avoids the clock-skew class of bugs.
- “Quorum loss means no new leader” — Shows the consistency-over-availability trade-off.
- “Don't roll your own consensus” — Signals you'll use etcd/ZK rather than reinvent Raft.
- “Fencing prevents the write; idempotency makes the slip harmless” — Names the defence-in-depth pairing.
- “Redlock is best-effort, not correctness” — Shows awareness of the Kleppmann critique.
Likely follow-ups
?“What if the old leader keeps running after a network partition?”Reveal
It may keep executing locally on the minority side, but its fencing token is now stale. The new leader (on the majority side) was granted a higher token, and the storage/sink remembers the highest token it has accepted and rejects anything lower — so the old leader's writes are refused and it cannot corrupt shared state. Quorum guarantees only one new leader is elected; fencing neutralises the old one. As a belt-and-braces measure the writes are also idempotent.
?“Why not just use Redis SET NX?”Reveal
It's fine for best-effort duplicate reduction where occasional double execution is tolerable, but it's not a correctness guarantee. As Kleppmann's Redlock critique shows, a GC pause or clock skew can let a node believe it still holds the lock after it has actually expired, so two nodes act at once — Redis isn't a Raft/Paxos consensus system and the timing assumptions don't hold under real pauses. For correctness-critical leadership I'd use etcd/ZooKeeper and fence at the sink.
?“How do you size the consensus cluster and what happens if it loses quorum?”Reveal
An odd number — 3 (survives 1 failure) or 5 (survives 2) — because even sizes like 4 need the same majority as 3 but add latency for no extra tolerance. Past 7, the cost of larger majority acks on every write outweighs the marginal fault tolerance. If the cluster loses quorum (2 of 3 down), the safe behaviour is to stop electing leaders and refuse writes — go read-only — rather than let a lone node act and risk split-brain. Consensus deliberately chooses consistency over availability under partition, so I'd monitor cluster health and keep a DR playbook for the cluster itself.
?“How do you pick the lease TTL?”Reveal
Longer than the worst-case realistic pause on the nodes — primarily the longest stop-the-world GC pause, plus network jitter — so a normal hiccup doesn't trigger a false expiry and leadership thrash. But not so long that a genuinely dead leader blocks progress for ages; 10–30 s is typical. Critically, the lease is refreshed against the consensus service's own clock, not the node's wall clock, so clock skew between nodes doesn't create disagreement about whether the lease is still valid.
?“The sink can't support fencing tokens. Now what?”Reveal
Then I can't get a hard correctness guarantee, so I lean fully on idempotency: key every downstream write by a business identity (step_id, period+account) and make it an upsert, so a duplicate from a brief split-brain is a safe no-op rather than a double-effect. I'd also minimise the dual-leader window (TTL tuning) and, if the work is genuinely correctness-critical, push to add a monotonic version/epoch column to the sink so it can fence — because without either fencing or idempotency, split-brain corruption is inevitable.
Canonical examples
- →Kafka controller election
- →MongoDB / Redis replica-set primary election
- →Distributed cron (one runner per job)
- →Kubernetes controller/scheduler leader
- →A custom migration runner ("only one pod migrates")
Variants
etcd lease + election
Raft-backed leases with CAS acquire and watch — the cloud-native default.
Each node grants a lease and CAS-puts the leader key; the winner KeepAlives while watchers wait; on crash/partition the lease expires, the key is deleted, and watchers race.
etcd is the default for the Kubernetes ecosystem and Go-centric stacks. It's a Raft-replicated key-value store with first-class leases and an election API. A node grants a lease, does a transactional put on the leader key conditioned on the key being absent, and the winner holds leadership for the lease's life, renewing via KeepAlive. Losers Watch the key and get notified the instant leadership frees up, so failover is fast.
etcd exposes a revision number on every key, which doubles as a natural fencing token — monotonic, server-assigned, and perfect to stamp on downstream writes. Kubernetes itself uses etcd for leader election of its controllers, which is the strongest possible production endorsement. Reach for etcd when you're already in that ecosystem or want a clean Go client and built-in fencing material.
Pros
- +Raft-correct leases with a clean election API
- +Revision numbers give you fencing tokens for free
- +Fast failover via watches; Kubernetes-proven
Cons
- −You operate (or pay for) an etcd cluster
- −Write latency bounded by Raft majority acks
Choose this variant when
- Kubernetes / cloud-native or Go-heavy stack
- You want built-in fencing (revision) material
- Correctness-critical singleton leadership
ZooKeeper ephemeral sequential nodes
The classic recipe: ephemeral-sequential znodes, watch your predecessor, no thundering herd.
Nodes register candidacy with etcd/ZooKeeper; the winner takes a lease, heartbeats to keep it, and the rest watch — on lease expiry they race again.
ZooKeeper is the mature, JVM-ecosystem choice (Kafka, Hadoop, HBase). Its canonical election recipe uses ephemeral sequential znodes: each candidate creates a child node under /election/ that is ephemeral (auto-deleted when the session dies) and sequential (gets a monotonically increasing suffix). The node with the lowest sequence number is the leader. Crucially, each node watches only the node immediately before it — so when a node leaves, exactly one watcher wakes up, avoiding the thundering herd where every node stampedes the same key.
The ephemeral property gives you automatic failure detection for free: if a leader's session times out (it crashed or partitioned), its znode vanishes and the next-in-line is promoted. The Apache Curator library wraps this recipe (and fencing) so you don't hand-roll it. Choose ZooKeeper when you're in the JVM/Kafka world or want this well-worn, herd-free election primitive.
Pros
- +Battle-tested recipe; Curator library encodes it
- +Ephemeral nodes = automatic failure detection
- +Watch-predecessor avoids thundering-herd re-election
Cons
- −Operationally heavier; JVM-centric
- −Session/timeout tuning has sharp edges
Choose this variant when
- JVM ecosystem (Kafka, Hadoop, HBase)
- You want the herd-free ephemeral-sequential recipe
- Mature, widely-understood coordination
Consul sessions + KV lock
Session-bound KV locks with health checks, plus service discovery in the same system.
Nodes register candidacy with etcd/ZooKeeper; the winner takes a lease, heartbeats to keep it, and the rest watch — on lease expiry they race again.
Consul bundles a Raft-backed KV store, sessions, and service discovery. Leader election uses a session tied to a health check; the leader acquires a lock on a KV key bound to that session, and renews it. If the node fails its health check or its session TTL lapses, the lock releases and another node acquires it. The bonus is that Consul also gives you service discovery and health checking in one system, so the newly-elected leader can register itself and clients can find it without a second tool.
This is attractive when you already run Consul for service mesh / discovery and want coordination without adding etcd or ZooKeeper. As with all of these, enforce fencing at the sink — Consul's lock-delay and session semantics reduce but don't eliminate the need for fencing on correctness-critical writes.
Pros
- +KV locks + sessions + service discovery in one system
- +Health-check-driven session expiry
- +Convenient if you already run Consul
Cons
- −Still need sink-side fencing for correctness
- −Another distributed system to operate if you don't
Choose this variant when
- You already run Consul for discovery/service mesh
- You want leader + discovery in one place
- Health-check-based liveness fits your fleet
Redis lock (best-effort only)
SET NX with TTL / Redlock — fine for reducing duplicates, NOT for correctness.
Even with fencing, downstream work is keyed so a duplicate attempt under the worst case is a safe no-op — belt and braces against double execution.
Redis-based locking — SET key value NX PX ttl, or the multi-node Redlock algorithm — is the tempting shortcut, and it's the one to be careful with. Martin Kleppmann's well-known critique shows that Redlock does not provide a correctness guarantee: a GC pause or clock skew can let a node believe it still holds the lock after it has expired, so two nodes act at once. Redis isn't a consensus system in the Raft/Paxos sense, and timing assumptions don't hold under real-world pauses.
So the honest framing in an interview is: a Redis lock is acceptable for best-effort duplicate reduction where occasional double-execution is tolerable (e.g. a cache-warm job, a non-critical periodic task) — but for anything where double execution corrupts data or double-charges a user, use a real consensus service and fence at the sink. Naming this distinction explicitly is itself a strong signal; pretending Redlock is correct is a red flag.
Pros
- +Trivial to set up if you already run Redis
- +Fine for best-effort, tolerant-of-duplicates work
- +Very low latency to acquire
Cons
- −NOT a correctness guarantee (Kleppmann critique)
- −Pauses / clock skew enable double execution
- −Needs idempotent sink to be safe at all
Choose this variant when
- Double execution is genuinely tolerable
- You only need to reduce, not eliminate, duplicates
- Latency matters more than strict correctness
Scaling path
v1 — single instance, no election needed
Run the singleton work in one place and recognise when that stops being safe.
If exactly one process runs the job and you can tolerate downtime while it restarts, you don't need consensus at all — a single scheduler instance is the "exactly one". This is the right answer when the work is infrequent and a few minutes of unavailability on crash is acceptable.
Nodes register candidacy with etcd/ZooKeeper; the winner takes a lease, heartbeats to keep it, and the rest watch — on lease expiry they race again.
It breaks the moment you add replicas for availability: now two instances are alive and both will run the job. Resist the urge to coordinate them with a database flag — that's the trap the next step exists to avoid.
What triggers the next iteration
- Single instance is a downtime risk on crash
- Adding replicas immediately causes double execution
- No automatic failover
v2 — consensus lease with heartbeat and watch
Elect exactly one active leader among replicas with automatic failover.
Introduce a consensus service (etcd/ZooKeeper/Consul). Each replica grants a lease and CAS-acquires the leader key; the winner heartbeats while it works, and losers watch. On crash or partition the lease expires and a watcher is promoted — automatic failover with no manual cleanup.
Each node grants a lease and CAS-puts the leader key; the winner KeepAlives while watchers wait; on crash/partition the lease expires, the key is deleted, and watchers race.
This is correct for liveness (one leader at a time under normal conditions), but it is not yet safe under partition: a paused old leader can resume and write alongside the new one. That gap is exactly what v3 closes.
What triggers the next iteration
- Split-brain still possible on partition / GC pause
- Lease TTL must be tuned against worst-case pauses
- A lock without fencing is not a correctness guarantee
v3 — fencing tokens enforced at the sink
Make leadership correct under partition, not just under normal operation.
Stamp every leader's writes with a monotonic fencing token from its lease (etcd's revision is ideal), and make the storage sink reject any write with a stale token. Now even a revived old leader on the wrong side of a partition cannot commit — the sink fences it out.
Each lease carries a monotonically increasing token; the storage sink rejects any write bearing a token older than the highest it has seen, so a revived old leader cannot corrupt state.
This is the step that turns "usually one leader" into "provably one writer". It requires the sink to participate (a compare-and-set on a version/epoch column), which is why you design the write path and the election together.
What triggers the next iteration
- Sink must support a monotonic version / CAS to fence
- Token must thread through every downstream write
- Some sinks can't fence — fall back to idempotency
v4 — idempotent work + per-entity locks at scale
Add defence-in-depth and reduce contention as the workload grows.
Make the downstream work idempotent (keyed by step_id / business period) so a duplicate attempt under the absolute worst case is a safe no-op — belt-and-braces behind fencing. And when one global leader becomes a bottleneck, switch to per-entity locks (per-job, per-shard) so many leaders run disjoint work concurrently instead of serialising through one.
Even with fencing, downstream work is keyed so a duplicate attempt under the worst case is a safe no-op — belt and braces against double execution.
Fine-grained locks cut contention at the cost of managing more locks; you also want monitoring of the consensus cluster's health and a recovery playbook for quorum loss. This is the mature shape: consensus + fencing + idempotency + right-sized lock granularity.
What triggers the next iteration
- More locks to manage and observe
- Per-shard rebalancing complexity
- Consensus cluster health becomes critical infra
Deep dives
Why "set a flag in the database" is wrong
Each node grants a lease and CAS-puts the leader key; the winner KeepAlives while watchers wait; on crash/partition the lease expires, the key is deleted, and watchers race.
The instinct to coordinate replicas with a database boolean ("whoever sets is_leader=true first wins") fails in three distinct ways, and naming all three is the signal.
Each node grants a lease and CAS-puts the leader key; the winner KeepAlives while watchers wait; on crash/partition the lease expires, the key is deleted, and watchers race.
- 1Crash leaves the flag stuck. If the "leader" crashes with the flag set, it stays set forever — the job is permanently blocked until a human notices and clears it manually. There's no expiry.
- 2No dead-leader detection. A flag can't tell you the holder died. There's no heartbeat, so a dead leader and a working leader look identical, and no other node knows it's safe to take over.
- 3No fencing. Even if you add a timeout, when the old "leader" wakes up it still believes it holds the flag and will act — there's nothing to stop it.
Leases solve all three by construction: a TTL gives automatic expiry on death, the heartbeat gives liveness detection, watches give automatic transfer to the next node, and the fencing token defeats the revived old leader. The naive flag also typically races under concurrent reads (two readers both see it unset and both set it) unless you use a real compare-and-swap — which is exactly the primitive a consensus service provides and a plain row update does not. The lesson: leadership is a distributed problem with crash, pause, and partition failure modes, and a single-row boolean models none of them.
Fencing tokens are the real safety property
Each lease carries a monotonically increasing token; the storage sink rejects any write bearing a token older than the highest it has seen, so a revived old leader cannot corrupt state.
A lease tells nodes who should be leader. A fencing token is what lets downstream systems reject a stale leader — and it, not the lease, is the actual correctness guarantee.
Each lease carries a monotonically increasing token; the storage sink rejects any write bearing a token older than the highest it has seen, so a revived old leader cannot corrupt state.
The scenario it defends against: the leader acquires a lease and starts work, then suffers a long stop-the-world GC pause (or a network partition). During the pause, the consensus service's lease expires and a new leader is elected. The old leader then resumes — it has no idea time has passed and still believes it's the leader. Now two nodes issue writes. A lock alone cannot stop this, because the old leader genuinely held a valid lock a moment ago.
Fencing closes it: every time leadership is granted, the leader gets a monotonically increasing token (etcd's key revision, a ZooKeeper zxid, a database epoch). The leader includes this token on every write, and the sink enforces monotonicity — it remembers the highest token it has accepted and rejects any write with a lower one. So when the paused old leader (token 33) resumes and writes after the new leader (token 34) has written, the sink rejects token 33 outright. The token must be monotonic and enforced at the sink — enforcement anywhere else is theatre. If a sink genuinely can't fence (no version column, no CAS), the fallback is to make writes idempotent and accept weaker guarantees, which is why idempotency is the companion safety net. "A lock without fencing is not a correctness guarantee" is the line to land.
Split-brain: how partitions create two leaders
A network partition can leave an old leader still believing it leads while a new one is elected on the majority side; only fencing at the sink prevents both from committing.
Split-brain is the failure that consensus exists to prevent, and understanding the mechanism — not just the word — is what's tested.
A network partition can leave an old leader still believing it leads while a new one is elected on the majority side; only fencing at the sink prevents both from committing.
A network partition splits the cluster into two groups that can't talk to each other. The old leader may end up on the minority side, still running and still believing it leads, while the majority side — which retains quorum — notices the lease lapse and elects a new leader. For a moment, two nodes both think they're in charge.
Consensus protocols prevent the election of two leaders by requiring a majority quorum to win: the minority side cannot elect anyone because it can't reach a majority, so at most one new leader emerges. But that doesn't stop the old leader on the minority side from continuing to act locally — it was elected before the partition. Two defences combine:
- Quorum ensures only one new leader is elected (the minority can't form a majority).
- Fencing at the sink ensures the old leader's writes are rejected once the new leader's higher token has been seen.
Together they guarantee a single effective writer even though, briefly, two processes believe they're leader. The CAP-theorem framing is worth stating: under partition, consensus chooses consistency over availability — the minority side stops making progress rather than risk a conflicting write. A system that "keeps both halves running" during a partition has chosen availability and will corrupt state without fencing.
Quorum sizing and the availability trade-off
Odd cluster sizes (3 or 5) maximise fault tolerance per node; on quorum loss the safe behaviour is to stop electing rather than risk split-brain.
The consensus cluster's size directly sets how many failures it tolerates, and the rule is precise.
Odd cluster sizes (3 or 5) maximise fault tolerance per node; on quorum loss the safe behaviour is to stop electing rather than risk split-brain.
An N-node cluster needs a majority (⌊N/2⌋ + 1) to make progress, so it survives ⌊(N-1)/2⌋ failures:
- 3 nodes → majority 2 → survives 1 failure.
- 5 nodes → majority 3 → survives 2 failures.
- 7 nodes → majority 4 → survives 3, but now every write waits on 4 acks, so latency climbs.
Two rules fall out. First, always use an odd number: a 4-node cluster also needs 3 for majority, so it tolerates only 1 failure — exactly like a 3-node cluster but with more machines and more latency. 2 and 4 are strictly worse than 1 and 3. Second, don't over-size: past 5–7 nodes the write-latency cost of larger majorities outweighs the marginal fault tolerance. 3 is the common default; 5 for workloads that can't tolerate even a brief write outage from a double failure.
The deeper point is what happens on quorum loss: in a 3-node cluster, if 2 nodes are down, the survivor cannot form a majority and the safe behaviour is to stop electing leaders and refuse writes (go read-only) rather than let a lone node act and risk split-brain. "Consensus protects correctness by refusing progress when it can't prove a majority" — stating that trade-off explicitly is a senior signal, because the naive expectation is that a system should always stay available.
Lease TTL, GC pauses, and trusting the right clock
Too short and a GC pause makes the leader thrash; too long and a dead leader blocks progress. The lease is refreshed by the consensus service's clock, not node wall time.
The lease TTL is a deceptively tricky tuning knob with correctness implications at both extremes.
Too short and a GC pause makes the leader thrash; too long and a dead leader blocks progress. The lease is refreshed by the consensus service's clock, not node wall time.
Too short and a normal stop-the-world GC pause (or a brief network blip) makes the leader miss a heartbeat, its lease expires, a new leader is elected, and then the original leader wakes up — leadership thrashes back and forth, and you get exactly the dual-leader window fencing must cover. Too long and a genuinely dead leader holds the lease for the full TTL, blocking all progress until it expires. The TTL must be set longer than the longest realistic pause on your nodes — which means you have to actually know your worst-case GC and scheduling pauses. 10–30 seconds is typical.
The subtler trap is whose clock you trust. A tempting-but-wrong mental model is "the leader is valid for 10 seconds from the moment of grant" measured on the node's wall clock — but node clocks drift and skew, so this leads to two nodes disagreeing about whether the lease is still valid. The correct model: the lease is owned and refreshed by the consensus service's own clock, and the node trusts the service's notion of expiry, not its local wall time. The node's job is to keep heartbeating; the service decides when the lease is dead. "The lease is owned by the consensus service's clock" is the precise framing that avoids the clock-skew class of bugs.
Idempotency, lock granularity, and operating the cluster
Even with fencing, downstream work is keyed so a duplicate attempt under the worst case is a safe no-op — belt and braces against double execution.
Three operational realities turn a correct election into a correct system.
Even with fencing, downstream work is keyed so a duplicate attempt under the worst case is a safe no-op — belt and braces against double execution.
Idempotency is the safety net behind fencing. Even with fencing, you want downstream work keyed by a business identity — step_id, billing_period + account_id, migration_step — so that if a duplicate attempt ever slips through (a sink that couldn't fence, a retry after an ambiguous failure), it's an upsert/no-op rather than a double-effect. Fencing prevents the stale write; idempotency makes the rare slip harmless. Belt and braces.
Lock granularity is a contention dial. A single global leader is simplest but serialises all the work through one node. Per-entity locks (per-job, per-shard) let many leaders run disjoint work in parallel — far less contention — at the cost of managing many locks and handling rebalancing when shards move. Choose granularity by how much parallelism you need.
Operate the consensus cluster like critical infrastructure. Two common operational failures: (1) co-locating etcd with the workload, so heavy app load starves etcd, leases miss, and leadership thrashes — isolate the consensus cluster; and (2) quorum loss of the cluster itself, which requires monitoring and a disaster-recovery playbook (the consensus cluster going down takes leadership with it). The consensus service is now part of your critical path, so it deserves the same care as your primary datastore.
Decision levers
Consensus service choice
etcd: default for Kubernetes/Go, revision = free fencing. ZooKeeper: mature, JVM (Kafka, Hadoop), ephemeral-sequential recipe. Consul: KV + service discovery in one. Avoid DIY Redis for anything that must be correct.
Lease TTL
Too short → a GC pause causes false expiry and leadership thrash. Too long → a dead leader blocks progress. Set it longer than your worst-case GC/network pause; 10–30 s typical. Trust the consensus service's clock, not node wall time.
Fencing enforcement
Non-optional when correctness matters. The sink must reject writes with stale (lower) tokens via a monotonic version / CAS. If the sink genuinely cannot fence, accept split-brain risk and design idempotent, replay-safe writes.
Quorum size
3 or 5 nodes — survive ⌊(N-1)/2⌋ failures. Never 2 or 4 (no extra tolerance over 1 or 3). Avoid > 7 (write latency from larger majorities). On quorum loss, stop electing and go read-only.
Lock granularity
One global lock = simplest but serialises all work through one leader. Per-entity locks (per-job, per-shard) = many leaders on disjoint work, less contention, more locks to manage and rebalance.
Idempotency safety net
Key downstream work by business identity (step_id, period+account) so a duplicate under the worst case is an upsert/no-op. Fencing prevents the stale write; idempotency makes a rare slip harmless. Use both.
Cluster isolation & ops
Run the consensus cluster on isolated hosts — co-locating with the workload starves it under load and causes lease thrash. Monitor cluster health and keep a quorum-loss recovery playbook; it is now critical-path infra.
Failure modes
A paused/partitioned old leader and a freshly-elected new leader both write — split-brain corrupts data. Add monotonic fencing tokens and enforce them at the sink (reject lower tokens).
Redlock has documented correctness issues (Kleppmann): pauses and clock skew permit double execution. Use it only for best-effort duplicate reduction; use real consensus + fencing when correctness matters.
A boolean row stays stuck on crash, can't detect a dead leader, has no fencing, and races without CAS. Use a lease (expiry + heartbeat + watch + token) instead.
Even cluster sizes give no extra fault tolerance over the odd size below them, just more latency. Run 3 or 5; never 2 or 4.
"Valid for 10 s from grant on the node's wall clock" assumes clocks agree; they drift. Trust the consensus service's clock for expiry, not local wall time.
etcd sharing a host with the app gets starved under heavy load, missing leases and thrashing leadership. Isolate the consensus cluster on its own nodes.
A normal stop-the-world GC pause exceeds the TTL, triggering needless re-election and thrash. Set the TTL above the worst-case realistic pause.
In a 3-node cluster, 2 down = no majority = no writes. Monitor cluster health and keep a disaster-recovery playbook for the consensus cluster itself.
Case studies
Apache Kafka
Kafka — from a ZooKeeper-elected controller to self-managed Raft (KRaft)
For most of its history Kafka used ZooKeeper for cluster coordination, including electing a single controller broker responsible for partition leadership and metadata. This is a textbook leader-election use: exactly one broker must be the controller, and on its failure ZooKeeper's ephemeral-node mechanism triggers a fast re-election. Kafka also elects a partition leader per partition (from the in-sync replica set) so that every read and write for a partition goes through one authoritative replica — leader election at two levels.
The instructive twist is Kafka's move to KRaft (Kafka Raft), which removes the ZooKeeper dependency and runs the controller quorum on an internal Raft implementation instead. The motivations map onto this pattern's themes: operating a separate ZooKeeper ensemble was operational overhead and a second system to keep healthy (the "isolate and operate the consensus cluster" burden), and folding metadata into a Raft log let metadata changes scale and propagate faster with a single consistency model. KRaft keeps a small controller quorum (an odd number of nodes) that elects a leader and replicates the metadata log — the same quorum/Raft machinery, now embedded.
The lesson for an interview: leader election shows up at multiple levels of a real system (controller and per-partition), and even mature systems consolidate onto Raft rather than hand-rolling coordination — the "use consensus, don't invent it" principle, taken to the point of embedding a proper Raft quorum in-product.
Kubernetes
Kubernetes — etcd-backed leader election for active-passive controllers
Kubernetes runs its control-plane components — the kube-controller-manager and kube-scheduler — in an active-passive configuration: you deploy several replicas for availability, but only one may be active at a time, because two schedulers making placement decisions simultaneously would conflict. It achieves this with leader election backed by etcd, the Raft-replicated store at the heart of the cluster.
The mechanism is exactly this pattern's recipe: candidates contend for a lease object; the holder renews it within the lease duration (the heartbeat), and the other replicas watch and wait. If the active leader fails to renew — it crashed, partitioned, or paused — the lease lapses and a standby acquires leadership and becomes active. The tunables Kubernetes exposes are precisely the ones this pattern flags: leaseDuration, renewDeadline, and retryPeriod — i.e. the TTL vs worst-case-pause trade-off, set so a transient hiccup doesn't cause needless failover while a real failure is detected promptly.
Kubernetes is the canonical "consume consensus, don't build it" example: the controllers don't implement Raft, they use etcd's lease primitives via a shared client-go election library. It's also a clean illustration of active-passive singleton leadership — many replicas for resilience, one actor for correctness — which is the shape behind most "only one worker may run this" interview prompts.
Google Chubby
Google Chubby — the lock service that taught the industry to centralise consensus
Google's Chubby is the lock service that inspired ZooKeeper and shaped how the industry thinks about coordination. Its central design insight, argued in Google's paper, is counter-intuitive: rather than have every distributed system embed Paxos and implement its own consensus (hard to get right, easy to get subtly wrong), provide a single, well-engineered, centralised lock/coordination service that everyone else consumes. Chubby runs a small 5-node Paxos group (a quorum that tolerates 2 failures), elects a master, and exposes a simple file-system-like lock API.
Two ideas from Chubby are baked into this pattern. First, leadership is delegated to experts: most engineers should acquire a lock from a consensus service, not implement consensus — exactly the "don't roll your own" rule. Second, Chubby formalised sequencers — opaque tokens a lock holder passes to a backend, which the backend validates — which is precisely the fencing token mechanism. Google explicitly designed for the case where a lock holder pauses and a new holder is elected, and sequencers let the protected resource reject the stale holder. Chubby also doubles as a name service, since a highly-available, consistent place to store "who is the current leader" is exactly what service discovery needs.
The enduring lesson: consensus is hard enough that the right architecture is to centralise it once behind a clean lock API with fencing sequencers — which is why etcd, ZooKeeper, and Consul exist and why you should use them.
Decision table
Leader election is about correctness under pause, crash, and partition.
| Need | Use | Risk | Robust answer includes |
|---|---|---|---|
| Exactly one active leader | etcd / ZooKeeper / Consul lease | Split-brain under partition | Lease, watch, quorum, fencing token |
| Best-effort duplicate avoidance | Redis lock / DB advisory lock | Double execution possible | Idempotent sink, timeout cleanup |
| Partitioned ownership | Shard-ownership table | Rebalance complexity | Epoch / fencing per shard |
| One-off migration | Consensus lease + idempotent steps | Crash mid-step | Lease expiry, step-level idempotency |
| Singleton controller | etcd lease (active-passive) | Two active actors | Lease-duration vs renew-deadline tuning |
- A lock without fencing is not a correctness guarantee — enforce the token at the sink.
- Run an odd quorum (3 or 5); on quorum loss, refuse progress rather than risk split-brain.
Drills
Why is "set a flag in a DB row" wrong?Reveal
Three reasons. (1) Crash leaves it stuck — if the holder dies with the flag set, it stays set forever and blocks the job until a human clears it. (2) No dead-leader detection — a flag has no heartbeat, so a dead leader and a live one look identical. (3) No fencing — even with a timeout, when the old "leader" wakes it still believes it holds the flag and acts. It also races without a real CAS. Leases solve all of these: automatic expiry on death, liveness via heartbeat, automatic transfer to a watcher, and a fencing token to defeat the revived old leader.
What is a fencing token and why is it the real safety property?Reveal
A lease says who should be leader; a fencing token lets the sink reject a stale leader, which is the actual correctness guarantee. Each leadership grant carries a monotonically increasing token (etcd revision, zxid, DB epoch). The leader stamps it on every write, and the sink remembers the highest token accepted and rejects any lower one. So when a leader pauses (GC) past its lease, a new leader is elected with a higher token, and the old leader resumes and tries to write — the sink fences its stale token out. Without sink-enforced fencing, split-brain corruption is inevitable. "A lock without fencing is not a correctness guarantee."
A network partition splits your cluster. How are two leaders prevented from corrupting state?Reveal
Two defences combine. Quorum ensures only one new leader is elected: the minority side can't reach a majority, so it can't elect anyone, and at most one new leader emerges on the majority side. But the old leader on the minority side may keep acting — so fencing at the sink rejects its now-stale (lower) token once the new leader's higher token has been seen. Quorum handles election; fencing handles the lingering old leader; together they guarantee a single effective writer. This is consensus choosing consistency over availability under partition — the minority stops rather than risk a conflicting write.
Why 3 or 5 nodes, never 2 or 4?Reveal
A cluster needs a majority (⌊N/2⌋+1) to make progress, so it tolerates ⌊(N-1)/2⌋ failures. 3 → majority 2 → survives 1. 5 → majority 3 → survives 2. But 4 also needs 3 for majority, so it survives only 1 — exactly like 3 but with more machines and higher write latency. Even sizes give no extra tolerance over the odd size below them. And past 7, every write waits on a larger majority, so latency outweighs the marginal fault tolerance. 3 is the default; 5 when a double failure must be survivable.
How do you choose the lease TTL, and whose clock matters?Reveal
Set the TTL longer than the worst-case realistic pause — chiefly the longest stop-the-world GC pause plus network jitter — so a normal hiccup doesn't cause a false expiry and leadership thrash; but not so long that a dead leader blocks progress for ages. 10–30 s is typical. The clock that matters is the consensus service's, not the node's: the lease is refreshed against the service's own clock, and the node trusts the service's notion of expiry rather than its local wall time — otherwise clock skew between nodes creates disagreement about whether the lease is still valid.
When is a Redis lock (SET NX / Redlock) acceptable, and when not?Reveal
Acceptable for best-effort duplicate reduction where occasional double execution is tolerable — a cache-warm job, a non-critical periodic task. Not acceptable for correctness-critical leadership, because (per Kleppmann's Redlock critique) a GC pause or clock skew can let a node believe it still holds an already-expired lock, so two nodes act simultaneously — Redis isn't a Raft/Paxos consensus system and its timing assumptions don't survive real pauses. For correctness, use a real consensus service and fence at the sink (and make writes idempotent). Stating this distinction honestly is itself a strong signal.
When to reach for this