Technology·Coordination

ZooKeeper / etcd

Strongly-consistent coordination services that solve the hard distributed problems — leader election, distributed locks, configuration, and service discovery — so you never have to implement consensus yourself.

Also worth naming: Apache ZooKeeper · etcd (Kubernetes’ backing store) · HashiCorp Consul · Google Chubby (the original)

~25 min read·15 sections

When you need exactly one of something across a cluster — one leader, one lock holder, one source of truth for config — you delegate to a consensus-backed coordination store. Rolling your own is how distributed systems get subtle, fatal bugs.

What it is

ZooKeeper and etcd are distributed coordination services: small, strongly-consistent key-value stores whose job is to answer the questions that are genuinely hard in a distributed system — who is the leader? who holds this lock right now? what is the current config? which service instances are alive? They achieve this by running a consensus protocol (ZooKeeper uses ZAB, etcd uses Raft) across a small cluster of nodes, so every client sees the same answer even through failures and network partitions.

They are deliberately not general databases. They hold a small amount of critical metadata (kilobytes to low megabytes), prioritise consistency and correctness over throughput, and expose coordination primitives: ephemeral keys that vanish when a client dies (perfect for liveness and locks), watches that notify clients of changes, leases with TTLs, and atomic compare-and-swap. etcd is the store behind Kubernetes; ZooKeeper underpins Kafka (historically), HBase, and countless clusters.

The interview framing is simple and important: do not implement consensus yourself. Raft and Paxos are subtle, and a homegrown leader election or distributed lock will have race conditions you won't find until production. When a design needs single-leader coordination, a distributed lock with a fencing token, dynamic config, or service discovery, you say "a coordination service like etcd or ZooKeeper" — a battle-tested system with years of hardening — and you keep it off the hot path, because consensus is correct but slow.

When to reach for it

Reach for this when…

You need leader election or a single active coordinator across a cluster
You need a correct distributed lock (with fencing) for mutual exclusion
Dynamic configuration that must be consistent across many nodes
Service discovery / membership — which instances are alive right now

Not really this pattern when…

You need throughput or to store real application data (it is metadata-only, low-throughput)
Eventual consistency is acceptable — you may not need consensus at all
A best-effort lock is fine and you already run Redis (cheaper, with caveats — no true consensus)
The coordination is on a hot per-request path — consensus latency will hurt

How it works

Four ideas explain how to use a coordination service:

1. Consensus over a small odd ensemble. A handful of nodes (almost always 3 or 5, odd) run ZAB/Raft. Writes go through an elected leader and commit only when a majority quorum acknowledges, which is the only way two concurrent decisions cannot both win. 3 nodes tolerate 1 failure (quorum 2); 5 tolerate 2 (quorum 3). Even counts are wasteful — 4 tolerates the same 1 failure as 3. The quorum is also what prevents split-brain: a minority partition cannot reach a majority, so it cannot elect a leader or commit.

Architecture diagram· A small consensus ensemble: one leader, a majority quorum

An odd number of nodes (3 or 5) run a consensus protocol (ZAB / Raft). Writes go through the leader and are committed once a majority acknowledges, so the cluster stays correct as long as a majority survives — 3 nodes tolerate 1 failure, 5 tolerate 2.

2. Strong consistency, low throughput — by design. Every committed write is agreed by a majority before it is acknowledged, so reads see a single consistent view. The price is that writes cost multiple round trips and the store is sized for small, critical metadata, not bulk data or high write rates. You store coordination state, not application data.

3. The primitives: ephemeral keys, watches, leases, CAS. An ephemeral key exists only while its client session is alive — when the client dies, the key vanishes, which is how you build liveness, membership, and locks. Watches notify clients when a key changes (no polling). Leases give a key a TTL the holder must renew. Compare-and-swap enables atomic state transitions. From these you compose leader election, locks, queues, and config.

4. Locks need fencing tokens. A coordination lock gives mutual exclusion, but a holder can stall (a GC pause) past its lease while still believing it holds the lock. The fix is a monotonically increasing fencing token issued with the lock; the protected resource rejects any operation carrying a token lower than the highest it has seen — so a zombie ex-leader cannot corrupt state.

Architecture diagram· Leader election with a lease + fencing token

Workers compete to acquire a lease in the coordination store; exactly one wins and becomes leader. The lease carries a monotonically increasing fencing token that the protected resource checks, so a stalled old leader cannot act after a new one is elected.

Performance envelope

Coordination service characteristics — what to reason about.

Dimension	Reality	Why it matters
Cluster size	3 or 5 nodes (odd)	Majority quorum; 3 tolerates 1, 5 tolerates 2 failures
Consistency	Strong (linearizable) via consensus	Every client sees one correct answer
Throughput	Low — thousands of writes/sec	Metadata coordination, not bulk data
Data size	Small (KB–low MB per key/total)	Config, locks, membership — not app data
Latency	Multiple round trips per write	Keep it off the hot per-request path
Failure tolerance	Survives a minority loss; no split-brain	Minority partition cannot elect or commit

Capabilities in interviews

Leader election

Guarantee exactly one active coordinator across a cluster, with automatic failover.

Workers compete to create a lease/ephemeral node; exactly one wins and is leader. If the leader dies, its ephemeral key vanishes and a follower is elected — automatic failover with no human in the loop:

text

acquire(/service/leader)  → one winner becomes leader (gets a fencing token)
leader dies → ephemeral key expires → next worker elected

This is how you get a single primary for a sharded job, a single scheduler, or a single writer — without the split-brain risk of homegrown election. Pass the fencing token to whatever the leader controls.

Choose this variant when

Single active scheduler / coordinator / primary
Failover without manual intervention
One-writer-per-shard designs

Distributed locks (with fencing)

Mutual exclusion across processes, safe against stalled holders via fencing tokens.

When only one process may act on a resource at a time across machines, a coordination lock provides it correctly:

text

lock(/locks/order-42) → token 7 → do work, passing token 7 downstream

Crucially, the lease carries a monotonic fencing token the resource validates, so a holder that pauses past its lease cannot clobber the new holder's work. This is the airtight version of the "Redis lock" — when correctness genuinely matters (not just best-effort), you use a consensus-backed lock, not a TTL key.

Choose this variant when

Exactly-one-actor mutual exclusion that must be correct
Distributed cron / single-runner jobs
Protecting a resource from concurrent writers

Dynamic configuration

A consistent, watchable source of truth for config that updates across the fleet instantly.

Store feature flags, runtime settings, or topology in the coordination store; every node watches the keys and is notified the instant they change — no redeploy, no polling, and everyone sees the same value at once:

text

set(/config/feature.x = on) → all watchers notified → consistent rollout

Because writes are strongly consistent, you never get half the fleet on the old config and half on the new in an undefined way. This is the backbone of consistent dynamic configuration and controlled rollouts.

Choose this variant when

Cluster-wide settings that must agree
Instant config propagation without redeploy
Coordinated, consistent feature rollouts

Service discovery & membership

Track which service instances are alive using ephemeral keys, so clients always find healthy nodes.

Each instance registers an ephemeral key on startup; when it dies (or its session lapses), the key disappears automatically, so the registry reflects live membership without stale entries:

text

instance up → create ephemeral /services/api/<id>
instance dies → key auto-removed → clients re-resolve

Consumers watch the service path to discover the current healthy set. This is exactly how etcd backs Kubernetes' view of the cluster, and how Consul/ZooKeeper power service registries.

Choose this variant when

Dynamic service registries / membership
Health-aware discovery without stale nodes
Cluster topology that changes as nodes come and go

Operating knobs

Delegate — never roll your own

The most important decision: use etcd, ZooKeeper, or Consul rather than implementing consensus. Raft/Paxos are 30 pages of subtle correctness, and homegrown election/locking has race conditions you will only find in production. These systems have years of hardening — reach for them and move on.

Cluster size (always odd)

3 nodes for most systems (tolerates 1 failure), 5 for critical ones (tolerates 2). Never even numbers — 4 costs more than 3 for the same fault tolerance, and 6 for the same as 5. Larger clusters tolerate more failures but make every write slower (more nodes to reach quorum), so rarely go past 5.

Fencing tokens for any lock/leader

A lock or leadership without a fencing token is unsafe: a paused holder can act after losing the lock. Always issue a monotonically increasing token with the lease and have the protected resource reject stale tokens. This is non-optional for correctness, not an enhancement.

Keep it off the hot path

Consensus is correct but slow (multiple round trips per write). Use the coordination store for coordination events — leader changes, lock acquisition, config updates — not on every request. If you find yourself calling it per request, redesign so the coordination decision is made occasionally and cached.

Versus the alternatives

Coordination service vs the alternatives.

Dimension	etcd / ZooKeeper	Redis lock	A regular database
Consistency	Strong (consensus, no split-brain)	Best-effort (no consensus)	Strong, but not built for coordination
Lock safety	Fencing tokens, correct	TTL only — unsafe under GC pause	Row locks work; awkward for leader election
Throughput	Low (metadata)	Very high	High
Primitives	Ephemeral keys, watches, leases	SET NX, basic	SELECT FOR UPDATE, advisory locks
Best for	Leader election, locks, config, discovery	Fast best-effort locks	App data, not coordination

Failure modes & gotchas

Rolling your own consensus

Implementing leader election or distributed locking from scratch produces subtle race conditions and split-brain bugs that surface only under failure. Delegate to etcd/ZooKeeper/Consul — proven, hardened systems — rather than re-deriving Raft incorrectly.

Locks without fencing tokensAdvanced

A lock with only a TTL is unsafe: a holder that stalls (GC pause, slow disk) past the lease can resume and act while a new holder also holds the lock, corrupting the resource. Always issue a monotonic fencing token and validate it at the resource — this is the difference between a correct lock and a 2 AM data-corruption incident.

Even-numbered or oversized clusters

A 4-node ensemble tolerates the same single failure as 3 but needs a larger quorum — pure waste. Oversized clusters (7+) slow every write because more nodes must ack. Stick to 3 or 5.

Consensus on the hot pathAdvanced

Putting a coordination read/write on every request adds multiple round trips of latency and makes the store a bottleneck. Use it for occasional coordination decisions (who is leader, current config) and cache the result; never per request.

Storing real data in it

These stores are sized for small, critical metadata and low write throughput. Dumping application data or high-rate writes into them overwhelms the consensus layer and degrades the coordination everything else depends on. Keep it to kilobytes of coordination state.

In production

Kubernetes

etcd as the brain of every Kubernetes cluster

Every Kubernetes cluster on earth runs on etcd — it is the single, strongly-consistent source of truth for all cluster state: which pods should run where, services, config, secrets, and the desired-vs-actual state the control loop reconciles. The API server is the only thing that writes to etcd, and every controller watches etcd keys to react the instant state changes.

It is the definitive coordination-service case study because it shows exactly what these systems are for and what they are not: etcd holds a relatively small amount of critical metadata with strong consistency (Raft consensus over a 3- or 5-node ensemble), watched by many clients — and it is emphatically not where application data or high-throughput writes go. When people ask "what would actually use ZooKeeper or etcd?", Kubernetes is the answer that makes it concrete.

Takeaway: A coordination service holds small, critical, strongly-consistent state watched by many clients (Kubernetes' entire cluster state in etcd) — never application data or high write volume.

Apache Kafka

From ZooKeeper-coordinated to self-managed consensus (KRaft)

For most of its life, Kafka relied on ZooKeeper for all its coordination: electing the controller broker, tracking which brokers are alive, storing topic/partition metadata, and managing leader election for each partition's replicas. It is a canonical example of the pattern — Kafka delegated the hard "who is the leader, who is alive, what is the config" problems to a proven consensus service rather than building its own.

Notably, Kafka has since replaced ZooKeeper with KRaft, its own built-in Raft implementation, to simplify operations and scale metadata further. That evolution is itself the lesson: coordination is so critical and subtle that even Kafka first delegated it for a decade, and when it did bring it in-house, it implemented real consensus (Raft) rather than a homegrown scheme. The interview takeaway holds — you delegate coordination to consensus; you don't improvise it.

Takeaway: Even Kafka delegated leader election and membership to ZooKeeper for years — coordination is subtle enough that you use real consensus (delegated or Raft), never a homegrown scheme.

Good vs bad answer

Interviewer probe

“A distributed cron system must run each scheduled job exactly once across many worker nodes, even as workers crash. How do you coordinate?”

Weak answer

"Each worker checks a 'last run' timestamp in the database before running, and uses a Redis SET NX lock with a TTL so only one runs the job. If the lock expires, another worker can take over."

Strong answer

"I'd use a coordination service like etcd or ZooKeeper for leader election: the workers elect a single active scheduler via a lease/ephemeral node, and only the leader dispatches jobs — so 'exactly one runner' is guaranteed by consensus, not hoped for. If the leader crashes, its ephemeral key expires and a follower is elected automatically, with no split-brain because a minority partition can't win a majority quorum. For the actual job execution I'd hand out a fencing token with the leadership lease and have the job's side-effecting target reject any work carrying a stale token — so a leader that GC-pauses past its lease can't double-run a job after a new leader takes over. The cluster is 3 or 5 nodes, odd, and I keep coordination off the per-job hot path. A plain Redis SET NX lock is only best-effort — it has no consensus and no fencing, so a paused holder can let the same job run twice, which violates exactly-once. When correctness is the requirement, I delegate to a consensus-backed coordinator rather than rolling my own."

Why it wins: Delegates to a proven coordination service, uses leader election + automatic failover, adds fencing tokens for true exactly-once, gets quorum/cluster sizing right, keeps it off the hot path, and explains precisely why a Redis TTL lock is insufficient.

Interview playbook

Interview playbook2–3 min on coordination-heavy designs (locks, scheduler, primary election, config)

When it comes up

You need exactly one of something — one leader, one lock holder, one writer
Distributed cron / single-runner jobs, primary election, sharded coordinators
Dynamic config or service discovery that must be consistent across nodes
The interviewer asks "how do you avoid two leaders / two runners?"

Order of reveal

1
1. Delegate, do not build. I use a coordination service like etcd or ZooKeeper rather than implementing consensus — the proven systems handle the hard part.
2
2. Leader election with quorum. Workers elect one leader via a lease; a majority quorum on an odd 3- or 5-node cluster prevents split-brain.
3
3. Fencing tokens. The lease carries a monotonic token the protected resource validates, so a stalled old leader cannot act.
4
4. Automatic failover. Ephemeral keys vanish when a node dies, so a new leader is elected with no manual intervention.
5
5. Off the hot path. Consensus is slow, so I use it for coordination decisions and cache the result, never per request.

Signature phrases

“Delegate consensus to etcd or ZooKeeper — never roll your own.”

“Majority quorum on an odd cluster means no split-brain.”

“Every lock carries a fencing token the resource validates.”

“Consensus is correct but slow — keep it off the request path.”

“Delegate consensus to etcd or ZooKeeper — never roll your own.” — The single most important instinct here.
“Majority quorum on an odd cluster means no split-brain.” — Shows you understand why it is correct.
“Every lock carries a fencing token the resource validates.” — The detail that makes a distributed lock actually safe.
“Consensus is correct but slow — keep it off the request path.” — Demonstrates you respect the latency cost.

Likely follow-ups

?“Why is a Redis SET NX lock not enough for this?”Reveal

Because it has no consensus and no fencing. Redis (single-node or even Redlock) gives a best-effort lock with a TTL, but a holder can stall past the TTL — a GC pause, a slow disk — while still believing it holds the lock; the TTL then lets a second worker acquire it, and now two processes act on the same resource. Without a fencing token the resource cannot tell the zombie holder to stop. A consensus-backed coordinator gives a correct lock and a monotonic fencing token the resource validates, which is what "exactly once / exactly one" actually requires. Redis locks are fine when best-effort is acceptable; they are not when correctness is the requirement.

?“Why 3 or 5 nodes, and what happens in a network partition?”Reveal

Odd sizes maximise fault tolerance per node: 3 needs a quorum of 2 (tolerates 1 failure), 5 needs 3 (tolerates 2); a 4-node cluster needs 3 too, so it costs more for the same tolerance as 3. In a partition, only the side with a majority can form a quorum, so only it can elect a leader and commit writes — the minority side cannot, which is exactly how split-brain is prevented. When the partition heals, the minority rejoins and catches up from the leader. You never get two leaders both accepting writes, because both sides cannot simultaneously hold a majority.

?“How is etcd related to Kubernetes?”Reveal

etcd is the source of truth for a Kubernetes cluster — all cluster state (which pods should run where, services, config, secrets) lives in etcd, and the control-plane components watch and update it. Its strong consistency is why Kubernetes can make coherent scheduling decisions, and its watch primitive is how controllers react to state changes instantly. It is the canonical production example of a coordination service: small critical state, strong consistency, watched by many clients — not a place for application data, but the brain that keeps the cluster coherent.

Worked example

Setup. A distributed cron system runs scheduled jobs across many worker nodes. Each job must run exactly once even as workers crash, and a job must never run on two workers at the same time.

The move. Use a coordination service (etcd or ZooKeeper) for leader election: the workers elect a single active scheduler by competing to acquire a lease / ephemeral node — exactly one wins. Only the leader dispatches jobs, so "exactly one runner" is guaranteed by consensus, not hoped for. If the leader crashes, its ephemeral key expires and a follower is elected automatically — failover with no human in the loop and no split-brain, because a minority partition can't reach a majority quorum.

Fencing. Electing a leader isn't enough — a leader that GC-pauses past its lease could resume and double-dispatch. So the lease carries a monotonically increasing fencing token, and the job-execution target rejects any dispatch carrying a token lower than the highest it has seen. A zombie ex-leader's writes are rejected; the new leader's win.

Cluster shape. The ensemble is 3 nodes (tolerates 1 failure) or 5 for critical systems (tolerates 2) — always odd, because 4 costs more than 3 for the same fault tolerance. I keep coordination off the hot path: the leader is elected occasionally and the result cached, not re-checked per job.

What breaks. The tempting wrong answer is a Redis `SET NX` lock with a TTL — but that's best-effort with no consensus and no fencing, so a paused holder can let the same job run twice, violating exactly-once. The other failure is dumping real data into the store: it's sized for small, critical metadata at low write throughput, so it stays coordination-only.

The result. Exactly-one active scheduler, automatic failover with no split-brain, fencing tokens that make double-execution impossible, on a 3-node consensus cluster kept off the hot path — correct coordination delegated to a proven system instead of hand-rolled.

Cheat sheet

•Coordination service = strongly-consistent metadata store for leader election, locks, config, discovery.
•Consensus (ZAB/Raft) over an odd ensemble: 3 tolerates 1 failure, 5 tolerates 2. Majority quorum, no split-brain.
•Delegate — never implement consensus yourself. etcd, ZooKeeper, Consul are hardened.
•Primitives: ephemeral keys (liveness/locks), watches (no polling), leases (TTL), compare-and-swap.
•Locks/leaders MUST carry a fencing token the resource validates — TTL alone is unsafe.
•Strong consistency, low throughput, small data. Store coordination state, not app data.
•Keep it off the hot path — consensus is correct but slow (multiple round trips per write).
•etcd backs Kubernetes; ZooKeeper backs Kafka (historically), HBase, and many clusters.

Drills

Why must distributed locks from a coordination service carry a fencing token?Reveal

Because a lock holder can stall — a GC pause, a slow disk, a long syscall — past its lease while still believing it holds the lock. The coordination service, seeing the lease expire, lets another process acquire the lock; now two processes think they hold it. Without a fencing token the protected resource cannot distinguish them, and the zombie holder can corrupt state. A monotonically increasing fencing token issued with each lease, validated by the resource (reject anything below the highest token seen), guarantees the stale holder's writes are rejected. This is the difference between a lock that looks correct and one that is correct.

Interviewer: "your 5-node etcd cluster has 3 nodes partitioned from the other 2. Who can serve writes?"Reveal

Only the 3-node side. A write needs a majority quorum, which for 5 nodes is 3. The 3-node partition has a majority, so it can elect/keep a leader and commit writes; the 2-node side cannot reach 3, so it cannot elect a leader or accept writes — it can at most serve stale reads. This is precisely how consensus prevents split-brain: both sides can never simultaneously hold a majority, so there is never more than one leader accepting writes. When the partition heals, the 2-node side rejoins as followers and catches up.

When would you NOT reach for a coordination service?Reveal

When you do not actually need strong coordination. If eventual consistency is acceptable (most application data — feeds, profiles, analytics), consensus is wasted cost and latency. If you only need a best-effort lock and already run Redis, a SET NX lock may be good enough (accepting it is not airtight). If the "coordination" is really just storing data, use a database. Coordination services are for the specific, narrow problems of single-leadership, correct mutual exclusion, consistent config, and live membership — and they are deliberately low-throughput, so you keep them off hot paths and out of bulk-data roles.

What it is

ZooKeeper / etcd

Also worth naming: Apache ZooKeeper · etcd (Kubernetes’ backing store) · HashiCorp Consul · Google Chubby (the original)

~25 min read·15 sections

Dimension

Reality

Why it matters

Cluster size

3 or 5 nodes (odd)

Majority quorum; 3 tolerates 1, 5 tolerates 2 failures

Consistency

Strong (linearizable) via consensus

Every client sees one correct answer

Throughput

Low — thousands of writes/sec

Metadata coordination, not bulk data

Data size

Small (KB–low MB per key/total)

Config, locks, membership — not app data

Latency

Multiple round trips per write

Keep it off the hot per-request path

Failure tolerance

Survives a minority loss; no split-brain

Minority partition cannot elect or commit

Dimension

etcd / ZooKeeper

Redis lock

A regular database

Consistency

Strong (consensus, no split-brain)

Best-effort (no consensus)

Strong, but not built for coordination

Lock safety

Fencing tokens, correct

TTL only — unsafe under GC pause

Row locks work; awkward for leader election

Throughput

Low (metadata)

Very high

High

Primitives

Ephemeral keys, watches, leases

SET NX, basic

SELECT FOR UPDATE, advisory locks

Best for

Leader election, locks, config, discovery

Fast best-effort locks

App data, not coordination