Coordination Patterns
The recipes that make ZooKeeper useful — leader election, distributed locks, service discovery, configuration management, barriers, and queues. These patterns are the reason ZooKeeper exists.
Table of Contents
Leader Election
Leader election is the most common ZooKeeper pattern. The goal: exactly one process is the leader at any time. If the leader crashes, a new one is elected automatically. The key insight is using ephemeral sequential nodes to create a fair, scalable election without the "herd effect."
The Numbered Ticket System
Imagine a group of people competing to be king. Each person takes a numbered ticket (ephemeral sequential node). The person with the lowest number is king. Everyone else watches the person with the number just below theirs. If the king leaves (session expires), their ticket disappears, and the person with the next lowest number becomes king — without everyone rushing to check at once.
Create Election Node
Each candidate creates an EPHEMERAL_SEQUENTIAL node under /election: /election/candidate-0000000001, /election/candidate-0000000002, etc.
Get All Children
Call getChildren('/election') to see all candidates and their sequence numbers.
Check If Leader
If your node has the lowest sequence number, you are the leader. Start performing leader duties.
Watch Previous Node
If not the leader, set a watch on the node with the next-lower sequence number (not the leader!). This avoids the herd effect.
Handle Watch Event
When the watched node is deleted (previous candidate crashed or stepped down), re-check if you now have the lowest number.
Leader Election — Avoiding the Herd Effect: Candidates: [candidate-0001, candidate-0002, candidate-0003, candidate-0004] Watch chain (each watches the one before it): candidate-0001 (LEADER) ← watched by candidate-0002 candidate-0002 ← watched by candidate-0003 candidate-0003 ← watched by candidate-0004 candidate-0004 When candidate-0001 (leader) crashes: 1. /election/candidate-0001 is deleted (ephemeral) 2. ONLY candidate-0002 gets notified (it was watching 0001) 3. candidate-0002 checks: "Am I lowest?" → YES → becomes leader 4. candidate-0003 and 0004 are NOT disturbed Compare to naive approach (everyone watches the leader): 1. Leader crashes 2. ALL candidates get notified simultaneously 3. ALL candidates call getChildren at once (thundering herd) 4. Server overwhelmed with N simultaneous requests The "watch previous" pattern: O(1) notifications per failure The "watch leader" pattern: O(N) notifications per failure
// Leader election implementation (pseudocode) async function participateInElection(zk: ZooKeeper) { // Step 1: Create our candidate node const myNode = await zk.create( "/election/candidate-", myInfo, EPHEMERAL_SEQUENTIAL ); const mySeq = extractSequence(myNode); // e.g., 0000000003 async function checkLeadership() { // Step 2: Get all candidates const children = await zk.getChildren("/election"); const sorted = children.sort(); // sort by sequence number if (sorted[0] === basename(myNode)) { // Step 3: I have the lowest number — I'm the leader! onBecomeLeader(); return; } // Step 4: Watch the node just before me const myIndex = sorted.indexOf(basename(myNode)); const previousNode = sorted[myIndex - 1]; const exists = await zk.exists( `/election/${previousNode}`, (event) => { // Step 5: Previous node deleted — re-check checkLeadership(); } ); // Edge case: previous node already gone before watch set if (!exists) { checkLeadership(); } } await checkLeadership(); }
Herd Effect Avoidance is Critical
In a cluster with 100 candidates, the naive "everyone watches the leader" approach causes 99 simultaneous getChildren calls when the leader fails. The "watch previous" approach causes exactly 1 notification. This is the difference between O(N) and O(1) load on the ensemble during failover.
Distributed Lock
Distributed locks provide mutual exclusion across machines. The ZooKeeper implementation uses the same ephemeral sequential pattern as leader election, but with the semantics of acquire/ release rather than permanent leadership.
| Lock Type | Mechanism | Fairness | Use Case |
|---|---|---|---|
| Simple (non-fair) | Single ephemeral node, all contenders watch it | ❌ No ordering | Low contention, simple cases |
| Fair (queue-based) | Ephemeral sequential, watch previous | ✅ FIFO order | High contention, ordered access |
| Read-Write Lock | Separate read/write sequential nodes | ✅ Readers concurrent, writers exclusive | Read-heavy workloads |
Fair Distributed Lock (Queue-Based): Lock path: /locks/resource-X Acquire lock: 1. Create EPHEMERAL_SEQUENTIAL: /locks/resource-X/lock-0000000005 2. getChildren("/locks/resource-X") → [lock-0003, lock-0004, lock-0005] 3. Am I the lowest? (lock-0003 is lowest, not me) 4. Watch lock-0004 (the one just before me) 5. Wait for notification... When lock-0004 is deleted: 6. getChildren again → [lock-0003, lock-0005] 7. Still not lowest (lock-0003 exists) 8. Watch lock-0003 When lock-0003 is deleted: 9. getChildren → [lock-0005] 10. I'm the lowest! → LOCK ACQUIRED ✅ Release lock: 11. Delete /locks/resource-X/lock-0005 12. Next waiter (lock-0006) gets notified → acquires lock Crash recovery (automatic): If lock holder crashes → ephemeral node deleted → next waiter proceeds No deadlocks from dead processes! Read-Write Lock: Writers: /locks/resource-X/write-0000000001 Readers: /locks/resource-X/read-0000000002 Writer acquires: must be lowest of ALL nodes (read + write) Reader acquires: must be lowest of all WRITE nodes before it Multiple readers can hold the lock simultaneously
Fencing Tokens
A subtle problem: if a lock holder has a long GC pause, its session might expire and another process acquires the lock. When the original process resumes, it still thinks it holds the lock. Solution: use the znode's czxid as a fencing token. Pass it to any resource you're protecting — the resource rejects operations with stale (lower) fencing tokens.
Lock Implementation Rules
- ✅Always use EPHEMERAL nodes — ensures lock is released if holder crashes
- ✅Use SEQUENTIAL for fairness — prevents starvation of waiting processes
- ✅Watch only the previous node — avoids herd effect on lock release
- ✅Implement fencing tokens — protects against stale lock holders after GC pauses
- ✅Set reasonable session timeouts — too short = false lock release, too long = slow recovery
Service Discovery & Group Membership
Service discovery answers: "which instances of service X are alive right now?" Group membership answers: "which nodes are part of this cluster?" Both use the same pattern: ephemeral nodes for registration, getChildren + watches for discovery.
Service Starts
Service instance creates an ephemeral node: /services/payment/instance-1 with data containing host:port.
Consumers Discover
Consumers call getChildren('/services/payment', watch=true) to get all live instances.
Instance Crashes
When an instance crashes, its ephemeral node is automatically deleted after session timeout.
Consumers Notified
The child watch fires, consumers re-read getChildren to get the updated list of live instances.
New Instance Joins
A new instance creates its ephemeral node. Child watch fires again, consumers discover the new instance.
Service Discovery Pattern: ZooKeeper namespace: /services ├── payment │ ├── instance-1 data: {"host":"10.0.1.5","port":8080,"version":"2.1"} │ ├── instance-2 data: {"host":"10.0.1.6","port":8080,"version":"2.1"} │ └── instance-3 data: {"host":"10.0.2.1","port":8080,"version":"2.2"} ├── inventory │ ├── instance-1 data: {"host":"10.0.1.7","port":9090} │ └── instance-2 data: {"host":"10.0.2.2","port":9090} └── notification └── instance-1 data: {"host":"10.0.1.8","port":7070} All instance nodes are EPHEMERAL: - Instance crashes → node auto-deleted → consumers notified - Instance gracefully shuts down → deletes node → consumers notified - No stale registrations (unlike DNS-based discovery with TTLs) Consumer pattern: 1. getChildren("/services/payment", watch=true) → ["instance-1", "instance-2", "instance-3"] 2. For each instance: getData to get host:port 3. Build connection pool to all instances 4. On watch event: re-read, update connection pool Advantages over DNS-based discovery: ✅ Instant notification (vs DNS TTL delay) ✅ No stale entries (ephemeral = auto-cleanup) ✅ Rich metadata in node data (version, capabilities) ✅ Consistent view (all consumers see same state)
Group Membership vs Service Discovery
They're the same pattern with different semantics. Service discovery: "which instances can I send requests to?" Group membership: "which nodes are part of my cluster for internal coordination?" Both use ephemeral nodes + child watches. The difference is who's consuming the information.
Configuration Management
ZooKeeper excels at distributed configuration management. Store config in a persistent znode, and all consumers watch it. When config changes, all consumers are notified instantly — no polling, no stale reads, no deployment required.
Configuration Management Pattern: Structure: /config ├── database data: {"host":"db.prod","port":5432,"pool_size":20} ├── feature-flags data: {"dark_mode":true,"new_checkout":false} ├── rate-limits data: {"api_rpm":1000,"upload_mb":50} └── maintenance data: {"enabled":false,"message":""} Update flow: 1. Admin updates: setData("/config/feature-flags", newFlags, version) 2. ZooKeeper replicates via Zab (consistent across ensemble) 3. All watchers receive NodeDataChanged event 4. Each service re-reads: getData("/config/feature-flags", watch=true) 5. Services apply new config immediately Advantages: ✅ Push-based — no polling interval, instant propagation ✅ Consistent — all services see the same config (sequential consistency) ✅ Versioned — version field prevents lost updates (CAS) ✅ Auditable — mzxid and mtime track when config last changed ✅ No deployment — change config without restarting services Atomic multi-config update (multi() operation): // Update database AND rate-limits atomically zk.multi([ Op.setData("/config/database", newDbConfig, dbVersion), Op.setData("/config/rate-limits", newLimits, limitsVersion), ]); // Either both succeed or both fail — no partial config state
Config Management Best Practices
- ✅Use version checks on updates — prevents concurrent admin changes from overwriting each other
- ✅Keep config data small — JSON with essential settings, not entire application configs
- ✅Use multi() for related config changes — ensures atomic updates across multiple znodes
- ✅Implement fallback defaults — if ZK is unavailable, services should use last-known-good config
- ✅Watch for NodeDeleted too — handle the case where a config node is accidentally removed
Barrier Synchronization
A barrier is a synchronization point where N processes must all arrive before any can proceed. ZooKeeper implements this with a "double barrier" pattern — processes enter the barrier, wait for all to arrive, do work, then exit the barrier together.
The Starting Line
A barrier is like a race starting line. All runners (processes) must be at the line before the gun fires. Each runner registers their presence (creates a node). A referee (watch on children count) checks if all runners are present. When the last runner arrives, the gun fires (barrier opens) and everyone proceeds simultaneously.
Double Barrier Pattern: Phase 1: ENTER (wait for all N processes to arrive) 1. Each process creates: /barrier/job-123/process-X (ephemeral) 2. Each process calls: getChildren("/barrier/job-123") 3. If numChildren < N: set watch, wait 4. When numChildren == N: barrier is open, all proceed Phase 2: DO WORK All processes perform their computation in parallel Phase 3: EXIT (wait for all N processes to finish) 1. Each process deletes its node when done 2. Each process calls: getChildren("/barrier/job-123") 3. If numChildren > 0: set watch, wait 4. When numChildren == 0: all done, barrier complete Example: MapReduce phase transition - 10 mappers must all finish before reducers start - Each mapper creates /barrier/map-phase/mapper-X - When all 10 exist → map phase complete - Reducers were watching → they start - Each reducer creates /barrier/reduce-phase/reducer-X - When all reducers finish → job complete Edge case: What if a process crashes mid-barrier? - Its ephemeral node is deleted - numChildren decreases - Barrier may never reach N → need timeout/abort logic
Barriers in Practice
Barriers are less common than leader election or locks, but they're essential for batch processing systems. Hadoop uses them for MapReduce phase transitions. Any system where "all workers must complete phase 1 before phase 2 begins" needs a barrier.
Queue Pattern
ZooKeeper can implement a simple FIFO queue using sequential nodes. Producers create sequential nodes, consumers process and delete the lowest-numbered node. However, this pattern has significant limitations and is generally not recommended for high-throughput use cases.
FIFO Queue Pattern: Queue path: /queue/job-processing Producer (enqueue): zk.create("/queue/job-processing/task-", jobData, PERSISTENT_SEQUENTIAL); // Creates: /queue/job-processing/task-0000000001 // /queue/job-processing/task-0000000002 // /queue/job-processing/task-0000000003 Consumer (dequeue): 1. children = getChildren("/queue/job-processing") 2. Sort children by sequence number 3. Try to process lowest: task-0000000001 4. Delete task-0000000001 (use version check for atomicity) 5. If delete fails (another consumer got it) → retry with next Priority Queue: Use different prefixes: /queue/high-001, /queue/med-002, /queue/low-003 Consumers process high-* first, then med-*, then low-* ⚠️ LIMITATIONS — Why NOT to use ZK as a queue: - No consumer groups (every consumer sees all messages) - No replay (deleted = gone forever) - getChildren returns ALL items (expensive with many items) - No backpressure mechanism - ZK is not designed for high-throughput data movement - Thousands of znodes = memory pressure on ensemble ✅ USE INSTEAD: - Kafka: high-throughput, consumer groups, replay, partitioned - RabbitMQ: routing, acknowledgments, dead letter queues - SQS: managed, scalable, no infrastructure to maintain
When ZK Queues Are Acceptable
ZooKeeper queues work for very low-volume coordination tasks: distributing a handful of jobs among workers, or implementing a simple task assignment system with <100 items. For anything higher volume, use a purpose-built message queue.
Two-Phase Commit Coordination
ZooKeeper can coordinate a two-phase commit (2PC) protocol across distributed participants. The coordinator creates a transaction node, participants vote by creating child nodes, and the coordinator makes the commit/abort decision based on votes.
Two-Phase Commit via ZooKeeper: Phase 1: PREPARE (voting) Coordinator: 1. Creates /transactions/txn-42 (persistent) 2. Creates /transactions/txn-42/participants with list of expected voters 3. Notifies participants (via watch or external message) Each Participant: 4. Prepares locally (acquire locks, validate) 5. Votes: creates /transactions/txn-42/votes/participant-X data: "COMMIT" or "ABORT" Coordinator: 6. Watches /transactions/txn-42/votes (child watch) 7. Waits until all expected participants have voted Phase 2: DECIDE If all votes are "COMMIT": 8. Coordinator sets /transactions/txn-42 data to "COMMIT" 9. Participants watch this node → see COMMIT → apply changes If any vote is "ABORT" (or timeout): 8. Coordinator sets /transactions/txn-42 data to "ABORT" 9. Participants watch this node → see ABORT → rollback Crash recovery: - Coordinator crashes before decision: participants see no decision, timeout and abort (safe) - Coordinator crashes after decision: decision is persisted in ZK, participants can read it on recovery - Participant crashes before voting: coordinator times out, aborts - Participant crashes after voting: recovers, reads decision from ZK
| Aspect | 2PC Without ZK | 2PC With ZK |
|---|---|---|
| Coordinator failure | Participants blocked indefinitely | Decision persisted, recoverable |
| Participant discovery | Hardcoded list | Dynamic via group membership |
| Vote collection | Custom protocol | Child watches on votes node |
| Decision broadcast | Point-to-point messages | Single znode update + watches |
| Recovery | Complex log-based | Read decision from ZK |
2PC Limitations Remain
ZooKeeper makes 2PC easier to implement but doesn't solve its fundamental limitations: it's blocking (participants hold locks while waiting for decision) and has a single point of failure (the coordinator). For most distributed transactions, consider Saga patterns instead.
Interview Questions
Q:How does ZooKeeper leader election avoid the herd effect?
A: Instead of all candidates watching the leader node (which causes N notifications when the leader fails), each candidate watches only the node with the next-lower sequence number. This creates a chain: candidate-0003 watches candidate-0002, which watches candidate-0001 (leader). When the leader fails, only candidate-0002 is notified — it checks if it's now the lowest and becomes leader. This is O(1) notifications per failure instead of O(N). The pattern is called 'watch previous' and is critical for scalability with many candidates.
Q:Explain how to implement a fair distributed lock with ZooKeeper.
A: (1) Create EPHEMERAL_SEQUENTIAL node under /locks/resource: /locks/resource/lock-0000000005. (2) getChildren to see all lock nodes. (3) If yours is the lowest → you hold the lock. (4) If not, watch the node just before yours (not the lock holder — avoids herd effect). (5) When that node is deleted, re-check if you're now lowest. Release: delete your node. Crash safety: ephemeral node auto-deleted on session expiry → lock auto-released. Fairness: FIFO order guaranteed by sequence numbers. Add fencing tokens (czxid) to protect against stale lock holders after GC pauses.
Q:How does service discovery work with ZooKeeper?
A: Registration: each service instance creates an EPHEMERAL node under /services/service-name/ with host:port data. Discovery: consumers call getChildren('/services/service-name', watch=true) to get all live instances. Failure detection: when an instance crashes, its ephemeral node is auto-deleted after session timeout. The child watch fires, consumers re-read the list. Advantages over DNS: instant notification (no TTL delay), no stale entries (ephemeral = auto-cleanup), rich metadata in node data, consistent view across all consumers.
Q:What is the double barrier pattern and when would you use it?
A: A double barrier synchronizes N processes at two points: entry (all must arrive before any proceeds) and exit (all must finish before the group moves on). Implementation: (1) Enter: each process creates an ephemeral child under /barrier/X. Watch children count. When count == N, barrier opens. (2) Exit: each process deletes its node when done. Watch children count. When count == 0, all have finished. Use cases: MapReduce phase transitions (all mappers finish before reducers start), batch processing stages, distributed test coordination. Edge case: if a process crashes, its ephemeral node is deleted — need timeout/abort logic.
Q:Why shouldn't you use ZooKeeper as a message queue?
A: ZooKeeper queues (sequential nodes) have severe limitations: (1) No consumer groups — every consumer sees all messages. (2) No replay — deleted messages are gone. (3) getChildren returns ALL items — O(N) per dequeue operation. (4) All nodes in memory — thousands of queue items = memory pressure. (5) No backpressure. (6) Write throughput limited by Zab consensus. ZK queues are acceptable for <100 low-frequency coordination tasks. For real queuing, use Kafka (high throughput, replay, partitions), RabbitMQ (routing, DLQ), or SQS (managed, scalable).
Common Mistakes
Herd effect in leader election
All candidates watch the leader node directly. When the leader fails, all N candidates simultaneously call getChildren, overwhelming the ensemble.
✅Each candidate watches only the node with the next-lower sequence number. This creates a chain where only one candidate is notified per failure — O(1) instead of O(N).
Not using fencing tokens with distributed locks
Assuming that holding a lock means exclusive access forever. If a GC pause causes session expiry, another process acquires the lock while the original still thinks it holds it.
✅Use the lock node's czxid as a fencing token. Pass it to protected resources. Resources reject operations with stale (lower) fencing tokens. This detects and prevents stale lock holders from causing damage.
Using ZooKeeper as a high-throughput queue
Creating thousands of sequential nodes for job processing. getChildren becomes expensive, memory grows, and write throughput is limited by consensus.
✅Use ZK queues only for very low-volume coordination (<100 items). For real queuing workloads, use Kafka, RabbitMQ, or SQS. ZooKeeper is a coordination service, not a data pipeline.
Not handling watch re-registration in service discovery
Setting a child watch once on /services/X and assuming it will keep notifying. After the first event, the watch is gone — new instances joining or leaving are silently missed.
✅Always re-register the watch in the callback: watch fires → getChildren with new watch → update service list. Or use persistent recursive watches (3.6+) which don't require re-registration.
No timeout in barrier implementations
Waiting indefinitely for all N processes to arrive at a barrier. If one process crashes before creating its node, the barrier never opens — all other processes hang forever.
✅Implement a timeout: if the barrier hasn't opened within X seconds, abort and clean up. Also handle the case where a process's ephemeral node disappears (crash) — adjust N or abort the barrier.