Leader ElectionDistributed LockService DiscoveryBarriersQueuesConfig Management

Coordination Patterns

The recipes that make ZooKeeper useful — leader election, distributed locks, service discovery, configuration management, barriers, and queues. These patterns are the reason ZooKeeper exists.

50 min read9 sections

Leader Election

Leader election is the most common ZooKeeper pattern. The goal: exactly one process is the leader at any time. If the leader crashes, a new one is elected automatically. The key insight is using ephemeral sequential nodes to create a fair, scalable election without the "herd effect."

👑

The Numbered Ticket System

Imagine a group of people competing to be king. Each person takes a numbered ticket (ephemeral sequential node). The person with the lowest number is king. Everyone else watches the person with the number just below theirs. If the king leaves (session expires), their ticket disappears, and the person with the next lowest number becomes king — without everyone rushing to check at once.

Create Election Node

Each candidate creates an EPHEMERAL_SEQUENTIAL node under /election: /election/candidate-0000000001, /election/candidate-0000000002, etc.

Get All Children

Call getChildren('/election') to see all candidates and their sequence numbers.

Check If Leader

If your node has the lowest sequence number, you are the leader. Start performing leader duties.

Watch Previous Node

If not the leader, set a watch on the node with the next-lower sequence number (not the leader!). This avoids the herd effect.

Handle Watch Event

When the watched node is deleted (previous candidate crashed or stepped down), re-check if you now have the lowest number.

leader-election.txttext

Leader Election — Avoiding the Herd Effect:

Candidates: [candidate-0001, candidate-0002, candidate-0003, candidate-0004]

Watch chain (each watches the one before it):
  candidate-0001 (LEADER) ← watched by candidate-0002
  candidate-0002          ← watched by candidate-0003
  candidate-0003          ← watched by candidate-0004
  candidate-0004

When candidate-0001 (leader) crashes:
  1. /election/candidate-0001 is deleted (ephemeral)
  2. ONLY candidate-0002 gets notified (it was watching 0001)
  3. candidate-0002 checks: "Am I lowest?" → YES → becomes leader
  4. candidate-0003 and 0004 are NOT disturbed

Compare to naive approach (everyone watches the leader):
  1. Leader crashes
  2. ALL candidates get notified simultaneously
  3. ALL candidates call getChildren at once (thundering herd)
  4. Server overwhelmed with N simultaneous requests
  
The "watch previous" pattern: O(1) notifications per failure
The "watch leader" pattern: O(N) notifications per failure

leader-election-pseudo.tstypescript

// Leader election implementation (pseudocode)
async function participateInElection(zk: ZooKeeper) {
  // Step 1: Create our candidate node
  const myNode = await zk.create(
    "/election/candidate-",
    myInfo,
    EPHEMERAL_SEQUENTIAL
  );
  
  const mySeq = extractSequence(myNode); // e.g., 0000000003
  
  async function checkLeadership() {
    // Step 2: Get all candidates
    const children = await zk.getChildren("/election");
    const sorted = children.sort(); // sort by sequence number
    
    if (sorted[0] === basename(myNode)) {
      // Step 3: I have the lowest number — I'm the leader!
      onBecomeLeader();
      return;
    }
    
    // Step 4: Watch the node just before me
    const myIndex = sorted.indexOf(basename(myNode));
    const previousNode = sorted[myIndex - 1];
    
    const exists = await zk.exists(
      `/election/${previousNode}`,
      (event) => {
        // Step 5: Previous node deleted — re-check
        checkLeadership();
      }
    );
    
    // Edge case: previous node already gone before watch set
    if (!exists) {
      checkLeadership();
    }
  }
  
  await checkLeadership();
}

Herd Effect Avoidance is Critical

In a cluster with 100 candidates, the naive "everyone watches the leader" approach causes 99 simultaneous getChildren calls when the leader fails. The "watch previous" approach causes exactly 1 notification. This is the difference between O(N) and O(1) load on the ensemble during failover.

Distributed Lock

Distributed locks provide mutual exclusion across machines. The ZooKeeper implementation uses the same ephemeral sequential pattern as leader election, but with the semantics of acquire/ release rather than permanent leadership.

Lock Type	Mechanism	Fairness	Use Case
Simple (non-fair)	Single ephemeral node, all contenders watch it	❌ No ordering	Low contention, simple cases
Fair (queue-based)	Ephemeral sequential, watch previous	✅ FIFO order	High contention, ordered access
Read-Write Lock	Separate read/write sequential nodes	✅ Readers concurrent, writers exclusive	Read-heavy workloads

distributed-lock.txttext

Fair Distributed Lock (Queue-Based):

Lock path: /locks/resource-X

Acquire lock:
  1. Create EPHEMERAL_SEQUENTIAL: /locks/resource-X/lock-0000000005
  2. getChildren("/locks/resource-X") → [lock-0003, lock-0004, lock-0005]
  3. Am I the lowest? (lock-0003 is lowest, not me)
  4. Watch lock-0004 (the one just before me)
  5. Wait for notification...

  When lock-0004 is deleted:
  6. getChildren again → [lock-0003, lock-0005]
  7. Still not lowest (lock-0003 exists)
  8. Watch lock-0003
  
  When lock-0003 is deleted:
  9. getChildren → [lock-0005]
  10. I'm the lowest! → LOCK ACQUIRED ✅

Release lock:
  11. Delete /locks/resource-X/lock-0005
  12. Next waiter (lock-0006) gets notified → acquires lock

Crash recovery (automatic):
  If lock holder crashes → ephemeral node deleted → next waiter proceeds
  No deadlocks from dead processes!

Read-Write Lock:
  Writers: /locks/resource-X/write-0000000001
  Readers: /locks/resource-X/read-0000000002
  
  Writer acquires: must be lowest of ALL nodes (read + write)
  Reader acquires: must be lowest of all WRITE nodes before it
  Multiple readers can hold the lock simultaneously

Fencing Tokens

A subtle problem: if a lock holder has a long GC pause, its session might expire and another process acquires the lock. When the original process resumes, it still thinks it holds the lock. Solution: use the znode's czxid as a fencing token. Pass it to any resource you're protecting — the resource rejects operations with stale (lower) fencing tokens.

Lock Implementation Rules

✅Always use EPHEMERAL nodes — ensures lock is released if holder crashes
✅Use SEQUENTIAL for fairness — prevents starvation of waiting processes
✅Watch only the previous node — avoids herd effect on lock release
✅Implement fencing tokens — protects against stale lock holders after GC pauses
✅Set reasonable session timeouts — too short = false lock release, too long = slow recovery

Service Discovery & Group Membership

Service discovery answers: "which instances of service X are alive right now?" Group membership answers: "which nodes are part of this cluster?" Both use the same pattern: ephemeral nodes for registration, getChildren + watches for discovery.

Service Starts

Service instance creates an ephemeral node: /services/payment/instance-1 with data containing host:port.

Consumers Discover

Consumers call getChildren('/services/payment', watch=true) to get all live instances.

Instance Crashes

When an instance crashes, its ephemeral node is automatically deleted after session timeout.

Consumers Notified

The child watch fires, consumers re-read getChildren to get the updated list of live instances.

New Instance Joins

A new instance creates its ephemeral node. Child watch fires again, consumers discover the new instance.

service-discovery.txttext

Service Discovery Pattern:

ZooKeeper namespace:
/services
├── payment
│   ├── instance-1    data: {"host":"10.0.1.5","port":8080,"version":"2.1"}
│   ├── instance-2    data: {"host":"10.0.1.6","port":8080,"version":"2.1"}
│   └── instance-3    data: {"host":"10.0.2.1","port":8080,"version":"2.2"}
├── inventory
│   ├── instance-1    data: {"host":"10.0.1.7","port":9090}
│   └── instance-2    data: {"host":"10.0.2.2","port":9090}
└── notification
    └── instance-1    data: {"host":"10.0.1.8","port":7070}

All instance nodes are EPHEMERAL:
  - Instance crashes → node auto-deleted → consumers notified
  - Instance gracefully shuts down → deletes node → consumers notified
  - No stale registrations (unlike DNS-based discovery with TTLs)

Consumer pattern:
  1. getChildren("/services/payment", watch=true)
     → ["instance-1", "instance-2", "instance-3"]
  2. For each instance: getData to get host:port
  3. Build connection pool to all instances
  4. On watch event: re-read, update connection pool

Advantages over DNS-based discovery:
  ✅ Instant notification (vs DNS TTL delay)
  ✅ No stale entries (ephemeral = auto-cleanup)
  ✅ Rich metadata in node data (version, capabilities)
  ✅ Consistent view (all consumers see same state)

Group Membership vs Service Discovery

They're the same pattern with different semantics. Service discovery: "which instances can I send requests to?" Group membership: "which nodes are part of my cluster for internal coordination?" Both use ephemeral nodes + child watches. The difference is who's consuming the information.

Configuration Management

ZooKeeper excels at distributed configuration management. Store config in a persistent znode, and all consumers watch it. When config changes, all consumers are notified instantly — no polling, no stale reads, no deployment required.

config-management.txttext

Configuration Management Pattern:

Structure:
/config
├── database        data: {"host":"db.prod","port":5432,"pool_size":20}
├── feature-flags   data: {"dark_mode":true,"new_checkout":false}
├── rate-limits     data: {"api_rpm":1000,"upload_mb":50}
└── maintenance     data: {"enabled":false,"message":""}

Update flow:
  1. Admin updates: setData("/config/feature-flags", newFlags, version)
  2. ZooKeeper replicates via Zab (consistent across ensemble)
  3. All watchers receive NodeDataChanged event
  4. Each service re-reads: getData("/config/feature-flags", watch=true)
  5. Services apply new config immediately

Advantages:
  ✅ Push-based — no polling interval, instant propagation
  ✅ Consistent — all services see the same config (sequential consistency)
  ✅ Versioned — version field prevents lost updates (CAS)
  ✅ Auditable — mzxid and mtime track when config last changed
  ✅ No deployment — change config without restarting services

Atomic multi-config update (multi() operation):
  // Update database AND rate-limits atomically
  zk.multi([
    Op.setData("/config/database", newDbConfig, dbVersion),
    Op.setData("/config/rate-limits", newLimits, limitsVersion),
  ]);
  // Either both succeed or both fail — no partial config state

Config Management Best Practices

✅Use version checks on updates — prevents concurrent admin changes from overwriting each other
✅Keep config data small — JSON with essential settings, not entire application configs
✅Use multi() for related config changes — ensures atomic updates across multiple znodes
✅Implement fallback defaults — if ZK is unavailable, services should use last-known-good config
✅Watch for NodeDeleted too — handle the case where a config node is accidentally removed

Barrier Synchronization

A barrier is a synchronization point where N processes must all arrive before any can proceed. ZooKeeper implements this with a "double barrier" pattern — processes enter the barrier, wait for all to arrive, do work, then exit the barrier together.

🚦

The Starting Line

A barrier is like a race starting line. All runners (processes) must be at the line before the gun fires. Each runner registers their presence (creates a node). A referee (watch on children count) checks if all runners are present. When the last runner arrives, the gun fires (barrier opens) and everyone proceeds simultaneously.

barrier-pattern.txttext

Double Barrier Pattern:

Phase 1: ENTER (wait for all N processes to arrive)
  1. Each process creates: /barrier/job-123/process-X (ephemeral)
  2. Each process calls: getChildren("/barrier/job-123")
  3. If numChildren < N: set watch, wait
  4. When numChildren == N: barrier is open, all proceed

Phase 2: DO WORK
  All processes perform their computation in parallel

Phase 3: EXIT (wait for all N processes to finish)
  1. Each process deletes its node when done
  2. Each process calls: getChildren("/barrier/job-123")
  3. If numChildren > 0: set watch, wait
  4. When numChildren == 0: all done, barrier complete

Example: MapReduce phase transition
  - 10 mappers must all finish before reducers start
  - Each mapper creates /barrier/map-phase/mapper-X
  - When all 10 exist → map phase complete
  - Reducers were watching → they start
  - Each reducer creates /barrier/reduce-phase/reducer-X
  - When all reducers finish → job complete

Edge case: What if a process crashes mid-barrier?
  - Its ephemeral node is deleted
  - numChildren decreases
  - Barrier may never reach N → need timeout/abort logic

Barriers in Practice

Barriers are less common than leader election or locks, but they're essential for batch processing systems. Hadoop uses them for MapReduce phase transitions. Any system where "all workers must complete phase 1 before phase 2 begins" needs a barrier.

Queue Pattern

ZooKeeper can implement a simple FIFO queue using sequential nodes. Producers create sequential nodes, consumers process and delete the lowest-numbered node. However, this pattern has significant limitations and is generally not recommended for high-throughput use cases.

queue-pattern.txttext

FIFO Queue Pattern:

Queue path: /queue/job-processing

Producer (enqueue):
  zk.create("/queue/job-processing/task-", 
            jobData, PERSISTENT_SEQUENTIAL);
  // Creates: /queue/job-processing/task-0000000001
  //          /queue/job-processing/task-0000000002
  //          /queue/job-processing/task-0000000003

Consumer (dequeue):
  1. children = getChildren("/queue/job-processing")
  2. Sort children by sequence number
  3. Try to process lowest: task-0000000001
  4. Delete task-0000000001 (use version check for atomicity)
  5. If delete fails (another consumer got it) → retry with next

Priority Queue:
  Use different prefixes: /queue/high-001, /queue/med-002, /queue/low-003
  Consumers process high-* first, then med-*, then low-*

⚠️ LIMITATIONS — Why NOT to use ZK as a queue:
  - No consumer groups (every consumer sees all messages)
  - No replay (deleted = gone forever)
  - getChildren returns ALL items (expensive with many items)
  - No backpressure mechanism
  - ZK is not designed for high-throughput data movement
  - Thousands of znodes = memory pressure on ensemble

✅ USE INSTEAD:
  - Kafka: high-throughput, consumer groups, replay, partitioned
  - RabbitMQ: routing, acknowledgments, dead letter queues
  - SQS: managed, scalable, no infrastructure to maintain

When ZK Queues Are Acceptable

ZooKeeper queues work for very low-volume coordination tasks: distributing a handful of jobs among workers, or implementing a simple task assignment system with <100 items. For anything higher volume, use a purpose-built message queue.

Two-Phase Commit Coordination

ZooKeeper can coordinate a two-phase commit (2PC) protocol across distributed participants. The coordinator creates a transaction node, participants vote by creating child nodes, and the coordinator makes the commit/abort decision based on votes.

two-phase-commit.txttext

Two-Phase Commit via ZooKeeper:

Phase 1: PREPARE (voting)
  Coordinator:
    1. Creates /transactions/txn-42 (persistent)
    2. Creates /transactions/txn-42/participants with list of expected voters
    3. Notifies participants (via watch or external message)

  Each Participant:
    4. Prepares locally (acquire locks, validate)
    5. Votes: creates /transactions/txn-42/votes/participant-X
       data: "COMMIT" or "ABORT"

  Coordinator:
    6. Watches /transactions/txn-42/votes (child watch)
    7. Waits until all expected participants have voted

Phase 2: DECIDE
  If all votes are "COMMIT":
    8. Coordinator sets /transactions/txn-42 data to "COMMIT"
    9. Participants watch this node → see COMMIT → apply changes

  If any vote is "ABORT" (or timeout):
    8. Coordinator sets /transactions/txn-42 data to "ABORT"
    9. Participants watch this node → see ABORT → rollback

Crash recovery:
  - Coordinator crashes before decision: participants see no decision,
    timeout and abort (safe)
  - Coordinator crashes after decision: decision is persisted in ZK,
    participants can read it on recovery
  - Participant crashes before voting: coordinator times out, aborts
  - Participant crashes after voting: recovers, reads decision from ZK

Aspect	2PC Without ZK	2PC With ZK
Coordinator failure	Participants blocked indefinitely	Decision persisted, recoverable
Participant discovery	Hardcoded list	Dynamic via group membership
Vote collection	Custom protocol	Child watches on votes node
Decision broadcast	Point-to-point messages	Single znode update + watches
Recovery	Complex log-based	Read decision from ZK

2PC Limitations Remain

ZooKeeper makes 2PC easier to implement but doesn't solve its fundamental limitations: it's blocking (participants hold locks while waiting for decision) and has a single point of failure (the coordinator). For most distributed transactions, consider Saga patterns instead.

Interview Questions

Q:How does ZooKeeper leader election avoid the herd effect?

A: Instead of all candidates watching the leader node (which causes N notifications when the leader fails), each candidate watches only the node with the next-lower sequence number. This creates a chain: candidate-0003 watches candidate-0002, which watches candidate-0001 (leader). When the leader fails, only candidate-0002 is notified — it checks if it's now the lowest and becomes leader. This is O(1) notifications per failure instead of O(N). The pattern is called 'watch previous' and is critical for scalability with many candidates.

Q:Explain how to implement a fair distributed lock with ZooKeeper.

A: (1) Create EPHEMERAL_SEQUENTIAL node under /locks/resource: /locks/resource/lock-0000000005. (2) getChildren to see all lock nodes. (3) If yours is the lowest → you hold the lock. (4) If not, watch the node just before yours (not the lock holder — avoids herd effect). (5) When that node is deleted, re-check if you're now lowest. Release: delete your node. Crash safety: ephemeral node auto-deleted on session expiry → lock auto-released. Fairness: FIFO order guaranteed by sequence numbers. Add fencing tokens (czxid) to protect against stale lock holders after GC pauses.

Q:How does service discovery work with ZooKeeper?

A: Registration: each service instance creates an EPHEMERAL node under /services/service-name/ with host:port data. Discovery: consumers call getChildren('/services/service-name', watch=true) to get all live instances. Failure detection: when an instance crashes, its ephemeral node is auto-deleted after session timeout. The child watch fires, consumers re-read the list. Advantages over DNS: instant notification (no TTL delay), no stale entries (ephemeral = auto-cleanup), rich metadata in node data, consistent view across all consumers.

Q:What is the double barrier pattern and when would you use it?

A: A double barrier synchronizes N processes at two points: entry (all must arrive before any proceeds) and exit (all must finish before the group moves on). Implementation: (1) Enter: each process creates an ephemeral child under /barrier/X. Watch children count. When count == N, barrier opens. (2) Exit: each process deletes its node when done. Watch children count. When count == 0, all have finished. Use cases: MapReduce phase transitions (all mappers finish before reducers start), batch processing stages, distributed test coordination. Edge case: if a process crashes, its ephemeral node is deleted — need timeout/abort logic.

Q:Why shouldn't you use ZooKeeper as a message queue?

A: ZooKeeper queues (sequential nodes) have severe limitations: (1) No consumer groups — every consumer sees all messages. (2) No replay — deleted messages are gone. (3) getChildren returns ALL items — O(N) per dequeue operation. (4) All nodes in memory — thousands of queue items = memory pressure. (5) No backpressure. (6) Write throughput limited by Zab consensus. ZK queues are acceptable for <100 low-frequency coordination tasks. For real queuing, use Kafka (high throughput, replay, partitions), RabbitMQ (routing, DLQ), or SQS (managed, scalable).

Common Mistakes

🐘

Herd effect in leader election

All candidates watch the leader node directly. When the leader fails, all N candidates simultaneously call getChildren, overwhelming the ensemble.

✅Each candidate watches only the node with the next-lower sequence number. This creates a chain where only one candidate is notified per failure — O(1) instead of O(N).

🔓

Not using fencing tokens with distributed locks

Assuming that holding a lock means exclusive access forever. If a GC pause causes session expiry, another process acquires the lock while the original still thinks it holds it.

✅Use the lock node's czxid as a fencing token. Pass it to protected resources. Resources reject operations with stale (lower) fencing tokens. This detects and prevents stale lock holders from causing damage.

📬

Using ZooKeeper as a high-throughput queue

Creating thousands of sequential nodes for job processing. getChildren becomes expensive, memory grows, and write throughput is limited by consensus.

✅Use ZK queues only for very low-volume coordination (<100 items). For real queuing workloads, use Kafka, RabbitMQ, or SQS. ZooKeeper is a coordination service, not a data pipeline.

🔄

Not handling watch re-registration in service discovery

Setting a child watch once on /services/X and assuming it will keep notifying. After the first event, the watch is gone — new instances joining or leaving are silently missed.

✅Always re-register the watch in the callback: watch fires → getChildren with new watch → update service list. Or use persistent recursive watches (3.6+) which don't require re-registration.

⏱️

No timeout in barrier implementations

Waiting indefinitely for all N processes to arrive at a barrier. If one process crashes before creating its node, the barrier never opens — all other processes hang forever.

✅Implement a timeout: if the barrier hasn't opened within X seconds, abort and clean up. Also handle the case where a process's ephemeral node disappears (crash) — adjust N or abort the barrier.