EnsembleZab ProtocolQuorumLeader ElectionObserversWrite Path

Architecture & The Ensemble

ZooKeeper runs as an ensemble of servers using the Zab atomic broadcast protocol. One leader handles all writes, followers replicate, and quorum ensures consistency even when nodes fail.

40 min read9 sections
01

The Ensemble

A ZooKeeper deployment is called an ensemble — a group of servers (typically 3, 5, or 7) that work together to provide the coordination service. The ensemble must always have an odd number of nodes to ensure a clear majority can be established.

āš–ļø

The Jury System

An ensemble works like a jury. You need a majority to reach a verdict (quorum). With 5 jurors, you need 3 to agree. If 2 jurors are absent, the remaining 3 can still reach a verdict. But if 3 are absent, the 2 remaining cannot — they must wait. This is why odd numbers matter: with 4 jurors, a 2-2 split is a deadlock. With 5, someone always breaks the tie.

ensemble-sizing.txttext
Ensemble Size vs Fault Tolerance:
═══════════════════════════════════════════════════════════
Nodes │ Quorum │ Can Tolerate │ Notes
═══════════════════════════════════════════════════════════
  1   │   1    │  0 failures  │ Development only
  3   │   2    │  1 failure   │ Minimum production
  5   │   3    │  2 failures  │ Recommended production
  7   │   4    │  3 failures  │ Large/critical deployments
═══════════════════════════════════════════════════════════

Formula: Quorum = ⌊N/2āŒ‹ + 1
         Tolerable failures = N - Quorum = ⌊(N-1)/2āŒ‹

Why NOT even numbers?
  - 4 nodes: quorum = 3, tolerates 1 failure (same as 3 nodes!)
  - 6 nodes: quorum = 4, tolerates 2 failures (same as 5 nodes!)
  - Even numbers add cost without improving fault tolerance
  - They also increase the chance of a tie during leader election

Why 5 is the Sweet Spot

A 3-node ensemble tolerates only 1 failure — during maintenance of one node, you have zero fault tolerance. A 5-node ensemble tolerates 2 failures, meaning you can lose one node to maintenance and still survive an unexpected failure. 7 nodes is rarely needed unless you have extreme availability requirements.

Ensemble Deployment Rules

  • āœ…Always use odd numbers (3, 5, 7) — even numbers waste resources
  • āœ…Place nodes across failure domains (different racks, AZs, or DCs)
  • āœ…All nodes should have similar hardware — the slowest node limits write throughput
  • āœ…Use dedicated machines — ZooKeeper is latency-sensitive and GC pauses affect the whole ensemble
  • āœ…Network latency between nodes should be low (<10ms) — Zab requires round-trips for every write
02

Leader and Followers

Unlike Cassandra's masterless architecture, ZooKeeper has a clear leader/follower hierarchy. Exactly one node is the leader at any time. The leader handles all write requests and coordinates replication. Followers serve read requests and participate in write quorum.

RoleHandles WritesHandles ReadsVotes in QuorumCount
Leaderāœ… All writes go through leaderāœ… Can serve readsāœ… YesExactly 1
FollowerāŒ Forwards writes to leaderāœ… Serves reads locallyāœ… YesN-1 (or fewer with observers)
ObserverāŒ Forwards writes to leaderāœ… Serves reads locallyāŒ No0 or more (optional)
request-routing.txttext
Client Request Routing:

Read Request (getData, getChildren, exists):
  Client → Any Server (Leader, Follower, or Observer)
  → Server reads from local in-memory data tree
  → Returns immediately (no consensus needed)
  ⚔ Fast: single node, no network round-trips

Write Request (create, setData, delete):
  Client → Follower (or Observer)
  → Follower forwards to Leader
  → Leader runs Zab protocol (propose → ack → commit)
  → Leader responds to originating server
  → Server responds to client
  🐢 Slower: requires quorum agreement

Write Request (direct to Leader):
  Client → Leader
  → Leader runs Zab protocol
  → Leader responds to client
  ⚔ One fewer hop (but clients don't choose which server)

Reads Are Local, Writes Are Global

This is the key performance characteristic of ZooKeeper. Reads scale linearly with the number of servers (each serves reads independently). Writes do NOT scale — every write must go through the single leader and achieve quorum. This is why ZooKeeper is optimized for read-heavy coordination workloads.

03

Zab Protocol

ZooKeeper Atomic Broadcast (Zab) is the consensus protocol that ensures all servers in the ensemble agree on the order of state changes. It is NOT Paxos and NOT Raft — it was designed specifically for ZooKeeper's primary-backup architecture.

PropertyZabPaxosRaft
Design goalPrimary-backup replicationGeneral consensusUnderstandable consensus
Leader requiredYes, alwaysNo (but Multi-Paxos uses one)Yes, always
OrderingTotal order of all transactionsPer-instance orderingLog-based total order
RecoveryLeader syncs followers on electionComplex catch-upLeader sends missing entries
Used byZooKeeperGoogle Chubby, Spanneretcd, CockroachDB, Consul
šŸ“¢

The News Anchor

Zab works like a news broadcast. The leader is the anchor — all news (writes) must go through them. The anchor broadcasts each story to all stations (followers). Each station confirms receipt. Once a majority of stations confirm, the story is 'committed' and can be reported to viewers (clients). If the anchor goes off-air, stations elect a new anchor who first catches up on any stories they missed.

zab-phases.txttext
Zab Protocol Phases:

Phase 1: Leader Election (LOOKING state)
  - All servers vote for the server with the highest zxid
  - If tied, vote for the highest server ID (myid)
  - Once a server receives votes from a quorum → it's the leader
  - Others become followers

Phase 2: Discovery / Synchronization
  - New leader collects the latest transaction from each follower
  - Leader determines the authoritative history
  - Leader sends DIFF, TRUNC, or SNAP to sync followers:
    • DIFF: "here are the transactions you're missing"
    • TRUNC: "you have transactions I don't — roll back"
    • SNAP: "you're too far behind — here's a full snapshot"
  - Once all followers are synced → ready to serve

Phase 3: Broadcast (normal operation)
  - Leader receives write request
  - Leader assigns next zxid (epoch + counter)
  - Leader sends PROPOSAL to all followers
  - Followers write to transaction log, send ACK
  - Once leader has ACKs from quorum → sends COMMIT
  - All servers apply the transaction to in-memory data tree

Key guarantee: All committed transactions are delivered in the
same order to all servers. No gaps, no reordering.

Zab vs Raft: Key Difference

The main difference is in recovery. In Raft, the leader with the longest log wins. In Zab, the leader with the highest zxid wins, and it actively synchronizes followers during the discovery phase. Zab also separates the broadcast protocol from leader election more cleanly than Raft.

04

Leader Election Process

When the ensemble starts or the current leader fails, all servers enter the LOOKING state and begin the leader election process. The goal: elect the server with the most up-to-date transaction history as the new leader.

1

Enter LOOKING State

When a server starts or loses contact with the leader, it transitions to LOOKING and begins an election round.

2

Cast Initial Vote

Each server votes for itself initially, sending (myid, last_zxid, epoch) to all other servers.

3

Compare Votes

When receiving a vote, compare: highest epoch wins. If tied, highest zxid wins. If still tied, highest myid wins.

4

Update Vote

If a received vote is 'better' than your current vote, adopt it and re-broadcast your updated vote.

5

Quorum Reached

Once a server sees that a quorum of servers have voted for the same candidate, the election is decided.

6

Role Assignment

The winner becomes LEADING, all others become FOLLOWING. The new leader begins the synchronization phase.

leader-election.txttext
Example: 5-node ensemble, server 3 (previous leader) crashes

Server states: [1: LOOKING, 2: LOOKING, 4: LOOKING, 5: LOOKING]

Round 1 — Initial votes (each votes for itself):
  Server 1 broadcasts: (myid=1, zxid=0x300000045, epoch=3)
  Server 2 broadcasts: (myid=2, zxid=0x300000047, epoch=3)
  Server 4 broadcasts: (myid=4, zxid=0x300000047, epoch=3)
  Server 5 broadcasts: (myid=5, zxid=0x300000044, epoch=3)

Round 2 — Compare and update:
  Server 1 sees Server 2's vote: zxid 47 > 45 → adopts vote for 2
  Server 1 sees Server 4's vote: zxid 47 = 47, myid 4 > 2 → adopts vote for 4
  Server 5 sees Server 4's vote: zxid 47 > 44 → adopts vote for 4
  Server 2 sees Server 4's vote: zxid 47 = 47, myid 4 > 2 → adopts vote for 4

Result: All servers vote for Server 4 (quorum = 3, got 4 votes)
  Server 4 → LEADING
  Servers 1, 2, 5 → FOLLOWING

Election time: typically 200ms - 2 seconds

Why Highest zxid Wins

The server with the highest zxid has the most recent committed transaction. Electing it as leader minimizes the amount of synchronization needed — it already has the most complete history. This is a key difference from Raft, where the longest log wins.

05

Quorum Mechanics

Quorum is the minimum number of servers that must agree for an operation to succeed. It's the mechanism that ensures consistency — any two quorums must overlap by at least one server, guaranteeing that committed data is never lost.

quorum-math.txttext
Quorum = ⌊N/2āŒ‹ + 1 (strict majority)

Why majority works:
  - Any two majorities of N must share at least one member
  - This shared member has seen both operations
  - Therefore: no committed write can be "forgotten"

Example with 5 nodes:
  Quorum = 3 (majority of 5)

  Write W1 committed by: {A, B, C}     (3 of 5 = quorum āœ…)
  Write W2 committed by: {C, D, E}     (3 of 5 = quorum āœ…)
  
  Overlap: Server C has both W1 and W2
  → No matter which servers you ask, at least one has the latest data

What happens when quorum is lost:
  5 nodes, 3 crash → only 2 remaining
  2 < quorum (3) → ensemble STOPS serving writes
  Reads may still work (stale) but writes are blocked
  → This is the CP trade-off: unavailable rather than inconsistent
Scenario (5-node)Available NodesHas Quorum?Can Write?
Normal operation5āœ… Yes (5 ≄ 3)āœ… Yes
One node down4āœ… Yes (4 ≄ 3)āœ… Yes
Two nodes down3āœ… Yes (3 ≄ 3)āœ… Yes (barely)
Three nodes down2āŒ No (2 < 3)āŒ No — unavailable
Network partition 3|23 in majorityāœ… Majority partition onlyāœ… Majority only

Quorum Intersection Property

The mathematical guarantee: for any N, two groups of size ⌊N/2āŒ‹ + 1 must share at least one member. This single property is what makes distributed consensus possible — it ensures that no two conflicting decisions can both achieve quorum.

06

Observer Nodes

Observers are a special type of ZooKeeper server that receive all state updates but do NOT participate in voting. They scale read throughput without affecting write performance — adding observers doesn't increase quorum size.

PropertyFollowerObserver
Receives proposalsāœ… Yesāœ… Yes
Sends ACKs to leaderāœ… Yes (counted for quorum)āŒ No
Votes in electionsāœ… YesāŒ No
Serves client readsāœ… Yesāœ… Yes
Forwards client writesāœ… Yes (to leader)āœ… Yes (to leader)
Affects write latencyāœ… Yes (must ACK)āŒ No
Use caseCore ensembleRead scaling, remote DC
zoo.cfgproperties
# 5-node ensemble with 2 observers
# Voting members (participate in quorum):
server.1=zk1.dc1:2888:3888
server.2=zk2.dc1:2888:3888
server.3=zk3.dc1:2888:3888
server.4=zk4.dc1:2888:3888
server.5=zk5.dc1:2888:3888

# Observers (read-only replicas, no voting):
server.6=zk6.dc2:2888:3888:observer
server.7=zk7.dc2:2888:3888:observer

# Quorum is still 3 (majority of 5 voting members)
# Observers in DC2 serve local reads with low latency
# Writes still require round-trip to DC1 for quorum

When to Use Observers

  • āœ…Cross-datacenter reads — place observers in remote DCs for low-latency local reads
  • āœ…Read scaling — add observers to handle more read clients without affecting write quorum
  • āœ…Non-critical clients — connect monitoring/analytics tools to observers instead of voting members
  • āœ…Gradual scaling — add read capacity without changing quorum dynamics
07

Write Path Through the Ensemble

Every write in ZooKeeper follows the same path through the ensemble. Understanding this path explains why writes are slower than reads and why write throughput doesn't scale with more nodes.

1

Client Sends Write

Client sends create/setData/delete to any server. If it's a follower, it forwards to the leader.

2

Leader Assigns zxid

Leader assigns the next globally unique transaction ID (zxid = epoch << 32 | counter). This ensures total ordering.

3

Leader Sends PROPOSAL

Leader sends the transaction proposal to all followers. Each proposal contains the zxid and the operation details.

4

Followers Write to Transaction Log

Each follower writes the proposal to its transaction log on disk (fsync) and sends an ACK back to the leader.

5

Leader Collects Quorum ACKs

Once the leader has ACKs from a quorum (including itself), the transaction is committed.

6

Leader Sends COMMIT

Leader sends COMMIT to all followers. Each server applies the transaction to its in-memory data tree.

7

Response to Client

The originating server (leader or follower) sends the response back to the client.

write-latency.txttext
Write Latency Breakdown (5-node ensemble, same DC):

Client → Follower:           ~0.5ms (network)
Follower → Leader:           ~0.5ms (forward)
Leader assigns zxid:         ~0.01ms (in-memory)
Leader → Followers (PROPOSAL): ~0.5ms (broadcast)
Followers fsync to disk:     ~1-5ms (SSD) or ~10ms (HDD)
Followers → Leader (ACK):    ~0.5ms (network)
Leader → Followers (COMMIT): ~0.5ms (broadcast)
Follower → Client:           ~0.5ms (response)
─────────────────────────────────────────────────
Total:                       ~4-8ms (SSD), ~12-15ms (HDD)

Key insight: disk fsync dominates write latency.
This is why SSDs are critical for ZooKeeper performance.

Throughput: ~10,000-50,000 writes/sec (depending on hardware)
Compare: ~100,000+ reads/sec (no disk, no consensus)

Why Writes Don't Scale

Adding more followers doesn't help write throughput — it actually hurts it slightly because the leader must send proposals to more servers. Write throughput is bounded by the leader's ability to process proposals and the disk fsync latency. This is by design — ZooKeeper is a coordination service, not a data store.

08

Interview Questions

Q:Why does a ZooKeeper ensemble need an odd number of nodes?

A: Because quorum requires a strict majority (⌊N/2āŒ‹ + 1). With even numbers, you get the same fault tolerance as N-1: a 4-node ensemble tolerates 1 failure (quorum=3), same as 3 nodes. The 4th node adds cost without improving resilience. Odd numbers also prevent tie situations during leader election. The sweet spot is 5 nodes: tolerates 2 failures, allowing maintenance of one node while still surviving an unexpected failure.

Q:How does the Zab protocol differ from Raft?

A: Key differences: (1) Recovery — Zab has an explicit discovery/synchronization phase where the new leader actively syncs followers. Raft relies on the leader sending missing log entries during normal operation. (2) Leader election — Zab elects the server with the highest zxid (most recent transaction). Raft elects the server with the longest log. (3) Design philosophy — Zab was designed specifically for primary-backup state machine replication. Raft was designed for general-purpose consensus with understandability as a goal. Both provide the same safety guarantees.

Q:What happens when the ZooKeeper leader crashes?

A: (1) Followers detect leader loss via missed heartbeats. (2) All servers enter LOOKING state. (3) Fast Leader Election begins — servers exchange votes preferring highest zxid, then highest myid. (4) Once quorum agrees on a candidate, it becomes leader. (5) New leader runs discovery phase — collects latest state from followers. (6) Synchronization — leader sends DIFF/TRUNC/SNAP to bring followers up to date. (7) Ensemble resumes serving requests. Total time: typically 200ms-2s. During election, the ensemble is unavailable for writes.

Q:Why can't you scale ZooKeeper write throughput by adding more nodes?

A: Every write must go through the single leader and achieve quorum acknowledgment. Adding followers means the leader must send proposals to more servers and wait for more ACKs. Write throughput is bounded by: (1) Leader's processing capacity, (2) Network round-trip time for proposals/ACKs, (3) Disk fsync latency on followers. More nodes actually slightly decrease write throughput. To scale reads, add observers (non-voting replicas). To scale writes, you'd need to shard — but ZooKeeper doesn't support that natively.

Q:What are observer nodes and when would you use them?

A: Observers receive all state updates but don't vote in quorum or elections. Use cases: (1) Cross-DC reads — place observers in remote datacenters for low-latency local reads without affecting write quorum latency. (2) Read scaling — handle more read clients without changing quorum size. (3) Non-critical clients — connect monitoring tools to observers. Key benefit: adding observers doesn't increase quorum size, so write latency is unaffected. Trade-off: observer reads may be slightly stale (they receive COMMIT asynchronously).

09

Common Mistakes

šŸ”¢

Using an even number of nodes

Deploying 4 or 6 nodes thinking more is always better. A 4-node ensemble has the same fault tolerance as 3 nodes (both tolerate 1 failure) but costs more and has slightly higher write latency.

āœ…Always use odd numbers: 3, 5, or 7. For most production deployments, 5 is the sweet spot — tolerates 2 failures, allowing safe rolling upgrades.

šŸŒ

Spreading voting members across high-latency links

Placing 5 voting members across 5 different regions. Every write requires quorum ACKs, so write latency equals the round-trip to the slowest quorum member.

āœ…Keep voting members in the same region (or nearby regions with <10ms latency). Use observers in remote regions for local reads. Write latency is bounded by the slowest quorum member.

šŸ’¾

Running ZooKeeper on spinning disks

Using HDD for the transaction log. Every write requires an fsync to the transaction log before ACKing. HDD fsync takes 10-15ms vs 1-2ms for SSD.

āœ…Always use SSDs for ZooKeeper's transaction log (dataLogDir). Separate the transaction log from snapshots (dataDir) on different disks for best performance.

šŸ 

Co-locating ZooKeeper with other services

Running ZooKeeper on the same machines as Kafka brokers, application servers, or databases. GC pauses or CPU contention from other processes cause ZK to miss heartbeats and trigger unnecessary leader elections.

āœ…Run ZooKeeper on dedicated machines. It's latency-sensitive and needs predictable performance. A single GC pause can cause the leader to lose quorum and trigger a disruptive election.

šŸ“ˆ

Adding nodes to scale write throughput

Thinking that adding more followers will increase write capacity, like adding Cassandra nodes increases throughput.

āœ…ZooKeeper writes don't scale horizontally — all writes go through one leader. Add observers for read scaling. If you need more write throughput, optimize hardware (faster SSDs, more RAM) or reduce write frequency.