Architecture & The Ensemble
ZooKeeper runs as an ensemble of servers using the Zab atomic broadcast protocol. One leader handles all writes, followers replicate, and quorum ensures consistency even when nodes fail.
Table of Contents
The Ensemble
A ZooKeeper deployment is called an ensemble ā a group of servers (typically 3, 5, or 7) that work together to provide the coordination service. The ensemble must always have an odd number of nodes to ensure a clear majority can be established.
The Jury System
An ensemble works like a jury. You need a majority to reach a verdict (quorum). With 5 jurors, you need 3 to agree. If 2 jurors are absent, the remaining 3 can still reach a verdict. But if 3 are absent, the 2 remaining cannot ā they must wait. This is why odd numbers matter: with 4 jurors, a 2-2 split is a deadlock. With 5, someone always breaks the tie.
Ensemble Size vs Fault Tolerance: āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Nodes ā Quorum ā Can Tolerate ā Notes āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā 1 ā 1 ā 0 failures ā Development only 3 ā 2 ā 1 failure ā Minimum production 5 ā 3 ā 2 failures ā Recommended production 7 ā 4 ā 3 failures ā Large/critical deployments āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Formula: Quorum = āN/2ā + 1 Tolerable failures = N - Quorum = ā(N-1)/2ā Why NOT even numbers? - 4 nodes: quorum = 3, tolerates 1 failure (same as 3 nodes!) - 6 nodes: quorum = 4, tolerates 2 failures (same as 5 nodes!) - Even numbers add cost without improving fault tolerance - They also increase the chance of a tie during leader election
Why 5 is the Sweet Spot
A 3-node ensemble tolerates only 1 failure ā during maintenance of one node, you have zero fault tolerance. A 5-node ensemble tolerates 2 failures, meaning you can lose one node to maintenance and still survive an unexpected failure. 7 nodes is rarely needed unless you have extreme availability requirements.
Ensemble Deployment Rules
- ā Always use odd numbers (3, 5, 7) ā even numbers waste resources
- ā Place nodes across failure domains (different racks, AZs, or DCs)
- ā All nodes should have similar hardware ā the slowest node limits write throughput
- ā Use dedicated machines ā ZooKeeper is latency-sensitive and GC pauses affect the whole ensemble
- ā Network latency between nodes should be low (<10ms) ā Zab requires round-trips for every write
Leader and Followers
Unlike Cassandra's masterless architecture, ZooKeeper has a clear leader/follower hierarchy. Exactly one node is the leader at any time. The leader handles all write requests and coordinates replication. Followers serve read requests and participate in write quorum.
| Role | Handles Writes | Handles Reads | Votes in Quorum | Count |
|---|---|---|---|---|
| Leader | ā All writes go through leader | ā Can serve reads | ā Yes | Exactly 1 |
| Follower | ā Forwards writes to leader | ā Serves reads locally | ā Yes | N-1 (or fewer with observers) |
| Observer | ā Forwards writes to leader | ā Serves reads locally | ā No | 0 or more (optional) |
Client Request Routing: Read Request (getData, getChildren, exists): Client ā Any Server (Leader, Follower, or Observer) ā Server reads from local in-memory data tree ā Returns immediately (no consensus needed) ā” Fast: single node, no network round-trips Write Request (create, setData, delete): Client ā Follower (or Observer) ā Follower forwards to Leader ā Leader runs Zab protocol (propose ā ack ā commit) ā Leader responds to originating server ā Server responds to client š¢ Slower: requires quorum agreement Write Request (direct to Leader): Client ā Leader ā Leader runs Zab protocol ā Leader responds to client ā” One fewer hop (but clients don't choose which server)
Reads Are Local, Writes Are Global
This is the key performance characteristic of ZooKeeper. Reads scale linearly with the number of servers (each serves reads independently). Writes do NOT scale ā every write must go through the single leader and achieve quorum. This is why ZooKeeper is optimized for read-heavy coordination workloads.
Zab Protocol
ZooKeeper Atomic Broadcast (Zab) is the consensus protocol that ensures all servers in the ensemble agree on the order of state changes. It is NOT Paxos and NOT Raft ā it was designed specifically for ZooKeeper's primary-backup architecture.
| Property | Zab | Paxos | Raft |
|---|---|---|---|
| Design goal | Primary-backup replication | General consensus | Understandable consensus |
| Leader required | Yes, always | No (but Multi-Paxos uses one) | Yes, always |
| Ordering | Total order of all transactions | Per-instance ordering | Log-based total order |
| Recovery | Leader syncs followers on election | Complex catch-up | Leader sends missing entries |
| Used by | ZooKeeper | Google Chubby, Spanner | etcd, CockroachDB, Consul |
The News Anchor
Zab works like a news broadcast. The leader is the anchor ā all news (writes) must go through them. The anchor broadcasts each story to all stations (followers). Each station confirms receipt. Once a majority of stations confirm, the story is 'committed' and can be reported to viewers (clients). If the anchor goes off-air, stations elect a new anchor who first catches up on any stories they missed.
Zab Protocol Phases: Phase 1: Leader Election (LOOKING state) - All servers vote for the server with the highest zxid - If tied, vote for the highest server ID (myid) - Once a server receives votes from a quorum ā it's the leader - Others become followers Phase 2: Discovery / Synchronization - New leader collects the latest transaction from each follower - Leader determines the authoritative history - Leader sends DIFF, TRUNC, or SNAP to sync followers: ⢠DIFF: "here are the transactions you're missing" ⢠TRUNC: "you have transactions I don't ā roll back" ⢠SNAP: "you're too far behind ā here's a full snapshot" - Once all followers are synced ā ready to serve Phase 3: Broadcast (normal operation) - Leader receives write request - Leader assigns next zxid (epoch + counter) - Leader sends PROPOSAL to all followers - Followers write to transaction log, send ACK - Once leader has ACKs from quorum ā sends COMMIT - All servers apply the transaction to in-memory data tree Key guarantee: All committed transactions are delivered in the same order to all servers. No gaps, no reordering.
Zab vs Raft: Key Difference
The main difference is in recovery. In Raft, the leader with the longest log wins. In Zab, the leader with the highest zxid wins, and it actively synchronizes followers during the discovery phase. Zab also separates the broadcast protocol from leader election more cleanly than Raft.
Leader Election Process
When the ensemble starts or the current leader fails, all servers enter the LOOKING state and begin the leader election process. The goal: elect the server with the most up-to-date transaction history as the new leader.
Enter LOOKING State
When a server starts or loses contact with the leader, it transitions to LOOKING and begins an election round.
Cast Initial Vote
Each server votes for itself initially, sending (myid, last_zxid, epoch) to all other servers.
Compare Votes
When receiving a vote, compare: highest epoch wins. If tied, highest zxid wins. If still tied, highest myid wins.
Update Vote
If a received vote is 'better' than your current vote, adopt it and re-broadcast your updated vote.
Quorum Reached
Once a server sees that a quorum of servers have voted for the same candidate, the election is decided.
Role Assignment
The winner becomes LEADING, all others become FOLLOWING. The new leader begins the synchronization phase.
Example: 5-node ensemble, server 3 (previous leader) crashes Server states: [1: LOOKING, 2: LOOKING, 4: LOOKING, 5: LOOKING] Round 1 ā Initial votes (each votes for itself): Server 1 broadcasts: (myid=1, zxid=0x300000045, epoch=3) Server 2 broadcasts: (myid=2, zxid=0x300000047, epoch=3) Server 4 broadcasts: (myid=4, zxid=0x300000047, epoch=3) Server 5 broadcasts: (myid=5, zxid=0x300000044, epoch=3) Round 2 ā Compare and update: Server 1 sees Server 2's vote: zxid 47 > 45 ā adopts vote for 2 Server 1 sees Server 4's vote: zxid 47 = 47, myid 4 > 2 ā adopts vote for 4 Server 5 sees Server 4's vote: zxid 47 > 44 ā adopts vote for 4 Server 2 sees Server 4's vote: zxid 47 = 47, myid 4 > 2 ā adopts vote for 4 Result: All servers vote for Server 4 (quorum = 3, got 4 votes) Server 4 ā LEADING Servers 1, 2, 5 ā FOLLOWING Election time: typically 200ms - 2 seconds
Why Highest zxid Wins
The server with the highest zxid has the most recent committed transaction. Electing it as leader minimizes the amount of synchronization needed ā it already has the most complete history. This is a key difference from Raft, where the longest log wins.
Quorum Mechanics
Quorum is the minimum number of servers that must agree for an operation to succeed. It's the mechanism that ensures consistency ā any two quorums must overlap by at least one server, guaranteeing that committed data is never lost.
Quorum = āN/2ā + 1 (strict majority) Why majority works: - Any two majorities of N must share at least one member - This shared member has seen both operations - Therefore: no committed write can be "forgotten" Example with 5 nodes: Quorum = 3 (majority of 5) Write W1 committed by: {A, B, C} (3 of 5 = quorum ā ) Write W2 committed by: {C, D, E} (3 of 5 = quorum ā ) Overlap: Server C has both W1 and W2 ā No matter which servers you ask, at least one has the latest data What happens when quorum is lost: 5 nodes, 3 crash ā only 2 remaining 2 < quorum (3) ā ensemble STOPS serving writes Reads may still work (stale) but writes are blocked ā This is the CP trade-off: unavailable rather than inconsistent
| Scenario (5-node) | Available Nodes | Has Quorum? | Can Write? |
|---|---|---|---|
| Normal operation | 5 | ā Yes (5 ā„ 3) | ā Yes |
| One node down | 4 | ā Yes (4 ā„ 3) | ā Yes |
| Two nodes down | 3 | ā Yes (3 ā„ 3) | ā Yes (barely) |
| Three nodes down | 2 | ā No (2 < 3) | ā No ā unavailable |
| Network partition 3|2 | 3 in majority | ā Majority partition only | ā Majority only |
Quorum Intersection Property
The mathematical guarantee: for any N, two groups of size āN/2ā + 1 must share at least one member. This single property is what makes distributed consensus possible ā it ensures that no two conflicting decisions can both achieve quorum.
Observer Nodes
Observers are a special type of ZooKeeper server that receive all state updates but do NOT participate in voting. They scale read throughput without affecting write performance ā adding observers doesn't increase quorum size.
| Property | Follower | Observer |
|---|---|---|
| Receives proposals | ā Yes | ā Yes |
| Sends ACKs to leader | ā Yes (counted for quorum) | ā No |
| Votes in elections | ā Yes | ā No |
| Serves client reads | ā Yes | ā Yes |
| Forwards client writes | ā Yes (to leader) | ā Yes (to leader) |
| Affects write latency | ā Yes (must ACK) | ā No |
| Use case | Core ensemble | Read scaling, remote DC |
# 5-node ensemble with 2 observers # Voting members (participate in quorum): server.1=zk1.dc1:2888:3888 server.2=zk2.dc1:2888:3888 server.3=zk3.dc1:2888:3888 server.4=zk4.dc1:2888:3888 server.5=zk5.dc1:2888:3888 # Observers (read-only replicas, no voting): server.6=zk6.dc2:2888:3888:observer server.7=zk7.dc2:2888:3888:observer # Quorum is still 3 (majority of 5 voting members) # Observers in DC2 serve local reads with low latency # Writes still require round-trip to DC1 for quorum
When to Use Observers
- ā Cross-datacenter reads ā place observers in remote DCs for low-latency local reads
- ā Read scaling ā add observers to handle more read clients without affecting write quorum
- ā Non-critical clients ā connect monitoring/analytics tools to observers instead of voting members
- ā Gradual scaling ā add read capacity without changing quorum dynamics
Write Path Through the Ensemble
Every write in ZooKeeper follows the same path through the ensemble. Understanding this path explains why writes are slower than reads and why write throughput doesn't scale with more nodes.
Client Sends Write
Client sends create/setData/delete to any server. If it's a follower, it forwards to the leader.
Leader Assigns zxid
Leader assigns the next globally unique transaction ID (zxid = epoch << 32 | counter). This ensures total ordering.
Leader Sends PROPOSAL
Leader sends the transaction proposal to all followers. Each proposal contains the zxid and the operation details.
Followers Write to Transaction Log
Each follower writes the proposal to its transaction log on disk (fsync) and sends an ACK back to the leader.
Leader Collects Quorum ACKs
Once the leader has ACKs from a quorum (including itself), the transaction is committed.
Leader Sends COMMIT
Leader sends COMMIT to all followers. Each server applies the transaction to its in-memory data tree.
Response to Client
The originating server (leader or follower) sends the response back to the client.
Write Latency Breakdown (5-node ensemble, same DC): Client ā Follower: ~0.5ms (network) Follower ā Leader: ~0.5ms (forward) Leader assigns zxid: ~0.01ms (in-memory) Leader ā Followers (PROPOSAL): ~0.5ms (broadcast) Followers fsync to disk: ~1-5ms (SSD) or ~10ms (HDD) Followers ā Leader (ACK): ~0.5ms (network) Leader ā Followers (COMMIT): ~0.5ms (broadcast) Follower ā Client: ~0.5ms (response) āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Total: ~4-8ms (SSD), ~12-15ms (HDD) Key insight: disk fsync dominates write latency. This is why SSDs are critical for ZooKeeper performance. Throughput: ~10,000-50,000 writes/sec (depending on hardware) Compare: ~100,000+ reads/sec (no disk, no consensus)
Why Writes Don't Scale
Adding more followers doesn't help write throughput ā it actually hurts it slightly because the leader must send proposals to more servers. Write throughput is bounded by the leader's ability to process proposals and the disk fsync latency. This is by design ā ZooKeeper is a coordination service, not a data store.
Interview Questions
Q:Why does a ZooKeeper ensemble need an odd number of nodes?
A: Because quorum requires a strict majority (āN/2ā + 1). With even numbers, you get the same fault tolerance as N-1: a 4-node ensemble tolerates 1 failure (quorum=3), same as 3 nodes. The 4th node adds cost without improving resilience. Odd numbers also prevent tie situations during leader election. The sweet spot is 5 nodes: tolerates 2 failures, allowing maintenance of one node while still surviving an unexpected failure.
Q:How does the Zab protocol differ from Raft?
A: Key differences: (1) Recovery ā Zab has an explicit discovery/synchronization phase where the new leader actively syncs followers. Raft relies on the leader sending missing log entries during normal operation. (2) Leader election ā Zab elects the server with the highest zxid (most recent transaction). Raft elects the server with the longest log. (3) Design philosophy ā Zab was designed specifically for primary-backup state machine replication. Raft was designed for general-purpose consensus with understandability as a goal. Both provide the same safety guarantees.
Q:What happens when the ZooKeeper leader crashes?
A: (1) Followers detect leader loss via missed heartbeats. (2) All servers enter LOOKING state. (3) Fast Leader Election begins ā servers exchange votes preferring highest zxid, then highest myid. (4) Once quorum agrees on a candidate, it becomes leader. (5) New leader runs discovery phase ā collects latest state from followers. (6) Synchronization ā leader sends DIFF/TRUNC/SNAP to bring followers up to date. (7) Ensemble resumes serving requests. Total time: typically 200ms-2s. During election, the ensemble is unavailable for writes.
Q:Why can't you scale ZooKeeper write throughput by adding more nodes?
A: Every write must go through the single leader and achieve quorum acknowledgment. Adding followers means the leader must send proposals to more servers and wait for more ACKs. Write throughput is bounded by: (1) Leader's processing capacity, (2) Network round-trip time for proposals/ACKs, (3) Disk fsync latency on followers. More nodes actually slightly decrease write throughput. To scale reads, add observers (non-voting replicas). To scale writes, you'd need to shard ā but ZooKeeper doesn't support that natively.
Q:What are observer nodes and when would you use them?
A: Observers receive all state updates but don't vote in quorum or elections. Use cases: (1) Cross-DC reads ā place observers in remote datacenters for low-latency local reads without affecting write quorum latency. (2) Read scaling ā handle more read clients without changing quorum size. (3) Non-critical clients ā connect monitoring tools to observers. Key benefit: adding observers doesn't increase quorum size, so write latency is unaffected. Trade-off: observer reads may be slightly stale (they receive COMMIT asynchronously).
Common Mistakes
Using an even number of nodes
Deploying 4 or 6 nodes thinking more is always better. A 4-node ensemble has the same fault tolerance as 3 nodes (both tolerate 1 failure) but costs more and has slightly higher write latency.
ā Always use odd numbers: 3, 5, or 7. For most production deployments, 5 is the sweet spot ā tolerates 2 failures, allowing safe rolling upgrades.
Spreading voting members across high-latency links
Placing 5 voting members across 5 different regions. Every write requires quorum ACKs, so write latency equals the round-trip to the slowest quorum member.
ā Keep voting members in the same region (or nearby regions with <10ms latency). Use observers in remote regions for local reads. Write latency is bounded by the slowest quorum member.
Running ZooKeeper on spinning disks
Using HDD for the transaction log. Every write requires an fsync to the transaction log before ACKing. HDD fsync takes 10-15ms vs 1-2ms for SSD.
ā Always use SSDs for ZooKeeper's transaction log (dataLogDir). Separate the transaction log from snapshots (dataDir) on different disks for best performance.
Co-locating ZooKeeper with other services
Running ZooKeeper on the same machines as Kafka brokers, application servers, or databases. GC pauses or CPU contention from other processes cause ZK to miss heartbeats and trigger unnecessary leader elections.
ā Run ZooKeeper on dedicated machines. It's latency-sensitive and needs predictable performance. A single GC pause can cause the leader to lose quorum and trigger a disruptive election.
Adding nodes to scale write throughput
Thinking that adding more followers will increase write capacity, like adding Cassandra nodes increases throughput.
ā ZooKeeper writes don't scale horizontally ā all writes go through one leader. Add observers for read scaling. If you need more write throughput, optimize hardware (faster SSDs, more RAM) or reduce write frequency.