Consistency Guarantees
ZooKeeper provides sequential consistency for writes and a weaker guarantee for reads. Understanding exactly what it promises β and what it doesn't β is critical for building correct coordination systems.
Table of Contents
Sequential Consistency
ZooKeeper guarantees sequential consistency for all state changes. This means: all updates from all clients are applied in a single, well-defined order that respects the order each client submitted them. If client A sends write W1 before W2, the ensemble applies W1 before W2 β guaranteed.
The Single Ledger
Imagine a single ledger book where all transactions are recorded in order. Multiple people can submit entries, and each person's entries appear in the order they submitted them. The ledger never reorders entries from the same person. However, entries from different people may be interleaved in any order β as long as each person's entries maintain their relative order. That's sequential consistency.
Sequential Consistency Guarantee: Client A sends: W1 (set /x = 1), W2 (set /y = 2) Client B sends: W3 (set /x = 3), W4 (set /z = 4) Valid orderings (respects per-client order): β W1, W2, W3, W4 (A's writes first, then B's) β W1, W3, W2, W4 (interleaved, but A: W1<W2, B: W3<W4) β W3, W1, W4, W2 (interleaved, but A: W1<W2, B: W3<W4) β W3, W4, W1, W2 (B's writes first, then A's) Invalid orderings: β W2, W1, W3, W4 (violates A's order: W2 before W1) β W1, W4, W3, W2 (violates B's order: W4 before W3) What this means in practice: - If you set config then set "ready=true", all observers will see the config BEFORE seeing ready=true - Causal ordering within a single client is preserved - Cross-client ordering is NOT guaranteed (no global real-time order)
Sequential β Linearizable
Sequential consistency is weaker than linearizability. It guarantees per-client ordering but NOT real-time ordering across clients. If client A writes at 10:00:00 and client B writes at 10:00:01, ZooKeeper might order B's write before A's in the global sequence. For most coordination use cases, sequential consistency is sufficient.
Atomicity & Single System Image
ZooKeeper provides two additional guarantees beyond sequential consistency: atomicity (updates either fully succeed or fully fail) and single system image (a client sees the same view regardless of which server it connects to).
ZooKeeper's Consistency Guarantees
- β Sequential consistency β updates are applied in the order sent by each client
- β Atomicity β updates either succeed completely or fail completely; no partial state
- β Single system image β a client sees the same view regardless of which server it's connected to
- β Reliability β once an update is applied, it persists until overwritten (durability)
- β Timeliness β the system view is guaranteed to be up-to-date within a bounded time
Atomicity Guarantee: Single operation: setData("/config", newData, version) β Either the data is fully updated OR the operation fails β No partial writes (half the bytes updated) β No torn reads (reading during a write) Multi operation (atomic batch): zk.multi([ Op.create("/txn/step1", data1), Op.setData("/txn/step2", data2, version2), Op.delete("/txn/step3", version3), ]); β ALL three operations succeed, OR NONE of them do β If step2 fails (wrong version), step1 is rolled back β Observers never see an intermediate state Single System Image: Client connected to Server A, reads /config β "v5" Client reconnects to Server B (failover) Client reads /config β guaranteed to see "v5" or newer β Client NEVER goes backward in time after reconnection β Implemented via zxid: client sends last-seen zxid, server won't serve if it's behind that zxid
Multi Operations Are Powerful
The multi() operation (ZooKeeper 3.4+) enables atomic transactions across multiple znodes. This is essential for patterns like "update config AND set version flag" or "delete old leader AND create new leader" atomically. Without multi(), you'd need complex two-phase patterns.
What ZooKeeper Does NOT Guarantee
Understanding what ZooKeeper does NOT guarantee is just as important as understanding what it does. The most critical gap: reads are NOT linearizable by default. A client reading from a follower may see stale data.
| Guarantee | Provided? | Explanation |
|---|---|---|
| Linearizable writes | β Yes | All writes go through leader, totally ordered |
| Linearizable reads | β No (by default) | Reads from followers may be stale |
| Real-time ordering | β No | Cross-client writes may be reordered |
| Read-your-own-writes (same server) | β Yes | Same server always returns your latest write |
| Read-your-own-writes (different server) | β Not guaranteed | New server may be behind; use sync() |
The Stale Read Problem: Timeline: t=0 Client A writes /config = "v2" (goes to leader) t=1 Leader commits, sends COMMIT to followers t=1 Client A gets success response t=2 Client B reads /config from Follower 3 β Follower 3 hasn't received COMMIT yet! β Client B reads "v1" (stale!) t=3 Follower 3 receives COMMIT, applies "v2" t=4 Client B reads /config from Follower 3 β "v2" (now current) This is CORRECT behavior per ZooKeeper's guarantees! Sequential consistency doesn't promise real-time reads. Solutions: 1. sync() β force follower to catch up zk.sync("/config"); // blocks until follower is current byte[] data = zk.getData("/config", false, stat); // now guaranteed fresh // Cost: one round-trip to leader + wait for sync 2. Read from leader (not standard API, but some clients support it) // Always gets latest committed value // Cost: all reads go to one server (no read scaling) 3. Use watches instead of polling // Watch fires AFTER the update is committed // So re-reading after a watch event always gets the new value When stale reads are OK: - Reading config that changes rarely - Service discovery (slightly stale list is fine) - Monitoring/dashboards When stale reads are NOT OK: - Checking if you still hold a lock (use session events instead) - Reading a value you just wrote from a different server
sync() Is Your Friend
If you need a linearizable read, call sync(path) before getData(). sync() forces the connected server to catch up with the leader before responding. This adds latency (one round-trip to leader) but guarantees you see the latest committed state. Use it sparingly β most coordination patterns don't need it.
The zxid β Transaction ID
Every state change in ZooKeeper is assigned a globally unique, monotonically increasing transaction ID called a zxid. The zxid is a 64-bit number that encodes both the leader epoch and a per-epoch counter. It is the backbone of ZooKeeper's ordering guarantees.
zxid Structure (64 bits): βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β High 32 bits: EPOCH β Low 32 bits: COUNTER β β (leader term) β (transaction within epoch) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Example: zxid = 0x0000000300000047 Epoch: 3 (third leader since cluster started) Counter: 71 (71st transaction in this epoch) Properties: - Globally unique (no two transactions share a zxid) - Monotonically increasing (zxid N+1 is always after zxid N) - Encodes causality (higher zxid = happened later in ZK's view) - Epoch increases on every leader election - Counter resets to 0 on new epoch Where zxid appears: - Stat.czxid β zxid that created this znode - Stat.mzxid β zxid that last modified this znode - Stat.pzxid β zxid of last child modification - Watch events β include the zxid that triggered the watch - Client session β tracks last-seen zxid for ordering Uses: - Fencing tokens (lock holder's czxid) - Ordering events (which happened first?) - Leader election (highest zxid = most up-to-date) - Follower sync (send all transactions after zxid X)
Why zxid Matters
- β Total ordering β every change has a unique position in the global history
- β Leader election β server with highest zxid becomes leader (most up-to-date)
- β Follower synchronization β leader sends transactions after follower's last zxid
- β Fencing tokens β use czxid to detect stale lock holders
- β Single system image β client won't connect to a server behind its last-seen zxid
Epochs
An epoch is the high 32 bits of the zxid β it represents a leader's term. Every time a new leader is elected, the epoch increments. This ensures that transactions from different leaders are never confused, and prevents "zombie leaders" from committing stale transactions.
Epoch Progression: Epoch 1 (Leader: Server A) zxid 0x0000000100000001 β first transaction zxid 0x0000000100000002 β second transaction ... zxid 0x000000010000002F β 47th transaction [Server A crashes] Epoch 2 (Leader: Server B) zxid 0x0000000200000001 β first transaction in new epoch zxid 0x0000000200000002 β second transaction ... [Server B steps down gracefully] Epoch 3 (Leader: Server C) zxid 0x0000000300000001 β first transaction in epoch 3 Why epochs prevent zombie leaders: 1. Server A was leader in epoch 1 2. Server A gets network-partitioned (but doesn't know it's not leader) 3. Server B elected as leader in epoch 2 4. Server A recovers, tries to send proposals with epoch 1 5. Followers reject: "I'm in epoch 2, your epoch 1 is stale" 6. Server A realizes it's no longer leader, becomes follower Without epochs: - Old leader could commit transactions after new leader starts - Two leaders writing simultaneously = inconsistency - Epochs are the "term number" that prevents this
Presidential Terms
Epochs are like presidential terms. President #1 can sign executive orders during their term. When President #2 takes office, any orders signed by #1 after the transition are invalid β they're from a previous epoch. Government officials (followers) check the term number before accepting orders. A former president showing up and trying to sign orders is rejected because their epoch is stale.
Epoch + Counter = Total Order
The combination of epoch (which leader) and counter (which transaction within that leader's term) gives every transaction a unique, totally ordered position. You can always determine which of two transactions happened first by comparing their zxids β higher epoch wins, and within the same epoch, higher counter wins.
Performance Characteristics
ZooKeeper is designed for read-heavy coordination workloads. Reads are fast (served locally from memory), writes are slower (require consensus). Understanding the performance profile helps you design systems that use ZooKeeper correctly.
| Metric | Reads | Writes |
|---|---|---|
| Throughput | 100,000+ ops/sec (scales with nodes) | 10,000-50,000 ops/sec (doesn't scale) |
| Latency (same DC) | < 1ms | 4-10ms (SSD) |
| Latency (cross DC) | < 1ms (local follower) | 50-200ms (round-trip to leader) |
| Scales with nodes | β Yes (more readers) | β No (one leader) |
| Requires consensus | β No (local memory) | β Yes (Zab protocol) |
| Disk I/O | None (in-memory) | fsync per write (transaction log) |
Performance Bottlenecks and Solutions: 1. Write Latency (biggest factor: disk fsync) Problem: Every write fsyncs to transaction log before ACK Solution: Use SSDs (1-2ms fsync vs 10-15ms HDD) Solution: Separate dataLogDir from dataDir (dedicated disk for txn log) 2. Write Throughput (bounded by leader) Problem: Single leader processes all writes sequentially Solution: Batch writes where possible (multi() operation) Solution: Reduce write frequency (batch config updates) Anti-pattern: Don't write metrics/logs to ZK 3. Read Latency (usually not a problem) Problem: Stale reads from followers Solution: Use sync() only when freshness is critical Solution: Connect to leader for critical reads (rare) 4. Large Ensemble Overhead Problem: More followers = more proposals to send Solution: Use observers for read scaling (don't affect quorum) Solution: Keep voting ensemble small (5-7 nodes max) 5. Watch Storm Problem: Many clients watching same node = many notifications Solution: Use hierarchical watches (watch parent, not all children) Solution: Batch watch processing in client Typical production numbers (5-node ensemble, SSDs): Read throughput: ~150,000 ops/sec (distributed across 5 nodes) Write throughput: ~30,000 ops/sec (all through leader) Read latency: ~0.5ms (p99: ~2ms) Write latency: ~5ms (p99: ~15ms) Ensemble size: ~100K-500K znodes comfortably
Read:Write Ratio Matters
ZooKeeper is optimized for workloads with a high read-to-write ratio (10:1 or higher). Coordination metadata is read frequently but written rarely β "who is the leader?" is read constantly but changes only on failure. If your workload is write-heavy, ZooKeeper is the wrong tool.
Interview Questions
Q:What consistency model does ZooKeeper provide? Is it linearizable?
A: ZooKeeper provides sequential consistency for writes (all updates applied in a single total order respecting per-client ordering) but NOT linearizable reads by default. Reads from followers may be stale β a follower might not have received the latest COMMIT yet. To get a linearizable read, call sync() before getData(), which forces the server to catch up with the leader. Writes ARE linearizable because they all go through the single leader and are totally ordered by zxid.
Q:What is a zxid and why is it important?
A: A zxid (ZooKeeper Transaction ID) is a 64-bit number assigned to every state change. High 32 bits = epoch (leader term), low 32 bits = counter (transaction within epoch). It's important because: (1) Provides total ordering of all changes. (2) Used in leader election β highest zxid = most up-to-date server. (3) Enables follower sync β leader sends all transactions after follower's last zxid. (4) Serves as fencing tokens for locks. (5) Ensures single system image β clients won't connect to servers behind their last-seen zxid.
Q:Can a client read stale data from ZooKeeper? How do you prevent it?
A: Yes! Reads from followers can be stale because followers apply COMMITs asynchronously. A client reading from Follower X might not see a write that was just committed by the leader. Prevention: (1) sync() β call before read to force follower to catch up (adds one round-trip latency). (2) Watches β watch events are delivered after the change is committed, so re-reading after a watch is always fresh. (3) Same-server guarantee β if you wrote to server X and read from server X, you'll see your own write. Stale reads are usually acceptable for coordination (config, service lists) but not for lock verification.
Q:What are epochs and how do they prevent split-brain?
A: An epoch is the high 32 bits of the zxid β it increments on every leader election. Epochs prevent split-brain by acting as a 'term number': (1) New leader gets epoch N+1. (2) Old leader (epoch N) might still be alive but partitioned. (3) If old leader tries to send proposals with epoch N, followers reject them ('I'm in epoch N+1'). (4) Old leader discovers it's been superseded, becomes follower. Without epochs, two leaders could both commit transactions, causing inconsistency. Epochs ensure only the current leader's proposals are accepted.
Q:Why does ZooKeeper read throughput scale with nodes but write throughput doesn't?
A: Reads are served locally from each server's in-memory data tree β no consensus needed. More servers = more read capacity (linear scaling). Writes must go through the single leader and achieve quorum ACKs via Zab β adding followers means more proposals to send and more ACKs to collect. Write throughput is bounded by: leader processing speed, network round-trip for proposals, and disk fsync latency. Adding observers (non-voting) scales reads without affecting write quorum. This is by design β ZooKeeper targets read-heavy coordination workloads (10:1+ read:write ratio).
Common Mistakes
Assuming reads are always fresh
Reading from a follower and assuming it reflects the latest write. A follower may be one or more transactions behind the leader, returning stale data.
β Use sync() before critical reads, or rely on watches (which fire after commit). For most coordination patterns, stale reads are acceptable β but for lock verification or leader checks, use sync().
Write-heavy workloads on ZooKeeper
Writing metrics, counters, or frequently-changing state to ZooKeeper. Every write requires consensus β this overwhelms the ensemble and increases latency for all clients.
β ZooKeeper is designed for read-heavy workloads (10:1+ ratio). Write coordination metadata that changes rarely (leader, config, membership). For high-frequency writes, use Redis, Kafka, or a time-series database.
Not using sync() when reading after a cross-client write
Client A writes a value, notifies Client B out-of-band, Client B reads from a different server and gets stale data because the follower hasn't caught up yet.
β Client B should call sync() before reading, or Client A should include the zxid in the out-of-band notification so Client B can verify it's reading at least that version.
Confusing sequential consistency with linearizability
Designing systems that assume real-time ordering across clients. Sequential consistency only guarantees per-client ordering β cross-client writes may be reordered relative to wall-clock time.
β If you need cross-client real-time ordering, use sync() or design your protocol to not depend on it. Most coordination patterns (leader election, locks, config) work correctly with sequential consistency.