Sessions & Watches
Sessions bind clients to the ensemble with heartbeats and timeouts. Watches provide event-driven notifications when znodes change — the mechanism that makes coordination reactive instead of polling-based.
Table of Contents
What a Session Is
A ZooKeeper session represents the relationship between a client and the ensemble. When a client connects, ZooKeeper creates a session with a unique 64-bit session ID and a session password (for reconnection authentication). The session is the unit of liveness — ephemeral nodes, watches, and pending requests are all tied to it.
The Hotel Key Card
A session is like a hotel key card. When you check in (connect), you get a card (session ID) that opens your room (ephemeral nodes). The card has an expiry (session timeout). As long as you swipe it periodically (heartbeats), it stays active. If you don't swipe for too long, the hotel deactivates it (session expired) and cleans your room (deletes ephemeral nodes). You can use the card at any door (any server in the ensemble) — it's not tied to one specific entrance.
Session Lifecycle: 1. CONNECTING Client initiates TCP connection to a ZK server Sends: ConnectRequest(sessionTimeout, sessionId=0, password=empty) Server responds: ConnectResponse(sessionId, negotiatedTimeout, password) 2. CONNECTED Client can now perform operations Client sends periodic heartbeats (PING) Server tracks session liveness 3. DISCONNECTED (temporary) TCP connection lost (network blip, server restart) Client enters CONNECTING state Client tries other servers in the ensemble Session is still alive on the server side (timeout hasn't expired) 4. RECONNECTED Client connects to another server Sends: ConnectRequest(sessionTimeout, sessionId=X, password=Y) Server validates session is still alive → CONNECTED again All ephemeral nodes and watches are preserved! 5. EXPIRED (terminal) Session timeout elapsed without heartbeat Server deletes all ephemeral nodes owned by this session All watches are removed Client receives SESSION_EXPIRED event Client must create a completely new session Session state machine: CONNECTING → CONNECTED → DISCONNECTED → CONNECTING → CONNECTED → EXPIRED (terminal)
What's Tied to a Session
- ✅Ephemeral nodes — automatically deleted when session expires
- ✅Watches — removed when session expires (no more notifications)
- ✅Pending requests — cancelled if session expires before completion
- ✅Session password — used to reconnect to a different server securely
- ✅Session timeout — negotiated at connect time, enforced by the server
Session Timeout & Heartbeats
The session timeout determines how long ZooKeeper waits before declaring a client dead. It's negotiated at connection time — the client proposes a timeout, and the server may adjust it within bounds (2× to 20× tickTime). Heartbeats (PING messages) keep the session alive.
Timeout Negotiation: Server config (zoo.cfg): tickTime = 2000 # Base time unit in ms minSessionTimeout = 4000 # 2 × tickTime (minimum) maxSessionTimeout = 40000 # 20 × tickTime (maximum) Client requests: sessionTimeout = 30000 (30 seconds) Server responds: negotiatedTimeout = 30000 (within bounds ✅) Client requests: sessionTimeout = 1000 (1 second) Server responds: negotiatedTimeout = 4000 (clamped to minimum) Heartbeat interval: Client sends PING every: negotiatedTimeout / 3 Example: 30s timeout → PING every 10 seconds Why timeout/3? - Gives 2 missed heartbeats before expiry - Accounts for network latency and GC pauses - Client: "I'll ping at 10s, 20s" → Server expires at 30s Server-side tracking: - Server checks session liveness every tickTime (2s) - If no heartbeat received within sessionTimeout → expire - The leader is responsible for session expiry decisions
| Timeout Value | Failure Detection | False Positives | Use Case |
|---|---|---|---|
| 4-6 seconds | Very fast (4-6s) | High (GC pauses trigger expiry) | Low-latency, stable network |
| 10-15 seconds | Fast (10-15s) | Medium | Most production deployments |
| 30-40 seconds | Slow (30-40s) | Low | Unstable networks, large heaps |
Choosing the Right Timeout
The timeout is a trade-off between failure detection speed and false positives. Too short: GC pauses or network blips cause unnecessary session expiry (ephemeral nodes deleted, locks lost). Too long: actual failures take too long to detect. Start with 10-15 seconds and adjust based on your GC behavior and network stability.
Session Expiry
Session expiry is the most critical event in ZooKeeper client programming. When a session expires, everything associated with it is destroyed — ephemeral nodes deleted, watches removed, pending operations cancelled. Your application must handle this gracefully.
Timeout Elapses
The leader hasn't received a heartbeat from the client within the session timeout period.
Leader Declares Expiry
The leader generates a session expiry transaction and commits it through Zab (replicated to all servers).
Ephemeral Nodes Deleted
All ephemeral nodes created by this session are deleted. This triggers watches on those nodes and their parents.
Watches Removed
All watches registered by this session are removed. The client will never receive these notifications.
Client Notified
If/when the client reconnects, it receives SESSION_EXPIRED. It cannot recover — must create a new session from scratch.
// Handling session expiry (conceptual) zk.addAuthInfo("digest", "user:pass".getBytes()); // Register a session watcher Watcher sessionWatcher = (event) -> { if (event.getState() == KeeperState.Expired) { // SESSION EXPIRED — everything is gone! // - All our ephemeral nodes: DELETED // - All our watches: REMOVED // - All pending operations: CANCELLED // We MUST: // 1. Create a completely new ZooKeeper client // 2. Re-create all ephemeral nodes (re-register as service, re-acquire locks) // 3. Re-set all watches // 4. Re-initialize application state // We CANNOT: // - Reconnect with the old session (it's gone forever) // - Assume our ephemeral nodes still exist // - Assume we still hold any locks reinitialize(); } else if (event.getState() == KeeperState.Disconnected) { // DISCONNECTED — session might still be alive! // Don't panic yet. The client library will try to reconnect. // If it reconnects before timeout, everything is preserved. log.warn("Disconnected from ZK — attempting reconnect..."); } };
Expiry is Server-Side
Session expiry is decided by the server (specifically the leader), not the client. The client might not even know its session has expired until it tries to reconnect. This means: even if your client process is alive but network-partitioned, the server will expire the session and delete ephemeral nodes after the timeout.
Session Reconnection
When a client loses its TCP connection to a ZooKeeper server, it doesn't mean the session is dead. The client has until the session timeout to reconnect to any server in the ensemble. If it reconnects in time, the session (and all ephemeral nodes) is preserved.
Reconnection Scenarios: Scenario 1: Quick reconnect (SUCCESS) t=0s Client connected to Server A t=1s Server A crashes t=1s Client enters DISCONNECTED state t=2s Client tries Server B → connection established t=2s Client sends session ID + password to Server B t=2s Server B validates session is still alive → CONNECTED Result: ✅ Session preserved, ephemeral nodes intact, watches intact Scenario 2: Slow reconnect (SUCCESS, barely) t=0s Client connected to Server A t=1s Network partition begins t=1s Client enters DISCONNECTED state t=5s Client tries Server B → fails (also partitioned) t=10s Client tries Server C → fails t=25s Network heals, Client connects to Server B t=25s Session timeout is 30s → session still alive! Result: ✅ Session preserved (reconnected before 30s timeout) Scenario 3: Too slow (EXPIRED) t=0s Client connected to Server A t=1s Network partition begins t=30s Server-side: session timeout reached → SESSION EXPIRED t=30s Ephemeral nodes deleted, watches removed t=45s Network heals, Client connects to Server B t=45s Client sends old session ID → Server says "EXPIRED" Result: ❌ Session gone, must create new session Key insight: The client doesn't know if its session is expired until it successfully reconnects. During DISCONNECTED state, it should assume the session MIGHT still be alive and keep trying.
Don't Give Up During DISCONNECTED
A common mistake is treating DISCONNECTED as EXPIRED. During DISCONNECTED, your session might still be alive on the server. Keep trying to reconnect. Only when you receive SESSION_EXPIRED (after reconnecting) should you reinitialize everything.
Watch Types
Watches are ZooKeeper's event notification mechanism. Instead of polling for changes, clients register a watch and receive a callback when the watched znode changes. There are two types of watches: data watches and child watches.
| Watch Type | Registered By | Triggered By | Use Case |
|---|---|---|---|
| Data Watch | getData(), exists() | setData(), delete(), create() (for exists) | Config changes, leader data updates |
| Child Watch | getChildren() | Child added or removed | Service discovery, lock queue changes |
Watch Registration and Triggering: DATA WATCHES (registered by getData or exists): // Register: "notify me when /config/db data changes" byte[] data = zk.getData("/config/db", true, stat); // Triggered by: // NodeDataChanged → someone called setData on /config/db // NodeDeleted → someone deleted /config/db // NodeCreated → (only for exists watch on non-existent node) CHILD WATCHES (registered by getChildren): // Register: "notify me when children of /services change" List<String> children = zk.getChildren("/services", true); // Triggered by: // NodeChildrenChanged → child added or removed under /services // NodeDeleted → /services itself was deleted // NOT triggered by: // ❌ Data changes in children (only structure changes) // ❌ Data changes in /services itself IMPORTANT: Watches are one-time triggers! After firing once, the watch is gone. You must re-register it to get the next notification. // Pattern: read + watch in a loop while (true) { data = zk.getData("/config/db", watchCallback, stat); // ... use data ... // watchCallback fires → loop re-reads with new watch }
The Doorbell
A watch is like a doorbell that only rings once. You install it (register the watch), and when someone arrives (data changes), it rings (callback fires). But then it's disconnected — you have to reinstall it to hear the next visitor. This one-time nature is intentional: it forces you to re-read the current state, preventing you from missing changes that happened between the notification and your re-registration.
Watch Guarantees
ZooKeeper provides strong guarantees about watch delivery that make them safe for coordination. Understanding these guarantees is essential for building correct distributed algorithms.
Watch Guarantees
- ✅Ordered — watch events are delivered in the same order as the changes that triggered them
- ✅Once-triggered — a watch fires at most once; re-registration is required for subsequent events
- ✅Delivered before new data — a client sees the watch event before seeing the new data from a subsequent read
- ✅Tied to session — watches are removed when the session expires (no stale notifications)
- ✅Server-local — the server that the client is connected to delivers the watch (no cross-server coordination needed)
Watch Ordering Guarantee: Client A sets watch on /config Client B updates /config to "v2" Client B updates /config to "v3" Client A receives: 1. Watch event: NodeDataChanged on /config 2. (Client A re-reads /config → gets "v3", NOT "v2") Key insight: Client A might "miss" seeing "v2" — but that's OK! The watch told it "something changed." The re-read gets the LATEST value. For coordination, you care about the current state, not the history. No-Miss Guarantee (with proper re-registration): If you always re-register your watch immediately after it fires (in the same callback), you will never miss a change. The pattern: void watchCallback(WatchedEvent event) { // Watch fired — re-read with new watch immediately byte[] newData = zk.getData("/config", this, stat); // Process newData... } Between the watch firing and re-registration, changes are captured by the re-read (you get the latest state).
Watches Are Not Message Queues
Watches don't deliver every intermediate value. If /config changes from "v1" to "v2" to "v3" before your watch fires, you get one notification and read "v3". You never see "v2". This is fine for coordination (you want current state) but wrong for event sourcing (use Kafka for that).
The One-Time Watch Problem
The one-time nature of watches is both a feature and a challenge. It prevents resource leaks (forgotten watches accumulating) but requires careful re-registration patterns. ZooKeeper 3.6+ introduced persistent watches to address common pain points.
| Aspect | One-Time Watch (Classic) | Persistent Watch (3.6+) |
|---|---|---|
| Fires | Once, then removed | Repeatedly until removed |
| Re-registration | Required after every event | Not needed |
| Miss window | Between fire and re-register | None (always active) |
| Resource cleanup | Automatic (fires once) | Must explicitly remove |
| API | getData(path, true) | addWatch(path, mode) |
| Modes | N/A | PERSISTENT, PERSISTENT_RECURSIVE |
// Classic one-time watch pattern (pre-3.6) // Problem: gap between watch fire and re-registration void watchConfig() { byte[] data = zk.getData("/config", event -> { // Watch fired! But between now and re-registration, // another change could happen that we'd miss... // (In practice, the re-read catches it, but the pattern is complex) watchConfig(); // re-register }, stat); processConfig(data); } // Persistent watch (3.6+) — no re-registration needed zk.addWatch("/config", event -> { // This fires for EVERY change, no re-registration needed byte[] newData = zk.getData("/config", null, stat); processConfig(newData); }, AddWatchMode.PERSISTENT); // Persistent recursive watch — watches entire subtree zk.addWatch("/services", event -> { // Fires for any change under /services (any depth) // NodeCreated, NodeDeleted, NodeDataChanged for any descendant refreshServiceRegistry(); }, AddWatchMode.PERSISTENT_RECURSIVE); // Remove when done zk.removeWatches("/config", watcher, WatcherType.Any, false);
When to Use Persistent Watches
Use persistent watches when you need continuous monitoring without the complexity of re-registration loops. They're ideal for service discovery (watch /services subtree) and configuration management (watch /config). Use classic one-time watches for one-shot coordination (waiting for a specific node to appear or disappear).
Watch Best Practices
- ✅Always re-read after a watch fires — the watch tells you WHAT changed, the read tells you the CURRENT state
- ✅Re-register watches in the callback — minimizes the window for missed events
- ✅Use persistent watches (3.6+) for long-lived monitoring to simplify code
- ✅Don't use watches for high-frequency changes — each watch event is a network message
- ✅Handle SESSION_EXPIRED by re-establishing all watches from scratch
Interview Questions
Q:What happens when a ZooKeeper session expires? Walk through the consequences.
A: When a session expires: (1) The leader generates a session expiry transaction (replicated via Zab). (2) All ephemeral nodes created by that session are deleted — this triggers watches on those nodes and their parents. (3) All watches registered by that session are removed. (4) Any pending operations are cancelled. (5) If/when the client reconnects, it receives SESSION_EXPIRED and must create a completely new session, re-create ephemeral nodes, and re-register watches. The client cannot recover the old session — it's gone permanently.
Q:How do watches work and what guarantees do they provide?
A: Watches are one-time event notifications registered during read operations (getData, getChildren, exists). Guarantees: (1) Ordered — events delivered in the order changes occurred. (2) Delivered before new data — client sees the watch event before any subsequent read returns new data. (3) Once-triggered — fires at most once, must re-register. (4) Session-bound — removed on session expiry. Two types: data watches (triggered by setData/delete) and child watches (triggered by child add/remove). They're NOT message queues — intermediate values may be skipped.
Q:What's the difference between DISCONNECTED and EXPIRED states?
A: DISCONNECTED means the TCP connection was lost but the session might still be alive on the server. The client should keep trying to reconnect — if it succeeds before the session timeout, everything (ephemeral nodes, watches) is preserved. EXPIRED means the server has declared the session dead (timeout elapsed without heartbeat). All ephemeral nodes are deleted, watches removed. The client must create a completely new session. Key rule: don't treat DISCONNECTED as EXPIRED — keep trying to reconnect.
Q:Why are watches one-time triggers? What problem does this solve?
A: One-time triggers solve two problems: (1) Resource management — if watches were permanent, forgotten watches would accumulate indefinitely, consuming server memory and generating unwanted traffic. (2) Correctness — the one-time nature forces clients to re-read the current state after a notification, ensuring they always act on the latest data rather than a potentially stale notification. The re-read pattern (watch fires → re-read with new watch) guarantees no changes are missed. ZooKeeper 3.6+ added persistent watches for cases where the re-registration pattern is too complex.
Q:How does session timeout negotiation work and how do you choose the right value?
A: The client proposes a timeout at connect time. The server clamps it between minSessionTimeout (2×tickTime) and maxSessionTimeout (20×tickTime). The client sends heartbeats every timeout/3. Choosing the value: too short (4-6s) causes false expirations during GC pauses or network blips. Too long (30-40s) means slow failure detection. Most production deployments use 10-15 seconds. Consider: your JVM's worst-case GC pause, network stability, and how quickly you need to detect failures. The timeout should be at least 2-3× your worst GC pause.
Common Mistakes
Treating DISCONNECTED as EXPIRED
Immediately reinitializing everything when the connection drops. This causes unnecessary ephemeral node recreation, lock re-acquisition attempts, and service disruption.
✅Only reinitialize on SESSION_EXPIRED. During DISCONNECTED, wait for the client library to reconnect. If it reconnects before timeout, everything is preserved — no action needed.
Setting session timeout too short
Using 2-4 second timeouts in production. A single GC pause (common in Java applications) can exceed this, causing session expiry, ephemeral node deletion, and cascading failures.
✅Set timeout to at least 2-3× your worst-case GC pause. For Java applications with default GC, 10-15 seconds is a safe starting point. Monitor session expiry rates and adjust.
Not re-registering watches after they fire
Setting a watch once and assuming it will keep notifying. After the first event, the watch is gone — subsequent changes are silently missed.
✅Always re-register watches in the callback handler. Use the pattern: watch fires → re-read with new watch → process data. Or use persistent watches (3.6+) which don't require re-registration.
Using watches for high-frequency monitoring
Watching a znode that changes hundreds of times per second. Each watch event is a network message — this overwhelms both the server and client.
✅Watches are designed for low-frequency coordination events (config changes, membership changes). For high-frequency data, poll at intervals or use a streaming system like Kafka.
Not handling SESSION_EXPIRED in lock implementations
Assuming that once you acquire a lock, you hold it forever. If your session expires (network partition, long GC), your ephemeral lock node is deleted and another process acquires the lock — but you don't know.
✅Always monitor session state. On SESSION_EXPIRED, assume you've lost all locks and ephemeral nodes. Implement fencing tokens (use the znode's czxid as a fence) to detect stale lock holders.