SessionsHeartbeatsWatchesOne-Time TriggerPersistent WatchesSession Expiry

Sessions & Watches

Sessions bind clients to the ensemble with heartbeats and timeouts. Watches provide event-driven notifications when znodes change — the mechanism that makes coordination reactive instead of polling-based.

38 min read9 sections
01

What a Session Is

A ZooKeeper session represents the relationship between a client and the ensemble. When a client connects, ZooKeeper creates a session with a unique 64-bit session ID and a session password (for reconnection authentication). The session is the unit of liveness — ephemeral nodes, watches, and pending requests are all tied to it.

🎫

The Hotel Key Card

A session is like a hotel key card. When you check in (connect), you get a card (session ID) that opens your room (ephemeral nodes). The card has an expiry (session timeout). As long as you swipe it periodically (heartbeats), it stays active. If you don't swipe for too long, the hotel deactivates it (session expired) and cleans your room (deletes ephemeral nodes). You can use the card at any door (any server in the ensemble) — it's not tied to one specific entrance.

session-lifecycle.txttext
Session Lifecycle:

1. CONNECTING
   Client initiates TCP connection to a ZK server
   Sends: ConnectRequest(sessionTimeout, sessionId=0, password=empty)
   Server responds: ConnectResponse(sessionId, negotiatedTimeout, password)

2. CONNECTED
   Client can now perform operations
   Client sends periodic heartbeats (PING)
   Server tracks session liveness

3. DISCONNECTED (temporary)
   TCP connection lost (network blip, server restart)
   Client enters CONNECTING state
   Client tries other servers in the ensemble
   Session is still alive on the server side (timeout hasn't expired)

4. RECONNECTED
   Client connects to another server
   Sends: ConnectRequest(sessionTimeout, sessionId=X, password=Y)
   Server validates session is still aliveCONNECTED again
   All ephemeral nodes and watches are preserved!

5. EXPIRED (terminal)
   Session timeout elapsed without heartbeat
   Server deletes all ephemeral nodes owned by this session
   All watches are removed
   Client receives SESSION_EXPIRED event
   Client must create a completely new session

Session state machine:
  CONNECTINGCONNECTEDDISCONNECTEDCONNECTINGCONNECTED
EXPIRED (terminal)

What's Tied to a Session

  • Ephemeral nodes — automatically deleted when session expires
  • Watches — removed when session expires (no more notifications)
  • Pending requests — cancelled if session expires before completion
  • Session password — used to reconnect to a different server securely
  • Session timeout — negotiated at connect time, enforced by the server
02

Session Timeout & Heartbeats

The session timeout determines how long ZooKeeper waits before declaring a client dead. It's negotiated at connection time — the client proposes a timeout, and the server may adjust it within bounds (2× to 20× tickTime). Heartbeats (PING messages) keep the session alive.

timeout-negotiation.txttext
Timeout Negotiation:

Server config (zoo.cfg):
  tickTime = 2000        # Base time unit in ms
  minSessionTimeout = 4000   # 2 × tickTime (minimum)
  maxSessionTimeout = 40000  # 20 × tickTime (maximum)

Client requests: sessionTimeout = 30000 (30 seconds)
Server responds: negotiatedTimeout = 30000 (within bounds ✅)

Client requests: sessionTimeout = 1000 (1 second)
Server responds: negotiatedTimeout = 4000 (clamped to minimum)

Heartbeat interval:
  Client sends PING every: negotiatedTimeout / 3
  Example: 30s timeoutPING every 10 seconds

  Why timeout/3?
  - Gives 2 missed heartbeats before expiry
  - Accounts for network latency and GC pauses
  - Client: "I'll ping at 10s, 20s"Server expires at 30s

Server-side tracking:
  - Server checks session liveness every tickTime (2s)
  - If no heartbeat received within sessionTimeoutexpire
  - The leader is responsible for session expiry decisions
Timeout ValueFailure DetectionFalse PositivesUse Case
4-6 secondsVery fast (4-6s)High (GC pauses trigger expiry)Low-latency, stable network
10-15 secondsFast (10-15s)MediumMost production deployments
30-40 secondsSlow (30-40s)LowUnstable networks, large heaps

Choosing the Right Timeout

The timeout is a trade-off between failure detection speed and false positives. Too short: GC pauses or network blips cause unnecessary session expiry (ephemeral nodes deleted, locks lost). Too long: actual failures take too long to detect. Start with 10-15 seconds and adjust based on your GC behavior and network stability.

03

Session Expiry

Session expiry is the most critical event in ZooKeeper client programming. When a session expires, everything associated with it is destroyed — ephemeral nodes deleted, watches removed, pending operations cancelled. Your application must handle this gracefully.

1

Timeout Elapses

The leader hasn't received a heartbeat from the client within the session timeout period.

2

Leader Declares Expiry

The leader generates a session expiry transaction and commits it through Zab (replicated to all servers).

3

Ephemeral Nodes Deleted

All ephemeral nodes created by this session are deleted. This triggers watches on those nodes and their parents.

4

Watches Removed

All watches registered by this session are removed. The client will never receive these notifications.

5

Client Notified

If/when the client reconnects, it receives SESSION_EXPIRED. It cannot recover — must create a new session from scratch.

session-expiry-handling.txttext
// Handling session expiry (conceptual)
zk.addAuthInfo("digest", "user:pass".getBytes());

// Register a session watcher
Watcher sessionWatcher = (event) -> {
  if (event.getState() == KeeperState.Expired) {
    // SESSION EXPIRED — everything is gone!
    // - All our ephemeral nodes: DELETED
    // - All our watches: REMOVED  
    // - All pending operations: CANCELLED
    
    // We MUST:
    // 1. Create a completely new ZooKeeper client
    // 2. Re-create all ephemeral nodes (re-register as service, re-acquire locks)
    // 3. Re-set all watches
    // 4. Re-initialize application state
    
    // We CANNOT:
    // - Reconnect with the old session (it's gone forever)
    // - Assume our ephemeral nodes still exist
    // - Assume we still hold any locks
    
    reinitialize();
  } else if (event.getState() == KeeperState.Disconnected) {
    // DISCONNECTED — session might still be alive!
    // Don't panic yet. The client library will try to reconnect.
    // If it reconnects before timeout, everything is preserved.
    log.warn("Disconnected from ZK — attempting reconnect...");
  }
};

Expiry is Server-Side

Session expiry is decided by the server (specifically the leader), not the client. The client might not even know its session has expired until it tries to reconnect. This means: even if your client process is alive but network-partitioned, the server will expire the session and delete ephemeral nodes after the timeout.

04

Session Reconnection

When a client loses its TCP connection to a ZooKeeper server, it doesn't mean the session is dead. The client has until the session timeout to reconnect to any server in the ensemble. If it reconnects in time, the session (and all ephemeral nodes) is preserved.

reconnection-flow.txttext
Reconnection Scenarios:

Scenario 1: Quick reconnect (SUCCESS)
  t=0s    Client connected to Server A
  t=1s    Server A crashes
  t=1s    Client enters DISCONNECTED state
  t=2s    Client tries Server Bconnection established
  t=2s    Client sends session ID + password to Server B
  t=2s    Server B validates session is still aliveCONNECTED
  Result: ✅ Session preserved, ephemeral nodes intact, watches intact

Scenario 2: Slow reconnect (SUCCESS, barely)
  t=0s    Client connected to Server A
  t=1s    Network partition begins
  t=1s    Client enters DISCONNECTED state
  t=5s    Client tries Server Bfails (also partitioned)
  t=10s   Client tries Server Cfails
  t=25s   Network heals, Client connects to Server B
  t=25s   Session timeout is 30ssession still alive!
  Result: ✅ Session preserved (reconnected before 30s timeout)

Scenario 3: Too slow (EXPIRED)
  t=0s    Client connected to Server A
  t=1s    Network partition begins
  t=30s   Server-side: session timeout reachedSESSION EXPIRED
  t=30s   Ephemeral nodes deleted, watches removed
  t=45s   Network heals, Client connects to Server B
  t=45s   Client sends old session IDServer says "EXPIRED"
  Result: ❌ Session gone, must create new session

Key insight: The client doesn't know if its session is expired
until it successfully reconnects. During DISCONNECTED state,
it should assume the session MIGHT still be alive and keep trying.

Don't Give Up During DISCONNECTED

A common mistake is treating DISCONNECTED as EXPIRED. During DISCONNECTED, your session might still be alive on the server. Keep trying to reconnect. Only when you receive SESSION_EXPIRED (after reconnecting) should you reinitialize everything.

05

Watch Types

Watches are ZooKeeper's event notification mechanism. Instead of polling for changes, clients register a watch and receive a callback when the watched znode changes. There are two types of watches: data watches and child watches.

Watch TypeRegistered ByTriggered ByUse Case
Data WatchgetData(), exists()setData(), delete(), create() (for exists)Config changes, leader data updates
Child WatchgetChildren()Child added or removedService discovery, lock queue changes
watch-types.txttext
Watch Registration and Triggering:

DATA WATCHES (registered by getData or exists):
  // Register: "notify me when /config/db data changes"
  byte[] data = zk.getData("/config/db", true, stat);
  
  // Triggered by:
  //   NodeDataChanged  → someone called setData on /config/db
  //   NodeDeleted      → someone deleted /config/db
  //   NodeCreated      → (only for exists watch on non-existent node)

CHILD WATCHES (registered by getChildren):
  // Register: "notify me when children of /services change"
  List<String> children = zk.getChildren("/services", true);
  
  // Triggered by:
  //   NodeChildrenChanged → child added or removed under /services
  //   NodeDeleted         → /services itself was deleted
  
  // NOT triggered by:
  //   ❌ Data changes in children (only structure changes)
  //   ❌ Data changes in /services itself

IMPORTANT: Watches are one-time triggers!
  After firing once, the watch is gone.
  You must re-register it to get the next notification.
  
  // Pattern: read + watch in a loop
  while (true) {
    data = zk.getData("/config/db", watchCallback, stat);
    // ... use data ...
    // watchCallback fires → loop re-reads with new watch
  }
🔔

The Doorbell

A watch is like a doorbell that only rings once. You install it (register the watch), and when someone arrives (data changes), it rings (callback fires). But then it's disconnected — you have to reinstall it to hear the next visitor. This one-time nature is intentional: it forces you to re-read the current state, preventing you from missing changes that happened between the notification and your re-registration.

06

Watch Guarantees

ZooKeeper provides strong guarantees about watch delivery that make them safe for coordination. Understanding these guarantees is essential for building correct distributed algorithms.

Watch Guarantees

  • Ordered — watch events are delivered in the same order as the changes that triggered them
  • Once-triggered — a watch fires at most once; re-registration is required for subsequent events
  • Delivered before new data — a client sees the watch event before seeing the new data from a subsequent read
  • Tied to session — watches are removed when the session expires (no stale notifications)
  • Server-local — the server that the client is connected to delivers the watch (no cross-server coordination needed)
watch-guarantees.txttext
Watch Ordering Guarantee:

Client A sets watch on /config
Client B updates /config to "v2"
Client B updates /config to "v3"

Client A receives:
  1. Watch event: NodeDataChanged on /config
  2. (Client A re-reads /configgets "v3", NOT "v2")

Key insight: Client A might "miss" seeing "v2"but that's OK!
The watch told it "something changed." The re-read gets the LATEST value.
For coordination, you care about the current state, not the history.

No-Miss Guarantee (with proper re-registration):
  If you always re-register your watch immediately after it fires
  (in the same callback), you will never miss a change. The pattern:

  void watchCallback(WatchedEvent event) {
    // Watch fired — re-read with new watch immediately
    byte[] newData = zk.getData("/config", this, stat);
    // Process newData...
  }

  Between the watch firing and re-registration, changes are
  captured by the re-read (you get the latest state).

Watches Are Not Message Queues

Watches don't deliver every intermediate value. If /config changes from "v1" to "v2" to "v3" before your watch fires, you get one notification and read "v3". You never see "v2". This is fine for coordination (you want current state) but wrong for event sourcing (use Kafka for that).

07

The One-Time Watch Problem

The one-time nature of watches is both a feature and a challenge. It prevents resource leaks (forgotten watches accumulating) but requires careful re-registration patterns. ZooKeeper 3.6+ introduced persistent watches to address common pain points.

AspectOne-Time Watch (Classic)Persistent Watch (3.6+)
FiresOnce, then removedRepeatedly until removed
Re-registrationRequired after every eventNot needed
Miss windowBetween fire and re-registerNone (always active)
Resource cleanupAutomatic (fires once)Must explicitly remove
APIgetData(path, true)addWatch(path, mode)
ModesN/APERSISTENT, PERSISTENT_RECURSIVE
persistent-watches.txttext
// Classic one-time watch pattern (pre-3.6)
// Problem: gap between watch fire and re-registration

void watchConfig() {
  byte[] data = zk.getData("/config", event -> {
    // Watch fired! But between now and re-registration,
    // another change could happen that we'd miss...
    // (In practice, the re-read catches it, but the pattern is complex)
    watchConfig(); // re-register
  }, stat);
  processConfig(data);
}

// Persistent watch (3.6+) — no re-registration needed
zk.addWatch("/config", event -> {
  // This fires for EVERY change, no re-registration needed
  byte[] newData = zk.getData("/config", null, stat);
  processConfig(newData);
}, AddWatchMode.PERSISTENT);

// Persistent recursive watch — watches entire subtree
zk.addWatch("/services", event -> {
  // Fires for any change under /services (any depth)
  // NodeCreated, NodeDeleted, NodeDataChanged for any descendant
  refreshServiceRegistry();
}, AddWatchMode.PERSISTENT_RECURSIVE);

// Remove when done
zk.removeWatches("/config", watcher, WatcherType.Any, false);

When to Use Persistent Watches

Use persistent watches when you need continuous monitoring without the complexity of re-registration loops. They're ideal for service discovery (watch /services subtree) and configuration management (watch /config). Use classic one-time watches for one-shot coordination (waiting for a specific node to appear or disappear).

Watch Best Practices

  • Always re-read after a watch fires — the watch tells you WHAT changed, the read tells you the CURRENT state
  • Re-register watches in the callback — minimizes the window for missed events
  • Use persistent watches (3.6+) for long-lived monitoring to simplify code
  • Don't use watches for high-frequency changes — each watch event is a network message
  • Handle SESSION_EXPIRED by re-establishing all watches from scratch
08

Interview Questions

Q:What happens when a ZooKeeper session expires? Walk through the consequences.

A: When a session expires: (1) The leader generates a session expiry transaction (replicated via Zab). (2) All ephemeral nodes created by that session are deleted — this triggers watches on those nodes and their parents. (3) All watches registered by that session are removed. (4) Any pending operations are cancelled. (5) If/when the client reconnects, it receives SESSION_EXPIRED and must create a completely new session, re-create ephemeral nodes, and re-register watches. The client cannot recover the old session — it's gone permanently.

Q:How do watches work and what guarantees do they provide?

A: Watches are one-time event notifications registered during read operations (getData, getChildren, exists). Guarantees: (1) Ordered — events delivered in the order changes occurred. (2) Delivered before new data — client sees the watch event before any subsequent read returns new data. (3) Once-triggered — fires at most once, must re-register. (4) Session-bound — removed on session expiry. Two types: data watches (triggered by setData/delete) and child watches (triggered by child add/remove). They're NOT message queues — intermediate values may be skipped.

Q:What's the difference between DISCONNECTED and EXPIRED states?

A: DISCONNECTED means the TCP connection was lost but the session might still be alive on the server. The client should keep trying to reconnect — if it succeeds before the session timeout, everything (ephemeral nodes, watches) is preserved. EXPIRED means the server has declared the session dead (timeout elapsed without heartbeat). All ephemeral nodes are deleted, watches removed. The client must create a completely new session. Key rule: don't treat DISCONNECTED as EXPIRED — keep trying to reconnect.

Q:Why are watches one-time triggers? What problem does this solve?

A: One-time triggers solve two problems: (1) Resource management — if watches were permanent, forgotten watches would accumulate indefinitely, consuming server memory and generating unwanted traffic. (2) Correctness — the one-time nature forces clients to re-read the current state after a notification, ensuring they always act on the latest data rather than a potentially stale notification. The re-read pattern (watch fires → re-read with new watch) guarantees no changes are missed. ZooKeeper 3.6+ added persistent watches for cases where the re-registration pattern is too complex.

Q:How does session timeout negotiation work and how do you choose the right value?

A: The client proposes a timeout at connect time. The server clamps it between minSessionTimeout (2×tickTime) and maxSessionTimeout (20×tickTime). The client sends heartbeats every timeout/3. Choosing the value: too short (4-6s) causes false expirations during GC pauses or network blips. Too long (30-40s) means slow failure detection. Most production deployments use 10-15 seconds. Consider: your JVM's worst-case GC pause, network stability, and how quickly you need to detect failures. The timeout should be at least 2-3× your worst GC pause.

09

Common Mistakes

💀

Treating DISCONNECTED as EXPIRED

Immediately reinitializing everything when the connection drops. This causes unnecessary ephemeral node recreation, lock re-acquisition attempts, and service disruption.

Only reinitialize on SESSION_EXPIRED. During DISCONNECTED, wait for the client library to reconnect. If it reconnects before timeout, everything is preserved — no action needed.

Setting session timeout too short

Using 2-4 second timeouts in production. A single GC pause (common in Java applications) can exceed this, causing session expiry, ephemeral node deletion, and cascading failures.

Set timeout to at least 2-3× your worst-case GC pause. For Java applications with default GC, 10-15 seconds is a safe starting point. Monitor session expiry rates and adjust.

🔄

Not re-registering watches after they fire

Setting a watch once and assuming it will keep notifying. After the first event, the watch is gone — subsequent changes are silently missed.

Always re-register watches in the callback handler. Use the pattern: watch fires → re-read with new watch → process data. Or use persistent watches (3.6+) which don't require re-registration.

📡

Using watches for high-frequency monitoring

Watching a znode that changes hundreds of times per second. Each watch event is a network message — this overwhelms both the server and client.

Watches are designed for low-frequency coordination events (config changes, membership changes). For high-frequency data, poll at intervals or use a streaming system like Kafka.

🔐

Not handling SESSION_EXPIRED in lock implementations

Assuming that once you acquire a lock, you hold it forever. If your session expires (network partition, long GC), your ephemeral lock node is deleted and another process acquires the lock — but you don't know.

Always monitor session state. On SESSION_EXPIRED, assume you've lost all locks and ephemeral nodes. Implement fencing tokens (use the znode's czxid as a fence) to detect stale lock holders.