SessionsHeartbeatsWatchesOne-Time TriggerPersistent WatchesSession Expiry

Sessions & Watches

Sessions bind clients to the ensemble with heartbeats and timeouts. Watches provide event-driven notifications when znodes change — the mechanism that makes coordination reactive instead of polling-based.

38 min read9 sections

What a Session Is

A ZooKeeper session represents the relationship between a client and the ensemble. When a client connects, ZooKeeper creates a session with a unique 64-bit session ID and a session password (for reconnection authentication). The session is the unit of liveness — ephemeral nodes, watches, and pending requests are all tied to it.

🎫

The Hotel Key Card

A session is like a hotel key card. When you check in (connect), you get a card (session ID) that opens your room (ephemeral nodes). The card has an expiry (session timeout). As long as you swipe it periodically (heartbeats), it stays active. If you don't swipe for too long, the hotel deactivates it (session expired) and cleans your room (deletes ephemeral nodes). You can use the card at any door (any server in the ensemble) — it's not tied to one specific entrance.

session-lifecycle.txttext

Session Lifecycle:

1. CONNECTING
   Client initiates TCP connection to a ZK server
   Sends: ConnectRequest(sessionTimeout, sessionId=0, password=empty)
   Server responds: ConnectResponse(sessionId, negotiatedTimeout, password)

2. CONNECTED
   Client can now perform operations
   Client sends periodic heartbeats (PING)
   Server tracks session liveness

3. DISCONNECTED (temporary)
   TCP connection lost (network blip, server restart)
   Client enters CONNECTING state
   Client tries other servers in the ensemble
   Session is still alive on the server side (timeout hasn't expired)

4. RECONNECTED
   Client connects to another server
   Sends: ConnectRequest(sessionTimeout, sessionId=X, password=Y)
   Server validates session is still alive → CONNECTED again
   All ephemeral nodes and watches are preserved!

5. EXPIRED (terminal)
   Session timeout elapsed without heartbeat
   Server deletes all ephemeral nodes owned by this session
   All watches are removed
   Client receives SESSION_EXPIRED event
   Client must create a completely new session

Session state machine:
  CONNECTING → CONNECTED → DISCONNECTED → CONNECTING → CONNECTED
                                                     → EXPIRED (terminal)

What's Tied to a Session

✅Ephemeral nodes — automatically deleted when session expires
✅Watches — removed when session expires (no more notifications)
✅Pending requests — cancelled if session expires before completion
✅Session password — used to reconnect to a different server securely
✅Session timeout — negotiated at connect time, enforced by the server

Session Timeout & Heartbeats

The session timeout determines how long ZooKeeper waits before declaring a client dead. It's negotiated at connection time — the client proposes a timeout, and the server may adjust it within bounds (2× to 20× tickTime). Heartbeats (PING messages) keep the session alive.

timeout-negotiation.txttext

Timeout Negotiation:

Server config (zoo.cfg):
  tickTime = 2000        # Base time unit in ms
  minSessionTimeout = 4000   # 2 × tickTime (minimum)
  maxSessionTimeout = 40000  # 20 × tickTime (maximum)

Client requests: sessionTimeout = 30000 (30 seconds)
Server responds: negotiatedTimeout = 30000 (within bounds ✅)

Client requests: sessionTimeout = 1000 (1 second)
Server responds: negotiatedTimeout = 4000 (clamped to minimum)

Heartbeat interval:
  Client sends PING every: negotiatedTimeout / 3
  Example: 30s timeout → PING every 10 seconds

  Why timeout/3?
  - Gives 2 missed heartbeats before expiry
  - Accounts for network latency and GC pauses
  - Client: "I'll ping at 10s, 20s" → Server expires at 30s

Server-side tracking:
  - Server checks session liveness every tickTime (2s)
  - If no heartbeat received within sessionTimeout → expire
  - The leader is responsible for session expiry decisions

Timeout Value	Failure Detection	False Positives	Use Case
4-6 seconds	Very fast (4-6s)	High (GC pauses trigger expiry)	Low-latency, stable network
10-15 seconds	Fast (10-15s)	Medium	Most production deployments
30-40 seconds	Slow (30-40s)	Low	Unstable networks, large heaps

Choosing the Right Timeout

The timeout is a trade-off between failure detection speed and false positives. Too short: GC pauses or network blips cause unnecessary session expiry (ephemeral nodes deleted, locks lost). Too long: actual failures take too long to detect. Start with 10-15 seconds and adjust based on your GC behavior and network stability.

Session Expiry

Session expiry is the most critical event in ZooKeeper client programming. When a session expires, everything associated with it is destroyed — ephemeral nodes deleted, watches removed, pending operations cancelled. Your application must handle this gracefully.

Timeout Elapses

The leader hasn't received a heartbeat from the client within the session timeout period.

Leader Declares Expiry

The leader generates a session expiry transaction and commits it through Zab (replicated to all servers).

Ephemeral Nodes Deleted

All ephemeral nodes created by this session are deleted. This triggers watches on those nodes and their parents.

Watches Removed

All watches registered by this session are removed. The client will never receive these notifications.

Client Notified

If/when the client reconnects, it receives SESSION_EXPIRED. It cannot recover — must create a new session from scratch.

session-expiry-handling.txttext

// Handling session expiry (conceptual)
zk.addAuthInfo("digest", "user:pass".getBytes());

// Register a session watcher
Watcher sessionWatcher = (event) -> {
  if (event.getState() == KeeperState.Expired) {
    // SESSION EXPIRED — everything is gone!
    // - All our ephemeral nodes: DELETED
    // - All our watches: REMOVED  
    // - All pending operations: CANCELLED
    
    // We MUST:
    // 1. Create a completely new ZooKeeper client
    // 2. Re-create all ephemeral nodes (re-register as service, re-acquire locks)
    // 3. Re-set all watches
    // 4. Re-initialize application state
    
    // We CANNOT:
    // - Reconnect with the old session (it's gone forever)
    // - Assume our ephemeral nodes still exist
    // - Assume we still hold any locks
    
    reinitialize();
  } else if (event.getState() == KeeperState.Disconnected) {
    // DISCONNECTED — session might still be alive!
    // Don't panic yet. The client library will try to reconnect.
    // If it reconnects before timeout, everything is preserved.
    log.warn("Disconnected from ZK — attempting reconnect...");
  }
};

Expiry is Server-Side

Session expiry is decided by the server (specifically the leader), not the client. The client might not even know its session has expired until it tries to reconnect. This means: even if your client process is alive but network-partitioned, the server will expire the session and delete ephemeral nodes after the timeout.

Session Reconnection

When a client loses its TCP connection to a ZooKeeper server, it doesn't mean the session is dead. The client has until the session timeout to reconnect to any server in the ensemble. If it reconnects in time, the session (and all ephemeral nodes) is preserved.

reconnection-flow.txttext

Reconnection Scenarios:

Scenario 1: Quick reconnect (SUCCESS)
  t=0s    Client connected to Server A
  t=1s    Server A crashes
  t=1s    Client enters DISCONNECTED state
  t=2s    Client tries Server B → connection established
  t=2s    Client sends session ID + password to Server B
  t=2s    Server B validates session is still alive → CONNECTED
  Result: ✅ Session preserved, ephemeral nodes intact, watches intact

Scenario 2: Slow reconnect (SUCCESS, barely)
  t=0s    Client connected to Server A
  t=1s    Network partition begins
  t=1s    Client enters DISCONNECTED state
  t=5s    Client tries Server B → fails (also partitioned)
  t=10s   Client tries Server C → fails
  t=25s   Network heals, Client connects to Server B
  t=25s   Session timeout is 30s → session still alive!
  Result: ✅ Session preserved (reconnected before 30s timeout)

Scenario 3: Too slow (EXPIRED)
  t=0s    Client connected to Server A
  t=1s    Network partition begins
  t=30s   Server-side: session timeout reached → SESSION EXPIRED
  t=30s   Ephemeral nodes deleted, watches removed
  t=45s   Network heals, Client connects to Server B
  t=45s   Client sends old session ID → Server says "EXPIRED"
  Result: ❌ Session gone, must create new session

Key insight: The client doesn't know if its session is expired
until it successfully reconnects. During DISCONNECTED state,
it should assume the session MIGHT still be alive and keep trying.

Don't Give Up During DISCONNECTED

A common mistake is treating DISCONNECTED as EXPIRED. During DISCONNECTED, your session might still be alive on the server. Keep trying to reconnect. Only when you receive SESSION_EXPIRED (after reconnecting) should you reinitialize everything.

Watch Types

Watches are ZooKeeper's event notification mechanism. Instead of polling for changes, clients register a watch and receive a callback when the watched znode changes. There are two types of watches: data watches and child watches.

Watch Type	Registered By	Triggered By	Use Case
Data Watch	getData(), exists()	setData(), delete(), create() (for exists)	Config changes, leader data updates
Child Watch	getChildren()	Child added or removed	Service discovery, lock queue changes

watch-types.txttext

Watch Registration and Triggering:

DATA WATCHES (registered by getData or exists):
  // Register: "notify me when /config/db data changes"
  byte[] data = zk.getData("/config/db", true, stat);
  
  // Triggered by:
  //   NodeDataChanged  → someone called setData on /config/db
  //   NodeDeleted      → someone deleted /config/db
  //   NodeCreated      → (only for exists watch on non-existent node)

CHILD WATCHES (registered by getChildren):
  // Register: "notify me when children of /services change"
  List<String> children = zk.getChildren("/services", true);
  
  // Triggered by:
  //   NodeChildrenChanged → child added or removed under /services
  //   NodeDeleted         → /services itself was deleted
  
  // NOT triggered by:
  //   ❌ Data changes in children (only structure changes)
  //   ❌ Data changes in /services itself

IMPORTANT: Watches are one-time triggers!
  After firing once, the watch is gone.
  You must re-register it to get the next notification.
  
  // Pattern: read + watch in a loop
  while (true) {
    data = zk.getData("/config/db", watchCallback, stat);
    // ... use data ...
    // watchCallback fires → loop re-reads with new watch
  }

🔔

The Doorbell

A watch is like a doorbell that only rings once. You install it (register the watch), and when someone arrives (data changes), it rings (callback fires). But then it's disconnected — you have to reinstall it to hear the next visitor. This one-time nature is intentional: it forces you to re-read the current state, preventing you from missing changes that happened between the notification and your re-registration.

Watch Guarantees

ZooKeeper provides strong guarantees about watch delivery that make them safe for coordination. Understanding these guarantees is essential for building correct distributed algorithms.

Watch Guarantees

✅Ordered — watch events are delivered in the same order as the changes that triggered them
✅Once-triggered — a watch fires at most once; re-registration is required for subsequent events
✅Delivered before new data — a client sees the watch event before seeing the new data from a subsequent read
✅Tied to session — watches are removed when the session expires (no stale notifications)
✅Server-local — the server that the client is connected to delivers the watch (no cross-server coordination needed)

watch-guarantees.txttext

Watch Ordering Guarantee:

Client A sets watch on /config
Client B updates /config to "v2"
Client B updates /config to "v3"

Client A receives:
  1. Watch event: NodeDataChanged on /config
  2. (Client A re-reads /config → gets "v3", NOT "v2")

Key insight: Client A might "miss" seeing "v2" — but that's OK!
The watch told it "something changed." The re-read gets the LATEST value.
For coordination, you care about the current state, not the history.

No-Miss Guarantee (with proper re-registration):
  If you always re-register your watch immediately after it fires
  (in the same callback), you will never miss a change. The pattern:

  void watchCallback(WatchedEvent event) {
    // Watch fired — re-read with new watch immediately
    byte[] newData = zk.getData("/config", this, stat);
    // Process newData...
  }

  Between the watch firing and re-registration, changes are
  captured by the re-read (you get the latest state).

Watches Are Not Message Queues

Watches don't deliver every intermediate value. If /config changes from "v1" to "v2" to "v3" before your watch fires, you get one notification and read "v3". You never see "v2". This is fine for coordination (you want current state) but wrong for event sourcing (use Kafka for that).

The One-Time Watch Problem

The one-time nature of watches is both a feature and a challenge. It prevents resource leaks (forgotten watches accumulating) but requires careful re-registration patterns. ZooKeeper 3.6+ introduced persistent watches to address common pain points.

Aspect	One-Time Watch (Classic)	Persistent Watch (3.6+)
Fires	Once, then removed	Repeatedly until removed
Re-registration	Required after every event	Not needed
Miss window	Between fire and re-register	None (always active)
Resource cleanup	Automatic (fires once)	Must explicitly remove
API	getData(path, true)	addWatch(path, mode)
Modes	N/A	PERSISTENT, PERSISTENT_RECURSIVE

persistent-watches.txttext

// Classic one-time watch pattern (pre-3.6)
// Problem: gap between watch fire and re-registration

void watchConfig() {
  byte[] data = zk.getData("/config", event -> {
    // Watch fired! But between now and re-registration,
    // another change could happen that we'd miss...
    // (In practice, the re-read catches it, but the pattern is complex)
    watchConfig(); // re-register
  }, stat);
  processConfig(data);
}

// Persistent watch (3.6+) — no re-registration needed
zk.addWatch("/config", event -> {
  // This fires for EVERY change, no re-registration needed
  byte[] newData = zk.getData("/config", null, stat);
  processConfig(newData);
}, AddWatchMode.PERSISTENT);

// Persistent recursive watch — watches entire subtree
zk.addWatch("/services", event -> {
  // Fires for any change under /services (any depth)
  // NodeCreated, NodeDeleted, NodeDataChanged for any descendant
  refreshServiceRegistry();
}, AddWatchMode.PERSISTENT_RECURSIVE);

// Remove when done
zk.removeWatches("/config", watcher, WatcherType.Any, false);

When to Use Persistent Watches

Use persistent watches when you need continuous monitoring without the complexity of re-registration loops. They're ideal for service discovery (watch /services subtree) and configuration management (watch /config). Use classic one-time watches for one-shot coordination (waiting for a specific node to appear or disappear).

Watch Best Practices

✅Always re-read after a watch fires — the watch tells you WHAT changed, the read tells you the CURRENT state
✅Re-register watches in the callback — minimizes the window for missed events
✅Use persistent watches (3.6+) for long-lived monitoring to simplify code
✅Don't use watches for high-frequency changes — each watch event is a network message
✅Handle SESSION_EXPIRED by re-establishing all watches from scratch

Interview Questions

Q:What happens when a ZooKeeper session expires? Walk through the consequences.

A: When a session expires: (1) The leader generates a session expiry transaction (replicated via Zab). (2) All ephemeral nodes created by that session are deleted — this triggers watches on those nodes and their parents. (3) All watches registered by that session are removed. (4) Any pending operations are cancelled. (5) If/when the client reconnects, it receives SESSION_EXPIRED and must create a completely new session, re-create ephemeral nodes, and re-register watches. The client cannot recover the old session — it's gone permanently.

Q:How do watches work and what guarantees do they provide?

A: Watches are one-time event notifications registered during read operations (getData, getChildren, exists). Guarantees: (1) Ordered — events delivered in the order changes occurred. (2) Delivered before new data — client sees the watch event before any subsequent read returns new data. (3) Once-triggered — fires at most once, must re-register. (4) Session-bound — removed on session expiry. Two types: data watches (triggered by setData/delete) and child watches (triggered by child add/remove). They're NOT message queues — intermediate values may be skipped.

Q:What's the difference between DISCONNECTED and EXPIRED states?

A: DISCONNECTED means the TCP connection was lost but the session might still be alive on the server. The client should keep trying to reconnect — if it succeeds before the session timeout, everything (ephemeral nodes, watches) is preserved. EXPIRED means the server has declared the session dead (timeout elapsed without heartbeat). All ephemeral nodes are deleted, watches removed. The client must create a completely new session. Key rule: don't treat DISCONNECTED as EXPIRED — keep trying to reconnect.

Q:Why are watches one-time triggers? What problem does this solve?

A: One-time triggers solve two problems: (1) Resource management — if watches were permanent, forgotten watches would accumulate indefinitely, consuming server memory and generating unwanted traffic. (2) Correctness — the one-time nature forces clients to re-read the current state after a notification, ensuring they always act on the latest data rather than a potentially stale notification. The re-read pattern (watch fires → re-read with new watch) guarantees no changes are missed. ZooKeeper 3.6+ added persistent watches for cases where the re-registration pattern is too complex.

Q:How does session timeout negotiation work and how do you choose the right value?

A: The client proposes a timeout at connect time. The server clamps it between minSessionTimeout (2×tickTime) and maxSessionTimeout (20×tickTime). The client sends heartbeats every timeout/3. Choosing the value: too short (4-6s) causes false expirations during GC pauses or network blips. Too long (30-40s) means slow failure detection. Most production deployments use 10-15 seconds. Consider: your JVM's worst-case GC pause, network stability, and how quickly you need to detect failures. The timeout should be at least 2-3× your worst GC pause.

Common Mistakes

💀

Treating DISCONNECTED as EXPIRED

Immediately reinitializing everything when the connection drops. This causes unnecessary ephemeral node recreation, lock re-acquisition attempts, and service disruption.

✅Only reinitialize on SESSION_EXPIRED. During DISCONNECTED, wait for the client library to reconnect. If it reconnects before timeout, everything is preserved — no action needed.

⏰

Setting session timeout too short

Using 2-4 second timeouts in production. A single GC pause (common in Java applications) can exceed this, causing session expiry, ephemeral node deletion, and cascading failures.

✅Set timeout to at least 2-3× your worst-case GC pause. For Java applications with default GC, 10-15 seconds is a safe starting point. Monitor session expiry rates and adjust.

🔄

Not re-registering watches after they fire

Setting a watch once and assuming it will keep notifying. After the first event, the watch is gone — subsequent changes are silently missed.

✅Always re-register watches in the callback handler. Use the pattern: watch fires → re-read with new watch → process data. Or use persistent watches (3.6+) which don't require re-registration.

📡

Using watches for high-frequency monitoring

Watching a znode that changes hundreds of times per second. Each watch event is a network message — this overwhelms both the server and client.

✅Watches are designed for low-frequency coordination events (config changes, membership changes). For high-frequency data, poll at intervals or use a streaming system like Kafka.

🔐

Not handling SESSION_EXPIRED in lock implementations

Assuming that once you acquire a lock, you hold it forever. If your session expires (network partition, long GC), your ephemeral lock node is deleted and another process acquires the lock — but you don't know.

✅Always monitor session state. On SESSION_EXPIRED, assume you've lost all locks and ephemeral nodes. Implement fencing tokens (use the znode's czxid as a fence) to detect stale lock holders.

Sessions & Watches

Table of Contents

What a Session Is

The Hotel Key Card

What's Tied to a Session

Session Timeout & Heartbeats

Session Expiry

Timeout Elapses

Leader Declares Expiry

Ephemeral Nodes Deleted

Watches Removed

Client Notified

Session Reconnection

Watch Types

The Doorbell

Watch Guarantees

Watch Guarantees

The One-Time Watch Problem

Watch Best Practices

Interview Questions

Q:What happens when a ZooKeeper session expires? Walk through the consequences.

Q:How do watches work and what guarantees do they provide?

Q:What's the difference between DISCONNECTED and EXPIRED states?

Q:Why are watches one-time triggers? What problem does this solve?

Q:How does session timeout negotiation work and how do you choose the right value?

Common Mistakes

Treating DISCONNECTED as EXPIRED

Setting session timeout too short

Not re-registering watches after they fire

Using watches for high-frequency monitoring

Not handling SESSION_EXPIRED in lock implementations