zoo.cfgFour Letter WordsJVM TuningMulti-DCFailure ScenariosMonitoring

Operations & Four Letter Words

Running ZooKeeper in production — configuration, monitoring with four letter words, JVM tuning, multi-datacenter deployment, and handling common failure scenarios.

40 min read9 sections
01

zoo.cfg Core Configuration

The zoo.cfg file is ZooKeeper's primary configuration file. It controls timing, storage, networking, and ensemble membership. Getting these settings right is critical for production stability.

zoo.cfgproperties
# ═══════════════════════════════════════════════════════════
# TIMING
# ═══════════════════════════════════════════════════════════
tickTime=2000
# Base time unit in milliseconds. Used as the basis for:
#   - Session timeouts (min=2Ɨtick, max=20Ɨtick)
#   - Heartbeat interval between leader and followers
#   - Default: 2000ms (2 seconds)

initLimit=10
# Ticks allowed for followers to connect and sync with leader
# during initial startup. 10 ticks = 20 seconds.
# Increase for large datasets that take longer to sync.

syncLimit=5
# Ticks allowed for followers to fall behind the leader.
# If a follower is more than 5 ticks (10s) behind, it's dropped.
# Increase for high-latency networks.

# ═══════════════════════════════════════════════════════════
# STORAGE
# ═══════════════════════════════════════════════════════════
dataDir=/var/lib/zookeeper/data
# Where ZooKeeper stores snapshots and the myid file.
# Should be on a reliable filesystem (not tmpfs).

dataLogDir=/var/lib/zookeeper/txnlog
# Where transaction logs are written (CRITICAL for performance).
# MUST be on a dedicated SSD — separate from dataDir.
# Transaction log fsync is the #1 write latency factor.

# ═══════════════════════════════════════════════════════════
# SNAPSHOTS & CLEANUP
# ═══════════════════════════════════════════════════════════
autopurge.snapRetainCount=5
# Keep the 5 most recent snapshots (delete older ones)

autopurge.purgeInterval=1
# Run purge every 1 hour (0 = disabled)

snapCount=100000
# Take a snapshot every 100,000 transactions (approximate)

# ═══════════════════════════════════════════════════════════
# NETWORKING
# ═══════════════════════════════════════════════════════════
clientPort=2181
# Port for client connections

maxClientCnxns=60
# Max concurrent connections from a single IP (0 = unlimited)

# ═══════════════════════════════════════════════════════════
# ENSEMBLE MEMBERS
# ═══════════════════════════════════════════════════════════
server.1=zk1.prod:2888:3888
server.2=zk2.prod:2888:3888
server.3=zk3.prod:2888:3888
server.4=zk4.prod:2888:3888
server.5=zk5.prod:2888:3888

Separate dataLogDir

The single most impactful performance optimization is putting dataLogDir on a dedicated SSD separate from dataDir. Transaction log writes are sequential and fsync'd — they should never compete with snapshot I/O or OS activity. This alone can cut write latency in half.

02

The myid File & Port Usage

Each ZooKeeper server needs a unique identity (myid) and uses three ports for different types of communication. Understanding the port layout is essential for firewall configuration and troubleshooting.

myid-and-ports.txttext
The myid File:
  Location: {dataDir}/myid
  Content: a single integer (the server's ID)
  
  Example: echo "3" > /var/lib/zookeeper/data/myid
  
  This ID must match the server.X line in zoo.cfg:
    server.3=zk3.prod:2888:3888  ← this server is ID 3

Port Usage:
═══════════════════════════════════════════════════════════
Port │ Purpose                    │ Who Connects
═══════════════════════════════════════════════════════════
2181 │ Client connections         │ Application clients
2888 │ Follower → Leader          │ Followers connect to leader
3888 │ Leader election            │ All servers during election
═══════════════════════════════════════════════════════════

Firewall rules needed:
  - 2181: Open to application servers (clients)
  - 2888: Open between all ZK servers (follower-leader)
  - 3888: Open between all ZK servers (election)
  - JMX port (if enabled): Restrict to monitoring systems only

server.X format:
  server.{myid}={hostname}:{follower_port}:{election_port}
  server.1=zk1.prod:2888:3888
  
  For observers:
  server.6=zk6.prod:2888:3888:observer

Port Configuration Best Practices

  • āœ…Use consistent port numbers across all servers (2181, 2888, 3888)
  • āœ…Firewall 2181 to only allow known application servers
  • āœ…Keep 2888 and 3888 open only between ZK ensemble members
  • āœ…Never expose ZK ports to the public internet
  • āœ…Use separate network interfaces for client traffic vs inter-server traffic if possible
03

JVM Tuning & Hardware

ZooKeeper runs on the JVM, making GC pauses a critical concern. A GC pause longer than the session timeout can cause the leader to lose quorum, triggering an unnecessary election. Proper JVM tuning and hardware selection prevent this.

jvm-tuning.txttext
# ZooKeeper JVM settings (java.env or JVMFLAGS)

# Heap size — match to your dataset size + overhead
# All znodes live in memory. Typical: 1-4GB for most deployments.
-Xms4g -Xmx4g    # Fixed heap (no resizing pauses)

# GC — use G1GC for predictable pause times
-XX:+UseG1GC
-XX:MaxGCPauseMillis=50        # Target 50ms max pause
-XX:G1HeapRegionSize=16m       # Larger regions for large heaps
-XX:InitiatingHeapOccupancyPercent=35  # Start GC early

# GC logging (essential for troubleshooting)
-Xlog:gc*:file=/var/log/zookeeper/gc.log:time,uptime:filecount=10,filesize=50m

# Avoid full GC pauses
-XX:+ParallelRefProcEnabled
-XX:+DisableExplicitGC         # Prevent System.gc() calls

# For ZK 3.7+ with Java 17+:
-XX:+UseZGC                    # Sub-millisecond pauses
# ZGC eliminates GC as a concern entirely
HardwareRecommendationWhy
CPU4-8 coresZK is not CPU-intensive; more cores help GC threads
RAM8-16 GBHeap (4GB) + OS cache + overhead. All data in memory.
Disk (txn log)NVMe SSD, dedicatedfsync latency dominates write performance
Disk (snapshots)SSD (can share)Periodic writes, less latency-sensitive
NetworkLow latency (<1ms between nodes)Zab round-trips on every write
Dedicated machineYesNo resource contention from other processes

GC Pauses Are the #1 Operational Risk

A GC pause longer than tickTime (2s default) causes the leader to miss heartbeats. Followers think the leader is dead and start an election. The leader comes back from GC to find it's been deposed. This causes a brief outage for all clients. Use G1GC with low pause targets, or ZGC (Java 17+) for sub-ms pauses.

04

The Four Letter Words

ZooKeeper provides a set of diagnostic commands called "four letter words" (because each command is exactly four characters). You send them via telnet or netcat to the client port, and ZooKeeper responds with status information.

four-letter-words.txttext
# Enable four letter words in zoo.cfg (3.5.3+):
4lw.commands.whitelist=mntr,srvr,stat,ruok,dump,envi,wchs,cons

# Usage: echo <command> | nc localhost 2181

# ─────────────────────────────────────────────────────────
# HEALTH CHECK
# ─────────────────────────────────────────────────────────
$ echo ruok | nc localhost 2181
imok
# Returns "imok" if server is running. Use for load balancer health checks.

# ─────────────────────────────────────────────────────────
# SERVER STATUS
# ─────────────────────────────────────────────────────────
$ echo srvr | nc localhost 2181
Zookeeper version: 3.8.1
Latency min/avg/max: 0/0/12
Received: 4521
Sent: 4520
Connections: 3
Outstanding: 0
Zxid: 0x300000047
Mode: leader          ← This server is the leader
Node count: 1423

# ─────────────────────────────────────────────────────────
# DETAILED METRICS (best for monitoring)
# ─────────────────────────────────────────────────────────
$ echo mntr | nc localhost 2181
zk_version  3.8.1
zk_avg_latency  0
zk_max_latency  12
zk_min_latency  0
zk_packets_received  4521
zk_packets_sent  4520
zk_num_alive_connections  3
zk_outstanding_requests  0
zk_server_state  leader
zk_znode_count  1423
zk_watch_count  45
zk_ephemerals_count  12
zk_approximate_data_size  98304
zk_open_file_descriptor_count  32
zk_max_file_descriptor_count  65536
zk_followers  4
zk_synced_followers  4
zk_pending_syncs  0

# ─────────────────────────────────────────────────────────
# OTHER USEFUL COMMANDS
# ─────────────────────────────────────────────────────────
$ echo stat | nc localhost 2181    # srvr + connection details
$ echo dump | nc localhost 2181    # sessions + ephemeral nodes
$ echo wchs | nc localhost 2181    # watch summary
$ echo cons | nc localhost 2181    # client connections detail
$ echo envi | nc localhost 2181    # environment (Java version, OS, etc.)

mntr for Monitoring Systems

The mntr command outputs key=value pairs perfect for ingestion by Prometheus, Datadog, or other monitoring systems. It's the most useful four letter word for production monitoring. Set up alerts on zk_outstanding_requests, zk_avg_latency, zk_synced_followers, and zk_open_file_descriptor_count.

05

Key Metrics to Monitor

Monitoring ZooKeeper effectively means tracking a small set of critical metrics that indicate health, performance, and capacity. Alert on these before your clients notice problems.

MetricHealthy ValueAlert ThresholdWhat It Means
zk_avg_latency< 5ms> 20msAverage request processing time
zk_outstanding_requests0-10> 100Requests queued but not processed
zk_num_alive_connectionsExpected countSudden drop/spikeConnected clients
zk_watch_countStableRapid growthActive watches (memory pressure)
zk_synced_followersN-1 (all followers)< expectedFollowers in sync with leader
zk_pending_syncs0> 0 sustainedFollowers waiting to sync
zk_open_file_descriptor_count< 50% of max> 80% of maxFD exhaustion risk
zk_approximate_data_size< 500MB> 1GBTotal data in memory

Critical Alerts to Configure

  • āŒLeader loss — zk_server_state changes from 'leader' on all nodes (no leader elected)
  • āŒQuorum loss — zk_synced_followers drops below quorum threshold
  • āŒLatency spike — zk_max_latency exceeds session timeout (clients will disconnect)
  • āŒRequest backlog — zk_outstanding_requests growing continuously (server overwhelmed)
  • āŒFD exhaustion — approaching max file descriptors (new connections will fail)

Monitor All Nodes, Not Just Leader

Monitor every node in the ensemble independently. A follower falling behind (high pending_syncs) is an early warning of network issues or disk problems. The leader's metrics show write throughput; followers' metrics show replication health.

06

Multi-Datacenter Deployment

Deploying ZooKeeper across multiple datacenters is complex because every write requires quorum — and cross-DC latency directly impacts write performance. There are three common approaches, each with different trade-offs.

ApproachWrite LatencySurvives DC LossComplexity
Option 1: Single DC + observersLow (intra-DC)āŒ No (primary DC loss = down)Low
Option 2: 3 DCs (2+2+1)High (cross-DC RTT)āœ… Yes (any 1 DC)Medium
Option 3: 2 DCs + hierarchical quorumMediumPartialHigh
multi-dc-options.txttext
Option 1: Single DC with Observers in Secondary
═══════════════════════════════════════════════════════════
DC1 (Primary):  server.1, server.2, server.3, server.4, server.5
DC2 (Secondary): server.6:observer, server.7:observer

Quorum: 3 of 5 (all in DC1)
Write latency: ~5ms (intra-DC only)
Read latency in DC2: ~1ms (local observer)
DC1 failure: TOTAL OUTAGE (no quorum possible)
DC2 failure: No impact on writes, DC2 reads unavailable

Best for: Most deployments where one DC is "primary"

Option 2: Three DCs (true multi-DC)
═══════════════════════════════════════════════════════════
DC1: server.1, server.2          (2 nodes)
DC2: server.3, server.4          (2 nodes)
DC3: server.5                    (1 node — tiebreaker)

Quorum: 3 of 5
Write latency: ~50-100ms (must reach quorum across DCs)
Any single DC failure: quorum still possible (3+ nodes remain)
Two DC failure: OUTAGE

Best for: Critical systems requiring DC-level fault tolerance
Trade-off: Every write pays cross-DC latency

Option 3: Hierarchical Quorum (advanced)
═══════════════════════════════════════════════════════════
DC1: server.1, server.2, server.3  (group weight: 1)
DC2: server.4, server.5, server.6  (group weight: 1)

Quorum: majority of groups AND majority within each group
Write latency: cross-DC (must reach both groups)
More flexible but more complex to reason about

Option 1 is Usually Right

For most deployments, Option 1 (single DC + observers) is the best choice. Cross-DC write latency is painful (50-100ms per write) and most organizations have a primary DC anyway. Use observers in the secondary DC for local reads and plan for DC failover at the application level.

07

Common Failure Scenarios

Understanding common failure scenarios and their symptoms helps you diagnose and resolve ZooKeeper issues quickly. Most production incidents fall into a few well-known categories.

failure-scenarios.txttext
Scenario 1: QUORUM LOSS
Symptom: All clients get CONNECTION_LOSS, no writes possible
Cause: More than ⌊N/2āŒ‹ nodes are down
Fix: Bring nodes back online. ZK auto-recovers once quorum is restored.
Prevention: 5-node ensemble (tolerates 2 failures), cross-rack placement

Scenario 2: SESSION EXPIRY CASCADE
Symptom: Many clients expire simultaneously, ephemeral nodes mass-deleted
Cause: Leader GC pause > session timeout, or network partition
Effect: All services lose their registrations, locks released, chaos
Fix: Increase session timeout, tune GC, use ZGC
Prevention: Session timeout > 3Ɨ worst GC pause, dedicated hardware

Scenario 3: DISK FULL
Symptom: Write failures, server may crash or become read-only
Cause: Transaction logs and snapshots fill the disk
Fix: Enable autopurge, manually clean old snapshots
Prevention: autopurge.purgeInterval=1, monitor disk usage, alerts at 80%

Scenario 4: GC PAUSE (Leader)
Symptom: Brief unavailability, unnecessary leader election
Cause: Full GC on leader node pauses all processing
Effect: Followers think leader is dead, elect new leader
Fix: Tune GC (G1GC with low pause target), or use ZGC
Prevention: Fixed heap size, avoid System.gc(), monitor GC logs

Scenario 5: SPLIT BRAIN (should be impossible)
Symptom: Two nodes both claim to be leader
Cause: Bug or misconfiguration (should never happen with correct Zab)
Reality: ZK's quorum mechanism prevents this. If you see it, check:
  - Are all nodes using the same zoo.cfg?
  - Is the myid file correct on each node?
  - Are there network issues causing false positives?

Scenario 6: SLOW FOLLOWER
Symptom: One follower has high pending_syncs, clients on it see stale data
Cause: Disk I/O issues, network congestion, or resource contention
Fix: Check disk health, network, competing processes
Prevention: Dedicated hardware, SSD for txn log, monitoring

Operational Runbook Essentials

  • āœ…Always check 'srvr' on all nodes first — identify who is leader, who is follower, who is down
  • āœ…Check 'mntr' for outstanding_requests and latency — indicates if the ensemble is overloaded
  • āœ…Check GC logs when latency spikes — GC pauses are the most common cause
  • āœ…Never restart more than one node at a time — maintain quorum during rolling restarts
  • āœ…Wait for the restarted node to fully sync before restarting the next one
08

Interview Questions

Q:What are the four letter words and how would you use them in production?

A: Four letter words are diagnostic commands sent via TCP to ZK's client port. Key ones: (1) ruok — health check ('imok' = alive), used by load balancers. (2) mntr — detailed metrics in key=value format, used by monitoring systems (Prometheus, Datadog). (3) srvr — server status including mode (leader/follower), zxid, connections. (4) stat — srvr + client connection details. (5) dump — sessions and ephemeral nodes. In production: use ruok for health checks, mntr for metrics collection, srvr for quick diagnosis during incidents.

Q:How would you deploy ZooKeeper across multiple datacenters?

A: Three options: (1) Single DC + observers — voting members in primary DC, observers in secondary for local reads. Low write latency but no DC-level fault tolerance. Best for most deployments. (2) Three DCs (2+2+1) — voting members across 3 DCs. Survives any single DC loss but every write pays cross-DC latency (50-100ms). For critical systems. (3) Hierarchical quorum — complex, rarely used. Key insight: every write requires quorum ACKs, so cross-DC members directly impact write latency. Use observers for read scaling in remote DCs without affecting write performance.

Q:What happens during a GC pause on the ZooKeeper leader?

A: If the GC pause exceeds tickTime: (1) Leader stops sending heartbeats to followers. (2) Followers' syncLimit timer expires — they think leader is dead. (3) Followers enter LOOKING state, start leader election. (4) New leader elected, ensemble briefly unavailable during election. (5) Old leader comes back from GC, discovers it's been deposed (epoch increased), becomes follower. (6) Clients connected to old leader get DISCONNECTED, reconnect to new leader. Prevention: use G1GC with MaxGCPauseMillis=50, or ZGC for sub-ms pauses. Fixed heap size. Dedicated hardware.

Q:What's the most important ZooKeeper performance optimization?

A: Putting the transaction log (dataLogDir) on a dedicated SSD separate from everything else. Every write requires an fsync to the transaction log before the follower can ACK. If the txn log shares a disk with snapshots, OS activity, or other applications, fsync latency increases dramatically. A dedicated NVMe SSD can achieve 1-2ms fsync vs 10-15ms on a shared HDD. This single change can cut write latency by 50-80%. Second most important: proper GC tuning to prevent pauses that trigger unnecessary elections.

09

Common Mistakes

šŸ’¾

Sharing the transaction log disk

Putting dataLogDir on the same disk as dataDir, application logs, or the OS. Snapshot writes and other I/O compete with transaction log fsyncs, causing write latency spikes.

āœ…Always put dataLogDir on a dedicated SSD with no other I/O. This is the #1 performance optimization for ZooKeeper. Use a separate disk for dataDir (snapshots).

šŸ—‘ļø

Not enabling autopurge

Transaction logs and snapshots accumulate indefinitely, eventually filling the disk. When the disk is full, ZooKeeper crashes or becomes read-only.

āœ…Enable autopurge: autopurge.purgeInterval=1 (hourly), autopurge.snapRetainCount=5 (keep last 5). Also monitor disk usage and alert at 80% capacity.

šŸ”„

Rolling restarts too fast

Restarting the next node before the previous one has fully synced. If two nodes are restarting simultaneously in a 5-node ensemble, you're down to 3 — one more failure loses quorum.

āœ…After restarting a node, wait until 'mntr' shows zk_synced_followers equals the expected count. Only then restart the next node. Never restart more than one at a time.

🌐

Exposing ZooKeeper ports to the internet

Leaving port 2181 open to the world. ZooKeeper has no built-in authentication by default — anyone can connect and read/write data.

āœ…Firewall 2181 to only allow known application servers. Firewall 2888/3888 to only allow ensemble members. Enable SASL authentication for client connections. Never expose ZK to the public internet.

ā˜•

Ignoring JVM GC tuning

Running ZooKeeper with default JVM settings. Default GC can cause multi-second pauses that trigger unnecessary leader elections and client session expirations.

āœ…Use G1GC with -XX:MaxGCPauseMillis=50, fixed heap size (-Xms = -Xmx), and enable GC logging. For Java 17+, use ZGC for sub-millisecond pauses. Monitor GC logs for pause times exceeding tickTime.