ZNodes & The Data Model
ZooKeeper's data model is a hierarchical tree of znodes ā each holding a small byte array, version info, and ACLs. The magic is in the node types: persistent, ephemeral, and sequential.
Table of Contents
ZNode Hierarchy
ZooKeeper's namespace is organized as a hierarchical tree, similar to a filesystem. Each node in the tree is called a znode. Every znode is identified by a path (like /app/config/database) and can have both data and children.
A Filesystem That's Also a Database
Think of ZooKeeper's namespace like a filesystem where every directory can also hold a small file. /app is both a 'directory' (it has children like /app/config) and a 'file' (it can store data bytes). Unlike a real filesystem, there are no 'files' vs 'directories' ā every znode is both.
ZooKeeper Namespace (typical production layout): / āāā kafka ā āāā brokers ā ā āāā ids ā ā ā āāā 0 data: {"host":"broker0","port":9092} ā ā ā āāā 1 data: {"host":"broker1","port":9092} ā ā ā āāā 2 data: {"host":"broker2","port":9092} ā ā āāā topics ā ā āāā orders ā ā āāā partitions ā ā āāā 0 data: {"leader":1,"isr":[1,2]} ā ā āāā 1 data: {"leader":2,"isr":[2,0]} ā āāā controller data: {"brokerid":0} āāā hbase ā āāā master data: <master server address> ā āāā rs (region servers registered here) āāā myapp āāā config ā āāā database data: {"host":"db.prod","port":5432} ā āāā feature-flags data: {"dark_mode":true,"beta":false} āāā leader data: "service-instance-3" āāā locks āāā payment-processing āāā _c_...0001 (lock holder) āāā _c_...0002 (waiting) Key rules: - Paths are absolute (always start with /) - No relative paths (no . or ..) - Path components cannot be empty (/a//b is invalid) - "zookeeper" is reserved (/zookeeper is system-internal)
Hierarchy Design Principles
- ā Use paths like namespaces ā /app-name/feature/specific-node
- ā Keep the tree shallow ā deep nesting adds latency to path resolution
- ā Use the hierarchy for logical grouping, not for data relationships
- ā Reserve top-level paths for different applications or services
- ā The root (/) always exists and cannot be deleted
ZNode Data
Each znode can store a small byte array ā up to 1MB by default, though in practice you should keep data much smaller (bytes to kilobytes). Reads and writes to znode data are atomic ā you always get the complete data or nothing.
| Property | Value | Why |
|---|---|---|
| Max data size | 1MB (configurable via jute.maxbuffer) | All data is replicated to all nodes in memory |
| Typical data size | Bytes to KB | Coordination metadata is small |
| Read semantics | Atomic ā full data or nothing | No partial reads possible |
| Write semantics | Atomic ā replaces entire data | No append or partial update |
| Data format | Opaque byte array | ZK doesn't interpret content ā JSON, protobuf, plain text all work |
What to store in znode data: ā Good ā small coordination metadata: /myapp/leader ā "instance-id-42" (15 bytes) /myapp/config/db ā {"host":"db.prod","port":5432} (35 bytes) /myapp/locks/job-1 ā "" (0 bytes ā existence is enough) /kafka/brokers/ids/0 ā {"host":"10.0.1.1","port":9092,"rack":"us-east-1a"} ā Bad ā too large or wrong use case: /myapp/user-sessions ā <50KB JSON blob> (too large, too many updates) /myapp/logs ā <application logs> (wrong tool entirely) /myapp/images/avatar ā <binary image> (use S3/object storage) Why the 1MB limit exists: - All znode data lives in memory on EVERY server - Every write replicates to ALL servers via Zab - Large znodes = more memory, more network, slower snapshots - If you need more than 1MB, you're using ZK wrong
Data is Versioned
Every write to a znode increments its data version. You can use this for optimistic concurrency control: setData(path, data, expectedVersion) will fail if the version doesn't match. This is ZooKeeper's compare-and-swap (CAS) primitive.
Persistent ZNodes
Persistent znodes are the default type. Once created, they exist until explicitly deleted by a client. They survive client disconnections, session expirations, and server restarts. They are the foundation for storing configuration and metadata.
Persistent ZNode Characteristics
- ā Survive client disconnection ā the creating client can crash and the node persists
- ā Survive server restarts ā persisted to disk via transaction log and snapshots
- ā Must be explicitly deleted ā no automatic cleanup
- ā Can have children ā both persistent and ephemeral children
- ā Used for: configuration, metadata, namespace structure, lock queues
// Creating persistent znodes (Java client) zk.create("/myapp/config", "v1".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); // This node survives: // ā Client disconnects // ā Client session expires // ā ZooKeeper server restarts // ā Leader election // ā Only removed by explicit delete() call // Use cases: // /myapp/config/database ā connection string (rarely changes) // /myapp/config/features ā feature flags // /kafka/brokers/topics/X ā topic metadata // /myapp/locks/ ā parent node for lock children
Persistent Nodes as Namespace
In practice, persistent nodes often serve as the "directory structure" ā they define the namespace hierarchy. Ephemeral nodes (which represent live processes) are created as children of persistent parent nodes.
Ephemeral ZNodes
Ephemeral znodes are ZooKeeper's most powerful primitive. They are tied to the client session that created them ā when the session ends (client disconnects, crashes, or times out), the ephemeral node is automatically deleted. This is the foundation for failure detection.
The Candle in the Window
An ephemeral node is like a candle in a window that says 'I'm home.' As long as the person is home, the candle burns. When they leave (or fall asleep and can't maintain it), the candle goes out. Neighbors watching the window (via watches) immediately know the person is gone. No one needs to knock on the door and wait ā the absence of the candle IS the signal.
| Property | Persistent ZNode | Ephemeral ZNode |
|---|---|---|
| Lifetime | Until explicitly deleted | Until session ends |
| Survives disconnect | ā Yes | ā Deleted automatically |
| Can have children | ā Yes | ā No children allowed |
| Created by | Any client | Bound to creating session |
| Use case | Config, metadata, structure | Liveness, locks, membership |
// Service registration with ephemeral nodes zk.create("/services/payment/instance-1", "10.0.1.5:8080".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // What happens when the service crashes: // 1. TCP connection to ZooKeeper drops // 2. ZooKeeper waits for session timeout (e.g., 30 seconds) // 3. No heartbeat received ā session expired // 4. /services/payment/instance-1 is AUTOMATICALLY deleted // 5. Watches on /services/payment fire (children changed) // 6. Other services discover instance-1 is gone // Why no children? // If ephemeral nodes could have children, what happens to the // children when the parent is auto-deleted? Cascading deletes // would be surprising and dangerous. So ZK forbids it. // The stat structure reveals the owner: // ephemeralOwner = session ID of the creating client // (0 for persistent nodes)
Ephemeral Node Use Cases
- ā Service discovery ā each service instance creates an ephemeral node; crash = automatic deregistration
- ā Leader election ā leader holds an ephemeral node; crash = node deleted = new election triggered
- ā Distributed locks ā lock holder creates ephemeral node; crash = lock auto-released
- ā Health monitoring ā existence of ephemeral node = process is alive
- ā Session tracking ā know exactly which clients are currently connected
The Most Important Primitive
Ephemeral nodes are what make ZooKeeper fundamentally different from a regular key-value store. They provide automatic failure detection without any polling or heartbeat logic in your application. Combined with watches, they enable instant notification of failures.
Sequential ZNodes
When you create a sequential znode, ZooKeeper appends a monotonically increasing 10-digit counter to the node name. This counter is unique within the parent node and never reused. Combined with ephemeral nodes, sequential nodes enable fair locking and ordered operations.
// Creating sequential nodes zk.create("/locks/job-lock-", data, acl, CreateMode.EPHEMERAL_SEQUENTIAL); // ZooKeeper appends a 10-digit sequence number: // First call: /locks/job-lock-0000000001 // Second call: /locks/job-lock-0000000002 // Third call: /locks/job-lock-0000000003 // The counter is: // - Monotonically increasing (never goes backward) // - Unique within the parent (/locks/) // - Never reused (even after deletion) // - 10 digits, zero-padded // - Maintained per-parent (different parents have independent counters) // Four combinations of node types: // āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā // ā PERSISTENT ā /config/db ā // ā PERSISTENT_SEQUENTIAL ā /queue/task-0000000001 ā // ā EPHEMERAL ā /services/payment/instance-1 ā // ā EPHEMERAL_SEQUENTIAL ā /locks/write-0000000003 ā // āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā // EPHEMERAL_SEQUENTIAL is the most powerful combination: // - Auto-deleted on session end (ephemeral) // - Globally ordered (sequential) // - Enables: fair locks, leader election, ordered queues
The Deli Counter
Sequential nodes work like a deli counter ticket system. Each customer takes a number (sequential node). The numbers always increase and are never reused. The customer with the lowest number gets served first (holds the lock). If a customer leaves without being served (session expires), their ticket disappears (ephemeral) and the next number is served.
Sequential Node Use Cases
- ā Fair distributed locks ā lowest sequence number holds the lock, others wait in order
- ā Leader election ā lowest sequence number is the leader
- ā FIFO queues ā process nodes in sequence order
- ā Ordering guarantees ā establish a global order among competing processes
- ā Barrier implementation ā wait until N sequential nodes exist
Container & TTL ZNodes
ZooKeeper 3.5+ introduced two additional node types to address common operational pain points: container nodes that auto-delete when empty, and TTL nodes that expire after a time period.
| Type | Introduced | Behavior | Use Case |
|---|---|---|---|
| Container | 3.5.3 | Auto-deleted when last child is removed | Parent nodes for locks/elections |
| TTL (Persistent) | 3.6.0 | Deleted after TTL expires if no children and not modified | Temporary config, lease tokens |
// Container nodes ā auto-cleanup of empty parents zk.create("/locks/job-123", null, acl, CreateMode.CONTAINER); // Children (lock holders) come and go: // /locks/job-123/lock-0000000001 (created, then deleted) // /locks/job-123/lock-0000000002 (created, then deleted) // When last child is deleted ā /locks/job-123 is auto-deleted // Without container nodes, you'd accumulate thousands of empty // /locks/job-XXX parent nodes that need manual cleanup. // TTL nodes ā time-based expiration // Requires: zookeeper.extendedTypesEnabled=true in zoo.cfg zk.create("/leases/client-42", data, acl, CreateMode.PERSISTENT_WITH_TTL, ttl=30000); // 30 seconds // If not modified and has no children after 30s ā auto-deleted // Useful for: // - Lease tokens that should expire // - Temporary coordination state // - Cleanup of abandoned resources // Note: TTL is checked lazily (not exactly at expiry time) // Don't rely on precise timing ā it's approximate cleanup
Container Nodes Solve a Real Problem
Before container nodes, lock and election implementations would create parent nodes like /locks/resource-X that accumulated forever. Operators had to write cleanup scripts. Container nodes solve this automatically ā when the last lock holder releases, the parent disappears.
The Stat Structure
Every znode has metadata stored in a Stat structure. This metadata tracks creation time, modification time, versions, data length, number of children, and the session that owns ephemeral nodes. Understanding Stat is essential for implementing coordination patterns correctly.
The Stat Structure: āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Field ā Type ā Description āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā czxid ā long ā Transaction ID that created this znode mzxid ā long ā Transaction ID of last modification pzxid ā long ā Transaction ID of last child change ctime ā long ā Time created (ms since epoch) mtime ā long ā Time last modified (ms since epoch) version ā int ā Data version (incremented on setData) cversion ā int ā Child version (incremented on child add/remove) aversion ā int ā ACL version (incremented on setACL) ephemeralOwner ā long ā Session ID of owner (0 if persistent) dataLength ā int ā Length of data in bytes numChildren ā int ā Number of children āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Example stat output (zkCli): [zk: localhost:2181(CONNECTED)] stat /myapp/leader cZxid = 0x300000012 ctime = Wed Mar 15 10:23:45 UTC 2024 mZxid = 0x300000015 mtime = Wed Mar 15 10:24:01 UTC 2024 pZxid = 0x300000012 cversion = 0 dataVersion = 3 aclVersion = 0 ephemeralOwner = 0x18e3a4c2b010001 ā this is an ephemeral node! dataLength = 24 numChildren = 0
How Stat Fields Are Used
- ā version ā optimistic locking: setData(path, data, version) fails if version doesn't match (CAS)
- ā czxid/mzxid ā determine ordering: which node was created/modified first
- ā ephemeralOwner ā identify who owns a lock or leadership node
- ā cversion ā detect child changes without listing all children
- ā numChildren ā quick check for barrier conditions (are N nodes present?)
- ā dataLength ā verify data was written (non-zero) without reading it
Version for Compare-and-Swap
The version field enables optimistic concurrency control. When updating config: (1) Read data + version. (2) Modify locally. (3) Write with expected version. If another client modified it between your read and write, the version won't match and your write fails. Retry from step 1. This prevents lost updates without locks.
Interview Questions
Q:What are ephemeral nodes and why are they ZooKeeper's most powerful primitive?
A: Ephemeral nodes are automatically deleted when the creating client's session ends (disconnect, crash, or timeout). They're the most powerful primitive because they provide automatic failure detection without any application-level heartbeat logic. Use cases: (1) Service discovery ā crash = automatic deregistration. (2) Leader election ā leader crash = node deleted = new election. (3) Distributed locks ā holder crash = lock auto-released. No other key-value store provides this session-bound lifecycle natively.
Q:Why can't ephemeral nodes have children?
A: If ephemeral nodes could have children, ZooKeeper would need to decide what happens to children when the parent is auto-deleted. Cascading deletes would be surprising and could destroy important data. Leaving orphaned children would break the tree structure. Rather than making either dangerous choice, ZooKeeper simply forbids it. If you need a hierarchy under an ephemeral concept, use a persistent parent with ephemeral children.
Q:How do sequential nodes enable fair distributed locking?
A: Each lock contender creates an EPHEMERAL_SEQUENTIAL node under a lock path (e.g., /locks/resource/lock-0000000001). The node with the lowest sequence number holds the lock. Others watch the node immediately before them (not the lock holder ā this avoids the herd effect). When a node is deleted (lock released or session expired), only the next waiter is notified. This creates a fair FIFO queue where processes acquire the lock in the order they requested it.
Q:What is the Stat structure and how is the version field used for optimistic concurrency?
A: Stat is metadata attached to every znode: czxid (creation txn), mzxid (last modification txn), version (data version), cversion (children version), ephemeralOwner (session ID), etc. The version field enables CAS (compare-and-swap): setData(path, data, expectedVersion) atomically fails if the current version doesn't match. This prevents lost updates: read version, modify locally, write with version check. If someone else modified it, your write fails and you retry.
Q:What's the 1MB data limit and why does it exist?
A: Each znode can store at most 1MB of data (configurable via jute.maxbuffer). The limit exists because: (1) All znode data is stored in memory on every server in the ensemble. (2) Every write replicates the full data to all servers via Zab. (3) Large znodes would increase memory usage, network traffic, and snapshot time. In practice, coordination metadata (leader IDs, config JSON, service addresses) is bytes to kilobytes. If you need more than 1MB, you're misusing ZooKeeper ā use a database instead.
Common Mistakes
Storing large data in znodes
Putting serialized objects, JSON blobs, or binary data approaching the 1MB limit. This increases memory pressure on all ensemble nodes and slows down snapshots.
ā Keep znode data small (bytes to low KB). Store only references (IDs, addresses, small config). Put large data in a database and store the pointer in ZooKeeper.
Trying to create children under ephemeral nodes
Attempting to build a hierarchy under an ephemeral node for service metadata. ZooKeeper will reject this with NoChildrenForEphemeralsException.
ā Use a persistent parent node with ephemeral children. Example: /services/payment (persistent) with /services/payment/instance-1 (ephemeral).
Not cleaning up persistent nodes
Creating persistent nodes for temporary purposes (job locks, task queues) without cleanup. Over time, thousands of abandoned nodes accumulate, slowing getChildren and increasing memory.
ā Use container nodes (3.5+) for parent nodes that should auto-delete when empty. For temporary data, use TTL nodes (3.6+). Otherwise, implement explicit cleanup in your application.
Ignoring the version field for updates
Using setData(path, data, -1) which skips version checking. This means concurrent updates silently overwrite each other ā last writer wins with no conflict detection.
ā Always read the current version first, then use setData(path, data, expectedVersion). Handle BadVersionException by re-reading and retrying. This is your CAS primitive ā don't bypass it.
Creating too many znodes
Using ZooKeeper as a queue with millions of sequential nodes, or creating a znode per user/request. ZooKeeper keeps all nodes in memory ā millions of nodes = GBs of RAM on every server.
ā ZooKeeper should have thousands to low hundreds of thousands of znodes, not millions. For queues, use Kafka. For per-user data, use a database. ZooKeeper is for coordination metadata only.