ZNode TypesEphemeralSequentialStat StructureHierarchy1MB Limit

ZNodes & The Data Model

ZooKeeper's data model is a hierarchical tree of znodes — each holding a small byte array, version info, and ACLs. The magic is in the node types: persistent, ephemeral, and sequential.

40 min read9 sections

ZNode Hierarchy

ZooKeeper's namespace is organized as a hierarchical tree, similar to a filesystem. Each node in the tree is called a znode. Every znode is identified by a path (like /app/config/database) and can have both data and children.

🌳

A Filesystem That's Also a Database

Think of ZooKeeper's namespace like a filesystem where every directory can also hold a small file. /app is both a 'directory' (it has children like /app/config) and a 'file' (it can store data bytes). Unlike a real filesystem, there are no 'files' vs 'directories' — every znode is both.

znode-hierarchy.txttext

ZooKeeper Namespace (typical production layout):

/
├── kafka
│   ├── brokers
│   │   ├── ids
│   │   │   ├── 0          data: {"host":"broker0","port":9092}
│   │   │   ├── 1          data: {"host":"broker1","port":9092}
│   │   │   └── 2          data: {"host":"broker2","port":9092}
│   │   └── topics
│   │       └── orders
│   │           └── partitions
│   │               ├── 0   data: {"leader":1,"isr":[1,2]}
│   │               └── 1   data: {"leader":2,"isr":[2,0]}
│   └── controller          data: {"brokerid":0}
├── hbase
│   ├── master              data: <master server address>
│   └── rs                  (region servers registered here)
└── myapp
    ├── config
    │   ├── database        data: {"host":"db.prod","port":5432}
    │   └── feature-flags   data: {"dark_mode":true,"beta":false}
    ├── leader              data: "service-instance-3"
    └── locks
        └── payment-processing
            ├── _c_...0001  (lock holder)
            └── _c_...0002  (waiting)

Key rules:
  - Paths are absolute (always start with /)
  - No relative paths (no . or ..)
  - Path components cannot be empty (/a//b is invalid)
  - "zookeeper" is reserved (/zookeeper is system-internal)

Hierarchy Design Principles

✅Use paths like namespaces — /app-name/feature/specific-node
✅Keep the tree shallow — deep nesting adds latency to path resolution
✅Use the hierarchy for logical grouping, not for data relationships
✅Reserve top-level paths for different applications or services
✅The root (/) always exists and cannot be deleted

ZNode Data

Each znode can store a small byte array — up to 1MB by default, though in practice you should keep data much smaller (bytes to kilobytes). Reads and writes to znode data are atomic — you always get the complete data or nothing.

Property	Value	Why
Max data size	1MB (configurable via jute.maxbuffer)	All data is replicated to all nodes in memory
Typical data size	Bytes to KB	Coordination metadata is small
Read semantics	Atomic — full data or nothing	No partial reads possible
Write semantics	Atomic — replaces entire data	No append or partial update
Data format	Opaque byte array	ZK doesn't interpret content — JSON, protobuf, plain text all work

znode-data-examples.txttext

What to store in znode data:

✅ Good — small coordination metadata:
  /myapp/leader         → "instance-id-42"              (15 bytes)
  /myapp/config/db      → {"host":"db.prod","port":5432} (35 bytes)
  /myapp/locks/job-1    → ""                            (0 bytes — existence is enough)
  /kafka/brokers/ids/0  → {"host":"10.0.1.1","port":9092,"rack":"us-east-1a"}

❌ Bad — too large or wrong use case:
  /myapp/user-sessions  → <50KB JSON blob>    (too large, too many updates)
  /myapp/logs           → <application logs>  (wrong tool entirely)
  /myapp/images/avatar  → <binary image>      (use S3/object storage)

Why the 1MB limit exists:
  - All znode data lives in memory on EVERY server
  - Every write replicates to ALL servers via Zab
  - Large znodes = more memory, more network, slower snapshots
  - If you need more than 1MB, you're using ZK wrong

Data is Versioned

Every write to a znode increments its data version. You can use this for optimistic concurrency control: setData(path, data, expectedVersion) will fail if the version doesn't match. This is ZooKeeper's compare-and-swap (CAS) primitive.

Persistent ZNodes

Persistent znodes are the default type. Once created, they exist until explicitly deleted by a client. They survive client disconnections, session expirations, and server restarts. They are the foundation for storing configuration and metadata.

Persistent ZNode Characteristics

✅Survive client disconnection — the creating client can crash and the node persists
✅Survive server restarts — persisted to disk via transaction log and snapshots
✅Must be explicitly deleted — no automatic cleanup
✅Can have children — both persistent and ephemeral children
✅Used for: configuration, metadata, namespace structure, lock queues

persistent-znodes.txttext

// Creating persistent znodes (Java client)
zk.create("/myapp/config", "v1".getBytes(), 
          ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);

// This node survives:
//   ✅ Client disconnects
//   ✅ Client session expires
//   ✅ ZooKeeper server restarts
//   ✅ Leader election
//   ❌ Only removed by explicit delete() call

// Use cases:
//   /myapp/config/database    → connection string (rarely changes)
//   /myapp/config/features    → feature flags
//   /kafka/brokers/topics/X   → topic metadata
//   /myapp/locks/             → parent node for lock children

Persistent Nodes as Namespace

In practice, persistent nodes often serve as the "directory structure" — they define the namespace hierarchy. Ephemeral nodes (which represent live processes) are created as children of persistent parent nodes.

Ephemeral ZNodes

Ephemeral znodes are ZooKeeper's most powerful primitive. They are tied to the client session that created them — when the session ends (client disconnects, crashes, or times out), the ephemeral node is automatically deleted. This is the foundation for failure detection.

🕯️

The Candle in the Window

An ephemeral node is like a candle in a window that says 'I'm home.' As long as the person is home, the candle burns. When they leave (or fall asleep and can't maintain it), the candle goes out. Neighbors watching the window (via watches) immediately know the person is gone. No one needs to knock on the door and wait — the absence of the candle IS the signal.

Property	Persistent ZNode	Ephemeral ZNode
Lifetime	Until explicitly deleted	Until session ends
Survives disconnect	✅ Yes	❌ Deleted automatically
Can have children	✅ Yes	❌ No children allowed
Created by	Any client	Bound to creating session
Use case	Config, metadata, structure	Liveness, locks, membership

ephemeral-znodes.txttext

// Service registration with ephemeral nodes
zk.create("/services/payment/instance-1", 
          "10.0.1.5:8080".getBytes(),
          ZooDefs.Ids.OPEN_ACL_UNSAFE, 
          CreateMode.EPHEMERAL);

// What happens when the service crashes:
// 1. TCP connection to ZooKeeper drops
// 2. ZooKeeper waits for session timeout (e.g., 30 seconds)
// 3. No heartbeat received → session expired
// 4. /services/payment/instance-1 is AUTOMATICALLY deleted
// 5. Watches on /services/payment fire (children changed)
// 6. Other services discover instance-1 is gone

// Why no children?
// If ephemeral nodes could have children, what happens to the
// children when the parent is auto-deleted? Cascading deletes
// would be surprising and dangerous. So ZK forbids it.

// The stat structure reveals the owner:
// ephemeralOwner = session ID of the creating client
// (0 for persistent nodes)

Ephemeral Node Use Cases

✅Service discovery — each service instance creates an ephemeral node; crash = automatic deregistration
✅Leader election — leader holds an ephemeral node; crash = node deleted = new election triggered
✅Distributed locks — lock holder creates ephemeral node; crash = lock auto-released
✅Health monitoring — existence of ephemeral node = process is alive
✅Session tracking — know exactly which clients are currently connected

The Most Important Primitive

Ephemeral nodes are what make ZooKeeper fundamentally different from a regular key-value store. They provide automatic failure detection without any polling or heartbeat logic in your application. Combined with watches, they enable instant notification of failures.

Sequential ZNodes

When you create a sequential znode, ZooKeeper appends a monotonically increasing 10-digit counter to the node name. This counter is unique within the parent node and never reused. Combined with ephemeral nodes, sequential nodes enable fair locking and ordered operations.

sequential-znodes.txttext

// Creating sequential nodes
zk.create("/locks/job-lock-", data, acl, CreateMode.EPHEMERAL_SEQUENTIAL);

// ZooKeeper appends a 10-digit sequence number:
// First call:  /locks/job-lock-0000000001
// Second call: /locks/job-lock-0000000002
// Third call:  /locks/job-lock-0000000003

// The counter is:
//   - Monotonically increasing (never goes backward)
//   - Unique within the parent (/locks/)
//   - Never reused (even after deletion)
//   - 10 digits, zero-padded
//   - Maintained per-parent (different parents have independent counters)

// Four combinations of node types:
// ┌─────────────────────────────────────────────────────────┐
// │ PERSISTENT              → /config/db                    │
// │ PERSISTENT_SEQUENTIAL   → /queue/task-0000000001        │
// │ EPHEMERAL               → /services/payment/instance-1  │
// │ EPHEMERAL_SEQUENTIAL    → /locks/write-0000000003       │
// └─────────────────────────────────────────────────────────┘

// EPHEMERAL_SEQUENTIAL is the most powerful combination:
//   - Auto-deleted on session end (ephemeral)
//   - Globally ordered (sequential)
//   - Enables: fair locks, leader election, ordered queues

🎫

The Deli Counter

Sequential nodes work like a deli counter ticket system. Each customer takes a number (sequential node). The numbers always increase and are never reused. The customer with the lowest number gets served first (holds the lock). If a customer leaves without being served (session expires), their ticket disappears (ephemeral) and the next number is served.

Sequential Node Use Cases

✅Fair distributed locks — lowest sequence number holds the lock, others wait in order
✅Leader election — lowest sequence number is the leader
✅FIFO queues — process nodes in sequence order
✅Ordering guarantees — establish a global order among competing processes
✅Barrier implementation — wait until N sequential nodes exist

Container & TTL ZNodes

ZooKeeper 3.5+ introduced two additional node types to address common operational pain points: container nodes that auto-delete when empty, and TTL nodes that expire after a time period.

Type	Introduced	Behavior	Use Case
Container	3.5.3	Auto-deleted when last child is removed	Parent nodes for locks/elections
TTL (Persistent)	3.6.0	Deleted after TTL expires if no children and not modified	Temporary config, lease tokens

container-ttl.txttext

// Container nodes — auto-cleanup of empty parents
zk.create("/locks/job-123", null, acl, CreateMode.CONTAINER);

// Children (lock holders) come and go:
// /locks/job-123/lock-0000000001  (created, then deleted)
// /locks/job-123/lock-0000000002  (created, then deleted)
// When last child is deleted → /locks/job-123 is auto-deleted

// Without container nodes, you'd accumulate thousands of empty
// /locks/job-XXX parent nodes that need manual cleanup.

// TTL nodes — time-based expiration
// Requires: zookeeper.extendedTypesEnabled=true in zoo.cfg
zk.create("/leases/client-42", data, acl, 
          CreateMode.PERSISTENT_WITH_TTL, ttl=30000); // 30 seconds

// If not modified and has no children after 30s → auto-deleted
// Useful for:
//   - Lease tokens that should expire
//   - Temporary coordination state
//   - Cleanup of abandoned resources

// Note: TTL is checked lazily (not exactly at expiry time)
// Don't rely on precise timing — it's approximate cleanup

Container Nodes Solve a Real Problem

Before container nodes, lock and election implementations would create parent nodes like /locks/resource-X that accumulated forever. Operators had to write cleanup scripts. Container nodes solve this automatically — when the last lock holder releases, the parent disappears.

The Stat Structure

Every znode has metadata stored in a Stat structure. This metadata tracks creation time, modification time, versions, data length, number of children, and the session that owns ephemeral nodes. Understanding Stat is essential for implementing coordination patterns correctly.

stat-structure.txttext

The Stat Structure:
═══════════════════════════════════════════════════════════════════
Field            │ Type   │ Description
═══════════════════════════════════════════════════════════════════
czxid            │ long   │ Transaction ID that created this znode
mzxid            │ long   │ Transaction ID of last modification
pzxid            │ long   │ Transaction ID of last child change
ctime            │ long   │ Time created (ms since epoch)
mtime            │ long   │ Time last modified (ms since epoch)
version          │ int    │ Data version (incremented on setData)
cversion         │ int    │ Child version (incremented on child add/remove)
aversion         │ int    │ ACL version (incremented on setACL)
ephemeralOwner   │ long   │ Session ID of owner (0 if persistent)
dataLength       │ int    │ Length of data in bytes
numChildren      │ int    │ Number of children
═══════════════════════════════════════════════════════════════════

Example stat output (zkCli):
  [zk: localhost:2181(CONNECTED)] stat /myapp/leader
  cZxid = 0x300000012
  ctime = Wed Mar 15 10:23:45 UTC 2024
  mZxid = 0x300000015
  mtime = Wed Mar 15 10:24:01 UTC 2024
  pZxid = 0x300000012
  cversion = 0
  dataVersion = 3
  aclVersion = 0
  ephemeralOwner = 0x18e3a4c2b010001  ← this is an ephemeral node!
  dataLength = 24
  numChildren = 0

How Stat Fields Are Used

✅version — optimistic locking: setData(path, data, version) fails if version doesn't match (CAS)
✅czxid/mzxid — determine ordering: which node was created/modified first
✅ephemeralOwner — identify who owns a lock or leadership node
✅cversion — detect child changes without listing all children
✅numChildren — quick check for barrier conditions (are N nodes present?)
✅dataLength — verify data was written (non-zero) without reading it

Version for Compare-and-Swap

The version field enables optimistic concurrency control. When updating config: (1) Read data + version. (2) Modify locally. (3) Write with expected version. If another client modified it between your read and write, the version won't match and your write fails. Retry from step 1. This prevents lost updates without locks.

Interview Questions

Q:What are ephemeral nodes and why are they ZooKeeper's most powerful primitive?

A: Ephemeral nodes are automatically deleted when the creating client's session ends (disconnect, crash, or timeout). They're the most powerful primitive because they provide automatic failure detection without any application-level heartbeat logic. Use cases: (1) Service discovery — crash = automatic deregistration. (2) Leader election — leader crash = node deleted = new election. (3) Distributed locks — holder crash = lock auto-released. No other key-value store provides this session-bound lifecycle natively.

Q:Why can't ephemeral nodes have children?

A: If ephemeral nodes could have children, ZooKeeper would need to decide what happens to children when the parent is auto-deleted. Cascading deletes would be surprising and could destroy important data. Leaving orphaned children would break the tree structure. Rather than making either dangerous choice, ZooKeeper simply forbids it. If you need a hierarchy under an ephemeral concept, use a persistent parent with ephemeral children.

Q:How do sequential nodes enable fair distributed locking?

A: Each lock contender creates an EPHEMERAL_SEQUENTIAL node under a lock path (e.g., /locks/resource/lock-0000000001). The node with the lowest sequence number holds the lock. Others watch the node immediately before them (not the lock holder — this avoids the herd effect). When a node is deleted (lock released or session expired), only the next waiter is notified. This creates a fair FIFO queue where processes acquire the lock in the order they requested it.

Q:What is the Stat structure and how is the version field used for optimistic concurrency?

A: Stat is metadata attached to every znode: czxid (creation txn), mzxid (last modification txn), version (data version), cversion (children version), ephemeralOwner (session ID), etc. The version field enables CAS (compare-and-swap): setData(path, data, expectedVersion) atomically fails if the current version doesn't match. This prevents lost updates: read version, modify locally, write with version check. If someone else modified it, your write fails and you retry.

Q:What's the 1MB data limit and why does it exist?

A: Each znode can store at most 1MB of data (configurable via jute.maxbuffer). The limit exists because: (1) All znode data is stored in memory on every server in the ensemble. (2) Every write replicates the full data to all servers via Zab. (3) Large znodes would increase memory usage, network traffic, and snapshot time. In practice, coordination metadata (leader IDs, config JSON, service addresses) is bytes to kilobytes. If you need more than 1MB, you're misusing ZooKeeper — use a database instead.

Common Mistakes

📦

Storing large data in znodes

Putting serialized objects, JSON blobs, or binary data approaching the 1MB limit. This increases memory pressure on all ensemble nodes and slows down snapshots.

✅Keep znode data small (bytes to low KB). Store only references (IDs, addresses, small config). Put large data in a database and store the pointer in ZooKeeper.

👶

Trying to create children under ephemeral nodes

Attempting to build a hierarchy under an ephemeral node for service metadata. ZooKeeper will reject this with NoChildrenForEphemeralsException.

✅Use a persistent parent node with ephemeral children. Example: /services/payment (persistent) with /services/payment/instance-1 (ephemeral).

🗑️

Not cleaning up persistent nodes

Creating persistent nodes for temporary purposes (job locks, task queues) without cleanup. Over time, thousands of abandoned nodes accumulate, slowing getChildren and increasing memory.

✅Use container nodes (3.5+) for parent nodes that should auto-delete when empty. For temporary data, use TTL nodes (3.6+). Otherwise, implement explicit cleanup in your application.

🔄

Ignoring the version field for updates

Using setData(path, data, -1) which skips version checking. This means concurrent updates silently overwrite each other — last writer wins with no conflict detection.

✅Always read the current version first, then use setData(path, data, expectedVersion). Handle BadVersionException by re-reading and retrying. This is your CAS primitive — don't bypass it.

🌊

Creating too many znodes

Using ZooKeeper as a queue with millions of sequential nodes, or creating a znode per user/request. ZooKeeper keeps all nodes in memory — millions of nodes = GBs of RAM on every server.

✅ZooKeeper should have thousands to low hundreds of thousands of znodes, not millions. For queues, use Kafka. For per-user data, use a database. ZooKeeper is for coordination metadata only.