ZooKeeper vs Alternatives
etcd, Consul, and Apache Curator — how they compare to ZooKeeper, when to use each, and how to discuss coordination systems correctly in system design interviews.
Table of Contents
etcd
etcd is a distributed key-value store built by CoreOS (now part of the CNCF). It's written in Go, uses the Raft consensus protocol, and is the backbone of Kubernetes. It provides similar coordination primitives to ZooKeeper but with a more modern API and simpler operational model.
| Property | ZooKeeper | etcd |
|---|---|---|
| Language | Java | Go |
| Consensus | Zab (custom) | Raft |
| API | Custom TCP protocol | gRPC + HTTP/JSON |
| Data model | Hierarchical tree (znodes) | Flat key-value (with prefix ranges) |
| Watch model | One-time triggers (persistent in 3.6+) | Streaming watches with revision history |
| Lease (ephemeral) | Session-based ephemeral nodes | TTL-based leases attached to keys |
| Transactions | multi() operation | Mini-transactions (if/then/else) |
| Linearizable reads | Requires sync() | Built-in (--consistent flag) |
| Primary user | Kafka, HBase, Hadoop | Kubernetes |
| Operational complexity | Higher (JVM, GC tuning) | Lower (single binary, no JVM) |
etcd Key Differences: 1. Flat namespace (not hierarchical) ZooKeeper: /services/payment/instance-1 (tree structure) etcd: key="/services/payment/instance-1" (flat, but prefix queries work) etcd range query: get all keys with prefix "/services/payment/" → Functionally similar to getChildren, but flat 2. Streaming watches (not one-time) ZooKeeper: watch fires once, must re-register etcd: watch is a stream — keeps sending events until cancelled etcd also provides revision-based watches: "Watch key X starting from revision 42" → Never miss events, even if client was disconnected 3. Leases instead of sessions ZooKeeper: ephemeral nodes tied to client session etcd: leases with explicit TTL, attached to one or more keys // Create a lease (like a session) lease = client.lease.grant(ttl=30) // Attach keys to the lease client.put("/services/payment/inst-1", "10.0.1.5", lease=lease) // Keep alive (like heartbeat) client.lease.keepAlive(lease) // If keepAlive stops → lease expires → keys deleted 4. Mini-transactions (more powerful than multi) // Atomic compare-and-swap with conditions client.txn() .if(key("/leader").value == "") .then(put("/leader", myId, lease)) .else(get("/leader")) .commit()
etcd's Raft vs ZooKeeper's Zab
Both provide the same safety guarantees (linearizable writes, leader-based replication). The practical difference: Raft is simpler to understand and has more implementations. Zab was designed specifically for ZooKeeper's primary-backup model. For users, the choice between them is about ecosystem, not consensus protocol quality.
Apache Curator
Apache Curator is NOT an alternative to ZooKeeper — it's a high-level Java client library that makes ZooKeeper easier to use. It provides pre-built "recipes" for common patterns (leader election, locks, barriers) so you don't have to implement them from scratch.
The Recipe Book
If ZooKeeper is a kitchen with raw ingredients (znodes, watches, ephemeral nodes), Curator is the recipe book that tells you exactly how to combine them into dishes (leader election, locks, service discovery). You still use ZooKeeper underneath — Curator just handles the complex patterns correctly so you don't have to.
Curator Recipes
- ✅LeaderLatch / LeaderSelector — leader election with automatic failover
- ✅InterProcessMutex — reentrant distributed lock with fair queuing
- ✅InterProcessReadWriteLock — concurrent readers, exclusive writers
- ✅ServiceDiscovery — service registration and discovery with JSON payloads
- ✅PathChildrenCache / TreeCache — local cache of ZK subtree with automatic sync
- ✅DistributedBarrier / DistributedDoubleBarrier — synchronization barriers
- ✅DistributedQueue / DistributedPriorityQueue — ordered task distribution
- ✅NodeCache — watch a single node with automatic re-registration
// Curator Leader Election (vs raw ZooKeeper) // RAW ZooKeeper (50+ lines, error-prone): // - Create ephemeral sequential node // - getChildren, sort, check if lowest // - Watch previous node // - Handle watch re-registration // - Handle session expiry // - Handle edge cases (node deleted before watch set) // CURATOR (5 lines, battle-tested): LeaderLatch latch = new LeaderLatch(client, "/election", myId); latch.addListener(new LeaderLatchListener() { public void isLeader() { startLeaderDuties(); } public void notLeader() { stopLeaderDuties(); } }); latch.start(); // Curator Distributed Lock: InterProcessMutex lock = new InterProcessMutex(client, "/locks/resource"); lock.acquire(30, TimeUnit.SECONDS); // blocks until acquired or timeout try { // critical section } finally { lock.release(); } // Curator Service Discovery: ServiceDiscovery<InstanceDetails> discovery = ServiceDiscoveryBuilder .builder(InstanceDetails.class) .client(client) .basePath("/services") .thisInstance(myInstance) .build(); discovery.start(); // registers this instance // Other services query: discovery.queryForInstances("payment")
Always Use Curator for Java
If you're using ZooKeeper from Java, always use Curator instead of the raw ZooKeeper client. The raw client requires you to handle connection management, watch re-registration, session expiry, and all edge cases yourself. Curator handles all of this correctly and has been battle-tested in production for years.
Consul
HashiCorp Consul is a service mesh and coordination platform that combines service discovery, health checking, KV store, and multi-datacenter support into a single tool. It's more opinionated than ZooKeeper — it provides higher-level abstractions out of the box.
| Feature | ZooKeeper | Consul |
|---|---|---|
| Service discovery | DIY with ephemeral nodes | Built-in with health checks |
| Health checking | Session timeout only | HTTP, TCP, script, gRPC checks |
| KV store | Hierarchical znodes | Flat KV with folders |
| DNS interface | ❌ No | ✅ Built-in DNS for service lookup |
| Multi-DC | Complex (observers) | Native multi-DC with WAN gossip |
| Service mesh | ❌ No | ✅ Connect (sidecar proxies, mTLS) |
| Consensus | Zab | Raft |
| Language | Java | Go |
| Lock/leader | DIY recipes | Built-in sessions + locks API |
| UI | ❌ No (third-party) | ✅ Built-in web UI |
Consul's Advantages Over ZooKeeper: 1. Built-in Health Checking ZooKeeper: session timeout is the only liveness signal → Service is "alive" if session exists (even if it's broken) Consul: multiple health check types → HTTP: GET /health returns 200? → TCP: can connect to port 8080? → Script: custom check script returns 0? → gRPC: health check RPC succeeds? → TTL: service must heartbeat within interval Consul removes unhealthy services from discovery BEFORE they crash. 2. DNS Interface ZooKeeper: clients need ZK client library Consul: services discoverable via DNS → dig payment.service.consul → Returns healthy instances only → Works with any language, no special library needed 3. Multi-Datacenter Native ZooKeeper: single ensemble, observers in remote DC Consul: separate server clusters per DC, WAN gossip between DCs → Each DC has its own Raft cluster (fast local writes) → Cross-DC queries via WAN federation → Prepared queries can failover across DCs 4. Service Mesh (Connect) ZooKeeper: coordination only Consul: also provides service-to-service mTLS, traffic routing, intentions (authorization), and sidecar proxy management
Consul is More Than Coordination
Consul is a full service networking platform, not just a coordination service. If you need service discovery + health checks + service mesh + multi-DC, Consul provides all of these in one tool. ZooKeeper only provides the low-level coordination primitives — you build everything else yourself.
When ZooKeeper is Still Right
Despite newer alternatives, ZooKeeper remains the right choice in several scenarios. Its maturity, ecosystem integration, and battle-tested reliability make it irreplaceable for certain use cases.
Choose ZooKeeper When
- ✅You're running Kafka (pre-KRaft), HBase, Solr, or Hadoop — they require ZooKeeper
- ✅You're in a Java/JVM ecosystem and want deep integration with Curator recipes
- ✅You need the hierarchical namespace model for complex coordination patterns
- ✅You have existing ZooKeeper expertise and infrastructure
- ✅You need proven stability — ZooKeeper has 15+ years of production hardening
- ✅Your coordination patterns are complex (read-write locks, barriers, 2PC) and Curator provides them
Don't Migrate Without Reason
If you have a working ZooKeeper deployment, don't migrate to etcd or Consul just because they're newer. Migration is risky and expensive. Only migrate if you have a concrete problem that ZooKeeper can't solve (e.g., you're moving to Kubernetes and want native etcd integration).
When to Use etcd Instead
etcd is the better choice for new systems, especially those in the Kubernetes ecosystem or non-Java environments. Its simpler operational model and modern API make it easier to adopt.
Choose etcd When
- ✅You're building on Kubernetes — etcd is already there, no additional infrastructure
- ✅You're not in a Java ecosystem — etcd's gRPC API has excellent Go, Python, and Rust clients
- ✅You want simpler operations — single Go binary, no JVM tuning, no GC concerns
- ✅You need linearizable reads by default — etcd provides this without sync()
- ✅You want streaming watches — etcd watches are persistent streams, not one-time triggers
- ✅You're building a new system from scratch with no existing ZooKeeper dependency
| Concern | ZooKeeper | etcd |
|---|---|---|
| Deployment | JVM + zoo.cfg + myid + tuning | Single binary + config file |
| GC pauses | Major concern (Java) | Not applicable (Go) |
| Client libraries | Best in Java (Curator) | Best in Go, good everywhere |
| Kubernetes native | Separate deployment | Already running in cluster |
| Learning curve | Higher (Zab, sessions, watches) | Lower (Raft, leases, streams) |
When to Use Consul Instead
Consul is the right choice when service discovery and health checking are your primary needs, especially in multi-datacenter environments. It provides a complete service networking solution rather than just coordination primitives.
Choose Consul When
- ✅Service discovery is your primary use case — Consul's health checks are far superior to ZK's session-based liveness
- ✅You need DNS-based discovery — any service can discover others via DNS without a client library
- ✅Multi-datacenter is a first-class requirement — Consul's WAN federation is purpose-built for this
- ✅You want a service mesh — Consul Connect provides mTLS, traffic routing, and authorization
- ✅You need a web UI for operations — Consul includes a built-in dashboard
- ✅You're in a polyglot environment — DNS and HTTP APIs work with any language
| Use Case | Best Tool | Why |
|---|---|---|
| Kafka/HBase coordination | ZooKeeper | Required dependency, deep integration |
| Kubernetes coordination | etcd | Already present, native integration |
| Service discovery + health | Consul | Built-in health checks, DNS, multi-DC |
| Complex coordination (locks, barriers) | ZooKeeper + Curator | Most mature recipes |
| Simple leader election | Any of the three | All support it well |
| Service mesh | Consul | Only one with built-in mesh capabilities |
ZooKeeper in System Design Interviews
In system design interviews, ZooKeeper (or "a coordination service like ZooKeeper/etcd") comes up frequently. Knowing when to mention it, how to describe it correctly, and what level of detail to provide is important.
When to Mention ZooKeeper in System Design: 1. Leader Election "We need exactly one instance processing payments at a time. We'll use ZooKeeper for leader election — ephemeral nodes ensure automatic failover if the leader crashes." 2. Service Discovery "Services register with ZooKeeper using ephemeral nodes. Consumers watch the service path and get notified instantly when instances join or leave." 3. Distributed Configuration "Feature flags are stored in ZooKeeper. All services watch the config node and apply changes instantly — no deployment needed, no polling delay." 4. Distributed Locking "To prevent duplicate payment processing, we use a ZooKeeper distributed lock. Ephemeral nodes ensure the lock is released if the holder crashes." 5. Cluster Membership "Each node registers an ephemeral node in ZooKeeper. The controller watches the membership path to detect failures and trigger rebalancing." How to Describe It Correctly: ✅ "A distributed coordination service that provides leader election, locks, and service discovery primitives" ✅ "A CP system — chooses consistency over availability" ✅ "Uses consensus (Zab) to replicate state across an ensemble" ❌ "A distributed database" (it's not — 1MB limit, not for app data) ❌ "A message queue" (it's not — use Kafka) ❌ "Always available" (it's CP — unavailable without quorum) Level of Detail: - Mention it exists and what it does: always - Explain the mechanism (ephemeral nodes, watches): if asked - Discuss internals (Zab, quorum, zxid): only if specifically asked
Say 'Coordination Service' Not 'ZooKeeper'
In interviews, it's often better to say "we'll use a coordination service like ZooKeeper or etcd" rather than committing to one. This shows you understand the category, not just one tool. Then if asked to go deeper, you can discuss ZooKeeper's specific mechanisms.
Interview Questions
Q:Compare ZooKeeper and etcd. When would you choose each?
A: Both provide distributed coordination with consensus. Key differences: (1) Language/ops: etcd is Go (single binary, no GC), ZK is Java (JVM tuning required). (2) API: etcd uses gRPC/HTTP, ZK uses custom protocol. (3) Watches: etcd has streaming watches (persistent), ZK has one-time triggers (persistent in 3.6+). (4) Reads: etcd provides linearizable reads by default, ZK requires sync(). Choose ZK: existing Kafka/HBase/Hadoop dependency, Java ecosystem, complex Curator recipes. Choose etcd: Kubernetes environment, non-Java, new systems, simpler operations.
Q:What is Consul and how does it differ from ZooKeeper?
A: Consul is a full service networking platform (not just coordination). It includes: service discovery with health checks (HTTP, TCP, script — not just session liveness), DNS interface (no client library needed), multi-DC native (WAN gossip federation), KV store, and service mesh (mTLS, traffic routing). ZooKeeper provides only low-level coordination primitives — you build everything else yourself. Choose Consul when service discovery + health checking is primary, you need DNS-based discovery, or you want multi-DC with minimal effort. Choose ZK for complex coordination patterns or existing dependencies.
Q:What is Apache Curator and why would you use it?
A: Curator is a high-level Java client library for ZooKeeper (NOT an alternative). It provides pre-built, battle-tested 'recipes' for common patterns: LeaderLatch (leader election), InterProcessMutex (distributed locks), ServiceDiscovery, PathChildrenCache (local ZK subtree cache), DistributedBarrier, etc. Use it because: (1) Raw ZK client requires handling connection management, watch re-registration, session expiry, and edge cases manually. (2) Curator handles all this correctly. (3) Recipes are production-tested. Rule: if you're using ZK from Java, always use Curator.
Q:Is ZooKeeper still relevant? Isn't Kafka removing its ZooKeeper dependency?
A: ZooKeeper is still relevant but its role is narrowing. Kafka 3.3+ introduced KRaft (internal Raft) to eliminate the ZK dependency — but KRaft provides the same coordination primitives internally. HBase, Solr, Hadoop YARN still require ZK. For new systems, etcd (Kubernetes) or Consul (service discovery) are often better choices. ZK remains relevant for: existing deployments, Java ecosystems with Curator, and systems that specifically require it. The coordination PRIMITIVES (leader election, locks, discovery) are eternal — only the implementation choice changes.
Q:In a system design interview, when should you mention ZooKeeper?
A: Mention it (or 'a coordination service') when you need: (1) Leader election — exactly one process performing a role. (2) Service discovery — knowing which instances are alive. (3) Distributed locking — mutual exclusion across machines. (4) Configuration management — push-based config updates. (5) Cluster membership — detecting node failures. Say 'coordination service like ZooKeeper or etcd' to show category understanding. Describe the mechanism (ephemeral nodes, watches) only if asked. Never call it a database, message queue, or cache. Emphasize it's CP (consistent, not always available).
Common Mistakes
Migrating from ZooKeeper without a concrete reason
Replacing a working ZooKeeper deployment with etcd or Consul just because they're newer. Migration is risky, expensive, and introduces new failure modes.
✅Only migrate if you have a specific problem ZK can't solve (Kubernetes native integration, non-Java ecosystem, operational complexity). If ZK is working, keep it.
Confusing Curator with a ZooKeeper alternative
Thinking Curator replaces ZooKeeper. Curator is a client library that runs ON TOP of ZooKeeper — you still need a ZK ensemble.
✅Curator = high-level Java client for ZK. etcd and Consul = actual alternatives to ZK. If someone asks 'what's your coordination service?', the answer is ZooKeeper (with Curator as the client library), not Curator alone.
Using ZooKeeper when Consul would be simpler
Building service discovery with ephemeral nodes + custom health checking + DNS integration on top of ZooKeeper, when Consul provides all of this out of the box.
✅If your primary need is service discovery with health checks, DNS interface, and multi-DC — use Consul. It's purpose-built for this. Use ZooKeeper for lower-level coordination (complex locks, barriers, 2PC) or when required by existing systems.
Describing ZooKeeper incorrectly in interviews
Calling ZooKeeper a 'distributed database', 'message queue', or saying it's 'always available'. These show fundamental misunderstanding.
✅Correct description: 'A distributed coordination service that provides primitives like leader election, locks, and service discovery. It's a CP system — chooses consistency over availability. Uses consensus (Zab) across an ensemble of servers.'
Choosing etcd just because it's in Kubernetes
Using the Kubernetes etcd cluster for application coordination. The K8s etcd is sized and tuned for K8s API server load — adding application traffic can destabilize the cluster.
✅If you need etcd for application coordination in a K8s environment, deploy a separate etcd cluster dedicated to your application. Never share the K8s control plane etcd with application workloads.