Kafka Brokers & Replication
The physical layer — how Kafka stores data across brokers, replicates for durability, and survives failures through leader election and ISR management.
Table of Contents
Brokers & Cluster Topology
A broker is a single Kafka server. It receives records from producers, stores them on disk, and serves them to consumers. A Kafka cluster is a group of brokers working together, with partitions distributed across them.
Broker 1 (id=1) Broker 2 (id=2) Broker 3 (id=3) ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ P0 (Leader) │ │ P0 (Follower) │ │ P0 (Follower) │ │ P1 (Follower) │ │ P1 (Leader) │ │ P1 (Follower) │ │ P2 (Follower) │ │ P2 (Follower) │ │ P2 (Leader) │ │ P3 (Leader) │ │ P3 (Follower) │ │ P3 (Follower) │ │ P4 (Follower) │ │ P4 (Leader) │ │ P4 (Follower) │ │ P5 (Follower) │ │ P5 (Follower) │ │ P5 (Leader) │ └──────────────────┘ └──────────────────┘ └──────────────────┘ Key points: - Each partition has exactly ONE leader and (RF-1) followers - Leaders handle ALL reads and writes for their partition - Followers only replicate data from the leader (fetch requests) - Partitions are distributed to balance load across brokers - If Broker 2 dies: P1 and P4 elect new leaders from remaining ISR
The Office Building
A Kafka cluster is like an office building with multiple floors (brokers). Each floor has filing cabinets (partitions). For each document type, one floor is the 'primary office' (leader) where all new documents arrive. The other floors keep photocopies (replicas). If a floor floods (broker failure), the primary office moves to another floor that has up-to-date copies.
What Each Broker Manages
- ✅A subset of partition replicas (both leaders and followers)
- ✅Disk storage for log segments of its assigned partitions
- ✅Network connections from producers and consumers
- ✅Fetch requests from follower replicas replicating data
- ✅Heartbeats to the controller for liveness detection
Partition Leadership
Every partition has exactly one leader replica and zero or more follower replicas. The leader handles ALL read and write requests for that partition. Followers exist solely for durability — they replicate data from the leader via fetch requests.
Normal operation: Producer → writes to → Partition 0 Leader (Broker 1) Consumer → reads from → Partition 0 Leader (Broker 1) Broker 2 (follower) → fetches from → Broker 1 (leader) Broker 3 (follower) → fetches from → Broker 1 (leader) Broker 1 fails: 1. Controller detects Broker 1 is unresponsive (session timeout) 2. Controller checks ISR for Partition 0: [Broker 2, Broker 3] 3. Controller elects Broker 2 as new leader (first in ISR) 4. Controller updates metadata and notifies all brokers 5. Producers/consumers discover new leader via metadata refresh 6. Broker 2 now serves reads and writes for Partition 0 Preferred leader election: - Each partition has a "preferred leader" (first replica in assignment) - Kafka periodically rebalances leadership back to preferred leaders - Prevents all leaders from accumulating on one broker after failures
🔑 Why Only Leaders Serve Reads
Unlike databases where read replicas serve queries, Kafka followers do NOT serve consumer reads (by default). This simplifies consistency — consumers always see the latest committed data. Kafka 2.4+ introduced follower fetching for rack-aware consumers, but the leader remains the source of truth.
Replication
The replication factor determines how many copies of each partition exist across the cluster. A replication factor of 3 means each partition has 1 leader + 2 followers = 3 total copies. This is the standard production configuration.
Topic: "payments", Partitions: 3, Replication Factor: 3 Partition 0: Leader=B1, Followers=[B2, B3] Partition 1: Leader=B2, Followers=[B1, B3] Partition 2: Leader=B3, Followers=[B1, B2] How followers replicate: 1. Follower sends FetchRequest to leader (same protocol as consumers) 2. Leader responds with new records since follower's last fetch 3. Follower appends records to its local log 4. Follower advances its "log end offset" (LEO) Replication lag: - Difference between leader's LEO and follower's LEO - If lag exceeds replica.lag.time.max.ms, follower is removed from ISR - Causes: slow disk, network congestion, GC pauses on follower Rack awareness (recommended for production): - Spread replicas across different racks/availability zones - broker.rack=us-east-1a on broker config - Kafka places replicas on different racks automatically - Survives entire rack/AZ failure
| Replication Factor | Fault Tolerance | Storage Cost | Use Case |
|---|---|---|---|
| RF=1 | None — single point of failure | 1x | Development/testing only |
| RF=2 | Survives 1 broker failure | 2x | Non-critical data with cost constraints |
| RF=3 (standard) | Survives 1 broker failure with acks=all | 3x | Production — the default recommendation |
In-Sync Replicas (ISR)
The ISR (In-Sync Replica set) is the subset of replicas that are "caught up" with the leader — meaning they have replicated all records within replica.lag.time.max.ms (default 30s). The ISR is the key to Kafka's durability guarantees.
Partition 0: Leader=B1, All Replicas=[B1, B2, B3] Normal state (all caught up): ISR = [B1, B2, B3] ← all replicas in sync B3 falls behind (slow disk, network issue): ISR = [B1, B2] ← B3 removed from ISR B3 continues fetching, trying to catch up B3 catches up (within replica.lag.time.max.ms): ISR = [B1, B2, B3] ← B3 re-added to ISR The critical config: min.insync.replicas - Minimum number of replicas that must be in ISR for a write to succeed - Only matters when acks=all - If ISR shrinks below min.insync.replicas, producer gets NotEnoughReplicasException The golden combination for production: replication.factor = 3 min.insync.replicas = 2 acks = all This guarantees: - Every acknowledged write exists on at least 2 brokers - Can survive 1 broker failure without data loss - If 2 brokers fail, writes are rejected (availability sacrifice for durability)
The Witness Signatures
Think of ISR as witnesses to a legal document. The leader writes the document (record). min.insync.replicas=2 means at least 2 witnesses (including the leader) must sign before the document is considered official. If only 1 witness is available, the notary (broker) refuses to process the document — better to reject than to have an unverifiable record.
Acknowledgement Modes (acks)
The acks producer configuration controls how many replicas must acknowledge a write before the producer considers it successful. This is the primary knob for trading latency against durability.
| acks | Behavior | Durability | Latency | Use Case |
|---|---|---|---|---|
| acks=0 | Fire and forget — don't wait for any acknowledgment | None — data may be lost | Lowest | Metrics, logs where occasional loss is acceptable |
| acks=1 | Leader acknowledges — followers not guaranteed | Medium — lost if leader dies before replication | Low | Most use cases where some risk is acceptable |
| acks=all (-1) | All ISR replicas must acknowledge | Highest — survives any single broker failure | Higher | Financial transactions, critical events |
Scenario: RF=3, min.insync.replicas=2, acks=all Write arrives at leader: 1. Leader writes to its local log 2. Leader waits for at least 1 follower to replicate (min.insync.replicas=2 total) 3. Once 2 replicas have the data → acknowledge to producer 4. Third replica catches up asynchronously What this guarantees: ✓ Data exists on at least 2 brokers before producer gets success ✓ If leader dies immediately after ack, data is on at least 1 follower ✓ No data loss for any single broker failure What this does NOT guarantee: ✗ If 2 brokers die simultaneously, data may be lost ✗ Does not prevent duplicates (use idempotent producer for that) Failure scenarios: ISR=[B1, B2, B3] → write succeeds (3 >= 2) ISR=[B1, B2] → write succeeds (2 >= 2) ISR=[B1] → write FAILS with NotEnoughReplicasException (1 < 2) Producer must retry or handle the error
⚠️ acks=all Without min.insync.replicas is Useless
If min.insync.replicas=1 (default), then acks=all only requires the leader to acknowledge — identical to acks=1. Always set min.insync.replicas=2 when using acks=all in production.
Unclean Leader Election
When all ISR replicas are unavailable, Kafka faces a choice: remain unavailable (no leader) or elect an out-of-sync replica as leader (potentially losing data). This is controlled by unclean.leader.election.enable.
| Setting | Behavior | Trade-off |
|---|---|---|
| unclean.leader.election.enable=false (default since 2.0) | Partition stays offline until an ISR replica recovers | Durability over availability — no data loss, but partition is unavailable |
| unclean.leader.election.enable=true | Out-of-sync replica becomes leader; records it missed are lost | Availability over durability — partition stays online but may lose recent data |
When to Enable Unclean Leader Election
- ✅Metrics/logging topics where availability matters more than losing a few records
- ✅Non-critical data where downtime is more costly than occasional data loss
When to Keep It Disabled (Default)
- ❌Financial transactions — losing a payment event is unacceptable
- ❌Event sourcing — gaps in the event log corrupt downstream state
- ❌Any topic where data loss has business consequences
ZooKeeper vs KRaft
Historically, Kafka depended on Apache ZooKeeper for cluster metadata management: controller election, broker registration, topic configuration, and ISR tracking. KRaft (Kafka Raft) removes this dependency by having Kafka manage its own metadata using a Raft-based consensus protocol.
| Aspect | ZooKeeper Mode | KRaft Mode |
|---|---|---|
| Metadata storage | External ZooKeeper ensemble (3-5 nodes) | Internal Kafka controller quorum (3 nodes) |
| Controller election | ZooKeeper ephemeral nodes + watches | Raft leader election among controller nodes |
| Operational complexity | Two separate systems to manage, monitor, upgrade | Single system — Kafka manages everything |
| Partition limit | ~200K partitions (ZooKeeper becomes bottleneck) | Millions of partitions (metadata in Raft log) |
| Recovery time | Slow controller failover (ZK session timeout + full metadata reload) | Fast failover (Raft leader election, metadata already replicated) |
| Status | Deprecated — removal planned | Production-ready since Kafka 3.3, default since 3.5 |
KRaft Architecture: ┌─────────────────────────────────────────────────────┐ │ Controller Quorum (3 nodes) │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │Controller │ │Controller │ │Controller │ │ │ │ (Active) │ │ (Standby) │ │ (Standby) │ │ │ │ Leader │ │ Follower │ │ Follower │ │ │ └───────────┘ └───────────┘ └───────────┘ │ └─────────────────────────────────────────────────────┘ │ Metadata Log (Raft replicated) │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ Broker 1 │ │ Broker 2 │ │ Broker 3 │ │ (data) │ │ (data) │ │ (data) │ └───────────┘ └───────────┘ └───────────┘ Key improvements: - No external dependency (ZooKeeper gone) - Metadata replicated via Raft — fast failover - Controllers can also serve as brokers (combined mode) - Scales to millions of partitions
💡 Migration Path
Existing clusters can migrate from ZooKeeper to KRaft without downtime using the bridge release process. New clusters should always use KRaft mode. ZooKeeper support is deprecated and will be removed in a future Kafka release.
Interview Questions
Q:What happens when a Kafka broker fails?
A: The controller detects the failure via missed heartbeats. For each partition where the failed broker was leader, the controller elects a new leader from the ISR (in-sync replicas). It updates cluster metadata and notifies all brokers. Producers and consumers discover the new leaders via metadata refresh and redirect traffic. If the failed broker was only a follower, no leader election is needed — it's simply removed from the ISR until it recovers.
Q:Explain the relationship between replication factor, min.insync.replicas, and acks.
A: Replication factor (RF) determines how many copies of each partition exist. min.insync.replicas (MIR) sets the minimum ISR size required for writes when acks=all. acks controls how many replicas must acknowledge before the producer gets success. The golden production config is RF=3, MIR=2, acks=all: every write is confirmed on 2 brokers, surviving any single failure. If ISR drops below MIR, writes are rejected to prevent under-replicated data.
Q:Why did Kafka move from ZooKeeper to KRaft?
A: ZooKeeper added operational complexity (separate cluster to manage), limited partition scalability (~200K), caused slow controller failover (full metadata reload), and created a split-brain risk. KRaft eliminates the external dependency by using Raft consensus within Kafka itself. Benefits: single system to operate, faster failover, millions of partitions, and simpler deployment.
Q:What is the ISR and why does it matter?
A: ISR (In-Sync Replicas) is the set of replicas caught up with the leader within replica.lag.time.max.ms. It matters because: (1) Only ISR members are eligible for leader election (preventing data loss). (2) acks=all waits for all ISR members to acknowledge. (3) min.insync.replicas enforces a minimum ISR size for writes. The ISR is Kafka's mechanism for balancing durability and availability dynamically.
Common Mistakes
acks=all with default min.insync.replicas
Setting acks=all but leaving min.insync.replicas=1 (default). This is identical to acks=1 — only the leader needs to acknowledge, providing no additional durability.
✅Always pair acks=all with min.insync.replicas=2. This ensures at least 2 replicas confirm every write, providing actual durability guarantees.
Replication factor = 1 in production
Running production topics with RF=1 to save storage costs. A single disk failure permanently loses all data for that partition with no recovery possible.
✅Always use RF=3 in production. Storage is cheap; data loss is not. RF=1 means a single disk failure loses all data for that partition permanently.
Ignoring under-replicated partitions
Not monitoring UnderReplicatedPartitions metric — letting ISR shrink silently until a second failure causes data loss.
✅Alert on UnderReplicatedPartitions > 0. This is the earliest signal of broker health issues, disk problems, or network congestion. A shrinking ISR means you're one failure away from data loss.
All replicas on the same rack
Not configuring broker.rack — all replicas may land on the same physical rack. A single rack failure takes out all copies of a partition.
✅Configure broker.rack for each broker. Kafka will spread replicas across racks automatically, surviving entire rack/AZ failures.