BrokersClustersISRAcksReplicationKRaftLeader ElectionController

Kafka Brokers & Replication

The physical layer — how Kafka stores data across brokers, replicates for durability, and survives failures through leader election and ISR management.

35 min read9 sections

Brokers & Cluster Topology

A broker is a single Kafka server. It receives records from producers, stores them on disk, and serves them to consumers. A Kafka cluster is a group of brokers working together, with partitions distributed across them.

Cluster Topology — 3 Brokers, 1 Topic, 6 Partitions, RF=3text

Broker 1 (id=1)          Broker 2 (id=2)          Broker 3 (id=3)
┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│ P0 (Leader)      │    │ P0 (Follower)    │    │ P0 (Follower)    │
│ P1 (Follower)    │    │ P1 (Leader)      │    │ P1 (Follower)    │
│ P2 (Follower)    │    │ P2 (Follower)    │    │ P2 (Leader)      │
│ P3 (Leader)      │    │ P3 (Follower)    │    │ P3 (Follower)    │
│ P4 (Follower)    │    │ P4 (Leader)      │    │ P4 (Follower)    │
│ P5 (Follower)    │    │ P5 (Follower)    │    │ P5 (Leader)      │
└──────────────────┘    └──────────────────┘    └──────────────────┘

Key points:
  - Each partition has exactly ONE leader and (RF-1) followers
  - Leaders handle ALL reads and writes for their partition
  - Followers only replicate data from the leader (fetch requests)
  - Partitions are distributed to balance load across brokers
  - If Broker 2 dies: P1 and P4 elect new leaders from remaining ISR

🏢

The Office Building

A Kafka cluster is like an office building with multiple floors (brokers). Each floor has filing cabinets (partitions). For each document type, one floor is the 'primary office' (leader) where all new documents arrive. The other floors keep photocopies (replicas). If a floor floods (broker failure), the primary office moves to another floor that has up-to-date copies.

What Each Broker Manages

✅A subset of partition replicas (both leaders and followers)
✅Disk storage for log segments of its assigned partitions
✅Network connections from producers and consumers
✅Fetch requests from follower replicas replicating data
✅Heartbeats to the controller for liveness detection

Partition Leadership

Every partition has exactly one leader replica and zero or more follower replicas. The leader handles ALL read and write requests for that partition. Followers exist solely for durability — they replicate data from the leader via fetch requests.

Leader Election Flowtext

Normal operation:
  Producer → writes to → Partition 0 Leader (Broker 1)
  Consumer → reads from → Partition 0 Leader (Broker 1)
  Broker 2 (follower) → fetches from → Broker 1 (leader)
  Broker 3 (follower) → fetches from → Broker 1 (leader)

Broker 1 fails:
  1. Controller detects Broker 1 is unresponsive (session timeout)
  2. Controller checks ISR for Partition 0: [Broker 2, Broker 3]
  3. Controller elects Broker 2 as new leader (first in ISR)
  4. Controller updates metadata and notifies all brokers
  5. Producers/consumers discover new leader via metadata refresh
  6. Broker 2 now serves reads and writes for Partition 0

Preferred leader election:
  - Each partition has a "preferred leader" (first replica in assignment)
  - Kafka periodically rebalances leadership back to preferred leaders
  - Prevents all leaders from accumulating on one broker after failures

🔑 Why Only Leaders Serve Reads

Unlike databases where read replicas serve queries, Kafka followers do NOT serve consumer reads (by default). This simplifies consistency — consumers always see the latest committed data. Kafka 2.4+ introduced follower fetching for rack-aware consumers, but the leader remains the source of truth.

Replication

The replication factor determines how many copies of each partition exist across the cluster. A replication factor of 3 means each partition has 1 leader + 2 followers = 3 total copies. This is the standard production configuration.

Replication Mechanicstext

Topic: "payments", Partitions: 3, Replication Factor: 3

Partition 0: Leader=B1, Followers=[B2, B3]
Partition 1: Leader=B2, Followers=[B1, B3]
Partition 2: Leader=B3, Followers=[B1, B2]

How followers replicate:
  1. Follower sends FetchRequest to leader (same protocol as consumers)
  2. Leader responds with new records since follower's last fetch
  3. Follower appends records to its local log
  4. Follower advances its "log end offset" (LEO)

Replication lag:
  - Difference between leader's LEO and follower's LEO
  - If lag exceeds replica.lag.time.max.ms, follower is removed from ISR
  - Causes: slow disk, network congestion, GC pauses on follower

Rack awareness (recommended for production):
  - Spread replicas across different racks/availability zones
  - broker.rack=us-east-1a on broker config
  - Kafka places replicas on different racks automatically
  - Survives entire rack/AZ failure

Replication Factor	Fault Tolerance	Storage Cost	Use Case
RF=1	None — single point of failure	1x	Development/testing only
RF=2	Survives 1 broker failure	2x	Non-critical data with cost constraints
RF=3 (standard)	Survives 1 broker failure with acks=all	3x	Production — the default recommendation

In-Sync Replicas (ISR)

The ISR (In-Sync Replica set) is the subset of replicas that are "caught up" with the leader — meaning they have replicated all records within replica.lag.time.max.ms (default 30s). The ISR is the key to Kafka's durability guarantees.

ISR Mechanicstext

Partition 0: Leader=B1, All Replicas=[B1, B2, B3]

Normal state (all caught up):
  ISR = [B1, B2, B3]  ← all replicas in sync

B3 falls behind (slow disk, network issue):
  ISR = [B1, B2]      ← B3 removed from ISR
  B3 continues fetching, trying to catch up

B3 catches up (within replica.lag.time.max.ms):
  ISR = [B1, B2, B3]  ← B3 re-added to ISR

The critical config: min.insync.replicas
  - Minimum number of replicas that must be in ISR for a write to succeed
  - Only matters when acks=all
  - If ISR shrinks below min.insync.replicas, producer gets NotEnoughReplicasException

The golden combination for production:
  replication.factor = 3
  min.insync.replicas = 2
  acks = all

  This guarantees:
    - Every acknowledged write exists on at least 2 brokers
    - Can survive 1 broker failure without data loss
    - If 2 brokers fail, writes are rejected (availability sacrifice for durability)

✍️

The Witness Signatures

Think of ISR as witnesses to a legal document. The leader writes the document (record). min.insync.replicas=2 means at least 2 witnesses (including the leader) must sign before the document is considered official. If only 1 witness is available, the notary (broker) refuses to process the document — better to reject than to have an unverifiable record.

Acknowledgement Modes (acks)

The acks producer configuration controls how many replicas must acknowledge a write before the producer considers it successful. This is the primary knob for trading latency against durability.

acks	Behavior	Durability	Latency	Use Case
acks=0	Fire and forget — don't wait for any acknowledgment	None — data may be lost	Lowest	Metrics, logs where occasional loss is acceptable
acks=1	Leader acknowledges — followers not guaranteed	Medium — lost if leader dies before replication	Low	Most use cases where some risk is acceptable
acks=all (-1)	All ISR replicas must acknowledge	Highest — survives any single broker failure	Higher	Financial transactions, critical events

acks + min.insync.replicas — What They Actually Guaranteetext

Scenario: RF=3, min.insync.replicas=2, acks=all

Write arrives at leader:
  1. Leader writes to its local log
  2. Leader waits for at least 1 follower to replicate (min.insync.replicas=2 total)
  3. Once 2 replicas have the data → acknowledge to producer
  4. Third replica catches up asynchronously

What this guarantees:
  ✓ Data exists on at least 2 brokers before producer gets success
  ✓ If leader dies immediately after ack, data is on at least 1 follower
  ✓ No data loss for any single broker failure

What this does NOT guarantee:
  ✗ If 2 brokers die simultaneously, data may be lost
  ✗ Does not prevent duplicates (use idempotent producer for that)

Failure scenarios:
  ISR=[B1, B2, B3] → write succeeds (3 >= 2)
  ISR=[B1, B2]     → write succeeds (2 >= 2)
  ISR=[B1]         → write FAILS with NotEnoughReplicasException (1 < 2)
                      Producer must retry or handle the error

⚠️ acks=all Without min.insync.replicas is Useless

If min.insync.replicas=1 (default), then acks=all only requires the leader to acknowledge — identical to acks=1. Always set min.insync.replicas=2 when using acks=all in production.

Unclean Leader Election

When all ISR replicas are unavailable, Kafka faces a choice: remain unavailable (no leader) or elect an out-of-sync replica as leader (potentially losing data). This is controlled by unclean.leader.election.enable.

Setting	Behavior	Trade-off
unclean.leader.election.enable=false (default since 2.0)	Partition stays offline until an ISR replica recovers	Durability over availability — no data loss, but partition is unavailable
unclean.leader.election.enable=true	Out-of-sync replica becomes leader; records it missed are lost	Availability over durability — partition stays online but may lose recent data

When to Enable Unclean Leader Election

✅Metrics/logging topics where availability matters more than losing a few records
✅Non-critical data where downtime is more costly than occasional data loss

When to Keep It Disabled (Default)

❌Financial transactions — losing a payment event is unacceptable
❌Event sourcing — gaps in the event log corrupt downstream state
❌Any topic where data loss has business consequences

ZooKeeper vs KRaft

Historically, Kafka depended on Apache ZooKeeper for cluster metadata management: controller election, broker registration, topic configuration, and ISR tracking. KRaft (Kafka Raft) removes this dependency by having Kafka manage its own metadata using a Raft-based consensus protocol.

Aspect	ZooKeeper Mode	KRaft Mode
Metadata storage	External ZooKeeper ensemble (3-5 nodes)	Internal Kafka controller quorum (3 nodes)
Controller election	ZooKeeper ephemeral nodes + watches	Raft leader election among controller nodes
Operational complexity	Two separate systems to manage, monitor, upgrade	Single system — Kafka manages everything
Partition limit	~200K partitions (ZooKeeper becomes bottleneck)	Millions of partitions (metadata in Raft log)
Recovery time	Slow controller failover (ZK session timeout + full metadata reload)	Fast failover (Raft leader election, metadata already replicated)
Status	Deprecated — removal planned	Production-ready since Kafka 3.3, default since 3.5

KRaft Controller Quorumtext

KRaft Architecture:
  ┌─────────────────────────────────────────────────────┐
  │              Controller Quorum (3 nodes)             │
  │  ┌───────────┐  ┌───────────┐  ┌───────────┐      │
  │  │Controller │  │Controller │  │Controller │      │
  │  │  (Active) │  │ (Standby) │  │ (Standby) │      │
  │  │  Leader   │  │ Follower  │  │ Follower  │      │
  │  └───────────┘  └───────────┘  └───────────┘      │
  └─────────────────────────────────────────────────────┘
                          │
                    Metadata Log
                    (Raft replicated)
                          │
  ┌───────────┐  ┌───────────┐  ┌───────────┐
  │  Broker 1 │  │  Broker 2 │  │  Broker 3 │
  │  (data)   │  │  (data)   │  │  (data)   │
  └───────────┘  └───────────┘  └───────────┘

Key improvements:
  - No external dependency (ZooKeeper gone)
  - Metadata replicated via Raft — fast failover
  - Controllers can also serve as brokers (combined mode)
  - Scales to millions of partitions

💡 Migration Path

Existing clusters can migrate from ZooKeeper to KRaft without downtime using the bridge release process. New clusters should always use KRaft mode. ZooKeeper support is deprecated and will be removed in a future Kafka release.

Interview Questions

Q:What happens when a Kafka broker fails?

A: The controller detects the failure via missed heartbeats. For each partition where the failed broker was leader, the controller elects a new leader from the ISR (in-sync replicas). It updates cluster metadata and notifies all brokers. Producers and consumers discover the new leaders via metadata refresh and redirect traffic. If the failed broker was only a follower, no leader election is needed — it's simply removed from the ISR until it recovers.

Q:Explain the relationship between replication factor, min.insync.replicas, and acks.

A: Replication factor (RF) determines how many copies of each partition exist. min.insync.replicas (MIR) sets the minimum ISR size required for writes when acks=all. acks controls how many replicas must acknowledge before the producer gets success. The golden production config is RF=3, MIR=2, acks=all: every write is confirmed on 2 brokers, surviving any single failure. If ISR drops below MIR, writes are rejected to prevent under-replicated data.

Q:Why did Kafka move from ZooKeeper to KRaft?

A: ZooKeeper added operational complexity (separate cluster to manage), limited partition scalability (~200K), caused slow controller failover (full metadata reload), and created a split-brain risk. KRaft eliminates the external dependency by using Raft consensus within Kafka itself. Benefits: single system to operate, faster failover, millions of partitions, and simpler deployment.

Q:What is the ISR and why does it matter?

A: ISR (In-Sync Replicas) is the set of replicas caught up with the leader within replica.lag.time.max.ms. It matters because: (1) Only ISR members are eligible for leader election (preventing data loss). (2) acks=all waits for all ISR members to acknowledge. (3) min.insync.replicas enforces a minimum ISR size for writes. The ISR is Kafka's mechanism for balancing durability and availability dynamically.

Common Mistakes

⚙️

acks=all with default min.insync.replicas

Setting acks=all but leaving min.insync.replicas=1 (default). This is identical to acks=1 — only the leader needs to acknowledge, providing no additional durability.

✅Always pair acks=all with min.insync.replicas=2. This ensures at least 2 replicas confirm every write, providing actual durability guarantees.

💾

Replication factor = 1 in production

Running production topics with RF=1 to save storage costs. A single disk failure permanently loses all data for that partition with no recovery possible.

✅Always use RF=3 in production. Storage is cheap; data loss is not. RF=1 means a single disk failure loses all data for that partition permanently.

📊

Ignoring under-replicated partitions

Not monitoring UnderReplicatedPartitions metric — letting ISR shrink silently until a second failure causes data loss.

✅Alert on UnderReplicatedPartitions > 0. This is the earliest signal of broker health issues, disk problems, or network congestion. A shrinking ISR means you're one failure away from data loss.

🏗️

All replicas on the same rack

Not configuring broker.rack — all replicas may land on the same physical rack. A single rack failure takes out all copies of a partition.

✅Configure broker.rack for each broker. Kafka will spread replicas across racks automatically, surviving entire rack/AZ failures.