Commit LogTopicsPartitionsOffsetsRecordsLog CompactionRetentionKeys

Kafka Core Architecture

Kafka is not a message queue — it's a distributed commit log. Understand topics, partitions, offsets, and records before anything else.

40 min read9 sections

What is Kafka & Why It Exists

Kafka is widely misunderstood as "just a message queue." It is not. Kafka is a distributed commit log — an append-only, immutable, ordered sequence of records that multiple consumers can read independently at their own pace.

Traditional message queues (RabbitMQ, SQS) delete messages after a consumer acknowledges them. Kafka retains messages for a configurable period regardless of consumption. This single design decision unlocks replay, event sourcing, and independent consumer groups — capabilities that fundamentally change system architecture.

📜

The Newspaper vs The Phone Call

A traditional message queue is like a phone call — once the message is delivered, it's gone. If you weren't listening, you missed it. Kafka is like a newspaper archive. Every event is printed and stored in order. Any reader can start from today's edition, last week's, or the very first issue. Multiple readers can read the same archive independently without affecting each other. The archive is retained for a configurable period (or forever with compaction).

The Three Roles Kafka Plays

📨 Message Queue

Decouples producer from consumer
Absorbs traffic spikes
Consumer groups for load balancing

🌊 Event Streaming

Real-time stream processing
Kafka Streams / Flink integration
Windowed aggregations

💾 Storage Layer

Source of truth (event log)
Replay from any offset
Log compaction for latest state

Kafka vs Traditional Message Queues

Aspect	Traditional Queue (RabbitMQ/SQS)	Kafka
Message retention	Deleted after acknowledgment	Retained for configurable period (days/weeks/forever)
Consumer model	Push-based, competing consumers	Pull-based, consumer groups with independent offsets
Replay	Not possible — message is gone	Any consumer can seek to any offset and replay
Ordering	Best-effort (per-queue in SQS FIFO)	Guaranteed within a partition
Throughput	Thousands/sec (broker bottleneck)	Millions/sec (partitioned, sequential I/O)
Multiple consumers	Requires separate queues or fan-out exchanges	Multiple consumer groups read same topic independently
Best for	Task queues, RPC, simple decoupling	Event streaming, event sourcing, high-throughput pipelines

🔑 The Key Insight

Kafka's fundamental innovation is treating messages as a durable, ordered log rather than a transient queue. This means the same data can serve multiple purposes: real-time processing, batch analytics, audit trails, and system recovery — all from the same stream.

The Log as a Data Structure

Before diving into Kafka's components, you need to internalize the data structure that underpins everything: the append-only log. Every topic partition in Kafka is an ordered, immutable sequence of records. New records are always appended to the end. Existing records are never modified or deleted (until retention expires).

The Append-Only Logtext

┌─────────────────────────────────────────────────────────┐
│                    Partition Log                          │
├──────┬──────┬──────┬──────┬──────┬──────┬──────┬────────┤
│  0   │  1   │  2   │  3   │  4   │  5   │  6   │  ← NEW │
├──────┼──────┼──────┼──────┼──────┼──────┼──────┼────────┤
│ msg  │ msg  │ msg  │ msg  │ msg  │ msg  │ msg  │ writes │
│  A   │  B   │  C   │  D   │  E   │  F   │  G   │ append │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴────────┘
  ↑                              ↑                    ↑
  │                              │                    │
oldest                     consumer               newest
(may be                    position              (producer
 garbage                   (offset 4)             writes
 collected)                                       here)

Properties:
  1. Append-only — new records go to the end
  2. Immutable — existing records never change
  3. Ordered — offset gives total order within partition
  4. Durable — written to disk, replicated across brokers

Why is this powerful? Because an append-only log gives you:

Why the Log Changes Everything

✅Sequential I/O — appending to the end of a file is the fastest disk operation (no seeks)
✅Replayability — any consumer can re-read from any position without affecting others
✅Durability — data survives consumer crashes; just resume from last committed offset
✅Decoupling — producers and consumers operate at different speeds independently
✅Time travel — replay events from yesterday to rebuild state or debug issues

💡 Why Kafka is Fast Despite Using Disk

Kafka writes sequentially to disk and relies on the OS page cache for reads. Sequential disk I/O (100+ MB/s) is faster than random memory access in many scenarios. Combined with zero-copy transfers (sendfile syscall) that move data directly from page cache to network socket without copying through application memory, Kafka achieves throughput that rivals in-memory systems.

Topics

A topic is a named, durable, ordered stream of records. Topics are logical categories — not queues. Think of a topic as a table in a database: "orders", "user-events", "payment-transactions". Producers write to topics, consumers read from topics.

📁

Topics as Filing Cabinets

A topic is like a labeled filing cabinet. 'orders' cabinet holds all order events. 'payments' cabinet holds all payment events. Each cabinet has multiple drawers (partitions) for parallel access. You can have as many cabinets as you need — they're cheap to create. The label (topic name) is how producers and consumers find the right data.

Topic Configuration

Key Topic Configurationstext

# Create a topic with specific settings
kafka-topics.sh --create \
  --topic orders \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=604800000 \      # 7 days retention
  --config retention.bytes=107374182400 \ # 100 GB max per partition
  --config cleanup.policy=delete          # delete old segments

# Key configurations:
retention.ms=604800000        # How long to keep records (7 days)
retention.bytes=-1            # Max size per partition (-1 = unlimited)
cleanup.policy=delete         # 'delete' old records OR 'compact' (keep latest per key)
segment.bytes=1073741824      # 1 GB log segment files
min.insync.replicas=2         # Minimum ISR for acks=all to succeed
compression.type=producer     # Use producer's compression (or force: lz4, zstd)

Topic Naming Conventions

Topic Naming Best Practicestext

# Pattern: <domain>.<entity>.<event-type>
orders.created              # Order was placed
orders.shipped              # Order was shipped
payments.processed          # Payment completed
users.profile-updated       # User updated their profile

# Anti-patterns:
topic1                      # Meaningless name
all-events                  # Too broad — no way to subscribe selectively
order_created_v2            # Version in name — use Schema Registry instead

# Rules:
#   - Use dots or hyphens as separators (not underscores — conflicts with metrics)
#   - Be specific: one event type per topic (not a grab-bag)
#   - Include the domain/bounded context
#   - Don't encode version in the topic name

Cleanup Policy	Behavior	Use Case
delete	Removes segments older than retention.ms or exceeding retention.bytes	Event streams, logs, metrics — you only need recent data
compact	Keeps only the latest record per key; deletes older duplicates	Changelog topics, current state snapshots, CDC
compact,delete	Compacts AND deletes segments older than retention.ms	State snapshots with bounded retention

Partitions

A topic is split into partitions — each partition is an independent, ordered, immutable sequence of records. Partitions are Kafka's unit of parallelism and scalability. More partitions = more consumers can read in parallel = higher throughput.

Topic with 3 Partitionstext

Topic: "orders" (3 partitions, replication-factor=3)

Partition 0: [0] [1] [2] [3] [4] [5] [6] [7] → (Leader: Broker 1)
Partition 1: [0] [1] [2] [3] [4] [5]         → (Leader: Broker 2)
Partition 2: [0] [1] [2] [3] [4] [5] [6]     → (Leader: Broker 3)

Key points:
  - Each partition is hosted on a different broker (leader)
  - Ordering is guaranteed WITHIN a partition, NOT across partitions
  - Each partition can be consumed by at most one consumer in a group
  - Partitions are the unit of parallelism

🏎️

Highway Lanes

A topic with one partition is a single-lane road — all traffic goes through one lane sequentially. Adding partitions is like adding highway lanes. Each lane (partition) handles traffic independently and in parallel. Cars (records) in the same lane maintain their order, but there's no ordering guarantee between lanes. More lanes = more throughput, but you can't merge lanes back into one without losing the cross-lane ordering.

How Records Are Distributed

Partition Assignmenttext

# If record has a key:
partition = hash(key) % numPartitions

# Example: topic "orders" with 6 partitions
#   key = "user-123" → hash("user-123") % 6 = 2 → Partition 2
#   key = "user-456" → hash("user-456") % 6 = 5 → Partition 5
#   key = "user-123" → hash("user-123") % 6 = 2 → Partition 2 (same!)

# All records with the same key go to the same partition
# This guarantees ordering for that key

# If record has NO key (null key):
#   Sticky partitioner (Kafka 2.4+): batches to one partition, then rotates
#   Older behavior: round-robin across partitions

# Why this matters:
#   - All events for user-123 are in Partition 2, in order
#   - A single consumer handles all of user-123's events
#   - You get per-entity ordering without global ordering

Choosing Partition Count

Partition Count Guidelines

✅Start with: max(expected throughput / partition throughput, expected consumer count)
✅Single partition throughput: ~10 MB/s write, ~30 MB/s read (varies by hardware)
✅More partitions = more parallelism but also more file handles, memory, and leader elections
✅You CAN increase partitions later, but records with the same key may land in different partitions after the change
✅You CANNOT decrease partitions — this is a one-way door
✅Rule of thumb: 6-12 partitions for moderate topics, 30-100 for high-throughput topics

⚠️ Partition Count is a One-Way Door

Once you increase partitions, records with the same key may hash to a different partition than before. This breaks ordering guarantees for in-flight data. Plan your partition count carefully upfront. Over-provisioning slightly is safer than under-provisioning.

Offsets

An offset is a unique, sequential, immutable ID assigned to each record within a partition. Offsets start at 0 and increment by 1 for each new record. They are the mechanism by which consumers track their position — "I've processed up to offset 42 in partition 3."

Offset Mechanicstext

Partition 0:
  Offset:  [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
  Record:   A   B   C   D   E   F   G   H   I   J

Consumer Group "analytics":
  Last committed offset: 5
  → Next poll() returns records at offsets 5, 6, 7, 8, 9
  → After processing, commits offset 10 (next to read)

Consumer Group "notifications":
  Last committed offset: 8
  → Next poll() returns records at offsets 8, 9
  → Independent of "analytics" group — different pace

Key rules:
  1. Offsets are per-partition (not global across topic)
  2. Offsets are immutable — once assigned, never change
  3. Consumers manage their own offsets (stored in __consumer_offsets topic)
  4. Gaps in offsets are possible (compaction, transactions)

Offset Reset Policies

Policy	Behavior	When to Use
earliest	Start from the beginning of the partition (offset 0)	New consumer that needs all historical data (replay)
latest	Start from the end (only new records)	New consumer that only cares about future events
none	Throw exception if no committed offset exists	Strict environments — fail loudly rather than silently skip or replay

Committing Offsets

Strategy	How It Works	Guarantee	Risk
Auto-commit	Kafka commits every auto.commit.interval.ms (default 5s)	At-most-once (if crash after commit but before processing)	Duplicate processing or data loss on crash
Manual sync (commitSync)	Application commits after processing; blocks until confirmed	At-least-once (if crash after processing but before commit)	Slower — blocks on every commit
Manual async (commitAsync)	Application commits after processing; non-blocking	At-least-once	Commit may fail silently; use callback for error handling

🎯 Best Practice: Manual Commit After Processing

Disable auto-commit (enable.auto.commit=false) and commit manually after successfully processing each batch. This gives you at-least-once semantics — you may process a record twice on crash recovery, but you'll never lose one. Make your consumers idempotent to handle the duplicates.

Records (Messages)

A Kafka record (also called a message) is the atomic unit of data. Every record has a well-defined structure with specific fields that serve different purposes.

Anatomy of a Kafka Recordtext

┌─────────────────────────────────────────────────────┐
│                   Kafka Record                        │
├─────────────┬───────────────────────────────────────┤
│ Key         │ "user-123" (nullable, determines       │
│             │  partition assignment)                  │
├─────────────┼───────────────────────────────────────┤
│ Value       │ {"event":"order_placed","amount":99}   │
│             │ (the actual payload — binary safe)     │
├─────────────┼───────────────────────────────────────┤
│ Timestamp   │ 1706745600000 (CreateTime or           │
│             │  LogAppendTime — configurable)         │
├─────────────┼───────────────────────────────────────┤
│ Headers     │ [{"traceId":"abc"},{"source":"api"}]   │
│             │ (metadata key-value pairs)             │
├─────────────┼───────────────────────────────────────┤
│ Offset      │ 42 (assigned by broker on append)      │
├─────────────┼───────────────────────────────────────┤
│ Partition   │ 3 (determined by key hash or           │
│             │  explicit assignment)                   │
└─────────────┴───────────────────────────────────────┘

Keys Are Not IDs

The key is not a unique identifier for the record. Its primary purpose is partition routing — all records with the same key go to the same partition, guaranteeing ordering for that key. Multiple records can (and will) share the same key.

Key Design Guidelines

✅Use the entity ID as key (user-id, order-id) to guarantee per-entity ordering
✅Null keys use round-robin/sticky partitioning — no ordering guarantee
✅Keys determine data locality — same key = same partition = same consumer
✅Choose keys carefully: hot keys create partition hotspots
✅Keys are used by log compaction to retain only the latest value per key

Timestamps

Type	Set By	Use Case
CreateTime (default)	Producer sets timestamp when record is created	Event time processing — when the event actually happened
LogAppendTime	Broker sets timestamp when record is appended to log	Ingestion time — when Kafka received it (useful for ordering guarantees)

💡 Record Size Limits

Default max record size is 1 MB (message.max.bytes on broker, max.request.size on producer). For larger payloads, store the data in object storage (S3) and put a reference/URL in the Kafka record. Kafka is optimized for high-throughput small messages, not large blobs.

Log Compaction

Log compaction is an alternative retention strategy where Kafka retains only the latest record for each key, discarding older records with the same key. The result is a compacted topic that acts as a distributed key-value store — you can always read the current state of every key.

Log Compaction — Before and Aftertext

Before compaction (cleanup.policy=compact):
  Offset: [0]  [1]  [2]  [3]  [4]  [5]  [6]  [7]
  Key:     A    B    A    C    B    A    C    B
  Value:  v1   v1   v2   v1   v2   v3   v2   v3

After compaction:
  Offset: [5]  [6]  [7]
  Key:     A    C    B
  Value:  v3   v2   v3

What happened:
  - Key A: kept offset 5 (v3), deleted offsets 0, 2
  - Key B: kept offset 7 (v3), deleted offsets 1, 4
  - Key C: kept offset 6 (v2), deleted offset 3
  - Only the LATEST value per key survives
  - Offsets are preserved (not renumbered)

Tombstones (deletion):
  - A record with a key and NULL value is a tombstone
  - After compaction, the tombstone is retained briefly (delete.retention.ms)
  - Then the key is fully removed — consumers know it was deleted

📸

Snapshots vs Full History

Delete retention is like a security camera that records 7 days and overwrites old footage. Log compaction is like a photo album where you only keep the most recent photo of each person. You don't need 50 photos of Alice — just the latest one tells you what she looks like now. Compacted topics give you the 'current state' of every entity without storing the full history.

When to Use Log Compaction

✅CDC (Change Data Capture) — latest state of each database row
✅KTable changelog topics in Kafka Streams — materialized view state
✅Configuration distribution — latest config per service/key
✅User profile snapshots — latest profile state per user-id
✅Cache rebuild — new consumers can bootstrap current state without full replay

When NOT to Use Log Compaction

❌Event streams where history matters (audit logs, analytics events)
❌Topics where you need to process every event in sequence
❌Null-keyed records — compaction requires keys to identify duplicates
❌When you need time-based retention (use delete policy instead)

Interview Questions

Q:Why is Kafka not a message queue?

A: Kafka is a distributed commit log. Unlike queues that delete messages after consumption, Kafka retains records for a configurable period. Multiple consumer groups can read the same data independently at different speeds. Consumers can replay from any offset. This makes Kafka suitable as an event streaming platform and storage layer — not just a transport mechanism.

Q:How does Kafka guarantee ordering?

A: Kafka guarantees ordering WITHIN a partition only. Records with the same key always go to the same partition (hash(key) % numPartitions), so all events for a given entity are ordered. There is NO ordering guarantee across partitions. If you need total ordering across all records, you need a single partition — but this limits throughput to one consumer.

Q:What happens when you increase the partition count?

A: New records with the same key may hash to a different partition than before (because numPartitions changed in hash(key) % numPartitions). Existing records stay where they are. This means per-key ordering is broken for in-flight data during the transition. You cannot decrease partition count. This is why partition count is a one-way door that should be planned carefully.

Q:Explain the difference between delete and compact retention policies.

A: Delete policy removes entire log segments older than retention.ms or exceeding retention.bytes — it's time/size-based garbage collection. Compact policy keeps only the latest record per key, regardless of age — it's a key-based deduplication that turns the topic into a distributed key-value snapshot. You can combine both: compact,delete removes old compacted segments after retention.ms.

Q:Why is Kafka fast despite writing to disk?

A: Three reasons: (1) Sequential I/O — Kafka only appends to the end of files, which is the fastest disk operation (no random seeks). (2) OS page cache — Kafka relies on the OS to cache frequently-read data in RAM automatically. (3) Zero-copy (sendfile) — data moves directly from page cache to network socket without copying through JVM heap. This avoids garbage collection and memory copies.

Common Mistakes

🗄️

Using Kafka as a database

Storing application state directly in Kafka and querying it for reads. Kafka is optimized for sequential access, not random lookups by key.

✅Use Kafka as the event log (source of truth) but materialize views in a proper database for queries. Kafka is optimized for sequential reads, not random lookups.

📏

Too few partitions

Creating a topic with 1 partition 'because we don't have much traffic yet.' This limits you to a single consumer and makes future scaling painful.

✅Start with enough partitions for expected growth (6-12 minimum). You can increase later but it breaks key-based ordering. You can never decrease.

🔀

Expecting global ordering

Assuming records are ordered across all partitions in a topic. Kafka only guarantees ordering within a single partition.

✅Ordering is per-partition only. Use the same key for records that must be ordered relative to each other. If you need total order, use a single partition (sacrificing parallelism).

📦

Large messages in Kafka

Sending 10 MB payloads (images, videos, large documents) through Kafka. This bloats broker storage, slows replication, and hurts throughput for all topics.

✅Store large blobs in object storage (S3) and send a reference/URL in the Kafka record. Kafka is optimized for high-throughput small messages (< 1 MB).

🔑

Ignoring key design

Using null keys everywhere or using high-cardinality random keys. Null keys lose ordering guarantees. Hot keys create partition hotspots where one partition gets 90% of traffic.

✅Choose keys that represent the entity you need ordering for (user-id, order-id). Null keys lose ordering guarantees. Hot keys (one key with 90% of traffic) create partition hotspots.