Schema RegistryAvroPerformance TuningMonitoringSecurityKafka vs AlternativesOperationsKafka Connect

Kafka Operations & Tuning

Schema Registry, performance tuning, monitoring, security, and knowing when Kafka is the wrong choice.

35 min read9 sections

Schema Registry

Producers and consumers are decoupled — they don't share code or deploy together. The schema is the contract between them. Without schema enforcement, a producer can silently break consumers by changing the message format. Schema Registry solves this.

Schema Registry — How It Workstext

Architecture:

  Producer                Schema Registry              Consumer
  ┌──────┐              ┌──────────────┐              ┌──────┐
  │      │──register──▶ │  Stores      │ ◀──lookup──  │      │
  │      │  schema      │  schemas by  │   schema     │      │
  │      │              │  subject +   │   by ID      │      │
  │      │              │  version     │              │      │
  └──────┘              └──────────────┘              └──────┘

Flow:
  1. Producer registers schema with Schema Registry (gets schema ID)
  2. Producer embeds schema ID in each Kafka record (first 5 bytes)
  3. Consumer reads schema ID from record
  4. Consumer fetches schema from Registry (cached locally)
  5. Consumer deserializes record using the fetched schema

Record format with Schema Registry:
  [0]         Magic byte (0x00)
  [1-4]       Schema ID (4 bytes, big-endian int)
  [5...]      Serialized payload (Avro/Protobuf/JSON)

Subjects (how schemas are organized):
  - TopicNameStrategy (default): subject = "<topic>-value" or "<topic>-key"
  - RecordNameStrategy: subject = fully qualified record name
  - TopicRecordNameStrategy: subject = "<topic>-<record-name>"

Serialization Formats

Format	Size	Schema Required	Evolvability	Best For
Avro	Compact binary	Yes (for deserialization)	Excellent (aliases, defaults)	Most Kafka deployments — the standard choice
Protobuf	Compact binary	Yes (.proto files)	Good (field numbers, not names)	Polyglot environments, gRPC integration
JSON Schema	Larger (text-based)	Optional (for validation)	Moderate	Human-readable debugging, simple schemas

Schema Evolution & Compatibility

Schemas change over time — new fields are added, old fields deprecated. Compatibility modes control which changes are allowed without breaking existing producers or consumers.

Mode	Rule	Safe Changes	Use Case
BACKWARD (default)	New schema can read data written by old schema	Add optional fields (with defaults), remove fields	Consumers upgrade first — they can read old and new data
FORWARD	Old schema can read data written by new schema	Add fields, remove optional fields (with defaults)	Producers upgrade first — old consumers can still read
FULL	Both BACKWARD and FORWARD compatible	Add/remove optional fields with defaults only	Strictest — both sides can upgrade independently
NONE	No compatibility checking	Any change allowed	Development only — never in production

Schema Evolution Rulestext

Safe changes (BACKWARD compatible):
  ✓ Add a new field WITH a default value
  ✓ Remove a field that has a default value
  ✓ Rename a field using an alias (Avro)

Unsafe changes (BREAKING):
  ✗ Remove a required field (no default)
  ✗ Change a field's type (int → string)
  ✗ Rename a field without alias
  ✗ Add a required field without default

Example — Avro schema evolution:

Version 1:
  {"type": "record", "name": "User", "fields": [
    {"name": "id", "type": "string"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"}
  ]}

Version 2 (BACKWARD compatible — added optional field):
  {"type": "record", "name": "User", "fields": [
    {"name": "id", "type": "string"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "phone", "type": ["null", "string"], "default": null}  ← optional
  ]}

New consumers can read old records (phone defaults to null).
Old consumers can still read old records (they ignore phone).

🎯 Always Use BACKWARD Compatibility

BACKWARD is the default and recommended mode. It means consumers can always be upgraded first — they can read both old and new data. This gives you a safe deployment order: upgrade consumers, then upgrade producers. If something goes wrong, consumers still work with old data.

Producer Tuning

Goal	Config	Value	Effect
Max throughput	batch.size	65536-131072 (64-128 KB)	Larger batches = fewer network calls
Max throughput	linger.ms	20-100	Wait longer to fill batches
Max throughput	compression.type	lz4 or zstd	Reduces network and disk I/O
Max throughput	buffer.memory	67108864 (64 MB)	More buffer for async batching
Min latency	linger.ms	0	Send immediately, no batching delay
Min latency	acks	1	Only leader ack (less durable)
Max durability	acks	all	All ISR must acknowledge
Max durability	enable.idempotence	true	No duplicates from retries
Max durability	min.insync.replicas (broker)	2	At least 2 replicas confirm

💡 The Throughput-Latency-Durability Triangle

You can optimize for at most two: high throughput + low latency (sacrifice durability with acks=0), high throughput + durability (sacrifice latency with large batches + acks=all), or low latency + durability (sacrifice throughput with small batches + acks=all). Choose based on your use case.

Consumer Tuning

Config	Default	Tuning Direction	Effect
fetch.min.bytes	1	Increase (e.g., 1024-65536)	Broker waits to accumulate data → fewer fetches, higher throughput
fetch.max.wait.ms	500	Increase with fetch.min.bytes	How long broker waits to fill fetch.min.bytes
max.partition.fetch.bytes	1048576 (1 MB)	Increase for large messages	Max data per partition per fetch
max.poll.records	500	Decrease if processing is slow	Fewer records per poll → less time between polls → avoids rebalance
max.poll.interval.ms	300000 (5 min)	Increase if processing is legitimately slow	Time budget for processing a batch before rebalance
session.timeout.ms	45000 (45s)	Decrease for faster failure detection	How quickly a dead consumer is detected

Consumer Scaling Strategytext

Scaling consumers for throughput:

  1. Start: 1 consumer, measure processing rate per partition
  2. If consumer can't keep up (lag increasing):
     a. First: optimize processing (batch DB writes, async I/O)
     b. Then: add consumers up to partition count
     c. If still not enough: increase partitions, then add consumers

  Max parallelism = min(partition count, consumer count in group)

  Example:
    Topic: 12 partitions
    1 consumer  → handles 12 partitions (max throughput: 1x processing speed)
    3 consumers → each handles 4 partitions (3x throughput)
    6 consumers → each handles 2 partitions (6x throughput)
    12 consumers → each handles 1 partition (12x throughput — maximum)
    15 consumers → 12 active + 3 idle (wasted resources)

Broker & OS Tuning

Broker Configuration Essentialstext

# Threading
num.network.threads = 8          # Threads handling network requests
num.io.threads = 16              # Threads handling disk I/O
num.replica.fetchers = 4         # Threads for follower replication

# Log management
log.segment.bytes = 1073741824   # 1 GB segment files
log.retention.hours = 168        # 7 days retention
log.retention.bytes = -1         # No size limit (use time-based)
log.cleanup.policy = delete      # Or "compact" for changelog topics

# Replication
default.replication.factor = 3
min.insync.replicas = 2
unclean.leader.election.enable = false

Why Kafka Loves the OS Page Cache

Hardware & OS Considerations

✅Kafka is I/O bound, not CPU bound — invest in fast disks and network, not more cores
✅Sequential I/O makes HDDs viable, but SSDs reduce tail latency for random reads (consumer catch-up)
✅Page cache is Kafka's best friend — allocate 60-70% of RAM to page cache (don't over-allocate JVM heap)
✅JVM heap: 4-6 GB is usually sufficient; more causes long GC pauses
✅Network: 10 Gbps minimum for production; Kafka saturates network before CPU
✅Separate disks for data logs vs OS/application — prevents I/O contention

💡 Why Not More JVM Heap?

Kafka stores data on disk and reads it via the OS page cache — not JVM heap. Giving the JVM 32 GB means 32 GB less for page cache, which actually hurts performance. Keep JVM heap small (4-6 GB) and let the OS use the rest of RAM for caching log segments.

Monitoring Key Metrics

Metric	What It Means	Alert Threshold	Action
UnderReplicatedPartitions	Partitions where followers are behind the leader	> 0 for > 5 min	Check broker health, disk I/O, network. Most important broker alert.
OfflinePartitionsCount	Partitions with no available leader	> 0 (critical)	Data is unavailable. Check if brokers are down. Immediate action required.
ActiveControllerCount	Number of active controllers in cluster	≠ 1	Should always be exactly 1. 0 = no controller, > 1 = split brain.
Consumer Lag	Latest offset − committed offset per partition	Increasing trend	Consumer can't keep up. Scale consumers or optimize processing.
RequestHandlerAvgIdlePercent	How busy broker request handler threads are	< 20%	Broker is saturated. Add brokers or reduce load.
BytesIn/BytesOut Rate	Network throughput in/out of brokers	Approaching NIC capacity	Network bottleneck. Upgrade NICs or add brokers.
ISR Shrink/Expand Rate	How often replicas fall out of / rejoin ISR	Frequent shrinks	Indicates unstable replicas. Check disk, GC, network on affected brokers.

⚠️ The Three Critical Alerts

At minimum, alert on: (1) UnderReplicatedPartitions > 0 — earliest signal of broker problems. (2) OfflinePartitionsCount > 0 — data is unavailable, immediate action needed. (3) Consumer lag increasing — processing is falling behind real-time.

Security

Layer	Mechanism	Purpose
Authentication	SASL/PLAIN, SASL/SCRAM, SASL/GSSAPI (Kerberos), mTLS	Verify identity of clients connecting to brokers
Authorization	ACLs (Access Control Lists)	Control who can read/write/describe which topics
Encryption in transit	TLS (SSL)	Encrypt data between clients and brokers, and broker-to-broker
Encryption at rest	Disk encryption (OS-level)	Kafka doesn't encrypt data on disk natively — use OS/cloud encryption

ACL Examplestext

# Allow user "order-service" to produce to "orders" topic
kafka-acls.sh --add \
  --allow-principal User:order-service \
  --operation Write \
  --topic orders

# Allow user "analytics" to consume from "orders" topic
kafka-acls.sh --add \
  --allow-principal User:analytics \
  --operation Read \
  --topic orders \
  --group analytics-group

# Deny all other access (default deny)
# Best practice: start with deny-all, explicitly allow what's needed

# Multi-tenant isolation:
#   - Separate topics per tenant (tenant-a.orders, tenant-b.orders)
#   - ACLs restrict each tenant to their own topics
#   - Quotas limit throughput per client to prevent noisy neighbors

Kafka vs Alternatives

Scenario	Kafka?	Better Alternative	Why
Simple task queue, no replay needed	Overkill	RabbitMQ, SQS, Redis Streams	Simpler ops, built-in retry/DLQ, no partition management
Small scale, low ops overhead	Overkill	Redis Streams, SQS	Kafka's operational cost isn't justified at small scale
Exactly-once to external systems	Hard	Transactional outbox + Kafka	Kafka EOS is within-Kafka only; outbox pattern for DB+Kafka atomicity
Complex stream processing	Use Kafka + Flink	Flink for processing, Kafka for transport	Flink has superior windowing, CEP, and multi-source support
Real-time pub/sub, ephemeral	Overkill	Redis Pub/Sub, WebSockets	No need for durability/replay — lighter solutions work
Durable, replayable event log at scale	✅ Ideal	—	This is exactly what Kafka was built for
Multi-region event replication	✅ Ideal	Kafka MirrorMaker 2	Built-in cross-cluster replication
CDC from databases	✅ Ideal	Kafka + Debezium	Mature ecosystem, exactly the intended use case

Kafka in the Cloud

Option	Ops Burden	Control	Cost
AWS MSK (Managed Kafka)	Low — AWS manages brokers, ZK/KRaft	Medium — limited config options	Pay per broker-hour + storage
Confluent Cloud	Lowest — fully serverless option available	Low — opinionated, Schema Registry included	Pay per throughput unit (CKU)
Self-hosted (K8s / EC2)	Highest — you manage everything	Full — any config, any version	Infrastructure + ops team cost

Common Mistakes

📋

No Schema Registry in production

Producers and consumers agree on JSON format via documentation or Slack messages. A producer changes a field name and silently breaks all consumers.

✅Use Schema Registry with Avro/Protobuf and BACKWARD compatibility. It enforces contracts at write time, prevents breaking changes, and provides schema evolution without downtime.

🔨

Using Kafka for simple task queues

Deploying a 3-broker Kafka cluster for a background job queue processing 100 jobs/minute. Massive operational overhead for a simple use case.

✅Use RabbitMQ or SQS for simple task queues. Kafka's value is durability, replay, and high throughput. If you don't need those, the operational overhead isn't justified.

💾

Ignoring broker disk space

Not monitoring disk usage — broker runs out of space and crashes, triggering cascading failures across the cluster.

✅Monitor disk usage per broker. Set retention.ms and retention.bytes appropriately. Alert at 70% capacity. A full disk causes broker failure, which triggers cascading rebalances and potential data loss.

🧠

Over-allocating JVM heap

Giving Kafka brokers 32 GB JVM heap thinking more memory = better performance. This actually starves the OS page cache that Kafka depends on.

✅Keep JVM heap at 4-6 GB. Kafka uses the OS page cache for data, not JVM heap. More heap = less page cache = worse performance + longer GC pauses.

🔓

No encryption or ACLs

Running Kafka with PLAINTEXT listeners and no authentication in production. Any client on the network can read/write any topic.

✅Enable TLS for encryption in transit. Use SASL for authentication. Configure ACLs to restrict topic access per service. Default-deny, explicitly allow.