Kafka Operations & Tuning
Schema Registry, performance tuning, monitoring, security, and knowing when Kafka is the wrong choice.
Table of Contents
Schema Registry
Producers and consumers are decoupled β they don't share code or deploy together. The schema is the contract between them. Without schema enforcement, a producer can silently break consumers by changing the message format. Schema Registry solves this.
Architecture: Producer Schema Registry Consumer ββββββββ ββββββββββββββββ ββββββββ β βββregisterβββΆ β Stores β βββlookupββ β β β β schema β schemas by β schema β β β β β subject + β by ID β β β β β version β β β ββββββββ ββββββββββββββββ ββββββββ Flow: 1. Producer registers schema with Schema Registry (gets schema ID) 2. Producer embeds schema ID in each Kafka record (first 5 bytes) 3. Consumer reads schema ID from record 4. Consumer fetches schema from Registry (cached locally) 5. Consumer deserializes record using the fetched schema Record format with Schema Registry: [0] Magic byte (0x00) [1-4] Schema ID (4 bytes, big-endian int) [5...] Serialized payload (Avro/Protobuf/JSON) Subjects (how schemas are organized): - TopicNameStrategy (default): subject = "<topic>-value" or "<topic>-key" - RecordNameStrategy: subject = fully qualified record name - TopicRecordNameStrategy: subject = "<topic>-<record-name>"
Serialization Formats
| Format | Size | Schema Required | Evolvability | Best For |
|---|---|---|---|---|
| Avro | Compact binary | Yes (for deserialization) | Excellent (aliases, defaults) | Most Kafka deployments β the standard choice |
| Protobuf | Compact binary | Yes (.proto files) | Good (field numbers, not names) | Polyglot environments, gRPC integration |
| JSON Schema | Larger (text-based) | Optional (for validation) | Moderate | Human-readable debugging, simple schemas |
Schema Evolution & Compatibility
Schemas change over time β new fields are added, old fields deprecated. Compatibility modes control which changes are allowed without breaking existing producers or consumers.
| Mode | Rule | Safe Changes | Use Case |
|---|---|---|---|
| BACKWARD (default) | New schema can read data written by old schema | Add optional fields (with defaults), remove fields | Consumers upgrade first β they can read old and new data |
| FORWARD | Old schema can read data written by new schema | Add fields, remove optional fields (with defaults) | Producers upgrade first β old consumers can still read |
| FULL | Both BACKWARD and FORWARD compatible | Add/remove optional fields with defaults only | Strictest β both sides can upgrade independently |
| NONE | No compatibility checking | Any change allowed | Development only β never in production |
Safe changes (BACKWARD compatible): β Add a new field WITH a default value β Remove a field that has a default value β Rename a field using an alias (Avro) Unsafe changes (BREAKING): β Remove a required field (no default) β Change a field's type (int β string) β Rename a field without alias β Add a required field without default Example β Avro schema evolution: Version 1: {"type": "record", "name": "User", "fields": [ {"name": "id", "type": "string"}, {"name": "name", "type": "string"}, {"name": "email", "type": "string"} ]} Version 2 (BACKWARD compatible β added optional field): {"type": "record", "name": "User", "fields": [ {"name": "id", "type": "string"}, {"name": "name", "type": "string"}, {"name": "email", "type": "string"}, {"name": "phone", "type": ["null", "string"], "default": null} β optional ]} New consumers can read old records (phone defaults to null). Old consumers can still read old records (they ignore phone).
π― Always Use BACKWARD Compatibility
BACKWARD is the default and recommended mode. It means consumers can always be upgraded first β they can read both old and new data. This gives you a safe deployment order: upgrade consumers, then upgrade producers. If something goes wrong, consumers still work with old data.
Producer Tuning
| Goal | Config | Value | Effect |
|---|---|---|---|
| Max throughput | batch.size | 65536-131072 (64-128 KB) | Larger batches = fewer network calls |
| Max throughput | linger.ms | 20-100 | Wait longer to fill batches |
| Max throughput | compression.type | lz4 or zstd | Reduces network and disk I/O |
| Max throughput | buffer.memory | 67108864 (64 MB) | More buffer for async batching |
| Min latency | linger.ms | 0 | Send immediately, no batching delay |
| Min latency | acks | 1 | Only leader ack (less durable) |
| Max durability | acks | all | All ISR must acknowledge |
| Max durability | enable.idempotence | true | No duplicates from retries |
| Max durability | min.insync.replicas (broker) | 2 | At least 2 replicas confirm |
π‘ The Throughput-Latency-Durability Triangle
You can optimize for at most two: high throughput + low latency (sacrifice durability with acks=0), high throughput + durability (sacrifice latency with large batches + acks=all), or low latency + durability (sacrifice throughput with small batches + acks=all). Choose based on your use case.
Consumer Tuning
| Config | Default | Tuning Direction | Effect |
|---|---|---|---|
| fetch.min.bytes | 1 | Increase (e.g., 1024-65536) | Broker waits to accumulate data β fewer fetches, higher throughput |
| fetch.max.wait.ms | 500 | Increase with fetch.min.bytes | How long broker waits to fill fetch.min.bytes |
| max.partition.fetch.bytes | 1048576 (1 MB) | Increase for large messages | Max data per partition per fetch |
| max.poll.records | 500 | Decrease if processing is slow | Fewer records per poll β less time between polls β avoids rebalance |
| max.poll.interval.ms | 300000 (5 min) | Increase if processing is legitimately slow | Time budget for processing a batch before rebalance |
| session.timeout.ms | 45000 (45s) | Decrease for faster failure detection | How quickly a dead consumer is detected |
Scaling consumers for throughput: 1. Start: 1 consumer, measure processing rate per partition 2. If consumer can't keep up (lag increasing): a. First: optimize processing (batch DB writes, async I/O) b. Then: add consumers up to partition count c. If still not enough: increase partitions, then add consumers Max parallelism = min(partition count, consumer count in group) Example: Topic: 12 partitions 1 consumer β handles 12 partitions (max throughput: 1x processing speed) 3 consumers β each handles 4 partitions (3x throughput) 6 consumers β each handles 2 partitions (6x throughput) 12 consumers β each handles 1 partition (12x throughput β maximum) 15 consumers β 12 active + 3 idle (wasted resources)
Broker & OS Tuning
# Threading num.network.threads = 8 # Threads handling network requests num.io.threads = 16 # Threads handling disk I/O num.replica.fetchers = 4 # Threads for follower replication # Log management log.segment.bytes = 1073741824 # 1 GB segment files log.retention.hours = 168 # 7 days retention log.retention.bytes = -1 # No size limit (use time-based) log.cleanup.policy = delete # Or "compact" for changelog topics # Replication default.replication.factor = 3 min.insync.replicas = 2 unclean.leader.election.enable = false
Why Kafka Loves the OS Page Cache
Hardware & OS Considerations
- β Kafka is I/O bound, not CPU bound β invest in fast disks and network, not more cores
- β Sequential I/O makes HDDs viable, but SSDs reduce tail latency for random reads (consumer catch-up)
- β Page cache is Kafka's best friend β allocate 60-70% of RAM to page cache (don't over-allocate JVM heap)
- β JVM heap: 4-6 GB is usually sufficient; more causes long GC pauses
- β Network: 10 Gbps minimum for production; Kafka saturates network before CPU
- β Separate disks for data logs vs OS/application β prevents I/O contention
π‘ Why Not More JVM Heap?
Kafka stores data on disk and reads it via the OS page cache β not JVM heap. Giving the JVM 32 GB means 32 GB less for page cache, which actually hurts performance. Keep JVM heap small (4-6 GB) and let the OS use the rest of RAM for caching log segments.
Monitoring Key Metrics
| Metric | What It Means | Alert Threshold | Action |
|---|---|---|---|
| UnderReplicatedPartitions | Partitions where followers are behind the leader | > 0 for > 5 min | Check broker health, disk I/O, network. Most important broker alert. |
| OfflinePartitionsCount | Partitions with no available leader | > 0 (critical) | Data is unavailable. Check if brokers are down. Immediate action required. |
| ActiveControllerCount | Number of active controllers in cluster | β 1 | Should always be exactly 1. 0 = no controller, > 1 = split brain. |
| Consumer Lag | Latest offset β committed offset per partition | Increasing trend | Consumer can't keep up. Scale consumers or optimize processing. |
| RequestHandlerAvgIdlePercent | How busy broker request handler threads are | < 20% | Broker is saturated. Add brokers or reduce load. |
| BytesIn/BytesOut Rate | Network throughput in/out of brokers | Approaching NIC capacity | Network bottleneck. Upgrade NICs or add brokers. |
| ISR Shrink/Expand Rate | How often replicas fall out of / rejoin ISR | Frequent shrinks | Indicates unstable replicas. Check disk, GC, network on affected brokers. |
β οΈ The Three Critical Alerts
At minimum, alert on: (1) UnderReplicatedPartitions > 0 β earliest signal of broker problems. (2) OfflinePartitionsCount > 0 β data is unavailable, immediate action needed. (3) Consumer lag increasing β processing is falling behind real-time.
Security
| Layer | Mechanism | Purpose |
|---|---|---|
| Authentication | SASL/PLAIN, SASL/SCRAM, SASL/GSSAPI (Kerberos), mTLS | Verify identity of clients connecting to brokers |
| Authorization | ACLs (Access Control Lists) | Control who can read/write/describe which topics |
| Encryption in transit | TLS (SSL) | Encrypt data between clients and brokers, and broker-to-broker |
| Encryption at rest | Disk encryption (OS-level) | Kafka doesn't encrypt data on disk natively β use OS/cloud encryption |
# Allow user "order-service" to produce to "orders" topic kafka-acls.sh --add \ --allow-principal User:order-service \ --operation Write \ --topic orders # Allow user "analytics" to consume from "orders" topic kafka-acls.sh --add \ --allow-principal User:analytics \ --operation Read \ --topic orders \ --group analytics-group # Deny all other access (default deny) # Best practice: start with deny-all, explicitly allow what's needed # Multi-tenant isolation: # - Separate topics per tenant (tenant-a.orders, tenant-b.orders) # - ACLs restrict each tenant to their own topics # - Quotas limit throughput per client to prevent noisy neighbors
Kafka vs Alternatives
| Scenario | Kafka? | Better Alternative | Why |
|---|---|---|---|
| Simple task queue, no replay needed | Overkill | RabbitMQ, SQS, Redis Streams | Simpler ops, built-in retry/DLQ, no partition management |
| Small scale, low ops overhead | Overkill | Redis Streams, SQS | Kafka's operational cost isn't justified at small scale |
| Exactly-once to external systems | Hard | Transactional outbox + Kafka | Kafka EOS is within-Kafka only; outbox pattern for DB+Kafka atomicity |
| Complex stream processing | Use Kafka + Flink | Flink for processing, Kafka for transport | Flink has superior windowing, CEP, and multi-source support |
| Real-time pub/sub, ephemeral | Overkill | Redis Pub/Sub, WebSockets | No need for durability/replay β lighter solutions work |
| Durable, replayable event log at scale | β Ideal | β | This is exactly what Kafka was built for |
| Multi-region event replication | β Ideal | Kafka MirrorMaker 2 | Built-in cross-cluster replication |
| CDC from databases | β Ideal | Kafka + Debezium | Mature ecosystem, exactly the intended use case |
Kafka in the Cloud
| Option | Ops Burden | Control | Cost |
|---|---|---|---|
| AWS MSK (Managed Kafka) | Low β AWS manages brokers, ZK/KRaft | Medium β limited config options | Pay per broker-hour + storage |
| Confluent Cloud | Lowest β fully serverless option available | Low β opinionated, Schema Registry included | Pay per throughput unit (CKU) |
| Self-hosted (K8s / EC2) | Highest β you manage everything | Full β any config, any version | Infrastructure + ops team cost |
Common Mistakes
No Schema Registry in production
Producers and consumers agree on JSON format via documentation or Slack messages. A producer changes a field name and silently breaks all consumers.
β Use Schema Registry with Avro/Protobuf and BACKWARD compatibility. It enforces contracts at write time, prevents breaking changes, and provides schema evolution without downtime.
Using Kafka for simple task queues
Deploying a 3-broker Kafka cluster for a background job queue processing 100 jobs/minute. Massive operational overhead for a simple use case.
β Use RabbitMQ or SQS for simple task queues. Kafka's value is durability, replay, and high throughput. If you don't need those, the operational overhead isn't justified.
Ignoring broker disk space
Not monitoring disk usage β broker runs out of space and crashes, triggering cascading failures across the cluster.
β Monitor disk usage per broker. Set retention.ms and retention.bytes appropriately. Alert at 70% capacity. A full disk causes broker failure, which triggers cascading rebalances and potential data loss.
Over-allocating JVM heap
Giving Kafka brokers 32 GB JVM heap thinking more memory = better performance. This actually starves the OS page cache that Kafka depends on.
β Keep JVM heap at 4-6 GB. Kafka uses the OS page cache for data, not JVM heap. More heap = less page cache = worse performance + longer GC pauses.
No encryption or ACLs
Running Kafka with PLAINTEXT listeners and no authentication in production. Any client on the network can read/write any topic.
β Enable TLS for encryption in transit. Use SASL for authentication. Configure ACLs to restrict topic access per service. Default-deny, explicitly allow.