Schema RegistryAvroPerformance TuningMonitoringSecurityKafka vs AlternativesOperationsKafka Connect

Kafka Operations & Tuning

Schema Registry, performance tuning, monitoring, security, and knowing when Kafka is the wrong choice.

35 min read9 sections
01

Schema Registry

Producers and consumers are decoupled β€” they don't share code or deploy together. The schema is the contract between them. Without schema enforcement, a producer can silently break consumers by changing the message format. Schema Registry solves this.

Schema Registry β€” How It Workstext
Architecture:

  Producer                Schema Registry              Consumer
  β”Œβ”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”
  β”‚      │──register──▢ β”‚  Stores      β”‚ ◀──lookup──  β”‚      β”‚
  β”‚      β”‚  schema      β”‚  schemas by  β”‚   schema     β”‚      β”‚
  β”‚      β”‚              β”‚  subject +   β”‚   by ID      β”‚      β”‚
  β”‚      β”‚              β”‚  version     β”‚              β”‚      β”‚
  β””β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”˜

Flow:
  1. Producer registers schema with Schema Registry (gets schema ID)
  2. Producer embeds schema ID in each Kafka record (first 5 bytes)
  3. Consumer reads schema ID from record
  4. Consumer fetches schema from Registry (cached locally)
  5. Consumer deserializes record using the fetched schema

Record format with Schema Registry:
  [0]         Magic byte (0x00)
  [1-4]       Schema ID (4 bytes, big-endian int)
  [5...]      Serialized payload (Avro/Protobuf/JSON)

Subjects (how schemas are organized):
  - TopicNameStrategy (default): subject = "<topic>-value" or "<topic>-key"
  - RecordNameStrategy: subject = fully qualified record name
  - TopicRecordNameStrategy: subject = "<topic>-<record-name>"

Serialization Formats

FormatSizeSchema RequiredEvolvabilityBest For
AvroCompact binaryYes (for deserialization)Excellent (aliases, defaults)Most Kafka deployments β€” the standard choice
ProtobufCompact binaryYes (.proto files)Good (field numbers, not names)Polyglot environments, gRPC integration
JSON SchemaLarger (text-based)Optional (for validation)ModerateHuman-readable debugging, simple schemas
02

Schema Evolution & Compatibility

Schemas change over time β€” new fields are added, old fields deprecated. Compatibility modes control which changes are allowed without breaking existing producers or consumers.

ModeRuleSafe ChangesUse Case
BACKWARD (default)New schema can read data written by old schemaAdd optional fields (with defaults), remove fieldsConsumers upgrade first β€” they can read old and new data
FORWARDOld schema can read data written by new schemaAdd fields, remove optional fields (with defaults)Producers upgrade first β€” old consumers can still read
FULLBoth BACKWARD and FORWARD compatibleAdd/remove optional fields with defaults onlyStrictest β€” both sides can upgrade independently
NONENo compatibility checkingAny change allowedDevelopment only β€” never in production
Schema Evolution Rulestext
Safe changes (BACKWARD compatible):
  βœ“ Add a new field WITH a default value
  βœ“ Remove a field that has a default value
  βœ“ Rename a field using an alias (Avro)

Unsafe changes (BREAKING):
  βœ— Remove a required field (no default)
  βœ— Change a field's type (int β†’ string)
  βœ— Rename a field without alias
  βœ— Add a required field without default

Example β€” Avro schema evolution:

Version 1:
  {"type": "record", "name": "User", "fields": [
    {"name": "id", "type": "string"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"}
  ]}

Version 2 (BACKWARD compatible β€” added optional field):
  {"type": "record", "name": "User", "fields": [
    {"name": "id", "type": "string"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "phone", "type": ["null", "string"], "default": null}  ← optional
  ]}

New consumers can read old records (phone defaults to null).
Old consumers can still read old records (they ignore phone).

🎯 Always Use BACKWARD Compatibility

BACKWARD is the default and recommended mode. It means consumers can always be upgraded first β€” they can read both old and new data. This gives you a safe deployment order: upgrade consumers, then upgrade producers. If something goes wrong, consumers still work with old data.

03

Producer Tuning

GoalConfigValueEffect
Max throughputbatch.size65536-131072 (64-128 KB)Larger batches = fewer network calls
Max throughputlinger.ms20-100Wait longer to fill batches
Max throughputcompression.typelz4 or zstdReduces network and disk I/O
Max throughputbuffer.memory67108864 (64 MB)More buffer for async batching
Min latencylinger.ms0Send immediately, no batching delay
Min latencyacks1Only leader ack (less durable)
Max durabilityacksallAll ISR must acknowledge
Max durabilityenable.idempotencetrueNo duplicates from retries
Max durabilitymin.insync.replicas (broker)2At least 2 replicas confirm

πŸ’‘ The Throughput-Latency-Durability Triangle

You can optimize for at most two: high throughput + low latency (sacrifice durability with acks=0), high throughput + durability (sacrifice latency with large batches + acks=all), or low latency + durability (sacrifice throughput with small batches + acks=all). Choose based on your use case.

04

Consumer Tuning

ConfigDefaultTuning DirectionEffect
fetch.min.bytes1Increase (e.g., 1024-65536)Broker waits to accumulate data β†’ fewer fetches, higher throughput
fetch.max.wait.ms500Increase with fetch.min.bytesHow long broker waits to fill fetch.min.bytes
max.partition.fetch.bytes1048576 (1 MB)Increase for large messagesMax data per partition per fetch
max.poll.records500Decrease if processing is slowFewer records per poll β†’ less time between polls β†’ avoids rebalance
max.poll.interval.ms300000 (5 min)Increase if processing is legitimately slowTime budget for processing a batch before rebalance
session.timeout.ms45000 (45s)Decrease for faster failure detectionHow quickly a dead consumer is detected
Consumer Scaling Strategytext
Scaling consumers for throughput:

  1. Start: 1 consumer, measure processing rate per partition
  2. If consumer can't keep up (lag increasing):
     a. First: optimize processing (batch DB writes, async I/O)
     b. Then: add consumers up to partition count
     c. If still not enough: increase partitions, then add consumers

  Max parallelism = min(partition count, consumer count in group)

  Example:
    Topic: 12 partitions
    1 consumer  β†’ handles 12 partitions (max throughput: 1x processing speed)
    3 consumers β†’ each handles 4 partitions (3x throughput)
    6 consumers β†’ each handles 2 partitions (6x throughput)
    12 consumers β†’ each handles 1 partition (12x throughput β€” maximum)
    15 consumers β†’ 12 active + 3 idle (wasted resources)
05

Broker & OS Tuning

Broker Configuration Essentialstext
# Threading
num.network.threads = 8          # Threads handling network requests
num.io.threads = 16              # Threads handling disk I/O
num.replica.fetchers = 4         # Threads for follower replication

# Log management
log.segment.bytes = 1073741824   # 1 GB segment files
log.retention.hours = 168        # 7 days retention
log.retention.bytes = -1         # No size limit (use time-based)
log.cleanup.policy = delete      # Or "compact" for changelog topics

# Replication
default.replication.factor = 3
min.insync.replicas = 2
unclean.leader.election.enable = false

Why Kafka Loves the OS Page Cache

Hardware & OS Considerations

  • βœ…Kafka is I/O bound, not CPU bound β€” invest in fast disks and network, not more cores
  • βœ…Sequential I/O makes HDDs viable, but SSDs reduce tail latency for random reads (consumer catch-up)
  • βœ…Page cache is Kafka's best friend β€” allocate 60-70% of RAM to page cache (don't over-allocate JVM heap)
  • βœ…JVM heap: 4-6 GB is usually sufficient; more causes long GC pauses
  • βœ…Network: 10 Gbps minimum for production; Kafka saturates network before CPU
  • βœ…Separate disks for data logs vs OS/application β€” prevents I/O contention

πŸ’‘ Why Not More JVM Heap?

Kafka stores data on disk and reads it via the OS page cache β€” not JVM heap. Giving the JVM 32 GB means 32 GB less for page cache, which actually hurts performance. Keep JVM heap small (4-6 GB) and let the OS use the rest of RAM for caching log segments.

06

Monitoring Key Metrics

MetricWhat It MeansAlert ThresholdAction
UnderReplicatedPartitionsPartitions where followers are behind the leader> 0 for > 5 minCheck broker health, disk I/O, network. Most important broker alert.
OfflinePartitionsCountPartitions with no available leader> 0 (critical)Data is unavailable. Check if brokers are down. Immediate action required.
ActiveControllerCountNumber of active controllers in cluster≠ 1Should always be exactly 1. 0 = no controller, > 1 = split brain.
Consumer LagLatest offset βˆ’ committed offset per partitionIncreasing trendConsumer can't keep up. Scale consumers or optimize processing.
RequestHandlerAvgIdlePercentHow busy broker request handler threads are< 20%Broker is saturated. Add brokers or reduce load.
BytesIn/BytesOut RateNetwork throughput in/out of brokersApproaching NIC capacityNetwork bottleneck. Upgrade NICs or add brokers.
ISR Shrink/Expand RateHow often replicas fall out of / rejoin ISRFrequent shrinksIndicates unstable replicas. Check disk, GC, network on affected brokers.

⚠️ The Three Critical Alerts

At minimum, alert on: (1) UnderReplicatedPartitions > 0 β€” earliest signal of broker problems. (2) OfflinePartitionsCount > 0 β€” data is unavailable, immediate action needed. (3) Consumer lag increasing β€” processing is falling behind real-time.

07

Security

LayerMechanismPurpose
AuthenticationSASL/PLAIN, SASL/SCRAM, SASL/GSSAPI (Kerberos), mTLSVerify identity of clients connecting to brokers
AuthorizationACLs (Access Control Lists)Control who can read/write/describe which topics
Encryption in transitTLS (SSL)Encrypt data between clients and brokers, and broker-to-broker
Encryption at restDisk encryption (OS-level)Kafka doesn't encrypt data on disk natively β€” use OS/cloud encryption
ACL Examplestext
# Allow user "order-service" to produce to "orders" topic
kafka-acls.sh --add \
  --allow-principal User:order-service \
  --operation Write \
  --topic orders

# Allow user "analytics" to consume from "orders" topic
kafka-acls.sh --add \
  --allow-principal User:analytics \
  --operation Read \
  --topic orders \
  --group analytics-group

# Deny all other access (default deny)
# Best practice: start with deny-all, explicitly allow what's needed

# Multi-tenant isolation:
#   - Separate topics per tenant (tenant-a.orders, tenant-b.orders)
#   - ACLs restrict each tenant to their own topics
#   - Quotas limit throughput per client to prevent noisy neighbors
08

Kafka vs Alternatives

ScenarioKafka?Better AlternativeWhy
Simple task queue, no replay neededOverkillRabbitMQ, SQS, Redis StreamsSimpler ops, built-in retry/DLQ, no partition management
Small scale, low ops overheadOverkillRedis Streams, SQSKafka's operational cost isn't justified at small scale
Exactly-once to external systemsHardTransactional outbox + KafkaKafka EOS is within-Kafka only; outbox pattern for DB+Kafka atomicity
Complex stream processingUse Kafka + FlinkFlink for processing, Kafka for transportFlink has superior windowing, CEP, and multi-source support
Real-time pub/sub, ephemeralOverkillRedis Pub/Sub, WebSocketsNo need for durability/replay β€” lighter solutions work
Durable, replayable event log at scaleβœ… Idealβ€”This is exactly what Kafka was built for
Multi-region event replicationβœ… IdealKafka MirrorMaker 2Built-in cross-cluster replication
CDC from databasesβœ… IdealKafka + DebeziumMature ecosystem, exactly the intended use case

Kafka in the Cloud

OptionOps BurdenControlCost
AWS MSK (Managed Kafka)Low β€” AWS manages brokers, ZK/KRaftMedium β€” limited config optionsPay per broker-hour + storage
Confluent CloudLowest β€” fully serverless option availableLow β€” opinionated, Schema Registry includedPay per throughput unit (CKU)
Self-hosted (K8s / EC2)Highest β€” you manage everythingFull β€” any config, any versionInfrastructure + ops team cost
09

Common Mistakes

πŸ“‹

No Schema Registry in production

Producers and consumers agree on JSON format via documentation or Slack messages. A producer changes a field name and silently breaks all consumers.

βœ…Use Schema Registry with Avro/Protobuf and BACKWARD compatibility. It enforces contracts at write time, prevents breaking changes, and provides schema evolution without downtime.

πŸ”¨

Using Kafka for simple task queues

Deploying a 3-broker Kafka cluster for a background job queue processing 100 jobs/minute. Massive operational overhead for a simple use case.

βœ…Use RabbitMQ or SQS for simple task queues. Kafka's value is durability, replay, and high throughput. If you don't need those, the operational overhead isn't justified.

πŸ’Ύ

Ignoring broker disk space

Not monitoring disk usage β€” broker runs out of space and crashes, triggering cascading failures across the cluster.

βœ…Monitor disk usage per broker. Set retention.ms and retention.bytes appropriately. Alert at 70% capacity. A full disk causes broker failure, which triggers cascading rebalances and potential data loss.

🧠

Over-allocating JVM heap

Giving Kafka brokers 32 GB JVM heap thinking more memory = better performance. This actually starves the OS page cache that Kafka depends on.

βœ…Keep JVM heap at 4-6 GB. Kafka uses the OS page cache for data, not JVM heap. More heap = less page cache = worse performance + longer GC pauses.

πŸ”“

No encryption or ACLs

Running Kafka with PLAINTEXT listeners and no authentication in production. Any client on the network can read/write any topic.

βœ…Enable TLS for encryption in transit. Use SASL for authentication. Configure ACLs to restrict topic access per service. Default-deny, explicitly allow.