LinkedIn Is Moving Beyond Kafka — A Deep Dive into Northguard
The company that created Kafka is building its replacement. We break down the architectural limits they hit, how Northguard solves them, and what it means for the rest of us.
Table of Contents
Context — Kafka's Role in the Modern Stack
Before we talk about replacing Kafka, we need to understand why it became irreplaceable in the first place. Kafka solved a problem that every growing company eventually hits: how do you move data between systems in real time without losing anything?
Before Kafka, companies stitched together point-to-point integrations. Service A writes to a database, Service B polls that database, Service C gets a webhook. It works until it doesn't — and it stops working the moment you need ordering guarantees, replay capability, or more than a handful of consumers.
Service A ──→ Database ←── Service B (polling) Service A ──→ Webhook ──→ Service C Service A ──→ Queue ──→ Service D Service B ──→ REST API ──→ Service E Problems: • No single source of truth for event ordering • Adding a new consumer means modifying the producer • No replay — if Service C was down, those events are gone • Scaling consumers independently is nearly impossible
Kafka introduced a deceptively simple abstraction: an append-only, partitioned, replicated log. Producers write events to the end of the log. Consumers read from wherever they left off. The log retains data for a configurable period (or forever). Multiple consumers can read the same data independently, at their own pace.
Append-Only Log
Events are immutable once written. This gives you a reliable audit trail and the ability to replay history — something traditional message queues can't do.
Partitioned
A topic is split into partitions, each an independent ordered log. This is how Kafka parallelizes reads and writes — more partitions means more throughput.
Replicated
Each partition is copied across multiple brokers. If a broker dies, another replica takes over as leader. No data loss, no downtime.
This model was so effective that Kafka became the backbone of event-driven architecture at thousands of companies. Netflix uses it for real-time recommendations. Uber uses it for trip event processing. Banks use it for transaction streams. It's everywhere.
🔥 The Key Point
Kafka didn't just solve a LinkedIn problem — it defined an entire category. The "distributed commit log" pattern is now foundational to modern data engineering. That's what makes LinkedIn's decision to move beyond it so significant.
What Changed at LinkedIn
When LinkedIn created Kafka in 2010, they had around 90 million members. The platform handled activity feeds, metrics pipelines, and log aggregation. Kafka was designed for this workload and it handled it well.
Fast forward to 2026. LinkedIn has over 1.2 billion members. The platform processes more than 32 trillion messages per day across 150+ Kafka clusters containing over 400,000 topics. To put that in perspective: that's roughly 370 million messages per second, sustained, 24/7.
2010 2026 Members: 90 million 1.2 billion (13x) Daily messages: millions 32 trillion (millions-x) Clusters: single digits 150+ Topics: hundreds 400,000+ Partitions: thousands millions The architecture that worked at 90M members doesn't necessarily work at 1.2B members. That's not a flaw — that's physics.
The important thing to understand is that LinkedIn didn't wake up one morning and decide Kafka was bad. They spent over a decade optimizing it. They contributed heavily to the open-source project. They built custom tooling around it. They squeezed every last drop of performance out of the architecture. And then they hit walls that no amount of tuning could fix.
The Skyscraper Foundation
Imagine you built a 10-story building with a foundation designed for 10 stories. Over the years, you've added floors — reinforcing beams, upgrading elevators, strengthening the base. You're now at 50 stories. The building still stands, but every new floor requires exponentially more reinforcement. At some point, it's cheaper and safer to build a new tower with a foundation designed for 100 stories.
💡 This Is Normal
Every successful distributed system eventually hits a scale where its original design assumptions break down. Google replaced GFS with Colossus. Facebook replaced Cassandra with their own storage. Twitter replaced their Ruby monolith with JVM services. LinkedIn replacing Kafka follows the same pattern — it's a sign of growth, not failure.
The Three Architectural Walls
LinkedIn's engineers identified three fundamental design constraints in Kafka that couldn't be solved with configuration changes or incremental improvements. These aren't bugs — they're consequences of architectural decisions that were perfectly reasonable at Kafka's original scale.
Wall 1: The Partition Is the Unit of Everything
In Kafka, a partition is simultaneously the unit of storage, replication, consumption, and rebalancing. This tight coupling is elegant at moderate scale but becomes a liability at extreme scale.
When a partition grows to hundreds of gigabytes (common for high-throughput topics), every operation that involves moving that partition becomes expensive. Adding a broker? You need to copy entire partitions — potentially terabytes of data — across the network. A broker fails? Its partition replicas need to be fully reconstructed elsewhere.
Topic: user-activity (high throughput) Partition 0: 340 GB on Broker A Partition 1: 280 GB on Broker B Partition 2: 510 GB on Broker C ← hot partition Partition 3: 190 GB on Broker D Scenario: Add Broker E to the cluster Step 1: Decide which partitions to move (manual or semi-auto) Step 2: Copy selected partitions to Broker E → Partition 2 alone is 510 GB → At 100 MB/s network throughput: ~85 minutes just for one partition Step 3: During copy: Broker C has elevated disk I/O + network usage Step 4: Consumers on Partition 2 may see latency spikes Step 5: If the copy fails midway, start over Total time for a full rebalance across the cluster: hours to days
Wall 2: Centralized Metadata Coordination
Kafka originally used ZooKeeper for metadata management, and more recently moved to KRaft (Kafka Raft) to eliminate the ZooKeeper dependency. In both models, a single controller node is responsible for all metadata decisions: which broker leads which partition, where replicas live, handling broker failures.
At LinkedIn's scale — millions of partitions across 150 clusters — this single controller becomes a coordination bottleneck. Leader elections pile up. Topic creation slows down. Broker failure recovery takes longer because the controller is processing a queue of metadata changes.
Single Kafka Controller responsibilities: ├── Track state of every partition (millions) ├── Elect new leaders when brokers fail ├── Process topic creation/deletion requests ├── Handle consumer group rebalances └── Propagate metadata updates to all brokers At 400,000 topics × avg 10 partitions = 4,000,000 partition states When Broker X fails: → Controller must reassign all partitions led by Broker X → While processing reassignments, other operations queue up → New topic creation? Wait. → Another broker failure? Queue behind the first one. → Metadata propagation to 100+ brokers? Delayed. The controller isn't slow — it's doing too many things sequentially.
Wall 3: Static Partition Assignment Creates Skew
Kafka assigns partitions to brokers at creation time. Once assigned, a partition stays on that broker unless explicitly moved. Over time, this creates imbalances: some brokers end up with more data, more traffic, or both. The industry calls this "hot spots."
At small scale, you fix this with manual rebalancing. At LinkedIn's scale, with millions of partitions and constantly shifting traffic patterns, manual rebalancing is like trying to level a swimming pool with a teaspoon.
| Wall | Root Cause | Why Tuning Can't Fix It |
|---|---|---|
| Monolithic partitions | Partition = unit of storage + replication + movement | You can't move half a partition. The coupling is in the design. |
| Centralized metadata | Single controller handles all coordination | You can make the controller faster, but it's still one node doing everything. |
| Static assignment | Partitions are pinned to brokers at creation | Auto-rebalancers exist but still move whole partitions — the cost is inherent. |
🔥 The Fundamental Insight
All three walls trace back to one design decision: the partition is an indivisible unit. Northguard's entire architecture is built around breaking that assumption.
Northguard's Core Idea — Log Striping
If the root problem is that partitions are too large and indivisible, the solution is to make them smaller and divisible. That's exactly what log striping does.
In Northguard, a logical partition still exists as a concept — producers write to it, consumers read from it, ordering is preserved. But under the hood, the data isn't stored as one contiguous blob on a single broker. Instead, it's broken into fixed-size segments (roughly 1 GB each) called "stripes," and those stripes are distributed across multiple brokers.
The Encyclopedia Analogy
Kafka stores each topic-partition like a single massive book — one shelf, one location. If you need to move it, you carry the whole book. Northguard stores the same content as individual chapters spread across different shelves. The table of contents (metadata) tells you where each chapter lives. Moving a chapter is trivial. Rebalancing the library means shuffling a few chapters, not hauling entire encyclopedias.
KAFKA — One partition, one broker: Broker A ┌─────────────────────────────────────────┐ │ Partition 0 │ │ [offset 0 ─────────────── offset 50M] │ │ Total: 340 GB, all on Broker A │ └─────────────────────────────────────────┘ NORTHGUARD — One logical partition, many stripes: Logical Partition 0 (same offsets, same ordering) ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │Stripe 1│ │Stripe 2│ │Stripe 3│ ... │Stripe N│ │ 1 GB │ │ 1 GB │ │ 1 GB │ │ 1 GB │ │Broker A│ │Broker C│ │Broker B│ │Broker D│ └────────┘ └────────┘ └────────┘ └────────┘ Consumer sees: one continuous ordered log (unchanged API) System sees: 340 independent 1 GB chunks across the cluster
Why 1 GB Changes Everything
The choice of a small, fixed stripe size has cascading benefits across every operational dimension:
Rebalancing becomes trivial
Moving a 1 GB stripe takes seconds, not hours. The system can continuously shuffle stripes in the background to maintain even distribution — no operator intervention needed.
New brokers absorb load immediately
When you add a broker, new stripes are written to it right away. You don't need to wait for a full rebalance. The cluster naturally evens out as new data arrives.
Failure recovery is parallelized
If a broker dies, its stripes are scattered across the cluster. Every surviving broker can participate in re-replicating those stripes simultaneously. Recovery time drops from 'proportional to total data on the failed broker' to 'proportional to data divided by cluster size.'
Hot spots self-heal
If one broker is getting disproportionate read traffic, the system can replicate its hot stripes to other brokers and spread the load. No manual partition reassignment needed.
💡 If This Sounds Familiar...
Log striping is conceptually similar to RAID striping in storage systems, or to how HDFS breaks files into 128 MB blocks distributed across a cluster. The insight isn't new — what's new is applying it to an event streaming log while preserving partition ordering guarantees and consumer API compatibility.
“Striping breaks ordering guarantees — consumers will see events out of order”
No. The stripes within a logical partition are strictly ordered. Stripe 1 contains offsets 0–N, Stripe 2 contains offsets N+1–M, and so on. A consumer reads stripes sequentially, just like reading chapters of a book in order. The ordering guarantee is preserved at the logical level — the physical distribution is invisible to consumers.
Sharded Metadata via Raft
The second major architectural change in Northguard addresses the centralized controller problem. Instead of one node managing all metadata, Northguard distributes metadata across multiple independent groups, each running the Raft consensus protocol.
Quick Primer: What Is Raft?
Raft is a consensus algorithm that allows a group of nodes to agree on a shared state, even if some nodes fail. It works by electing a leader within the group. The leader accepts writes, replicates them to followers, and commits once a majority acknowledges. If the leader dies, the remaining nodes elect a new one — typically within milliseconds.
The key property: as long as a majority of nodes in the group are alive, the group continues operating correctly. This is much more resilient than a single-node controller.
KAFKA (KRaft mode): ┌──────────────────────────────────────────┐ │ Single Controller Quorum │ │ │ │ Manages ALL metadata for ALL topics: │ │ • 400,000 topics │ │ • 4,000,000 partitions │ │ • All leader elections │ │ • All broker state │ │ │ │ Throughput ceiling: one quorum's worth │ └──────────────────────────────────────────┘ NORTHGUARD: ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Raft Group 1 │ │ Raft Group 2 │ │ Raft Group 3 │ ... │ │ │ │ │ │ │ Topics A–D │ │ Topics E–H │ │ Topics I–L │ │ 3 replicas │ │ 3 replicas │ │ 3 replicas │ │ │ │ │ │ │ │ Independent │ │ Independent │ │ Independent │ │ leader │ │ leader │ │ leader │ └──────────────┘ └──────────────┘ └──────────────┘ Each group handles a subset of topics independently. Total metadata throughput = sum of all groups. Failure in Group 2 doesn't affect Groups 1 or 3.
Why Sharding Metadata Matters
The benefit isn't just fault isolation — it's horizontal scalability of the control plane itself. Need to handle more topics? Add more Raft groups. Each group operates independently, so the total metadata throughput scales linearly with the number of groups.
Compare this to Kafka's approach: even with KRaft replacing ZooKeeper, you still have a single quorum handling all metadata. You can make that quorum faster (better hardware, optimized code), but you can't shard it. It's a vertical scaling ceiling on the control plane.
| Property | Kafka (KRaft) | Northguard (Sharded Raft) |
|---|---|---|
| Metadata scaling | Vertical — faster single quorum | Horizontal — add more Raft groups |
| Failure blast radius | All topics affected | Only topics in the failed group |
| Leader election scope | Cluster-wide | Per-group (milliseconds) |
| Topic creation throughput | Bounded by single quorum | Parallelized across groups |
| Operational complexity | One quorum to monitor | Multiple groups to monitor (but each is simpler) |
🔥 The Tradeoff
Sharded metadata adds operational complexity — you now have many Raft groups to monitor instead of one controller. But at LinkedIn's scale, the alternative (a single controller managing 4 million partitions) is worse. This is a classic distributed systems tradeoff: simplicity vs. scalability. At some point, you have to choose scalability.
Xinfra — Migrating Without a Big Bang
Building a better streaming platform is one thing. Migrating 32 trillion daily messages to it without downtime is another thing entirely. This is arguably the most impressive part of the whole effort — and the part most engineers underestimate.
LinkedIn's solution is Xinfra: a virtualized Pub/Sub abstraction layer that sits between applications and the underlying streaming infrastructure. Applications talk to Xinfra through a unified API. Xinfra decides whether to route messages to Kafka, Northguard, or both.
The Power Adapter Analogy
Imagine your house has 200 appliances all plugged directly into the wall with hardwired connections. Now you need to replace the electrical grid. Do you rewire every appliance? No — you install universal outlets first. Once every appliance uses a standard plug, you can swap the grid behind the wall without touching a single appliance. Xinfra is the universal outlet.
The Four-Phase Migration
LinkedIn follows a careful, reversible migration strategy. Each phase builds confidence before moving to the next:
Adopt Xinfra API
Applications are migrated from the raw Kafka client to the Xinfra API. This is a code change, but it's behavior-preserving — Xinfra initially routes everything to Kafka. The app doesn't know the difference.
Dual-write to both systems
For a given topic, Xinfra starts writing to both Kafka and Northguard simultaneously. Engineers compare the two systems' behavior: latency, throughput, data integrity. Any discrepancy is investigated before proceeding.
Shadow-read from Northguard
Consumers read from Northguard in parallel with Kafka, but only Kafka's results are used for production decisions. Northguard's results are compared for correctness. This catches subtle bugs like ordering differences or missing records.
Cutover
Once the topic has been validated in dual-write and shadow-read for a sufficient period, Xinfra flips the routing. The application's code doesn't change — only the Xinfra configuration does. And it's reversible: if something goes wrong, flip back to Kafka.
Application code (unchanged across all phases): xinfra.produce("user-events", event); xinfra.consume("user-events", handler); Phase 1 — Xinfra config: user-events → backend: kafka Phase 2 — Xinfra config: user-events → backend: kafka + northguard (dual-write) Phase 3 — Xinfra config: user-events → write: kafka + northguard read: kafka (primary) + northguard (shadow) Phase 4 — Xinfra config: user-events → backend: northguard Rollback at any phase: user-events → backend: kafka
💡 The Pattern to Study
This is the Strangler Fig pattern applied to infrastructure. You don't replace the old system — you grow the new system around it, gradually redirecting traffic until the old system can be decommissioned. It's the same pattern used for database migrations, monolith-to-microservice transitions, and cloud migrations. If you take one thing from this article, let it be this migration strategy.
As of early 2026, roughly 90% of LinkedIn's applications have been migrated to the Xinfra API. The hardest part — decoupling applications from the Kafka client — is largely complete. The remaining work is routing individual topics from Kafka to Northguard, which is a configuration change per topic.
Architecture Comparison
Let's put the two systems side by side across the dimensions that matter most for operating a streaming platform at scale.
| Dimension | Apache Kafka | Northguard |
|---|---|---|
| Storage unit | Monolithic partition (unbounded size) | Fixed 1 GB stripes |
| Rebalancing cost | Proportional to partition size (GBs–TBs) | Proportional to stripe size (1 GB) |
| Rebalancing trigger | Manual or semi-automated | Continuous and automatic |
| Metadata architecture | Single controller quorum (KRaft) | Sharded Raft groups |
| Metadata scaling | Vertical only | Horizontal — add more groups |
| Broker addition | Requires partition reassignment | New stripes auto-route to new broker |
| Failure recovery | Re-replicate full partitions | Re-replicate small stripes in parallel |
| Consumer API | Kafka consumer protocol | Xinfra abstraction (Kafka-compatible) |
| Ecosystem | Massive — 14 years of connectors, tools, managed services | Internal to LinkedIn |
| Availability | Open source, multiple managed offerings | Not publicly available |
Where Each System Excels
🟢 Kafka's Strengths
- ✓Battle-tested across thousands of production deployments worldwide
- ✓Rich ecosystem: Kafka Connect, Kafka Streams, ksqlDB, Schema Registry
- ✓Multiple managed services: Confluent Cloud, AWS MSK, Redpanda, Aiven
- ✓Extensive documentation, community support, and hiring pool
- ✓Well-understood operational model — teams know how to run it
🔵 Northguard's Strengths
- ✓Designed from scratch for planet-scale (trillions of messages/day)
- ✓Zero-downtime rebalancing via small, movable stripes
- ✓Horizontally scalable control plane — no single-controller ceiling
- ✓Self-healing: automatic load distribution without operator intervention
- ✓Faster failure recovery through parallelized stripe re-replication
🔥 The Honest Assessment
Northguard solves problems that most companies will never have. Its advantages only materialize at extreme scale — hundreds of thousands of topics, millions of partitions, trillions of daily messages. Below that threshold, Kafka's ecosystem advantages far outweigh Northguard's architectural advantages.
What This Means for Your Stack
Let's be direct: unless you work at one of maybe 20 companies on Earth, this changes nothing about your technology choices today. Northguard is not available outside LinkedIn. You cannot download it, deploy it, or buy it as a managed service.
But the ideas behind it are worth understanding, because they reveal where the streaming ecosystem is heading — and because the migration strategy (Xinfra) is a pattern you can apply regardless of scale.
Kafka remains the right choice when...
- ✅Your daily message volume is in the billions, not trillions
- ✅You have fewer than a few thousand topics
- ✅Rebalancing is infrequent and tolerable (minutes, not days)
- ✅You benefit from the Kafka Connect ecosystem for integrations
- ✅Your team's operational expertise is built around Kafka
- ✅You use a managed service (Confluent, MSK) and don't manage brokers directly
Start thinking about alternatives when...
- ❌Rebalancing operations regularly cause production incidents
- ❌Your controller node is a recurring bottleneck in incident reports
- ❌You have a dedicated platform team of 20+ engineers just for streaming
- ❌Adding brokers is a multi-day project instead of a routine operation
- ❌You're spending more time operating Kafka than building on top of it
Alternatives You Can Actually Use Today
If you're hitting Kafka's limits but can't build your own Northguard, there are production-ready alternatives that address some of the same pain points:
Redpanda
A Kafka-compatible streaming platform written in C++ with a thread-per-core architecture. Eliminates the JVM, reduces tail latency, and simplifies operations. Drop-in replacement for Kafka's protocol.
WarpStream
A Kafka-compatible system that separates compute from storage by writing directly to object storage (S3). Zero local disks means zero disk management, zero rebalancing. Trades some latency for massive operational simplicity.
Apache Pulsar
Separates serving (brokers) from storage (BookKeeper). This decoupling gives it some of the same rebalancing benefits as Northguard's log striping, though with a different implementation.
Confluent Cloud (Kora)
Confluent's managed Kafka runs on Kora, their custom storage engine that also uses tiered storage and decoupled compute. You get some of Northguard's benefits without leaving the Kafka ecosystem.
The Transferable Lesson: Build Abstraction Layers Early
The most practical takeaway from LinkedIn's story isn't about Northguard — it's about Xinfra. By wrapping their Kafka dependency behind an abstraction layer, they made the underlying infrastructure swappable. This is a pattern you can apply at any scale:
// ❌ Tight coupling — every service imports the Kafka client directly import { Kafka } from 'kafkajs'; const kafka = new Kafka({ brokers: ['broker1:9092'] }); const producer = kafka.producer(); await producer.send({ topic: 'user-events', messages: [{ value: payload }] }); // ✅ Abstraction layer — services import your messaging interface import { messageBus } from '@/lib/messaging'; await messageBus.publish('user-events', payload); // The implementation behind messageBus can be: // - KafkaJS today // - Redpanda tomorrow // - An in-memory bus for testing // - A dual-write setup during migration // No application code changes required.
💡 You Don't Need LinkedIn's Scale for This
Even at modest scale, wrapping your message broker behind an interface pays dividends. It makes testing easier (swap in an in-memory implementation), makes migrations possible (dual-write during cutover), and prevents vendor lock-in. LinkedIn just proved it works at the extreme end of the spectrum.
Common Questions
Q:Will Northguard be open-sourced?
A: LinkedIn has hinted at it but made no commitment. They have a track record of open-sourcing infrastructure projects (Kafka itself, Samza, Brooklin, Gobblin), so it's plausible. But even if they do, adoption would take years — building an ecosystem around a new streaming platform is a massive undertaking.
Q:Should I stop learning Kafka?
A: No. Kafka is the industry standard and will remain so for the foreseeable future. Understanding Kafka's architecture — partitions, consumer groups, replication, exactly-once semantics — is foundational knowledge that transfers to any streaming system, including Northguard.
Q:Is this similar to the ZooKeeper → KRaft migration?
A: In spirit, yes — both replace a core subsystem. But the scope is different. KRaft replaced Kafka's metadata store while keeping the storage engine intact. Northguard replaces the storage engine, the metadata system, and the rebalancing model. It's a much deeper architectural change.
Q:Could Kafka adopt log striping itself?
A: Theoretically, but it would be an enormous change to Kafka's core. The partition-as-unit-of-everything assumption is deeply embedded in Kafka's codebase, protocol, and ecosystem tooling. Retrofitting log striping would essentially mean rewriting Kafka — which is what LinkedIn did by building Northguard.
Q:How does this compare to Pulsar's architecture?
A: Apache Pulsar also separates serving from storage (using BookKeeper), which gives it some similar rebalancing benefits. The key difference is that Northguard was designed specifically for LinkedIn's workload and integrates with their Xinfra migration layer. Pulsar is a general-purpose alternative you can use today.
Q:What happens to Kafka's open-source community?
A: Nothing changes. LinkedIn is one contributor among many. Confluent, the company founded by Kafka's original creators, continues to drive the open-source project. The KRaft migration, tiered storage, and other improvements continue regardless of what LinkedIn does internally.
Key Takeaways
LinkedIn's move beyond Kafka is a landmark moment in data infrastructure. Not because Kafka is dying — it isn't — but because it demonstrates what happens when a system designed for one era of scale meets the demands of the next.
🧠 The Three Ideas Worth Remembering
- 01Log striping breaks the "partition = indivisible unit" assumption. By splitting logs into small, fixed-size segments distributed across brokers, rebalancing goes from a multi-hour operation to a background process that takes seconds. This is the single biggest architectural insight in Northguard.
- 02Sharding the control plane removes the metadata ceiling. Instead of one controller managing millions of partitions, independent Raft groups each handle a subset. The control plane scales horizontally, just like the data plane.
- 03The migration strategy matters as much as the new system. Xinfra's abstraction layer — wrap, dual-write, shadow-read, cutover — is a reusable pattern for replacing any critical infrastructure without downtime. This is the most practically useful idea in the entire story.
Quick Revision Cheat Sheet
Why LinkedIn moved on: Monolithic partitions, single controller, and static assignment don't scale to 32T messages/day
Log striping: Break partitions into 1 GB stripes distributed across brokers — rebalancing becomes trivial
Sharded metadata: Replace single controller with independent Raft groups, each managing a subset of topics
Xinfra: Abstraction layer between apps and streaming infra — enables gradual, reversible migration
Migration pattern: Wrap → dual-write → shadow-read → cutover (Strangler Fig pattern)
Current status: 90% of LinkedIn apps on Xinfra, topic-by-topic cutover to Northguard in progress
Impact on you: Kafka remains the right choice for most. Study the migration pattern — it's universally applicable
Alternatives today: Redpanda, WarpStream, Pulsar, Confluent Kora — all address some of the same pain points
Your team is evaluating streaming platforms for a new project
Should you choose Kafka or wait for Northguard?
Answer: Choose Kafka (or a Kafka-compatible alternative like Redpanda). Northguard isn't available, and Kafka's ecosystem is unmatched. But wrap your messaging behind an abstraction layer from day one — you'll thank yourself later.
You're hitting Kafka scaling limits at your company
Should you build your own Northguard?
Answer: Almost certainly not. First exhaust Kafka tuning (partition count, compression, batching, tiered storage). Then evaluate managed alternatives (Confluent Kora, WarpStream). Building a custom streaming platform requires a dedicated team of dozens of engineers over multiple years.
You're preparing for a system design interview
How should you talk about this?
Answer: Mention it as an example of how even well-designed systems hit architectural limits at extreme scale. Discuss the specific tradeoffs (monolithic partitions vs. striping, centralized vs. sharded metadata). Show you understand that the right architecture depends on the scale you're designing for.