LinkedIn Created Kafka, But It is Ditching It for Something Better

Fifteen years ago, LinkedIn engineers gave the world Kafka, a resilient, distributed event streaming platform. Subsequently, it was open-sourced and widely adopted by the industry.

Notably, the company is trying to replace it. After pushing Kafka to its operational limits while serving over a billion users and processing trillions of events, LinkedIn has unveiled Northguard and Xinfra. These systems are designed to take ordered data and the pattern of separating data producers from data consumers (Pub/Sub) further.

Kafka isn’t going away overnight. It still works, and it works well. However, LinkedIn believes that the scale and complexity of its operations have grown too large for its original design. 

Why Kafka Couldn’t Keep Up

Kafka was originally designed for a much smaller LinkedIn. The company states that it had 90 million members back in 2010. Today, that number has ballooned to over 1.2 billion. And Kafka had to carry the weight of over 32 trillion records a day, 17 petabytes of data, and hundreds of thousands of topics stretched across 150 clusters. Running Kafka at this scale wasn’t just difficult, it required systems to stay functional.

As Kafka scaled, several cracks began to appear. Metadata bottlenecks, resource skews, replication delays, and limited durability forced teams to make compromises. Partition-based replication struggled with balancing consistency and availability.

Adding new brokers or restoring replication factors involved painful data moves. To operate Kafka smoothly, LinkedIn ran an entire ecosystem of support services—some of which were as complex as Kafka itself.

“We needed a system that scales well not just in terms of data, but also in terms of its metadata and cluster size, all while supporting lights-out operations with even load distribution by design and fast cluster deployments, regardless of scale,” the company said.

Enter Northguard and Xinfra

Northguard approaches log storage differently. Instead of treating logs as monolithic partitions, it breaks them into smaller, self-contained units called segments and ranges. 

This fine-grained design enables log striping, a built-in mechanism for balancing workloads across the cluster without manual intervention. New brokers can be added without reshuffling old data. Fault tolerance improves because producers can skip failed segments and continue writing to new ones.

To manage this complexity, Northguard introduces a highly distributed metadata system. Rather than relying on a single controller, it uses a network of vnode leaders, each managing a slice of metadata via Raft-based state machines. 

But the more radical shift is Xinfra, a virtualised Pub/Sub layer that overlays both Kafka and Northguard. It allows applications to interact with a unified API, while hiding the details of the underlying system. 

“Xinfra topics” can connect Kafka and Northguard clusters over time, allowing for smooth migration. During migration, new data is written to both systems, while old data is still read. This guarantees data order and safe rollbacks without any interruptions.

The company claims that more than 90% of LinkedIn applications already use Xinfra clients, and thousands of topics have been quietly shifted to Northguard. 

“The migration is transparent to users, and the migration state is delivered via Xinfra topic metadata update to the client,” writes the company.

The motivation behind Xinfra is rooted in lessons from Kafka, where infrastructure growth isn’t transparent to applications, making migration a nightmare. By virtualising the entire layer, Xinfra separates physical deployment concerns from application logic, just like virtual machines once did for bare-metal servers.

The Future of the Infrastructure 

Kafka isn’t obsolete—it remains a key piece of open-source infrastructure used worldwide, just like being the core of Confluent, which is utilised by Swiggy.

But for LinkedIn, the path forward is one of reinvention. Northguard is tailored for a world where logs aren’t just about append-only durability, but also about elasticity, observability, and high availability at scale.

Xinfra, meanwhile, hints at a future where Pub/Sub systems behave more like cloud-native abstractions, elastic, pluggable, and mostly invisible. With plans to add auto-scaling topics and more resilient virtual operations, LinkedIn’s infra team is engineering itself out of manual labour.

Ironically, the company that once built Kafka is now building something Kafka was never meant to be, a pub-sub cloud that runs itself.

The post LinkedIn Created Kafka, But It is Ditching It for Something Better appeared first on Analytics India Magazine.

Scroll to Top