Running Kafka At Scale

Todd Palino
10 min readintermediate
--
View Original

Overview

The article 'Running Kafka At Scale' discusses how LinkedIn utilizes Apache Kafka as a crucial messaging system for handling vast amounts of data. It details Kafka's architecture, message types, and the complexities involved in maintaining high throughput and reliability across multiple clusters.

What You'll Learn

1

How to implement a tiered Kafka architecture for data management

2

Why auditing message completeness is critical in Kafka systems

3

When to separate message types into different Kafka clusters

Prerequisites & Requirements

  • Understanding of messaging systems and data streaming concepts
  • Familiarity with Apache Kafka and its ecosystem(optional)

Key Questions Answered

What is Apache Kafka and how does it function?
Apache Kafka is a publish/subscribe messaging system that combines queuing with message retention on disk. It organizes messages into topics and partitions, allowing multiple producers and consumers to interact with the data efficiently, ensuring reliability and high throughput.
How does LinkedIn manage over 800 billion messages daily with Kafka?
LinkedIn's Kafka infrastructure handles over 800 billion messages per day, consuming more than 650 terabytes of data. This is achieved through over 1100 Kafka brokers organized into more than 60 clusters, allowing for high throughput and reliability.
What are the four categories of messages used in LinkedIn's Kafka system?
LinkedIn categorizes messages into queuing, metrics, logs, and tracking data. Each category serves a specific purpose, such as coordinating application actions, monitoring system health, aggregating logs, and tracking user interactions.
How does LinkedIn ensure message completeness in Kafka?
LinkedIn uses an internal tool called Kafka Audit to verify that all messages produced are copied to every tier without loss. It compares message counts from producers and consumers to ensure integrity across the Kafka infrastructure.

Key Statistics & Figures

Messages processed daily
800 billion
This figure highlights the scale at which LinkedIn operates its Kafka infrastructure.
Data consumed daily
650 terabytes
This statistic emphasizes the volume of data handled by LinkedIn's Kafka system.
Messages received at peak times
13 million messages per second
This peak throughput showcases Kafka's ability to handle high-load scenarios efficiently.
Data throughput at peak times
2.75 gigabytes per second
This measure indicates the speed at which data is processed during peak usage.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Messaging System
Apache Kafka
Used for moving data between systems at LinkedIn.
Stream Processing Framework
Apache Samza
Layered over Kafka for processing streaming data.
Batch Processing Framework
Apache Hadoop
Used in conjunction with Kafka for processing large data sets.

Key Actionable Insights

1
Implement a tiered Kafka architecture to optimize data flow across multiple data centers.
This approach reduces network costs and latency by allowing local consumers to access data without cross-datacenter issues, enhancing overall system performance.
2
Utilize auditing tools to ensure message integrity and completeness in your Kafka setup.
By regularly checking message counts between producers and consumers, you can quickly identify and resolve issues that may lead to data loss or duplication.
3
Consider separating different message types into distinct Kafka clusters for better management and performance.
This separation allows for optimized resource allocation and simplifies monitoring, making it easier to maintain system health and performance.

Common Pitfalls

1
Failing to monitor the health of Kafka clusters can lead to undetected message loss.
Without proper monitoring, issues may arise that cause messages to be lost or duplicated, impacting data integrity and system reliability.
2
Overloading a single Kafka cluster with too many message types can complicate management.
This can lead to performance bottlenecks and make it difficult to monitor and maintain the system effectively.

Related Concepts

Message Queuing Systems
Data Streaming Architectures
Distributed Systems Design
Kafka Ecosystem Components