Introduction to Kafka Tiered Storage at Uber

Satish Duggana, Kamal Chandraprakash, Abhijeet Kumar

Uber

•

Satish Duggana, Kamal Chandraprakash, Abhijeet Kumar

•9 min read•intermediate•

--

•View Original

ApacheApache KafkaAzure

Overview

This article introduces Kafka Tiered Storage at Uber, detailing its architecture and the motivation behind its implementation. It highlights how tiered storage separates local and remote data, enhancing scalability and operational efficiency while reducing costs associated with Kafka clusters.

What You'll Learn

1

How to implement Kafka Tiered Storage to optimize data retention

2

Why separating local and remote storage improves Kafka cluster efficiency

3

When to use remote storage for older Kafka data

Key Questions Answered

What is Kafka Tiered Storage and how does it work?

Kafka Tiered Storage separates data into local and remote tiers, allowing older data to be stored externally while maintaining quick access to recent data. This architecture reduces the burden on Kafka brokers and improves scalability, enabling longer data retention without needing additional broker nodes.

How does the copying of segments to remote storage function?

The leader broker copies eligible log segments from local storage to remote storage based on the last-stable-offset. This process uses the RemoteStorageManager to handle the copying, ensuring that metadata is updated accordingly in the RemoteLogMetadataManager.

What are the goals of implementing tiered storage in Kafka?

The primary goals include extending storage beyond brokers, ensuring durability and consistency, isolating reading of latest and historical data, and improving operational and cost efficiency without requiring changes from clients.

How does Kafka handle the cleanup of remote segments?

Remote segments are cleaned up at regular intervals by a dedicated thread pool, ensuring that segments are deleted based on the retention configuration. This process is asynchronous and does not block other operations, allowing for efficient management of remote data.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Kafka

Used as the core messaging system for handling data streams at Uber.

Key Actionable Insights

1
Implementing Kafka Tiered Storage can significantly reduce operational costs by offloading older data to remote storage solutions like S3 or GCS.
This approach allows organizations to maintain a leaner Kafka cluster while still meeting data retention requirements, which is crucial for businesses with large data volumes.

2
Utilizing the RemoteStorageManager and RemoteLogMetadataManager interfaces allows for flexibility in integrating various storage solutions.
By allowing developers to plug in their implementations, Kafka can adapt to different storage needs, enhancing its versatility in diverse environments.

3
Regularly review and adjust retention policies for both local and remote storage to optimize performance and cost.
As data usage patterns evolve, fine-tuning these settings can help maintain efficiency and ensure that the system operates within budget constraints.

Common Pitfalls

1

Failing to properly configure retention policies can lead to excessive costs or data loss.

It's essential to regularly assess and adjust these configurations based on the evolving data landscape to ensure optimal performance.

2

Overloading the Kafka cluster with unnecessary broker nodes can complicate deployment and increase operational costs.

Instead, leveraging tiered storage can alleviate the need for additional nodes, simplifying management and reducing expenses.

Related Concepts

Data Retention Strategies

Scalability In Distributed Systems

Cost Management In Cloud Storage Solutions