Pinterest Tiered Storage for Apache Kafka®️: A Broker-Decoupled Approach

Pinterest Engineering
24 min readadvanced
--
View Original

Overview

The article discusses Pinterest's implementation of Tiered Storage for Apache Kafka®️, highlighting a broker-decoupled approach that offloads data to cheaper remote storage. This method enhances flexibility, reduces costs, and improves resource utilization compared to the native broker-coupled implementation.

What You'll Learn

1

How to implement broker-decoupled Tiered Storage for Apache Kafka

2

Why offloading data to remote storage reduces costs in Kafka

3

How to configure a Segment Uploader for efficient log segment management

Prerequisites & Requirements

  • Understanding of Apache Kafka architecture and PubSub systems
  • Familiarity with remote storage solutions like Amazon S3(optional)

Key Questions Answered

How does Pinterest's Tiered Storage implementation differ from KIP-405?
Pinterest's implementation decouples storage from the broker, allowing direct consumption from remote storage, which enhances flexibility and reduces costs compared to the broker-coupled approach in KIP-405.
What are the main components of the broker-decoupled Tiered Storage?
The main components include the Segment Uploader, which uploads log segments to remote storage, and the Tiered Storage Consumer, which reads data from both remote storage and local broker disk, optimizing resource usage.
What challenges does the Segment Uploader face in ensuring data integrity?
The Segment Uploader must prevent missed uploads due to log segment deletion and handle transient upload failures, ensuring that all log segments are uploaded before they are cleaned up by Kafka's retention policies.
What factors should be considered when choosing a remote storage system for Tiered Storage?
Key factors include interface compatibility, pricing models for storage and data transfer, scalability, and the ability to support necessary operations for effective Tiered Storage implementation.

Key Statistics & Figures

Data offloaded daily
200 TB
This is the amount of data currently offloaded from broker disk to remote storage using the Tiered Storage implementation.
Production topics onboarded
20+
This indicates the scale at which Pinterest has implemented the broker-decoupled Tiered Storage since May 2024.

Technologies & Tools

Backend
Apache Kafka®️
Used as the primary messaging system for data transportation.
Storage
Amazon S3®️
Utilized as the remote storage solution for offloading data.

Key Actionable Insights

1
Implementing a broker-decoupled Tiered Storage can significantly lower storage costs for Kafka clusters.
By offloading data to cheaper remote storage, organizations can reduce the per-unit cost of storage while maintaining high availability and performance.
2
Utilizing a Segment Uploader allows for efficient log segment management and ensures data integrity.
This independent process monitors log directories and uploads finalized segments, preventing data loss and optimizing resource utilization.
3
Choosing the right remote storage system is crucial for the success of Tiered Storage.
Evaluate storage systems based on their operational capabilities and pricing models to ensure they meet the demands of your Kafka workloads.

Common Pitfalls

1
Failing to monitor log segment lifecycle can lead to missed uploads and data loss.
Since the Segment Uploader operates independently, it must be designed to track log segment states accurately to ensure all necessary data is uploaded before deletion.
2
Inadequate configuration of remote storage can result in performance bottlenecks.
Choosing a remote storage system without considering request rate limits and partitioning strategies can lead to hotspots and degraded performance.

Related Concepts

Apache Kafka Architecture
Pubsub Design Patterns
Remote Storage Solutions
Data Lifecycle Management