How Pinterest runs Kafka at scale

Pinterest Engineering

•

Pinterest Engineering

•6 min read•intermediate•

--

•View Original

ApacheApache KafkaAWSKubernetes

Overview

Pinterest operates one of the largest Kafka deployments in the cloud, utilizing Apache Kafka as a message bus for data transport and real-time streaming services. The article details their Kafka setup, operational challenges, and automation efforts to maintain reliability and performance.

What You'll Learn

1

How to set up a Kafka cluster across multiple AWS regions

2

Why automated broker replacement is crucial for Kafka reliability

3

How to implement partition reassignment during broker failures

Prerequisites & Requirements

Understanding of Kafka architecture and operations
Familiarity with AWS services, particularly EC2

Key Questions Answered

How does Pinterest manage Kafka broker failures?

Pinterest uses an open-sourced tool called DoctorKafka to automate partition reassignment during broker failures. This tool helps reduce operational overhead and ensures quick recovery from failures, significantly decreasing Kafka-related alerts by over 95%.

What is the default replication factor used in Pinterest's Kafka setup?

Pinterest has set the default replication factor to 3, allowing the system to withstand up to two broker failures within a single cluster. This configuration is essential for maintaining data availability and reliability.

What instance types does Pinterest use for Kafka brokers?

Pinterest primarily uses d2.2xlarge instances for Kafka brokers, with some smaller clusters utilizing d2.8xlarge instances for workloads that require higher read fanout. This choice was based on performance comparisons with Elastic Block Store (EBS) storage.

How does Pinterest ensure data is spread across availability zones?

To ensure resilience, Pinterest spreads Kafka brokers across three availability zones and ensures that replicas of each topic partition are also distributed among these zones. This setup helps the system withstand failures in up to two brokers per cluster.

Key Statistics & Figures

Number of Kafka brokers

>2000

Pinterest runs over 2000 Kafka brokers on Amazon Web Services.

Messages transported per day

>800 billion

The Kafka deployment at Pinterest transports over 800 billion messages daily.

Messages handled per second during peak hours

>15 million

During peak hours, Pinterest's Kafka setup handles over 15 million messages per second.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Kafka

Used extensively as a message bus for data transport and real-time streaming services.

Cloud

AWS

Hosting the Kafka deployment across multiple regions.

Key Actionable Insights

1
Implementing automated broker replacement can drastically reduce operational overhead.
By using tools like DoctorKafka, teams can minimize manual intervention during broker failures, allowing for quicker recovery and less downtime.

2
Utilizing the right instance types for Kafka brokers is crucial for performance.
Pinterest found that d2 instances with local storage outperformed EBS storage, highlighting the importance of testing various configurations for optimal performance.

3
Maintaining a proper replication factor is essential for data reliability.
Setting a replication factor of 3 allows for resilience against broker failures, ensuring that data remains accessible even during outages.

Common Pitfalls

1

Failing to automate broker management can lead to increased operational overhead.

Manual handling of broker failures requires significant resources and can slow down recovery times. Automating these processes with tools like DoctorKafka can alleviate this burden.

Related Concepts

Kafka Architecture And Operations

AWS EC2 Instance Types

Data Replication Strategies