Overview
Pinterest operates one of the largest Kafka deployments in the cloud, utilizing Apache Kafka as a message bus for data transport and real-time streaming services. The article details their Kafka setup, operational challenges, and automation efforts to maintain reliability and performance.
What You'll Learn
1
How to set up a Kafka cluster across multiple AWS regions
2
Why automated broker replacement is crucial for Kafka reliability
3
How to implement partition reassignment during broker failures
Prerequisites & Requirements
- Understanding of Kafka architecture and operations
- Familiarity with AWS services, particularly EC2
Key Questions Answered
How does Pinterest manage Kafka broker failures?
Pinterest uses an open-sourced tool called DoctorKafka to automate partition reassignment during broker failures. This tool helps reduce operational overhead and ensures quick recovery from failures, significantly decreasing Kafka-related alerts by over 95%.
What is the default replication factor used in Pinterest's Kafka setup?
Pinterest has set the default replication factor to 3, allowing the system to withstand up to two broker failures within a single cluster. This configuration is essential for maintaining data availability and reliability.
What instance types does Pinterest use for Kafka brokers?
Pinterest primarily uses d2.2xlarge instances for Kafka brokers, with some smaller clusters utilizing d2.8xlarge instances for workloads that require higher read fanout. This choice was based on performance comparisons with Elastic Block Store (EBS) storage.
How does Pinterest ensure data is spread across availability zones?
To ensure resilience, Pinterest spreads Kafka brokers across three availability zones and ensures that replicas of each topic partition are also distributed among these zones. This setup helps the system withstand failures in up to two brokers per cluster.
Key Statistics & Figures
Number of Kafka brokers
>2000
Pinterest runs over 2000 Kafka brokers on Amazon Web Services.
Messages transported per day
>800 billion
The Kafka deployment at Pinterest transports over 800 billion messages daily.
Messages handled per second during peak hours
>15 million
During peak hours, Pinterest's Kafka setup handles over 15 million messages per second.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Apache Kafka
Used extensively as a message bus for data transport and real-time streaming services.
Cloud
AWS
Hosting the Kafka deployment across multiple regions.
Key Actionable Insights
1Implementing automated broker replacement can drastically reduce operational overhead.By using tools like DoctorKafka, teams can minimize manual intervention during broker failures, allowing for quicker recovery and less downtime.
2Utilizing the right instance types for Kafka brokers is crucial for performance.Pinterest found that d2 instances with local storage outperformed EBS storage, highlighting the importance of testing various configurations for optimal performance.
3Maintaining a proper replication factor is essential for data reliability.Setting a replication factor of 3 allows for resilience against broker failures, ensuring that data remains accessible even during outages.
Common Pitfalls
1
Failing to automate broker management can lead to increased operational overhead.
Manual handling of broker failures requires significant resources and can slow down recovery times. Automating these processes with tools like DoctorKafka can alleviate this burden.
Related Concepts
Kafka Architecture And Operations
AWS EC2 Instance Types
Data Replication Strategies