Overview
This article discusses Pinterest's transition to using Druid as its next-generation analytics data store, detailing the architecture and optimization strategies for host types. It highlights the limitations of HBase and the benefits of Druid in handling analytics workloads.
What You'll Learn
1
How to optimize host types for memory in Druid deployments
2
Why Druid is preferred over HBase for analytics workloads
3
How to implement a lambda architecture for data ingestion
Prerequisites & Requirements
- Understanding of data analytics and data warehousing concepts
- Familiarity with Druid and Kafka(optional)
Key Questions Answered
What are the limitations of using HBase for analytics at Pinterest?
HBase's limitations include its unsuitability for analytics query patterns, high costs of pre-calculating filter combinations, and operational challenges as data sets grow. These issues prompted Pinterest to transition to Druid for better performance and flexibility.
How does Pinterest's architecture utilize Druid for data ingestion?
Pinterest employs a lambda architecture for data ingestion, combining batch processes that transform data from S3 into Druid index files with real-time streaming from Kafka. This dual approach allows for efficient handling of both historical and real-time analytics.
What are the best practices for optimizing memory in Druid hosts?
Best practices include provisioning hosts with sufficient memory to keep frequently accessed segments in the page cache, aiming for a 1:1 ratio of disk segment size to available RAM. This minimizes disk reads and enhances query performance.
What types of host configurations are recommended for high I/O workloads?
For high I/O workloads, it is recommended to use host types with high random read IOPS, such as AWS i3en instances, which provide a balance of price and performance. This is crucial when data volumes exceed available RAM for caching.
Key Statistics & Figures
Number of nodes in Druid infrastructure
More than 2,000 nodes
This extensive setup supports multiple clusters and handles significant data volumes efficiently.
Data volume for the largest offline use case
More than 1 PB of data
This showcases the scale at which Pinterest operates its analytics platform.
Queries per second (QPS) for the largest online use case
More than 1,000 QPS
This indicates the high performance and responsiveness of the Druid system in real-time analytics.
P99 latency
Less than 250 ms
This performance metric highlights the efficiency of Druid in handling queries.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Data Store
Druid
Used as the next-generation analytics data store for Pinterest.
Streaming Platform
Kafka
Utilized for real-time data ingestion.
Database
Mysql
Stores metadata records for the data ingested into Druid.
Key Actionable Insights
1Optimize your Druid deployment by selecting the right host types based on workload characteristics.Choosing between memory-optimized and I/O-optimized hosts can significantly impact performance. Understanding your data access patterns will guide this decision.
2Implement a lambda architecture for data ingestion to balance batch and real-time processing needs.This approach allows for flexibility in handling different data sources and types, ensuring that analytics can keep pace with real-time events while also processing historical data.
3Regularly evaluate and adjust your caching strategies based on query patterns and data access frequency.As data grows and access patterns change, maintaining optimal cache performance is essential for minimizing latency and maximizing throughput.
Common Pitfalls
1
Underestimating the impact of disk reads on query performance.
Many deployments fail to account for how frequent disk access can degrade performance, especially in high-load scenarios. Ensuring adequate memory for caching can mitigate this issue.
2
Neglecting to optimize host types based on specific workload requirements.
Using inappropriate host configurations can lead to inefficiencies and increased costs. It's crucial to match host capabilities with the expected data access patterns.
Related Concepts
Data Ingestion Strategies
Lambda Architecture
Performance Optimization In Data Analytics