Overview
This article discusses Pinterest's transition to using Druid as their analytical data store, detailing the challenges faced with HBase, the architecture of their Druid implementation, and insights on optimizing host types for memory mapping. It serves as the first part of a three-part series aimed at sharing learnings from their analytics platform.
What You'll Learn
1
How to optimize host types for memory mapping in Druid
2
Why Druid is preferred over HBase for analytics at Pinterest
3
When to use memory optimized versus IO optimized host types
Prerequisites & Requirements
- Understanding of data storage and analytics concepts
- Familiarity with Druid and its architecture(optional)
Key Questions Answered
What were the limitations of using HBase for Pinterest's analytics?
HBase's key value data model did not align well with analytics query patterns, leading to increased complexity in aggregation, high cardinality issues, limited filter choices, and unmanageable operational costs as data sets grew. These limitations prompted Pinterest to switch to Druid.
How does Pinterest's Druid architecture handle data ingestion?
Pinterest employs a lambda architecture for data ingestion, utilizing both daily/hourly batch pipelines that transform data from S3 into Druid formatted index files and a streaming pipeline that consumes real-time events from Kafka. This dual approach ensures efficient data processing.
What are the benefits of using memory optimized host types in Druid?
Memory optimized host types, such as AWS R series, are beneficial for low latency use cases as they minimize disk reads by ensuring that frequently accessed segments are cached in RAM. This leads to improved query performance by reducing major page faults.
When should IO optimized host types be used in Druid?
IO optimized host types are recommended when the data volume is too large to cache all segments in RAM or when the use case does not require sub-second latency. This approach focuses on maximizing disk read performance for large datasets.
Key Statistics & Figures
Druid cluster size
more than 2,000 nodes
This multi-cluster setup supports various critical use cases, including reporting and metrics analysis.
Largest offline use case data size
over 1 PB
This indicates the scale at which Pinterest operates its analytics platform.
Largest online use case QPS
1000+ QPS
This demonstrates the high query performance capabilities of the Druid implementation.
P99 latency
under 250 ms
This latency metric reflects the responsiveness of the Druid system for online queries.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Database
Druid
Used as the next-gen analytical data store for Pinterest's analytics platform.
Database
Hbase
Previously used for analytics before switching to Druid.
Streaming
Kafka
Utilized for real-time data ingestion in the streaming pipeline.
Database
Mysql
Used for storing metadata records related to the data ingested into Druid.
Cloud Provider
AWS
Hosting environment for the Druid implementation, including specific instance types like R series and i3en.
Key Actionable Insights
1Evaluate the specific analytics query patterns your application requires before choosing a data store.Understanding the query patterns can help determine whether a key-value store like HBase or a column-oriented store like Druid is more suitable for your use case.
2Consider using memory optimized hosts for analytics workloads that require low latency.By ensuring that frequently accessed data resides in RAM, you can significantly reduce query response times and improve overall system performance.
3Implement a pre-cache strategy for new segments to enhance query performance.This strategy allows the operating system to load segments into the page cache proactively, reducing the need for costly disk reads during query execution.
Common Pitfalls
1
Over-relying on memory optimized hosts for large datasets can lead to performance issues.
If the data volume exceeds the RAM capacity, the performance will degrade due to frequent disk reads, making it crucial to assess the workload requirements before provisioning.
2
Neglecting to implement a pre-cache strategy can result in suboptimal query performance.
Without pre-caching, new segments will cause major page faults, leading to increased latency until the segments are fully cached.
Related Concepts
Data Ingestion Strategies
Lambda Architecture
Memory Mapping Optimization
Druid Performance Tuning