Powering Pinterest ads analytics with Apache Druid

Pinterest Engineering
10 min readadvanced
--
View Original

Overview

The article discusses Pinterest's transition from using Apache HBase to Apache Druid for ads analytics, highlighting the challenges faced and the benefits of Druid's capabilities in handling complex metrics and real-time data ingestion. It details the implementation process, including data ingestion methods, cluster setup, query construction, and system tuning.

What You'll Learn

1

How to implement real-time data ingestion using Apache Druid

2

Why to use namespaces for versioning data in Druid

3

How to optimize query performance in Druid by batching requests

4

When to choose native queries over SQL queries in Druid

Prerequisites & Requirements

  • Understanding of data analytics and database management
  • Familiarity with Apache Druid and its ecosystem(optional)

Key Questions Answered

What are the advantages of using Apache Druid for analytics?
Apache Druid provides real-time ingestion, automatic versioning, data pre-aggregation, and a SQL interface, making it suitable for complex analytics needs. It simplifies data handling and allows for efficient querying, which is essential for scaling analytics in a growing business like Pinterest.
How does Pinterest handle data ingestion in Druid?
Pinterest uses two ingestion modes in Druid: Native and Hadoop. Native ingestion allows for direct data reading on the MiddleManager node, while Hadoop ingestion utilizes a MapReduce job for parallel processing. Both methods ensure data is automatically versioned and ready for querying as soon as it's ingested.
What is the structure of Pinterest's Druid cluster?
Pinterest's Druid cluster is organized into three tiers: a 'hot' tier for recent data on optimized nodes, a 'cold' tier for the last year of data, and an 'icy' tier for older historical data. This structure helps manage query loads effectively and ensures low latency for recent queries.
What are common performance bottlenecks in Druid queries?
Performance bottlenecks in Druid often arise from high QPS due to individual queries for each entity, which can overload the system. To mitigate this, Pinterest implemented query batching, which significantly reduced the number of connections and improved overall performance.

Key Statistics & Figures

Data ingestion speed improvement
5 minutes for mirroring tier vs. 2 hours for natural rebalancing
This statistic highlights the efficiency gained by implementing a mirroring tier for quick data availability during high traffic.
Query performance after batching
15,000 requests per second peaks lowered by 10x
This demonstrates the effectiveness of batching queries in reducing the load on the Druid cluster and improving overall system performance.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a mirroring tier in your Druid setup to enhance auto-scaling capabilities during peak loads.
This approach allows for quick scaling without waiting for Druid's rebalancing algorithm, which can be slow. By linking mirror nodes to primary nodes, you can efficiently serve increased query demands.
2
Utilize namespaces for managing data versions in Druid to prevent data overwrites from multiple ingestion pipelines.
This strategy allows for better organization of data segments and ensures that different data sources can coexist without conflict, which is crucial for maintaining data integrity.
3
Batch your queries to optimize performance and reduce the load on your Druid cluster.
By grouping requests into batches of 1,000 to 2,000 entities, you can minimize the number of network connections and improve query response times, which is essential for high QPS environments.
4
Choose native queries over SQL for performance-critical applications in Druid.
While SQL offers user-friendly syntax, native queries can handle complex filter conditions more efficiently, making them preferable for low-latency requirements.

Common Pitfalls

1
Overloading the Druid cluster with too many individual queries can lead to performance degradation.
This occurs because Druid opens new connections and object handles for each query, which can exhaust resources. To avoid this, implement query batching to reduce the number of simultaneous requests.
2
Relying solely on SQL queries for performance-sensitive applications can result in slow execution times.
SQL queries, while user-friendly, may introduce performance bottlenecks, especially with complex filters. It's important to evaluate the use of native queries for critical operations.

Related Concepts

Real-time Data Analytics
Data Ingestion Strategies
Performance Optimization Techniques In Druid
Versioning And Data Management In Distributed Systems