Pinterest’s Analytics as a Platform on Druid (Part 2 of 3)

Pinterest Engineering
9 min readintermediate
--
View Original

Overview

This article is the second part of a series discussing Pinterest's use of Druid for analytics. It focuses on optimizing Druid for batch use cases, sharing insights on system visibility, request pattern tiering, secondary key pruning, and partitioning strategies for skewed data.

What You'll Learn

1

How to enhance system visibility in Druid by adding critical metrics

2

Why tiering segments based on request patterns can optimize performance

3

How to implement secondary key pruning to reduce query load

4

How to create a custom shard spec for better data distribution

Prerequisites & Requirements

  • Understanding of Druid architecture and batch processing concepts
  • Familiarity with performance optimization techniques in data systems(optional)

Key Questions Answered

How can Druid's performance be optimized for batch use cases?
Druid's performance can be optimized by enhancing system visibility through critical metrics, implementing tiering based on request patterns, and utilizing secondary key pruning to reduce the number of segments scanned during queries. These strategies help identify bottlenecks and improve resource allocation.
What is the impact of secondary key pruning on query performance?
Secondary key pruning allows Druid to skip scanning segments that are guaranteed to return empty results, significantly reducing the number of segments to scan. This can lead to a threefold reduction in the segments scanned, relieving the load on data nodes and improving query performance.
Why is custom shard specification important for skewed data?
Custom shard specifications are crucial for skewed data as they allow for better distribution of data across segments, particularly for high-cardinality dimensions. This prevents long tail ingestion latencies and improves query performance for large partners by ensuring that data is evenly distributed.

Key Statistics & Figures

Reduction in segments scanned
3x
After implementing secondary key pruning, the number of segments scanned during queries dropped threefold, enhancing performance.
Percentage of requests hitting recent segments
98%
Analysis showed that 98% of requests were focused on the most recent 35 days of data, guiding effective segment tiering.

Technologies & Tools

Backend
Druid
Used as the analytics platform for processing and querying large datasets.

Key Actionable Insights

1
Implementing system visibility metrics can greatly enhance your ability to diagnose performance issues in Druid.
By adding metrics related to processing threads and memory usage, you can identify bottlenecks and make informed decisions about capacity provisioning.
2
Using request pattern analysis to inform segment tiering can lead to significant infrastructure cost savings.
By understanding which segments are accessed most frequently, you can allocate resources more efficiently, ensuring that high-demand segments are served by optimized hosts.
3
Adopting secondary key pruning can drastically reduce the workload on data nodes during query execution.
This technique allows Druid to avoid unnecessary scans, which not only improves performance but also optimizes resource utilization.

Common Pitfalls

1
Relying solely on hash-based partitioning can lead to inefficiencies with skewed data.
This approach may not distribute data evenly, resulting in long ingestion times and slow query performance for high-volume partners. It's important to consider custom partitioning strategies to address these issues.

Related Concepts

Druid Architecture
Batch Processing Optimization
Data Partitioning Strategies
Performance Monitoring In Data Systems