Improving the scalability of a Spark pipeline for conversion attribution

Pinterest Engineering

•

Pinterest Engineering

•9 min read•intermediate•

--

•View Original

ShellSQL

Overview

The article discusses the enhancements made to a Spark pipeline for conversion attribution at Pinterest, focusing on scalability as the number of users and advertisers grows. It outlines two primary solutions, bucketing and delta processing, which significantly reduce resource consumption while maintaining performance.

What You'll Learn

1

How to implement bucketing in Spark to optimize data processing

2

Why delta processing can reduce data size before joining datasets

3

How to apply ID joins to minimize data shuffling in Spark

Prerequisites & Requirements

Understanding of Spark and data processing concepts
Familiarity with HDFS and Spark APIs(optional)

Key Questions Answered

How does bucketing improve the performance of Spark pipelines?

Bucketing improves performance by reducing the need for repeated shuffling of data across multiple executions. By saving the shuffled state of actions, the pipeline can avoid redundant shuffle operations, leading to significant resource savings and faster processing times.

What is the delta processing method in Spark?

Delta processing involves filtering data before joining datasets, which reduces the size of the data being processed. By maintaining only the latest actions in a sliding window, the pipeline can efficiently handle large datasets without unnecessary shuffling.

What challenges arise when implementing bucketing in Spark?

Challenges include determining the optimal number of buckets to balance read parallelism and storage costs, as well as ensuring that bucketed data is not reshuffled during processing. Additionally, API limitations can complicate the saving of bucketed data.

How can ID joins reduce data shuffling in conversion attribution?

ID joins can minimize data shuffling by only joining the necessary fields to create a list of unique identifiers. This smaller dataset can then be joined back with the full inputs using a more efficient BroadcastJoin, reducing the overall shuffle size.

Key Statistics & Figures

Days of action logs considered for attribution

60 days

The pipeline attributes conversions to actions that occurred up to 60 days prior.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Spark

Used for implementing the data pipeline for conversion attribution.

Storage

Hdfs

Used for storing bucketed data files.

Key Actionable Insights

1
Implement bucketing in your Spark pipelines to reduce redundant data shuffling.
This method allows you to persist the shuffled state of actions, which can significantly lower resource usage and improve processing times, especially in large-scale data environments.

2
Utilize delta processing to filter data before joining, minimizing the dataset size.
By applying filtering operations early, you can streamline the data processing workflow, making it more efficient and less resource-intensive.

3
Consider using ID joins to limit the amount of data shuffled during joins.
This technique can help maintain performance by ensuring that only necessary data is processed, which is particularly useful in conversion attribution scenarios.

Common Pitfalls

1

Overestimating the number of buckets can lead to excessive file creation and increased storage costs.

It's important to balance the number of buckets with the need for parallelism and storage efficiency to avoid under-utilization of HDFS blocks.

2

Not filtering data before joining can lead to unnecessary shuffling and increased processing times.

By delaying filtering until after the join, you may end up processing larger datasets than necessary, which can strain resources.

Related Concepts

Data Processing Optimization Techniques

Spark Performance Tuning

Big Data Management Strategies