Building a new experiment pipeline with Spark

Pinterest Engineering
5 min readintermediate
--
View Original

Overview

The article discusses how Pinterest revamped its legacy experiment pipeline using Spark to improve computational speed, scalability, and performance. With over 1,000 experiments running daily and billions of records processed, the new pipeline significantly reduces execution time and enhances data handling capabilities.

What You'll Learn

1

How to leverage Spark for building scalable data pipelines

2

Why job abstraction is crucial for reducing code duplication

3

How to optimize data storage using Parquet format

Prerequisites & Requirements

  • Understanding of A/B testing and data processing concepts
  • Familiarity with Spark and data pipeline tools(optional)

Key Questions Answered

What were the main disadvantages of Pinterest's legacy experiment pipeline?
The legacy pipeline faced longer computation times due to increased user activity and job redundancy from parallelizing tasks. These issues led to bottlenecks and made debugging and maintenance difficult, necessitating a new approach.
How does the new Spark pipeline improve execution time?
The new pipeline reduces job execution time from over four hours to below two hours by simplifying the logic and utilizing Spark's in-memory execution capabilities. This allows for better performance and faster availability of experiment results.
What technologies are integrated into the new experiment pipeline?
The new pipeline incorporates Kafka for log transport, Pinball for orchestration, Spark Streaming for real-time validation, HBase and MemSQL for the dashboard backend, and Presto for interactive analysis, creating a robust framework for experiments.
What benefits does job abstraction provide in the new pipeline?
Job abstraction allows for a more flexible and maintainable pipeline by reducing code duplication and enabling the addition of new metrics. This design helps ensure that the pipeline can scale with the growing number of experiments and users.

Key Statistics & Figures

Average execution time of new job
below two hours
This is a significant reduction from the previous execution time of over four hours.
Number of experiments running daily
more than 1,000
This high volume necessitated the need for a more efficient pipeline.
Monthly active users growth
from 100 million in 2015 to 175 million today
This growth contributed to the challenges faced by the legacy pipeline.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Spark
Used for processing and analyzing experiment data efficiently.
Backend
Kafka
Serves as the log transport layer for the pipeline.
Backend
Pinball
Orchestrates the Spark workflow.
Database
Hbase
Powers the dashboard backend.
Database
Memsql
Also powers the dashboard backend.
Database
Presto
Used for interactive analysis.

Key Actionable Insights

1
Implement job abstraction in your data pipelines to minimize code duplication and enhance maintainability.
By abstracting jobs, you can streamline your pipeline and make it easier to adapt to new requirements, which is crucial as data needs evolve.
2
Utilize Spark's in-memory processing capabilities to significantly reduce job execution times.
This approach not only speeds up processing but also allows for dynamic tuning of job parameters, leading to more efficient resource usage.
3
Consider using Parquet format for data storage to optimize space and cost.
As data volumes grow, efficient storage formats can lead to significant savings and improved performance in data retrieval.

Common Pitfalls

1
Failing to address job redundancy can lead to increased complexity and maintenance challenges.
As the number of jobs increases, so do dependencies and potential delays, making it crucial to implement job abstraction to streamline processes.

Related Concepts

Data Processing Frameworks
A/B Testing Methodologies
Scalable Data Architectures