How Pinterest Accelerates ML Feature Iterations via Effective Backfill

Pinterest Engineering

•

Pinterest Engineering

•14 min read•advanced•

--

•View Original

ApacheApache Spark

Overview

The article discusses how Pinterest enhances its machine learning feature iterations through an effective backfill process. By transitioning from forward logging to a two-stage backfill method, Pinterest significantly reduces costs and accelerates feature iteration times, achieving improvements of up to 90 times.

What You'll Learn

1

How to implement a two-stage backfill process for machine learning features

2

Why using Iceberg table format improves data management in backfilling

3

How to leverage Ray for efficient data loading during model training

Prerequisites & Requirements

Understanding of machine learning feature engineering
Familiarity with Spark and Airflow(optional)

Key Questions Answered

What are the challenges of forward logging in ML feature training?

Forward logging presents challenges such as high calendar day costs, high development time costs, lack of isolation between production and experimental features, and resource wastage. Each iteration can take 3 to 6 months to be hydrated in the training dataset, making it inefficient for rapid feature experimentation.

How does the two-stage backfill process improve efficiency?

The two-stage backfill process allows for parallel execution of feature staging, significantly reducing wait times and computational costs. By separating the staging and promotion stages, it minimizes data shuffling and enhances collaboration among engineers, leading to an overall completion time reduction of 82% compared to previous methods.

What benefits does the Iceberg table format provide?

Switching to the Iceberg table format allows for dynamic partition insertion, reducing the manual overhead of inserting into individual partitions. This change improves version control and rollback support, enabling faster recovery and minimizing downtime during data operations.

Key Statistics & Figures

Improvement in completion times

82%

The new two-stage backfill approach reduced the total time taken for backfilling features from 140 days to 26 days.

Speed up in backfill completion

90 times

The enhancements in the backfill process have enabled faster iterations and more efficient use of engineering resources.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Spark

Used for processing and materializing features in the backfill solution.

Tools

Airflow

Used to manage the execution of backfill jobs through DAGs.

Data Management

Iceberg

Facilitates dynamic partition insertion and enhances version control in the backfilling process.

Data Processing

Ray

Optimizes data loading speeds and enables efficient joining of datasets during model training.

Key Actionable Insights

1
Implementing a two-stage backfill process can drastically reduce feature iteration times.
By separating the backfill into staging and promotion stages, teams can work concurrently on different features, leading to faster deployment and testing cycles.

2
Utilizing Iceberg for data management can streamline partition handling and improve performance.
The dynamic partition insertion capability of Iceberg reduces the time spent on data writes and enhances the efficiency of data operations, making it a valuable tool for large-scale data processing.

3
Adopting Ray can optimize data loading speeds during model training.
Ray's capabilities allow for efficient resource management and data processing, which can significantly enhance the speed and efficiency of machine learning workflows.

Common Pitfalls

1

Not managing concurrent backfills can lead to data overwrites and delays.

Since each backfill writes data in place, multiple backfills cannot occur simultaneously on the same partition. Organizing a queue to manage backfill sequences is essential to avoid conflicts.

2

Underestimating the compute costs associated with backfilling.

Backfills can be extremely expensive due to significant data shuffling, which can exceed millions in EC2 expenses. Proper resource allocation and cost management strategies are necessary to mitigate these expenses.

Related Concepts

Machine Learning Feature Engineering

Data Processing Optimization Techniques

Distributed Computing Frameworks