Scaling Pinterest ML Infrastructure with Ray: From Training to End-to-End ML Pipelines

Pinterest Engineering

•

Pinterest Engineering

•9 min read•advanced•

--

•View Original

NumbaPythonPyTorch

Overview

The article discusses how Pinterest scaled its machine learning infrastructure using Ray, extending its capabilities beyond training to include feature development, sampling, and labeling. This transformation led to faster, more efficient, and cost-effective ML workflows.

What You'll Learn

1

How to implement a Ray Data native transformation API for ML workflows

2

Why Iceberg bucket joins improve feature joining efficiency

3

How to optimize Ray's data processing capabilities for large-scale ML workloads

4

How to use Ray for data persistence to enhance feature iteration

Key Questions Answered

What challenges did Pinterest face before integrating Ray into its ML infrastructure?

Pinterest faced several challenges, including slow data pipelines, costly feature iterations, and inefficient compute usage. Specifically, feature development bottlenecks involved days-long backfill jobs, while inefficient sampling and slow labeling experimentation further hampered productivity.

How did Pinterest expand Ray's role in its ML infrastructure?

Pinterest expanded Ray's role by developing a Ray Data native pipeline API for ML data transformations, implementing Iceberg bucket joins for efficient data merging, and introducing data persistence mechanisms to enhance iteration speed. These innovations collectively improved the efficiency and scalability of their ML workflows.

What impact did the new Ray-powered ML workflow have on Pinterest?

The new Ray-powered ML workflow reduced ML iteration times by 10X while significantly cutting infrastructure costs. This transformation enabled faster experimentation and hyperparameter tuning, allowing ML engineers to launch new features more efficiently.

What are the key technical innovations introduced by Pinterest using Ray?

Key innovations include a Ray Data native pipeline API for on-the-fly feature transformations, Iceberg bucket joins for efficient data merging, and mechanisms for data persistence that allow transformed data to be reused across experiments, enhancing overall workflow efficiency.

Key Statistics & Figures

Reduction in ML iteration times

10X

This improvement was achieved through the implementation of a fully Ray-powered ML workflow.

Speedup across different pipelines

2-3X

This speedup was realized by optimizing Ray's data processing capabilities.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Ray

Used to enhance ML infrastructure and workflows at Pinterest.

Data Processing

Iceberg

Implemented for efficient data joining and persistence in ML workflows.

Key Actionable Insights

1
Implementing a Ray Data native transformation API can significantly reduce preprocessing time in ML workflows.
By allowing on-the-fly feature transformations, this API eliminates the need for lengthy Spark backfills, streamlining the feature development process.

2
Utilizing Iceberg bucket joins can enhance the efficiency of feature joins across datasets.
This approach allows for dynamic joining of datasets at runtime without the need for expensive precomputations, enabling faster iterations in feature experimentation.

3
Adopting data persistence strategies can improve the efficiency of ML iterations.
By caching transformed features and reusing them in subsequent training jobs, Pinterest reduced redundant computations, accelerating the overall ML workflow.

4
Optimizing Ray's data processing capabilities can lead to significant performance improvements.
Pinterest achieved 2-3X speedup across different pipelines by optimizing underlying data structures and enhancing UDF efficiency, which is crucial for handling large workloads.

Common Pitfalls

1

Over-reliance on traditional Spark workflows can lead to inefficiencies in ML pipelines.

Many organizations may struggle with slow data processing and high costs associated with Spark-based workflows. Transitioning to a more efficient framework like Ray can mitigate these issues.