Training Foundation Improvements for Closeup Recommendation Ranker

Pinterest Engineering

•

Pinterest Engineering

•10 min read•advanced•

--

•View Original

MLflowThrift

Overview

The article discusses improvements made to the training foundations of Pinterest's Closeup Recommendation Ranker, focusing on data logging, sampling strategies, and an auto-retraining framework. It highlights how these enhancements contribute to better model performance and user engagement.

What You'll Learn

1

How to implement a hybrid data logging approach for machine learning models

2

Why randomized traffic is essential for improving model training data

3

How to leverage an auto-retraining framework to maintain model performance

4

When to refresh machine learning models to adapt to user trends

Prerequisites & Requirements

Understanding of machine learning concepts and model training
Familiarity with data logging and sampling techniques(optional)

Key Questions Answered

What is the hybrid logging approach used in Pinterest's Closeup Recommendation Ranker?

The hybrid logging approach combines data logging through both backend services and frontend clients, allowing for efficient data storage while capturing essential training features. This method ensures that only relevant impressions and positive engagements are logged, significantly reducing data volume without sacrificing training quality.

How does Pinterest's auto-retraining framework work?

Pinterest's Auto-Retraining Framework (ARF) automates the training and re-training of models on a specified schedule. It includes an offline Airflow workflow for training and validation, and an online deployment pipeline that ensures new models are released only after passing various performance checks.

What benefits does randomized traffic provide in model training?

Randomized traffic allows for logging all candidates served to users, not just those that received impressions. This approach enhances the training dataset by providing a more comprehensive view of user interactions, which is beneficial for offline replay experimentation and model evaluations.

What are the key improvements made to the sampling strategy?

The sampling strategy was enhanced by constructing a sampling job that utilizes a more sophisticated approach than simple downsampling. This new strategy allows for customized sampling logic, increasing the efficiency of the training data and mitigating biases in the model.

Key Statistics & Figures

Engagement gains from sampling experiments

+0.3% impressions, +1% repins, +3% longclicks, +2% product longclicks, +2% video repins

These metrics were observed site-wide as a result of the new sampling configurations tested in A/B experiments.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Pyspark

Used for processing large datasets and implementing the sampling job in the training data generation pipeline.

Workflow Management

Airflow

Used for orchestrating the offline workflow for training, validating, and registering models.

Model Management

Mlflow

Used for tracking and managing model artifacts throughout the training and deployment process.

Deployment

Spinnaker

Used for deploying new model versions and managing their release to production.

Key Actionable Insights

1
Implement a hybrid logging system to improve data storage efficiency while capturing essential training features.
This approach not only reduces the volume of logged data but also ensures that only relevant interactions are recorded, which is crucial for maintaining high-quality training datasets.

2
Utilize randomized traffic to enhance the training dataset for your models.
By logging all candidates served, you can create a richer dataset that supports better model evaluations and offline experimentation, leading to improved model performance.

3
Adopt an auto-retraining framework to keep your models updated with the latest user interactions.
This framework automates the retraining process, ensuring that your models adapt to changing user trends without requiring extensive manual intervention, thus improving overall efficiency.

Common Pitfalls

1

Neglecting the importance of data logging efficiency can lead to bloated datasets that slow down training processes.

Without a proper logging strategy, you may end up with excessive data that complicates data management and analysis, ultimately impacting model performance.

2

Failing to refresh models regularly can result in performance degradation over time.

Models trained on outdated data may not perform well as user preferences change, leading to a poor user experience. Implementing an auto-retraining framework can help mitigate this risk.

Related Concepts

Machine Learning Model Training

Data Logging Techniques

Sampling Strategies In ML

Auto-retraining Frameworks