Handling Online-Offline Discrepancy in Pinterest Ads Ranking System

Pinterest Engineering

•

Pinterest Engineering

•16 min read•intermediate•

--

•View Original

ApacheAWSAWS S3Machine Learning

Overview

The article discusses the challenges of online-offline discrepancies in Pinterest's ads ranking system, emphasizing the importance of aligning offline model performance with online business metrics. It explores various scenarios that lead to discrepancies and shares insights from the engineering team's experiences and solutions.

What You'll Learn

1

How to identify and analyze online-offline discrepancies in machine learning models

2

Why aligning offline evaluation metrics with online business metrics is crucial

3

How to implement feature validation checks to ensure data consistency

Prerequisites & Requirements

Understanding of machine learning model evaluation metrics
Familiarity with Apache Superset for monitoring(optional)

Key Questions Answered

What are the common scenarios leading to online-offline discrepancies in Pinterest's ads ranking?

The article identifies two main scenarios: a bug-free scenario where offline performance gains do not correlate with online metrics, and a buggy scenario where issues in the ads ranking system lead to diminished online gains. Each scenario presents unique challenges in model evaluation and performance.

How does Pinterest ensure data consistency between training and serving?

Pinterest employs Feature Stats Validation checks, monitoring dashboards, and a Unified Feature Representation to maintain data consistency. These measures help detect irregularities in feature distribution and ensure that the same feature values are used during both training and serving.

What hypotheses explain the online-offline discrepancies observed in Pinterest's ads ranking?

The article outlines several hypotheses, including misalignment between offline evaluation metrics and online business metrics, potential learning from treatment traffic by control models, and feature delays in online settings that may dilute performance gains.

What steps did Pinterest take to diagnose a significant performance deterioration during an online experiment?

Pinterest examined the model training pipeline for issues, checked the consistency of predictions between offline and online data, and investigated traffic patterns. They discovered that peak traffic caused feature servers to return 'Null' values, impacting model performance.

Key Statistics & Figures

Number of major conversion model iterations analyzed

15

In 2023, Pinterest analyzed 15 major conversion model iterations to observe online business metric movements.

Statistically significant movements observed

8

Out of the 15 iterations, 8 showed statistically significant movements between ROC-AUC and CPA.

Technologies & Tools

Monitoring

Apache Superset

Used for monitoring feature coverage, freshness, and distribution over time.

Storage

AWS S3

Used for storing data utilized in offline model training and online model serving.

Key Actionable Insights

1
Regularly implement Feature Stats Validation checks in your ML pipelines to ensure data integrity.
This practice helps identify issues with feature distribution and coverage, which can significantly impact model performance and reliability.

2
Align your offline evaluation metrics closely with your online business metrics to reduce discrepancies.
Understanding the relationship between metrics like ROC-AUC and CPA can help in better predicting online performance based on offline results.

3
Monitor for feature delays in your online serving environment to mitigate performance issues.
Being aware of the timing of feature updates can prevent stale data from affecting model predictions, especially during peak traffic periods.

Common Pitfalls

1

Failing to align offline evaluation metrics with online business metrics can lead to misleading results.

This misalignment can cause teams to invest time in improving models that do not yield tangible online performance gains, ultimately slowing down development cycles.

2

Neglecting to monitor feature freshness can result in using stale data during online serving.

Stale features can lead to inaccurate predictions, especially during critical traffic periods, which can significantly impact user engagement and conversion rates.

Related Concepts

Machine Learning Model Evaluation

Feature Engineering

Data Consistency In ML Systems

Performance Monitoring In Large-scale Systems