Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models (Part 1)

Pinterest Engineering
11 min readadvanced
--
View Original

Overview

This article discusses the re-architecture of the serving stack for next-generation ads lightweight ranking models at Pinterest, moving from a traditional Two-Tower architecture to a more complex GPU-based model inference system. It highlights the challenges faced, optimizations made, and the resulting improvements in recommendation quality and latency.

What You'll Learn

1

How to implement a GPU-based model inference stage in a serving stack

2

Why moving business logic into the model can improve performance

3

How to optimize feature fetching to reduce latency

4

When to apply inventory segmentation strategies for high-value documents

Prerequisites & Requirements

  • Understanding of neural network architectures and model inference
  • Familiarity with PyTorch and GPU computing

Key Questions Answered

What are the limitations of the Two-Tower architecture in recommendation systems?
The Two-Tower architecture struggles to leverage interaction features and complex signals, limiting its expressiveness. It decouples user and item data, preventing advanced architectural patterns like target attention and early feature crossing, which are essential for modeling deep interactions.
How did Pinterest reduce GPU inference latency from 4000ms to 20ms?
Pinterest achieved this by implementing multi-stream CUDA for overlapping operations, aligning worker threads with CPU cores to avoid context switching, using kernel fusion to reduce memory bandwidth pressure, and adopting the Brain Floating Point 16 (BF16) format for faster computations.
What is the impact of moving business logic into the model?
By moving business logic like utility calculations and top-k sorting into the PyTorch model, Pinterest reduced data transmission time significantly and improved processing speed. This allows the model to output only the final winners instead of all scores, enhancing efficiency.
What changes were made to the retrieval data flow to improve performance?
The retrieval engine was restructured to return a lightweight Thrift structure containing only essential data, reducing serialization time. Heavy metadata is fetched only for the top-k documents after filtering, which decreased retrieval latency from 200ms to 75ms.

Key Statistics & Figures

Initial p90 latency
4000ms
This was the latency before optimizations were applied to the GPU inference stage.
Final p90 latency
20ms
This is the latency achieved after implementing various optimizations in the serving stack.
Reduction in retrieval latency
125ms
The retrieval stage latency was reduced from 200ms to 75ms through structural changes.
Reduction in model loss
20%
Early offline results showed a significant reduction in model loss, indicating improved performance.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Machine Learning Framework
Pytorch
Used for building and deploying the neural network models.
Parallel Computing Platform
Cuda
Utilized for optimizing GPU operations and improving inference speed.
Data Serialization
Thrift
Used for efficient data transmission in the retrieval engine.

Key Actionable Insights

1
Implementing a GPU-based model inference stage can significantly enhance the performance of recommendation systems.
This approach allows for more complex models that can handle deep interactions, leading to improved recommendation quality and user engagement.
2
Adopting an inventory segmentation strategy for high-value documents can optimize feature fetching and reduce latency.
By bundling features directly into the model for high-value candidates, you can eliminate network overhead and improve response times.
3
Moving business logic into the model can streamline processing and reduce unnecessary data transmission.
This allows for parallel execution on the GPU, making the system more efficient and responsive, especially under high load.

Common Pitfalls

1
Failing to consider the impact of distribution shifts when changing ranking algorithms can lead to unexpected metric changes.
This can occur when the new system processes all candidates globally, altering the composition of served ads. It's crucial to analyze and tune these shifts to maintain business performance.

Related Concepts

Neural Network Architectures
Model Inference Optimization
Recommendation System Design
Feature Engineering Strategies