A Decade of AI Platform at Pinterest

Pinterest Engineering

•

Pinterest Engineering

•22 min read•advanced•

--

•View Original

AutoMLDockerEmbeddingGenerative AIJavaKubernetesLightGBMPySparkPythonPyTorchSeedSQLTensorFlowThriftTransformer

Overview

The article reflects on a decade of AI platform development at Pinterest, detailing the evolution from fragmented machine learning stacks to a unified AI platform that supports various models. Key lessons are shared regarding platform building, adoption dynamics, and the interplay between modeling and infrastructure.

What You'll Learn

1

How to implement a unified AI platform that supports various machine learning models

2

Why organizational alignment is crucial for the adoption of new infrastructure

3

When to rebuild machine learning foundations in response to new modeling techniques

Prerequisites & Requirements

Understanding of machine learning concepts and infrastructure
Experience with AI/ML platforms and their challenges(optional)

Key Questions Answered

What were the key lessons learned from building the AI platform at Pinterest?

The key lessons include that adoption follows alignment, foundations are layered and temporary, local innovations need shared foundations, and enablement, efficiency, and velocity multiply each other. These insights highlight the importance of organizational incentives and the need for continuous adaptation as technology evolves.

How did Pinterest evolve its machine learning infrastructure over the years?

Pinterest's ML infrastructure evolved through five distinct eras, starting from fragmented individual team stacks to a unified AI platform. Each era brought new challenges and innovations, such as the introduction of Linchpin and Scorpion for feature unification, and later advancements like MLEnv and TabularML for standardization.

What is the significance of the Unified Feature Representation (UFR) in Pinterest's ML platform?

The Unified Feature Representation (UFR) was crucial for standardizing feature representation across different machine learning frameworks. It allowed for better integration with TensorFlow and PyTorch, facilitating the transition away from older systems and enabling more efficient model training and serving.

What challenges did Pinterest face with GPU serving and how were they addressed?

Pinterest faced challenges with low GPU utilization and the need for efficient data processing. They addressed these by consolidating workloads, minimizing CPU-GPU handoffs, and implementing remote inference, which allowed for independent scaling of CPU and GPU resources, ultimately improving efficiency and model performance.

Key Statistics & Figures

Inferences served per second

hundreds of millions

This performance metric illustrates the scale at which Pinterest's AI platform operates, showcasing its capability to handle extensive user requests efficiently.

Model evaluation time

under 100 milliseconds

This indicates the speed at which user requests are processed, highlighting the efficiency of the AI infrastructure.

Engagement boost from GPU-based models

16%

This statistic reflects the immediate impact of implementing GPU serving on user engagement in the Home Feed.

Adoption rate of MLEnv

~95%

This rapid adoption rate demonstrates the effectiveness of standardizing training practices across the organization.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Framework

Tensorflow

Used for building and training machine learning models.

Framework

Pytorch

Standardized for deep learning tasks across Pinterest.

Framework

Ray

Utilized for efficient data processing and training workflows.

Orchestration

Kubernetes

Employed for managing containerized applications and workloads.

Inference Service

Scorpion

First company-wide online inference engine for scoring Pins.

Domain-specific Language

Linchpin

Used for defining feature transformations and models.

Key Actionable Insights

1
Focus on aligning organizational goals with infrastructure needs to drive adoption.
When building new platforms, ensure that the infrastructure directly supports the business objectives of the teams involved. This alignment can significantly enhance the likelihood of adoption and successful implementation.

2
Embrace a layered approach to building machine learning foundations.
Recognize that no foundation is permanent and be prepared to rebuild as new technologies emerge. This mindset allows for flexibility and adaptability in the fast-evolving field of AI/ML.

3
Invest in shared foundations to prevent local innovations from decaying.
Encourage collaboration across teams to create shared resources and frameworks. This can help maintain the longevity and applicability of innovative solutions developed in isolated contexts.

Common Pitfalls

1

Failing to align infrastructure development with organizational goals can lead to poor adoption.

When teams prioritize immediate product metrics over infrastructure improvements, they may resist adopting new systems, leading to fragmentation and inefficiency.

2

Over-relying on local innovations without building shared foundations can cause decay.

Local innovations may not be sustainable without a common framework, resulting in duplicated efforts and wasted resources across teams.

Related Concepts

AI/ML Infrastructure Development

Machine Learning Model Optimization

Organizational Alignment In Tech Teams