Unifying visual embeddings for visual search at Pinterest

Pinterest Engineering
7 min readadvanced
--
View Original

Overview

The article discusses the evolution and unification of visual embeddings at Pinterest to enhance visual search capabilities. It highlights the transition from separate embeddings for different applications to a unified multi-task visual embedding system that improves performance and simplifies infrastructure.

What You'll Learn

1

How to implement a unified visual embedding for multiple applications

2

Why proxy-based metric learning is beneficial for training visual embeddings

3

When to transition from specialized to unified embeddings in visual search systems

Prerequisites & Requirements

  • Understanding of visual embeddings and metric learning concepts
  • Familiarity with PyTorch for model training(optional)

Key Questions Answered

How does Pinterest improve its visual search capabilities?
Pinterest enhances its visual search by unifying multiple visual embeddings into a single multi-task embedding system. This approach allows for better performance across different applications, such as Visual Cropper, Lens camera search, and Shop the Look, while simplifying the training and maintenance of the underlying infrastructure.
What are the challenges in developing visual embeddings for different applications?
The main challenges include managing the domain shift between camera images and Pin images, and ensuring that the embeddings are optimized for specific tasks. This often leads to technical debt as separate embeddings are developed for each application, complicating maintenance and improvement efforts.
What metrics indicate the success of the unified visual embedding?
The success of the unified visual embedding is measured through offline retrieval metrics and online A/B testing. These metrics showed improved engagement and relevance compared to the previously deployed specialized embeddings, demonstrating the effectiveness of the unified approach.
What is the role of proxy-based metric learning in visual embedding training?
Proxy-based metric learning trains on classification datasets where relationships between images are implicitly defined. This method alleviates negative sampling issues and has shown comparable performance to traditional metric learning approaches, making it suitable for training Pinterest's visual embeddings.

Key Statistics & Figures

Monthly searches on visual search products
Hundreds of millions
This statistic highlights the scale and importance of visual search at Pinterest.
Number of ideas browsed via visual embeddings
200B+
This showcases the extensive corpus that the visual search technology operates on.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Pytorch
Used for training the unified visual embedding model.
Backend
Caffe
Previously used for training older visual embeddings.
Backend
Caffe2
Used in conjunction with PyTorch for serving newer embeddings.
Tools
Nvidia Apex
Utilized for mixed precision training to improve performance.

Key Actionable Insights

1
Implement a unified visual embedding system to streamline your visual search applications.
This approach can reduce technical debt and improve performance across multiple applications, allowing for faster iterations and enhancements.
2
Leverage proxy-based metric learning for training visual embeddings to enhance model performance.
This method can help manage the complexities of training on diverse datasets and improve the efficiency of the training process.
3
Regularly evaluate the performance of your visual search systems using A/B testing.
This practice ensures that you can measure the impact of changes and optimizations on user engagement and relevance effectively.

Common Pitfalls

1
Developing separate embeddings for each application can lead to technical debt.
This occurs because maintaining multiple models complicates updates and improvements, making it harder to leverage advancements in training infrastructure.
2
Neglecting the domain shift between different types of images can hinder embedding performance.
When transitioning from camera images to Pin images, it's crucial to ensure that the model is trained on relevant datasets to avoid poor performance.

Related Concepts

Visual Embeddings
Metric Learning
Proxy-based Learning
Visual Search Technologies