LLM-Powered Relevance Assessment for Pinterest Search

Pinterest Engineering

•

Pinterest Engineering

•9 min read•intermediate•

--

•View Original

BERTBLIPMachine LearningRoBERTaT5

Overview

The article discusses the implementation of LLM-powered relevance assessment at Pinterest Search, focusing on how fine-tuned large language models (LLMs) can enhance search relevance measurement while reducing costs and improving efficiency. It outlines the methodology, results, and future directions for leveraging LLMs in search relevance tasks.

What You'll Learn

1

How to implement LLM-based relevance assessment for search queries

2

Why stratified sampling improves measurement sensitivity in A/B testing

3

When to use cross-encoder models for relevance prediction

4

How to leverage multilingual LLMs for cross-lingual relevance tasks

Prerequisites & Requirements

Understanding of machine learning concepts and A/B testing methodologies
Familiarity with LLMs and their fine-tuning processes(optional)

Key Questions Answered

How does Pinterest use LLMs for relevance assessment in search?

Pinterest employs fine-tuned LLMs to predict the relevance of Pins to search queries, utilizing a 5-level relevance guideline. This approach allows for efficient evaluation of ranking results across experimental groups in A/B tests, significantly reducing labeling costs and improving evaluation quality.

What are the benefits of using stratified sampling in A/B testing?

Stratified sampling allows Pinterest to create a more representative sample population and reduces the minimum detectable effects (MDEs) by an order of magnitude. This method enhances the sensitivity of experiments, enabling the detection of smaller changes in relevance metrics.

What LLMs were tested for relevance prediction at Pinterest?

Pinterest experimented with various LLMs including multilingual BERT-base, T5-base, mDeBERTa-V3-base, XLM-RoBERTa-large, and Llama-3–8B. The XLM-RoBERTa-large model was chosen for its balance of prediction quality and inference efficiency.

What is the significance of the MDE reduction achieved through LLM labeling?

The introduction of LLM labeling reduced the minimum detectable effects (MDEs) from 1.3%-1.5% to ≤ 0.25%. This significant reduction allows for more sensitive detection of relevance shifts in A/B testing, enhancing the team's ability to iterate and improve features.

Key Statistics & Figures

Exact match rate of LLM-generated labels to human labels

73.7%

This statistic reflects the alignment between LLM-generated relevance labels and those from human annotators, indicating the effectiveness of LLMs in this context.

Reduction in minimum detectable effects (MDE)

≤ 0.25%

This reduction allows for more sensitive detection of relevance shifts in A/B testing, significantly enhancing the experimentation process.

Inference time for labeling 150,000 rows

30 minutes

Using the XLM-RoBERTa-large model, this efficiency highlights the practical benefits of deploying LLMs in relevance assessment.

Technologies & Tools

Machine Learning Model

Xlm-roberta-large

Used as the backbone for the relevance model due to its balance of prediction quality and inference efficiency.

Machine Learning Model

Distilbert

Utilized in the in-house query-to-interest model for stratified sampling design.

Key Actionable Insights

1
Implement LLM-based relevance assessment to streamline your A/B testing processes.
By adopting LLMs for relevance labeling, teams can significantly reduce manual annotation costs and turnaround times, allowing for more efficient experimentation and faster feature deployment.

2
Utilize stratified sampling to enhance the sensitivity of your experiments.
Stratified sampling ensures that your sample population accurately reflects the overall user base, leading to more reliable results and enabling the detection of smaller changes in metrics.

3
Leverage multilingual capabilities of LLMs for global applications.
This approach not only improves relevance assessment for non-English queries but also expands the reach of your search functionalities across different markets.

Common Pitfalls

1

Relying solely on human annotations for relevance measurement can lead to high costs and inefficiencies.

This can restrict the ability to measure nuanced effects and small topline changes, making it difficult to iterate on product features effectively.

Related Concepts

A/B Testing Methodologies

Large Language Models (llms)

Stratified Sampling Techniques

Relevance Assessment In Search Systems