Embedding-Based Retrieval for Airbnb Search

Our journey in applying embedding-based retrieval techniques to build an accurate and scalable candidate retrieval system for Airbnb Homes…

Huiji Gao
7 min readadvanced
--
View Original

Overview

The article discusses the development of Airbnb's first Embedding-Based Retrieval (EBR) search system, which aims to improve the relevance of search results for users by narrowing down the pool of listings based on their queries. It outlines the challenges faced during the implementation, including training data construction, model architecture, and online serving strategies.

What You'll Learn

1

How to construct training data for machine learning models using contrastive learning

2

Why a two-tower architecture is beneficial for embedding-based retrieval systems

3

How to select the appropriate approximate nearest neighbor solution for online serving

Prerequisites & Requirements

  • Understanding of machine learning concepts and embedding techniques
  • Experience with building and deploying machine learning models(optional)

Key Questions Answered

What are the key challenges in building an embedding-based retrieval system?
The key challenges include constructing training data, designing the model architecture, and developing an online serving strategy using Approximate Nearest Neighbor (ANN) solutions. Each of these components plays a critical role in ensuring the system can effectively retrieve relevant homes while maintaining performance.
How does Airbnb's EBR system improve search result relevance?
The EBR system improves search result relevance by effectively incorporating query context, which allows homes to be ranked more accurately during retrieval. This leads to displaying more relevant results, especially for queries with a high number of eligible listings.
What approximate nearest neighbor solutions were considered for online serving?
Airbnb explored two main ANN solutions: inverted file index (IVF) and hierarchical navigable small worlds (HNSW). While HNSW performed slightly better in evaluation metrics, IVF was chosen for its better trade-off between speed and performance, especially given the high volume of real-time updates.

Key Statistics & Figures

Bookings lift from EBR system
statistically-significant gain
This improvement was observed during A/B testing, indicating that the new retrieval system performed comparably to major machine learning enhancements made in the past two years.

Technologies & Tools

Algorithm
Approximate Nearest Neighbor (ann)
Used for online serving to efficiently retrieve relevant listings based on user queries.

Key Actionable Insights

1
Implement a two-tower architecture for your embedding-based retrieval systems to separate the processing of listing features and query features.
This architecture allows for offline computation of listing embeddings, reducing online latency and improving the overall efficiency of the retrieval process.
2
Utilize contrastive learning techniques to construct training data that captures user behavior effectively.
By pairing positive and negative examples based on user interactions, you can train models that better understand the context of user queries, leading to improved retrieval accuracy.
3
Choose the right ANN solution based on your specific use case, balancing performance and speed.
For systems with frequent updates, like Airbnb's, an inverted file index may provide better performance than more complex solutions, ensuring that retrieval remains efficient.

Common Pitfalls

1
Randomly sampling negative examples can lead to poor model performance.
This occurs because it makes the problem too easy for the model, failing to capture the complexity of user interactions and preferences.