Distributed tier merge: How LinkedIn tackles stragglers in search index build

Andy Li

•

Andy Li

•11 min read•advanced•

--

•View Original

ApacheApache Spark

Overview

The article discusses LinkedIn's approach to mitigating stragglers during the search index build process through a technique called Distributed Tier Merge (DTM). It highlights the challenges faced with traditional indexing methods and how DTM improves efficiency by reducing indexing time by 30-40% for major search products.

What You'll Learn

1

How to implement Distributed Tier Merge (DTM) to optimize indexing processes

2

Why stragglers occur in distributed computing and how to mitigate them

3

How to improve indexing speed using Apache Spark

Prerequisites & Requirements

Understanding of distributed systems and indexing processes
Familiarity with Apache Spark and Hadoop MapReduce(optional)

Key Questions Answered

How does LinkedIn tackle stragglers in search index builds?

LinkedIn addresses stragglers during search index builds by implementing Distributed Tier Merge (DTM), which allows individual merges to run on separate executors in a Spark cluster. This approach reduces the chance of stragglers affecting the overall indexing time, leading to a 30-40% reduction in build time for major search products.

What are the main challenges faced in LinkedIn's indexing pipeline?

The main challenges include scaling issues with the existing Hadoop MapReduce pipeline, which resulted in long indexing times due to stragglers. Stragglers are machines that complete tasks slower than expected due to various resource constraints, significantly impacting service availability.

What improvements were observed after migrating to Apache Spark?

After migrating to Apache Spark, LinkedIn observed a 30% improvement in the speed of building indexes for the combiner, divider, and indexer stages. However, the overall end-to-end build time did not improve significantly due to the time-intensive nature of the merging process.

What is the impact of Distributed Tier Merge on LinkedIn's search products?

The impact of Distributed Tier Merge (DTM) on LinkedIn's search products includes a reduction in index build time by 30-40%, improved service availability, and the ability to deliver fresh indexes more quickly, enhancing the user experience.

Key Statistics & Figures

Index build time reduction

30-40%

Achieved through the implementation of Distributed Tier Merge for major search products.

Time taken by mergers

60-70%

The merger stage accounts for this percentage of the total IndexGen time when indexing jobs run at normal speed.

Speed improvement after migrating to Spark

30%

Observed in the combiner, divider, and indexer stages, although overall end-to-end build time did not improve significantly.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Spark

Used for indexing processes to improve speed and efficiency.

Backend

Hadoop Mapreduce

Original framework used for LinkedIn's indexing pipeline before migration to Spark.

Backend

Lucene

Used for indexing documents in LinkedIn's search architecture.

Key Actionable Insights

1
Implement Distributed Tier Merge (DTM) to enhance the efficiency of your indexing processes.
DTM allows for concurrent merging of subpartitions across different machines, significantly reducing the likelihood of stragglers and improving overall indexing time.

2
Consider migrating to Apache Spark for better performance in data processing tasks.
The migration to Spark not only improved the speed of certain indexing stages but also provided a more flexible framework for handling large datasets.

3
Monitor and analyze the performance of your indexing pipeline to identify potential stragglers.
Understanding where stragglers occur can help in optimizing resource allocation and improving the overall performance of distributed systems.

Common Pitfalls

1

Relying solely on speculative execution to mitigate stragglers can be ineffective.

Speculative execution may not help with merge operations that take hours to complete, as backup mergers would still require significant time, leading to high latency.

Related Concepts

Distributed Systems

Indexing Techniques

Data Processing Frameworks