Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam

LinkedIn Engineering Team

•

LinkedIn Engineering Team

•11 min read•advanced•

--

•View Original

ApacheApache Spark

Overview

This article discusses LinkedIn's implementation of unified streaming and batch pipelines using Apache Beam, achieving a significant reduction in processing time by 94%. It highlights the challenges of traditional lambda architecture and the benefits of a unified approach in simplifying data processing workflows.

What You'll Learn

1

How to unify streaming and batch data processing using Apache Beam

2

Why using a single codebase for data pipelines reduces operational complexity

3

How to implement backfilling as a batch job to improve resource efficiency

Prerequisites & Requirements

Understanding of data processing concepts and Apache Beam
Familiarity with Apache Samza and Apache Spark(optional)

Key Questions Answered

How does LinkedIn reduce processing time using Apache Beam?

LinkedIn reduced processing time by 94% by unifying their streaming and batch pipelines with Apache Beam. This approach allowed them to run a single codebase for both real-time processing and periodic backfilling, significantly improving efficiency and resource utilization.

What challenges did LinkedIn face with traditional lambda architecture?

The traditional lambda architecture required maintaining two separate codebases for batch and streaming jobs, leading to increased complexity, operational overhead, and the need for engineers to learn and manage different systems and languages.

What are the performance gains from using unified pipelines at LinkedIn?

After migrating to a unified pipeline, LinkedIn saw a reduction in memory allocation and CPU time by approximately 50%. The duration for backfilling jobs decreased from seven hours to just 25 minutes, showcasing significant efficiency improvements.

Key Statistics & Figures

Processing time reduction

94%

Achieved by unifying streaming and batch pipelines using Apache Beam.

Backfilling duration

25 minutes

Reduced from seven hours after migrating to a Beam unified pipeline.

Resource reduction

50%

Memory and CPU time were cut in half after implementing the unified pipeline.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing Framework

Apache Beam

Used for building unified streaming and batch data processing pipelines.

Stream Processing Engine

Apache Samza

Powers streaming processing applications at LinkedIn.

Batch Processing Engine

Apache Spark

Used for handling sophisticated batch scenarios.

Key Actionable Insights

1
Implementing unified streaming and batch pipelines can drastically reduce processing times and resource usage.
By adopting Apache Beam, organizations can streamline their data processing workflows, making it easier to manage and maintain codebases.

2
Standardizing data processing logic across streaming and batch jobs enhances engineer productivity.
With a single codebase, developers can focus on writing efficient code without the overhead of managing multiple systems.

3
Utilizing batch processing for backfilling can optimize resource allocation during peak loads.
This approach allows for better handling of complex training models without compromising on time or resource efficiency.

Common Pitfalls

1

Failing to properly manage different data sources in batch and stream environments can lead to inefficiencies.

This can happen when the same codebase is not adequately abstracted for different data sources, leading to increased complexity and potential errors.

Related Concepts

Data Streaming

Batch Processing

Apache Beam

Data Pipeline Optimization