An inside look at LinkedIn’s data pipeline monitoring system

Krishnan Raman

•

Krishnan Raman

•16 min read•intermediate•

--

•View Original

ApacheAvroFlaskMySQLOracleSQLYAML

Overview

This article provides an in-depth look at LinkedIn's data pipeline monitoring system, focusing on the challenges faced with traditional monitoring methods and how they have evolved to improve visibility and efficiency. It discusses the architecture designed to break down big data pipelines into measurable segments using events, particularly through the use of Apache Gobblin for data ingestion.

What You'll Learn

1

How to implement real-time monitoring for data pipelines using events

2

Why leveraging Apache Gobblin can enhance data ingestion processes

3

How to identify and remediate data ingestion issues automatically

Prerequisites & Requirements

Understanding of data pipeline concepts and monitoring techniques
Familiarity with Apache Gobblin and Kafka(optional)

Key Questions Answered

How does LinkedIn monitor its data ingestion pipelines?

LinkedIn monitors its data ingestion pipelines by breaking them down into smaller, measurable segments using events emitted by Apache Gobblin. This allows for real-time visibility into job progress and helps identify issues before they impact downstream consumers.

What are the challenges with traditional data pipeline monitoring?

Traditional data pipeline monitoring often relies on pass/fail statuses, which lack visibility into job progress and can delay issue detection. LinkedIn faced challenges such as not recognizing ingestion problems until downstream users reported data unavailability, leading to significant delays in data processing.

What metrics are used to assess data availability in LinkedIn's pipelines?

Data availability is assessed through metrics such as ingestion lag, data loss, and the timestamp of the last file written into HDFS. These metrics help determine if the data is being ingested timely and accurately.

Key Statistics & Figures

Time taken to catch up after ingestion issues

almost two weeks

After a rollback due to ingestion problems, it took nearly two weeks to recover from being behind by 1.5 days.

Delay in recognizing ingestion problems

1.5 days

The team often did not notice ingestion issues until downstream consumers reported data unavailability, which could take up to 1.5 days.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Gobblin

Used for data ingestion from various sources including Kafka and HDFS.

Messaging

Kafka

Serves as a source for data ingestion and event emission.

Storage

Hadoop Distributed File System (hdfs)

Used for storing ingested data.

Key Actionable Insights

1
Implement event-driven monitoring to enhance visibility into data pipeline performance.
By using events emitted during data processing, teams can gain real-time insights into job statuses and quickly address issues, reducing downtime and improving data availability.

2
Utilize Apache Gobblin for efficient data ingestion from multiple sources.
Gobblin's architecture allows for scalable and reliable data ingestion, making it easier to manage large datasets and integrate various data sources seamlessly.

3
Adopt automated remediation strategies to handle common ingestion issues.
Automating the remediation process can significantly reduce the time spent on manual interventions and help maintain data pipeline health without constant oversight.

Common Pitfalls

1

Relying solely on pass/fail statuses for monitoring data pipelines can lead to delayed issue detection.

This approach lacks the granularity needed to understand job progress and can result in significant downtime before problems are identified.

2

Failing to correlate alerts can lead to false positives and unnecessary escalations.

Without proper correlation of metrics and alerts, teams may react to alerts that do not indicate actual issues, wasting resources and time.

Related Concepts

Data Pipeline Monitoring Techniques

Real-time Data Processing

Event-driven Architecture

Automated Remediation Strategies