Overview
This article provides an in-depth look at LinkedIn's data pipeline monitoring system, focusing on the challenges faced with traditional monitoring methods and how they have evolved to improve visibility and efficiency. It discusses the architecture designed to break down big data pipelines into measurable segments using events, particularly through the use of Apache Gobblin for data ingestion.
What You'll Learn
1
How to implement real-time monitoring for data pipelines using events
2
Why leveraging Apache Gobblin can enhance data ingestion processes
3
How to identify and remediate data ingestion issues automatically
Prerequisites & Requirements
- Understanding of data pipeline concepts and monitoring techniques
- Familiarity with Apache Gobblin and Kafka(optional)
Key Questions Answered
How does LinkedIn monitor its data ingestion pipelines?
LinkedIn monitors its data ingestion pipelines by breaking them down into smaller, measurable segments using events emitted by Apache Gobblin. This allows for real-time visibility into job progress and helps identify issues before they impact downstream consumers.
What are the challenges with traditional data pipeline monitoring?
Traditional data pipeline monitoring often relies on pass/fail statuses, which lack visibility into job progress and can delay issue detection. LinkedIn faced challenges such as not recognizing ingestion problems until downstream users reported data unavailability, leading to significant delays in data processing.
What metrics are used to assess data availability in LinkedIn's pipelines?
Data availability is assessed through metrics such as ingestion lag, data loss, and the timestamp of the last file written into HDFS. These metrics help determine if the data is being ingested timely and accurately.
Key Statistics & Figures
Time taken to catch up after ingestion issues
almost two weeks
After a rollback due to ingestion problems, it took nearly two weeks to recover from being behind by 1.5 days.
Delay in recognizing ingestion problems
1.5 days
The team often did not notice ingestion issues until downstream consumers reported data unavailability, which could take up to 1.5 days.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Apache Gobblin
Used for data ingestion from various sources including Kafka and HDFS.
Messaging
Kafka
Serves as a source for data ingestion and event emission.
Storage
Hadoop Distributed File System (hdfs)
Used for storing ingested data.
Key Actionable Insights
1Implement event-driven monitoring to enhance visibility into data pipeline performance.By using events emitted during data processing, teams can gain real-time insights into job statuses and quickly address issues, reducing downtime and improving data availability.
2Utilize Apache Gobblin for efficient data ingestion from multiple sources.Gobblin's architecture allows for scalable and reliable data ingestion, making it easier to manage large datasets and integrate various data sources seamlessly.
3Adopt automated remediation strategies to handle common ingestion issues.Automating the remediation process can significantly reduce the time spent on manual interventions and help maintain data pipeline health without constant oversight.
Common Pitfalls
1
Relying solely on pass/fail statuses for monitoring data pipelines can lead to delayed issue detection.
This approach lacks the granularity needed to understand job progress and can result in significant downtime before problems are identified.
2
Failing to correlate alerts can lead to false positives and unnecessary escalations.
Without proper correlation of metrics and alerts, teams may react to alerts that do not indicate actual issues, wasting resources and time.
Related Concepts
Data Pipeline Monitoring Techniques
Real-time Data Processing
Event-driven Architecture
Automated Remediation Strategies