Gobblin' Big Data With Ease

LinkedIn Engineering Team
5 min readadvanced
--
View Original

Overview

The article discusses LinkedIn's efforts to simplify big data ingestion for Hadoop-based warehouses using a framework called Gobblin. It highlights the challenges faced in managing diverse datasets and the solutions developed to streamline the ingestion process.

What You'll Learn

1

How to simplify big data ingestion for Hadoop-based warehouses

2

Why a centralized data lake is essential for analytics

3

When to implement an ingestion framework like Gobblin

Prerequisites & Requirements

  • Understanding of data ingestion processes and Hadoop ecosystems
  • Familiarity with Kafka and data pipeline technologies(optional)

Key Questions Answered

What challenges does LinkedIn face in data ingestion?
LinkedIn faces challenges in managing diverse datasets from various sources, including schema evolution and data quality. The complexity of maintaining over 15 different data ingestion pipelines led to the need for a unified framework to ensure data quality and operational efficiency.
How does Gobblin improve data ingestion at LinkedIn?
Gobblin consolidates various data ingestion processes into a single framework, allowing LinkedIn to efficiently manage both internal and external datasets. It addresses issues like schema evolution, data quality, and operational ease, processing tens of terabytes of data daily.
What types of data sources does LinkedIn ingest?
LinkedIn ingests data from internal sources such as member profiles and external platforms like Salesforce, Google, Facebook, and Twitter. Internal datasets are significantly larger in volume compared to external datasets, which present challenges in quality and availability.
What are the common patterns in LinkedIn's data ingestion?
Common patterns include centralized data lakes with standardized formats, lightweight transformations, data quality measurements, scalable ingestion, and ease of operations. These patterns help streamline the ingestion process and maintain data integrity.

Key Statistics & Figures

Daily data processed
tens of terabytes
Gobblin is already processing this volume of data in production, showcasing its capability to handle large-scale data ingestion.
Number of data ingestion pipelines previously managed
more than 15
This complexity led to challenges in maintaining data quality and operability across different pipelines.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Hadoop
Used as the primary data warehouse for storing ingested data.
Backend
Kafka
Utilized for streaming important events through LinkedIn's central activity pipeline.
Backend
Databus
Employed for continuous change capture streams from source databases.

Key Actionable Insights

1
Implement a centralized data lake to unify data sources for analytics.
Centralizing data allows for better management and analysis, ensuring that insights are derived from a comprehensive dataset rather than fragmented sources.
2
Standardize data formats and directory layouts for easier data ingestion.
Having a standardized approach reduces complexity and improves the efficiency of data pipelines, making it easier for engineers to onboard new datasets.
3
Utilize Gobblin to streamline your data ingestion processes.
By adopting Gobblin, organizations can reduce the operational burden of managing multiple ingestion pipelines and improve data quality across the board.

Common Pitfalls

1
Overcomplicating data ingestion processes with too many pipelines.
Having multiple ingestion pipelines can lead to inconsistencies in data quality and operational difficulties. It's crucial to streamline and standardize these processes to maintain efficiency.

Related Concepts

Data Streaming/Processing
Data Management
Distributed Systems