Solving the data integration variety problem at scale, with Gobblin

Chris L.
11 min readintermediate
--
View Original

Overview

The article discusses the challenges of data integration at scale within LinkedIn's big data ecosystem and presents Gobblin's Data Integration Library (DIL) as a solution to streamline and standardize data integration processes. It highlights the importance of a configuration-driven framework that accommodates various data standards while ensuring scalability and compliance.

What You'll Learn

1

How to utilize the Data Integration Library (DIL) for efficient data integration

2

Why a multistage architecture enhances data ingestion processes

3

How to configure data integration pipelines using HOCON, YAML, or JSON

Prerequisites & Requirements

  • Understanding of data integration concepts and frameworks
  • Familiarity with Gobblin and its execution framework(optional)

Key Questions Answered

What are the main challenges of data integration at scale?
The main challenges include linearly growing engineering costs due to a 'bag of connectors' strategy, long lead times for connector development, and varying levels of compliance and security across different connectors. These issues arise from the need to create numerous custom connectors for specific use cases, leading to inefficiencies.
How does the Data Integration Library (DIL) simplify data integration?
DIL simplifies data integration by providing a library of generic components that can be configured for various data sources, reducing the need for custom-built connectors. It allows users to write configurations in HOCON, YAML, or JSON, which streamlines the integration process and improves maintainability.
What benefits does DIL offer to the open source community?
DIL offers significant benefits such as quicker time-to-market for new business initiatives, reduced lead times for onboarding, and lower maintenance costs. Its standardized design can lead to widespread adoption across companies using Gobblin, potentially impacting hundreds or thousands of customized connectors.
What is the significance of the multistage architecture in DIL?
The multistage architecture in DIL allows for the decomposition of complex ingestion jobs into smaller, manageable tasks. This approach enhances efficiency and scalability by enabling independent execution and management of each stage, facilitating easier recovery and state tracking.

Key Statistics & Figures

Daily Kafka events processed
Seven trillion
This highlights the scale at which LinkedIn operates and the complexity of data integration required to manage such a volume.
Reduction in independently-maintained connectors
89
DIL has replaced these connectors with generic connectors, simplifying maintenance and enhancing scalability.

Technologies & Tools

Data Integration Framework
Gobblin
Used as the underlying framework for implementing the Data Integration Library.

Key Actionable Insights

1
Implement the Data Integration Library (DIL) to standardize your data integration processes.
By using DIL, organizations can significantly reduce the complexity of managing multiple data sources and formats, leading to faster integration times and lower maintenance costs.
2
Leverage the multistage architecture to break down complex data ingestion jobs.
This approach allows teams to manage smaller tasks independently, improving efficiency and making it easier to troubleshoot issues during data integration.
3
Utilize configuration files in HOCON, YAML, or JSON for data integration setups.
This reduces the need for extensive coding, allowing teams without full engineering resources to effectively manage data integration tasks.

Common Pitfalls

1
Relying on a 'bag of connectors' strategy can lead to increased engineering and maintenance costs.
This approach often results in redundant connectors that do not leverage existing code, making it difficult to adapt to new use cases without significant investment.

Related Concepts

Data Integration Frameworks
Big Data Processing
Apache Gobblin
Multistage Data Ingestion Architectures