Bridging Batch and Streaming Data Ingestion with Gobblin

Shirshanka Das

•

Shirshanka Das

•7 min read•intermediate•

--

•View Original

ApacheApache KafkaKubernetesLessMySQLOracleSQLSQL Server

Overview

The article discusses Gobblin, a unified data ingestion framework developed by LinkedIn, designed to bridge batch and streaming data ingestion. It highlights the challenges faced with disparate pipelines and outlines Gobblin's capabilities, including its integration with Apache Kafka and the transition from Camus to Gobblin.

What You'll Learn

1

How to integrate Apache Kafka as a data source in Gobblin

2

Why a unified ingestion framework is essential for managing batch and streaming data

3

How to leverage Apache YARN and Apache Helix for resource management in Gobblin

Prerequisites & Requirements

Understanding of data ingestion frameworks and big data concepts
Familiarity with Apache Kafka and Hadoop ecosystems(optional)

Key Questions Answered

What are the main features of Gobblin 0.5.0?

Gobblin 0.5.0 includes production-grade integration with Apache Kafka as a data source and support for operational monitoring and metadata integration. This release marks a significant milestone in providing a unified framework for data ingestion across batch and streaming sources.

How does Gobblin compare to Camus in terms of functionality?

Gobblin offers better support for robust hourly compaction, simpler configuration, and uniformity in debugging compared to Camus. It addresses issues related to operability, data integrity, and flexibility, making it a more efficient choice for ingesting Kafka data into Hadoop.

What challenges does Gobblin face in continuous ingestion?

Gobblin encounters impedance mismatches between the source, sink, and execution environment when building a single ingestion framework for both batch and streaming data. This leads to efficiency costs during peak times, necessitating a balance between resource allocation and data lag.

Key Statistics & Figures

Data ingested daily

hundreds of terabytes

Gobblin currently ingests about a thousand Kafka topics, aggregating hundreds of terabytes of data per day.

Number of pipelines previously run

more than 15

LinkedIn was managing over 15 different pipelines for various data sources before adopting Gobblin.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Ingestion Framework

Gobblin

Used for unified ingestion of batch and streaming data.

Streaming Platform

Apache Kafka

Integrated as a data source in Gobblin for real-time data ingestion.

Data Storage

Hadoop

Primary storage system for data ingested by Gobblin.

Resource Management

Apache Yarn

Used for macro-level container allocation in Gobblin.

Resource Management

Apache Helix

Handles micro-level resource assignment and fault tolerance in Gobblin.

Key Actionable Insights

1
Implementing Gobblin can streamline your data ingestion processes by consolidating batch and streaming pipelines into a single framework.
This is particularly beneficial for organizations dealing with multiple data sources, as it reduces operational complexity and improves data quality management.

2
Utilizing Apache YARN and Apache Helix can enhance resource management in Gobblin, allowing for elastic scaling based on data throughput.
This approach ensures optimal resource utilization during varying load conditions, which is crucial for maintaining performance in a multi-tenant Hadoop environment.

Common Pitfalls

1

Failing to address impedance mismatches between data sources and sinks can lead to inefficiencies in data ingestion.

This often results in increased latency and resource wastage, especially during peak loads. It's crucial to design ingestion frameworks that can adapt to varying data flows.

Related Concepts

Data Ingestion Frameworks

Batch Vs. Streaming Data Processing

Apache Kafka Integration

Resource Management In Big Data