Overview
The article discusses Gobblin, a unified data ingestion framework developed by LinkedIn, designed to bridge batch and streaming data ingestion. It highlights the challenges faced with disparate pipelines and outlines Gobblin's capabilities, including its integration with Apache Kafka and the transition from Camus to Gobblin.
What You'll Learn
1
How to integrate Apache Kafka as a data source in Gobblin
2
Why a unified ingestion framework is essential for managing batch and streaming data
3
How to leverage Apache YARN and Apache Helix for resource management in Gobblin
Prerequisites & Requirements
- Understanding of data ingestion frameworks and big data concepts
- Familiarity with Apache Kafka and Hadoop ecosystems(optional)
Key Questions Answered
What are the main features of Gobblin 0.5.0?
Gobblin 0.5.0 includes production-grade integration with Apache Kafka as a data source and support for operational monitoring and metadata integration. This release marks a significant milestone in providing a unified framework for data ingestion across batch and streaming sources.
How does Gobblin compare to Camus in terms of functionality?
Gobblin offers better support for robust hourly compaction, simpler configuration, and uniformity in debugging compared to Camus. It addresses issues related to operability, data integrity, and flexibility, making it a more efficient choice for ingesting Kafka data into Hadoop.
What challenges does Gobblin face in continuous ingestion?
Gobblin encounters impedance mismatches between the source, sink, and execution environment when building a single ingestion framework for both batch and streaming data. This leads to efficiency costs during peak times, necessitating a balance between resource allocation and data lag.
Key Statistics & Figures
Data ingested daily
hundreds of terabytes
Gobblin currently ingests about a thousand Kafka topics, aggregating hundreds of terabytes of data per day.
Number of pipelines previously run
more than 15
LinkedIn was managing over 15 different pipelines for various data sources before adopting Gobblin.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Data Ingestion Framework
Gobblin
Used for unified ingestion of batch and streaming data.
Streaming Platform
Apache Kafka
Integrated as a data source in Gobblin for real-time data ingestion.
Data Storage
Hadoop
Primary storage system for data ingested by Gobblin.
Resource Management
Apache Yarn
Used for macro-level container allocation in Gobblin.
Resource Management
Apache Helix
Handles micro-level resource assignment and fault tolerance in Gobblin.
Key Actionable Insights
1Implementing Gobblin can streamline your data ingestion processes by consolidating batch and streaming pipelines into a single framework.This is particularly beneficial for organizations dealing with multiple data sources, as it reduces operational complexity and improves data quality management.
2Utilizing Apache YARN and Apache Helix can enhance resource management in Gobblin, allowing for elastic scaling based on data throughput.This approach ensures optimal resource utilization during varying load conditions, which is crucial for maintaining performance in a multi-tenant Hadoop environment.
Common Pitfalls
1
Failing to address impedance mismatches between data sources and sinks can lead to inefficiencies in data ingestion.
This often results in increased latency and resource wastage, especially during peak loads. It's crucial to design ingestion frameworks that can adapt to varying data flows.
Related Concepts
Data Ingestion Frameworks
Batch Vs. Streaming Data Processing
Apache Kafka Integration
Resource Management In Big Data