Gobblin Enters Apache Incubation

Abhishek Tiwari
6 min readadvanced
--
View Original

Overview

The article discusses Gobblin's transition into the Apache Incubation phase, highlighting its evolution as a distributed data integration framework since its inception in 2014. It emphasizes the framework's capabilities, the significance of joining the Apache Software Foundation, and the community's contributions towards its growth.

What You'll Learn

1

How to utilize Gobblin for both streaming and batch data processing

2

Why joining the Apache Software Foundation enhances project sustainability

3

When to implement global throttling in data integration processes

Prerequisites & Requirements

  • Understanding of distributed data integration concepts
  • Familiarity with Apache projects(optional)

Key Questions Answered

What is Gobblin and what are its main features?
Gobblin is a distributed data integration framework that simplifies data ingestion, replication, organization, and lifecycle management for both streaming and batch ecosystems. It has evolved from a basic ingestion framework to a robust system supporting various execution environments and data velocities.
Why did Gobblin enter Apache Incubation?
Gobblin entered Apache Incubation to ensure self-sustenance and durability, allowing the community to nurture it under 'The Apache Way'. This transition aims to enhance its adoption and support across various organizations.
What enhancements have been made to Gobblin since joining Apache?
Since joining Apache, Gobblin has introduced multiple execution modes, support for both stream and batch processing, global throttling capabilities, and is working towards a Gobblin-as-a-Service model to unify data management deployments.
How does Gobblin support different execution environments?
Gobblin supports various execution environments including Embedded, CLI, Standalone, Mapreduce, and Cluster modes across platforms like Bare metal, AWS, and Yarn, allowing flexibility in deployment.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Integration Framework
Gobblin
Used for simplifying data ingestion, replication, organization, and lifecycle management.
Stream Processing
Apache Kafka
Gobblin supports Kafka for data streaming capabilities.
Cloud Platform
AWS
Gobblin can run in AWS environments for scalable data processing.

Key Actionable Insights

1
Consider adopting Gobblin for your data integration needs to leverage its robust features for both streaming and batch processing.
Gobblin's capabilities allow organizations to efficiently manage data ingestion and processing, making it a valuable tool for big data environments.
2
Engage with the Gobblin community to contribute to its development and stay updated on new features and enhancements.
Community involvement is crucial for the growth of open-source projects, and contributing can provide valuable experience and networking opportunities.
3
Utilize the global throttling feature in Gobblin to manage API quotas effectively during data integration tasks.
This feature can help prevent overloading resources and ensure smooth operation across distributed systems.

Common Pitfalls

1
Neglecting community involvement can hinder the growth and sustainability of open-source projects like Gobblin.
Active participation helps in receiving feedback, improving the project, and fostering a supportive ecosystem.

Related Concepts

Distributed Data Integration
Apache Software Foundation
Data Processing Frameworks
Community-driven Development