FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format

Zihan Li
15 min readintermediate
--
View Original

Overview

The article introduces FastIngest, a new evolution of Apache Gobblin designed to enable low-latency data ingestion from Kafka to HDFS using the ORC file format and Apache Iceberg for metadata management. It highlights the improvements in ingestion speed, reducing latency from 45 minutes to just 5 minutes, while addressing challenges related to batch processing.

What You'll Learn

1

How to implement a low-latency data ingestion pipeline using FastIngest

2

Why Apache Iceberg is beneficial for managing metadata in fast-moving data environments

3

How to optimize data ingestion performance by using ORC format directly

Prerequisites & Requirements

  • Understanding of data ingestion frameworks and Apache Gobblin
  • Familiarity with Kafka and HDFS

Key Questions Answered

How does FastIngest improve data ingestion speed?
FastIngest reduces data ingestion latency from 45 minutes to 5 minutes by implementing a streaming-based pipeline that continuously writes to HDFS. This approach allows for more efficient resource usage and faster data availability for downstream processing.
What are the challenges of migrating from batch to streaming ingestion?
Migrating from batch to streaming ingestion introduces challenges such as estimating work sizes dynamically, managing continuous data publishing, and ensuring schema changes are handled effectively. These challenges require robust monitoring and a well-defined architecture to maintain performance.
Why is the ORC file format chosen for data ingestion?
The ORC file format is chosen for its improved I/O efficiency and support for predicate pushdown capabilities, which enhance query performance in data processing engines like Spark and Presto. This eliminates the need for an additional conversion pipeline from Avro to ORC.
What role does Apache Iceberg play in FastIngest?
Apache Iceberg serves as the metadata catalog for FastIngest, providing snapshot isolation and enabling incremental data processing. This allows for better management of fast-moving data while ensuring data consistency and performance.

Key Statistics & Figures

Reduction in ingestion latency
From 45 minutes to 5 minutes
This significant decrease in latency is achieved through the implementation of the FastIngest pipeline.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Ingestion Framework
Apache Gobblin
Used for building the FastIngest pipeline to facilitate low-latency data ingestion.
Metadata Management
Apache Iceberg
Used for managing metadata and ensuring data consistency in fast-moving data environments.
File Format
Orc
Chosen for its efficiency in data storage and query performance.
Message Broker
Kafka
Serves as the source of data events for ingestion into HDFS.
Storage
Hdfs
Destination for ingested data from Kafka.

Key Actionable Insights

1
Implementing a streaming data ingestion pipeline can significantly reduce latency and improve data availability.
Organizations should consider transitioning from batch processing to streaming models, especially when dealing with time-sensitive data. This shift can enhance operational efficiency and responsiveness to data changes.
2
Utilizing ORC format directly for data ingestion can streamline processes and reduce operational overhead.
By avoiding the need for data conversion from Avro to ORC, teams can save time and resources, leading to faster data processing and improved performance in analytics.
3
Employing Apache Iceberg for metadata management can enhance data integrity and facilitate incremental processing.
Using Iceberg allows teams to manage fast-moving data more effectively, ensuring that data consumers can access the latest information without delays.

Common Pitfalls

1
Failing to dynamically adjust work sizes for varying data volumes can lead to inefficiencies.
Without a mechanism to rebalance workloads, ingestion tasks may become bottlenecks, causing delays in data availability. Implementing a replanning strategy can help mitigate this issue.
2
Overloading the metadata management system with too many concurrent commits can introduce latency.
Asynchronous metadata publishing is crucial to avoid conflicts and reduce the overhead associated with frequent metadata updates, ensuring smoother operations.

Related Concepts

Data Ingestion Frameworks
Streaming Data Processing
Metadata Management With Apache Iceberg
Performance Optimization In Data Pipelines