Incremental Processing using Netflix Maestro and Apache Iceberg

Netflix Technology Blog
19 min readadvanced
--
View Original

Overview

The article discusses the implementation of incremental processing at Netflix using Netflix Maestro and Apache Iceberg. It highlights the advantages of processing only new or changed data to enhance data freshness, accuracy, and backfill capabilities while reducing compute costs and execution time.

What You'll Learn

1

How to implement incremental processing using Netflix Maestro and Apache Iceberg

2

Why incremental processing improves data freshness and accuracy

3

When to use the incremental change capture design for efficient data workflows

Prerequisites & Requirements

  • Understanding of data processing workflows and ETL concepts
  • Familiarity with Netflix Maestro and Apache Iceberg(optional)

Key Questions Answered

What challenges does Netflix face with data processing workflows?
Netflix faces challenges such as data freshness, data accuracy due to late arriving data, and the need for backfilling datasets. These issues lead to inefficiencies in processing large datasets, which the incremental processing solution aims to address.
How does incremental processing enhance data workflows?
Incremental processing enhances data workflows by allowing only new or changed data to be processed, which significantly reduces compute costs and execution time. This approach also improves data accuracy and enables efficient backfilling of datasets.
What is the role of Apache Iceberg in incremental processing?
Apache Iceberg provides a high-performance table format that supports features like schema evolution and time travel, which are leveraged in the incremental processing solution to efficiently capture changes without duplicating data.
What are the main advantages of using Netflix Maestro for incremental processing?
Netflix Maestro enables seamless orchestration of workflows while integrating incremental processing features. This allows users to adopt incremental processing with minimal changes to existing workflows, enhancing productivity and reducing complexity.

Key Statistics & Figures

Cost reduction in data processing
> 80%
The new pipeline using incremental processing showed over 80% cost reduction compared to the traditional lookback window approach.
Execution time for first stage workflow
30 minutes
The new incremental processing pipeline reduced the execution time for the first stage workflow to about 30 minutes.
Execution time for second stage workflow
15 minutes
The second stage workflow takes about 15 minutes to process change data using the new pipeline.
Resource usage for new pipeline
10%
The new incremental processing pipeline requires only around 10% of the resources compared to the original pipeline.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Workflow Orchestration
Netflix Maestro
Used for managing and orchestrating data workflows at Netflix.
Data Management
Apache Iceberg
Provides a high-performance format for managing large analytic tables.

Key Actionable Insights

1
Adopt incremental processing to optimize your data workflows and reduce costs.
By implementing incremental processing, organizations can minimize the amount of data that needs to be reprocessed, leading to significant savings in compute resources and time.
2
Utilize Apache Iceberg's features to enhance data management capabilities.
Leveraging Iceberg's capabilities such as time travel and schema evolution can simplify the management of large datasets and improve data accuracy.
3
Integrate Maestro with existing workflows to streamline data processing.
Using Maestro for orchestration can help automate and manage complex workflows, making it easier to implement incremental processing without extensive rework.

Common Pitfalls

1
Relying solely on lookback windows for handling late arriving data can lead to inefficiencies.
This approach often results in unnecessary reprocessing of large datasets, increasing costs and execution time. Instead, incremental processing should be utilized to focus on only the changed data.

Related Concepts

Data Processing Workflows
Etl Processes
Incremental Processing Techniques
Data Freshness And Accuracy