Psyberg: Automated end to end catch up

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•7 min read•beginner•

--

•View Original

Scala

Overview

The article discusses Psyberg, a tool developed by Netflix to automate the end-to-end catchup of data pipelines, particularly focusing on how it manages late-arriving data and enhances workflow efficiency. It outlines the architecture, processing modes, and the significant improvements in resource utilization and data accuracy achieved through its implementation.

What You'll Learn

1

How to automate data pipeline catchup using Psyberg

2

Why late-arriving data management is crucial in ETL processes

3

How to implement both stateless and stateful processing with Psyberg

Prerequisites & Requirements

Understanding of ETL processes and data pipelines
Familiarity with Apache Iceberg and Spark(optional)

Key Questions Answered

How does Psyberg automate the catchup of late-arriving data?

Psyberg automates catchup by utilizing a workflow that identifies new events since the last high watermark and processes late data efficiently without reprocessing already handled data. This is achieved through a combination of stateless and stateful processing modes that ensure minimal change footprint and accurate data updates.

What improvements has Psyberg brought to Netflix's data workflows?

Psyberg has significantly reduced resource utilization, with a reported 90% drop in Spark core usage, onboarded 30 tables and 13 workflows into incremental processing, and achieved zero manual catchups or missing data incidents since its implementation. This has enhanced both reliability and performance in data processing.

What are the key components of the Psyberg workflow?

The Psyberg workflow consists of three main components: initialization, where it identifies new events; a Write-Audit-Publish process for applying ETL logic and quality checks; and a commit phase that updates the high watermark for future processing. This structured approach ensures consistent and efficient data handling.

Key Statistics & Figures

Reduction in Spark core usage

90%

This reduction was observed after implementing Psyberg compared to previous methods using fixed lookback windows.

Number of tables onboarded into incremental processing

30

This onboarding was achieved since the implementation of Psyberg, enhancing the data processing capabilities.

Number of workflows onboarded into incremental processing

13

This reflects the efficiency gains and improved workflow management facilitated by Psyberg.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Apache Iceberg

Used for managing large datasets in a way that supports incremental processing.

Backend

Spark

Utilized for executing ETL processes and managing data transformations.

Key Actionable Insights

1
Implementing Psyberg can drastically reduce the complexity of managing late-arriving data in ETL workflows.
By automating the catchup process, data engineers can focus on more critical tasks, thereby improving overall productivity and workflow efficiency.

2
Utilizing both stateless and stateful processing modes allows for flexible data handling tailored to specific pipeline requirements.
This flexibility can lead to better resource management and improved performance metrics, as seen in the significant reduction of Spark core usage.

3
Regularly auditing and monitoring the data processing stages can help maintain data integrity and reliability.
Incorporating thorough quality checks during the Write-Audit-Publish process ensures that only accurate data is published, preventing issues downstream.

Common Pitfalls

1

Failing to properly configure the Psyberg initialization can lead to missed data events.

It's crucial to ensure that the input source tables and processing modes are accurately defined to avoid gaps in data processing.

2

Neglecting to conduct thorough audits during the Write-Audit-Publish process may result in publishing inaccurate data.

Quality checks are essential to maintain data integrity, and skipping these can lead to downstream issues that are costly to rectify.

Related Concepts

Incremental Processing

Etl Workflows

Data Pipeline Architecture

Stateful Vs Stateless Processing