Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Technology Blog
8 min readbeginner
--
View Original

Overview

The article explores the Psyberg framework used by Netflix's Membership and Finance data engineering team, focusing on its two primary operational modes: stateless and stateful data processing. It details the initialization phases for both modes, the Write Audit Publish (WAP) process, and the importance of audits and high watermark commits in ensuring data integrity.

What You'll Learn

1

How to differentiate between stateless and stateful data processing in data pipelines

2

Why understanding event sequences is crucial for accurate data analysis

3

How to configure Psyberg for different data processing patterns

4

When to apply the Write Audit Publish (WAP) process in ETL workflows

Prerequisites & Requirements

  • Understanding of data processing concepts and ETL workflows
  • Familiarity with Iceberg tables and Spark(optional)

Key Questions Answered

What is the difference between stateless and stateful data processing?
Stateless data processing involves scenarios where the output does not depend on the order of incoming events, allowing for independent processing of each event. In contrast, stateful data processing requires tracking the sequence of events across multiple streams, as the output is contingent on the order and relationship of these events.
How does Psyberg handle late-arriving data in stateful processing?
In stateful processing, late-arriving data is managed by overwriting previously processed data to ensure that all events are accounted for. This is critical for maintaining the accuracy of derived states, such as customer account lifecycles, where missing events can lead to incorrect analyses.
What parameters are used during the initialization phase of Psyberg?
During the initialization phase, Psyberg uses parameters such as process name, source tables, session ID, high watermark table, and session metadata table to determine the data range for processing. These parameters help configure the pipeline according to the specific needs of either stateless or stateful processing.
What is the purpose of the Write Audit Publish (WAP) process?
The Write Audit Publish (WAP) process is used to validate writes to the uncommitted Iceberg snapshot before publishing to the target table. It ensures data integrity by checking for consistency and completeness of the data being processed.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Framework
Psyberg
Used for managing data processing workflows at Netflix.
Database
Iceberg
Utilized for storing raw signup events and managing snapshots.
Backend
Spark
Used for processing data in the Psyberg framework.

Key Actionable Insights

1
Implementing the correct data processing pattern is crucial for accurate analytics. Choose stateless processing when event order does not matter, and stateful processing when it does.
Understanding the nature of your data and its processing requirements can significantly enhance the accuracy of your analytics, leading to better business decisions.
2
Utilizing the Write Audit Publish (WAP) process can prevent data integrity issues in your ETL workflows.
By validating writes before they are committed, you can ensure that only accurate and complete data is published, which is essential for maintaining trust in your data pipelines.
3
Regular audits on uncommitted Iceberg snapshots can help identify discrepancies early in the data processing lifecycle.
Conducting audits can save time and resources by catching issues before they propagate to final datasets, ensuring higher data quality.

Common Pitfalls

1
Failing to correctly identify whether to use stateless or stateful processing can lead to inaccurate data analysis.
Choosing the wrong processing pattern may result in missing critical insights or deriving incorrect conclusions from the data.
2
Neglecting to handle late-arriving data appropriately in stateful processing can compromise data integrity.
If late events are not accounted for, the derived states may be incorrect, leading to flawed business decisions based on inaccurate data.

Related Concepts

Etl Processes
Data Integrity
Data Pipeline Management
Event-driven Architecture