Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and Efficiency

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•9 min read•intermediate•

--

•View Original

AWSAWS S3GraphQLSQL

Overview

The article discusses Netflix's journey in building and scaling a comprehensive data lineage system to enhance data infrastructure reliability and efficiency. It highlights the complexities of Netflix's data landscape, the design goals for the lineage system, and the challenges faced during its development.

What You'll Learn

1

How to design a scalable data lineage system for complex data environments

2

Why data integrity is crucial for building trust in data lineage systems

3

How to integrate disparate data sources for lineage tracking

Prerequisites & Requirements

Understanding of data lineage concepts and challenges
Familiarity with data ingestion tools like Kafka and ETL processes(optional)

Key Questions Answered

What are the design goals for Netflix's data lineage system?

The design goals include ensuring data integrity, enabling seamless integration with various data tools, and creating a flexible data model that represents a wide range of data artifacts and relationships. These principles guide the architecture and development of the lineage system.

How does Netflix handle data ingestion for lineage tracking?

Netflix employs a pull-heavy model for data ingestion, scanning system logs and metadata from various compute engines. They also utilize a push model where platform tools publish lineage events to Kafka topics, facilitating scalable data ingestion.

What challenges did Netflix face in building their data lineage system?

Netflix faced challenges related to capturing diverse data accurately, conforming it to a common data model, and managing operational complexity due to varied ingestion patterns. Continuous changes in the Spark compute engine also introduced unexpected issues.

What are the current use cases for Netflix's data lineage data?

Current use cases include powering a data lineage service for SLA alerts and enhancing data efficiency by providing visibility on entities and their relationships. This helps in managing data lifecycle and identifying potential SLA misses.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Streaming

Kafka

Used for publishing lineage events to facilitate scalable data ingestion.

Cloud Storage

AWS S3

Leveraged for access logs to improve data accuracy in lineage tracking.

Workflow Orchestration

Meson

Used to extract workflow definitions and runtime metadata for lineage data.

Key Actionable Insights

1
Implement a robust data lineage system to enhance data reliability and decision-making.
By mapping out data dependencies, teams can make informed decisions and proactively address issues, leading to improved operational efficiency.

2
Focus on data integrity to build trust in your lineage data.
Accurate representation of data relationships is essential; without trust, lineage data can mislead users and hinder effective decision-making.

3
Utilize both push and pull models for data ingestion to maximize scalability.
Combining these models allows for flexibility in capturing lineage data from various sources, accommodating Netflix's diverse data landscape.

Common Pitfalls

1

Failing to accurately capture and conform data from disparate sources can lead to unreliable lineage information.

This often occurs due to the complexity of integrating various data tools and platforms. To avoid this, ensure a robust conformance process is in place to standardize data representation.

Related Concepts

Data Lineage

Data Integrity

Data Ingestion

Etl Processes

Data Efficiency