How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet Scale

Netflix Technology Blog
8 min readadvanced
--
View Original

Overview

Netflix built a Real-Time Distributed Graph (RDG) to connect member interaction data across their expanding business verticals including streaming, live events, and mobile games. This first part of a multi-part series covers the ingestion and processing pipeline architecture, which uses Apache Kafka for event ingestion and Apache Flink for stream processing to generate graph nodes and edges at internet scale.

What You'll Learn

1

Why Netflix chose a graph representation over traditional data warehouses for cross-domain member interaction analysis

2

How to architect a real-time data ingestion pipeline using Apache Kafka and Apache Flink at million-messages-per-second scale

3

How to transform raw event streams into graph nodes and edges through filtering, enrichment, and deduplication stages

4

When to split a monolithic Flink job into multiple jobs mapped 1:1 to Kafka source topics for better operational stability

5

Why microservices data isolation creates challenges for cross-domain analytics and how a graph solves them

Prerequisites & Requirements

  • Understanding of stream processing concepts and event-driven architectures
  • Familiarity with graph data structures (nodes, edges, traversals)
  • Basic understanding of Apache Kafka topics, consumers, and message processing
  • Understanding of microservices architecture and its data isolation implications(optional)
  • Experience with distributed systems and data pipeline design(optional)

Key Questions Answered

Why did Netflix build a Real-Time Distributed Graph instead of using a traditional data warehouse?
Netflix's expansion into ad-supported plans, live programming, and mobile games created a need to analyze member interactions across business verticals. In a microservices architecture, data was siloed across hundreds of services, making it extremely difficult for data science and engineering teams to manually stitch data together from the data warehouse. A graph representation enables fast relationship-centric queries, flexible schema evolution, and natural pattern detection without expensive joins.
How does Netflix ingest data into their Real-Time Distributed Graph at scale?
Member actions in the Netflix app are published to an API Gateway, which writes them as records to Apache Kafka topics. Each topic generates up to roughly 1 million messages per second, encoded in Apache Avro format with schemas stored in a centralized registry. Retention policies are tailored per topic based on throughput and record size, and records are also persisted to Apache Iceberg tables for backfill scenarios.
What does the Apache Flink processing pipeline look like for Netflix's graph system?
Flink jobs consume Kafka events and process them through a series of stages: filtering and projections to remove noise, enriching events with metadata via side inputs, transforming events into graph nodes and edges, buffering and deduplicating overlapping updates within configurable time windows using stateful process functions, and finally publishing more than 5 million records per second to Data Mesh for persistence.
Why did Netflix move from one monolithic Flink job to multiple jobs?
A single Flink job consuming all Kafka source topics became an operational headache because different topics have varying data volumes and throughputs at different times of day. Tuning CPU, memory, parallelism, and checkpointing intervals for the monolithic job was extremely difficult. Netflix pivoted to a 1:1 mapping of Kafka source topic to Flink job, which made each job simpler to maintain, analyze, and tune despite increased operational overhead.
What are the advantages of using a graph over relational tables for cross-domain data analysis?
Graphs enable fast relationship-centric queries by allowing quick hops across nodes and edges without expensive joins or manual denormalization. They offer flexibility as new connections and entities emerge without significant schema changes. Additionally, graphs naturally support pattern and anomaly detection through graph traversals, which are much more efficient than siloed point lookups for identifying hidden relationships, cycles, or groupings.
How does Netflix handle deduplication in their real-time graph processing pipeline?
Netflix uses stateful Flink process functions with timers to buffer, detect, and deduplicate overlapping updates that occur to the same graph nodes and edges within a small, configurable time window. This deduplication step reduces the data throughput published downstream, helping manage the volume before writing to Data Mesh and subsequent storage systems.
What is Netflix Data Mesh and how does it connect to the graph system?
Data Mesh is Netflix's abstraction layer that connects data applications and storage systems. The RDG Flink jobs publish more than 5 million node and edge records per second to Data Mesh, which handles persisting the records to various data stores that other internal services can query. It serves as the bridge between the processing pipeline and the storage layer.

Key Statistics & Figures

Kafka topic message throughput
~1 million messages per second
Per Kafka source topic consumed by RDG team applications
Total records published to Data Mesh
More than 5 million records per second
Combined total of graph nodes and edges written to Data Mesh from Flink processing

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Map each Kafka source topic to its own dedicated Flink job rather than building a monolithic consumer. A 1:1 mapping from source topic to processing job allows independent tuning of CPU, memory, parallelism, and checkpointing intervals per data stream, which is critical when topics have varying throughput patterns throughout the day.
Netflix initially tried a single Flink job consuming all topics but found it impossible to find stable configurations. The tradeoff of more jobs to manage was worth the improved maintainability and operational stability.
2
Implement a deduplication stage using stateful process functions and timers in your stream processing pipeline. Buffering and merging overlapping updates to the same entities within a configurable time window significantly reduces downstream write volume, which is essential when operating at millions of records per second.
Netflix uses this approach before publishing to Data Mesh to reduce the 1M+ messages/second input down to manageable write volumes across their storage systems.
3
Persist streaming data to both real-time systems and batch-accessible data warehouse tables (such as Apache Iceberg) simultaneously. This dual-write strategy enables data backfill scenarios when older data is no longer available in Kafka topics due to retention policies, providing a safety net for data recovery and historical analysis.
Netflix tailors Kafka retention policies per topic based on throughput and record size to balance data availability with storage costs, making the Iceberg backup essential for completeness.
4
Choose a graph data model when your use case requires analyzing relationships across multiple microservices' data domains. Graphs excel at relationship-centric queries, flexible schema evolution, and pattern detection — capabilities that require expensive joins or manual denormalization in traditional table-based models.
Netflix's microservices architecture resulted in siloed data across hundreds of services, making cross-domain member interaction analysis extremely difficult until they adopted a graph representation.
5
Design your graph data model to be as generic and flexible as possible from the start, so that adding new node and edge types is an infrequent operation. Separating each node and edge type into its own Kafka topic allows bespoke tuning and scaling per type, even though it increases the number of topics to manage.
Netflix accepted the tradeoff of managing more Kafka topics in exchange for the ability to independently scale and tune each graph entity type as their business verticals expanded.

Common Pitfalls

1
Building a single monolithic Flink job to consume all Kafka source topics. When different topics have varying data volumes and throughputs at different times of day, finding CPU, memory, parallelism, and checkpointing configurations that ensure stability for a monolithic job becomes extremely difficult.
Netflix experienced this firsthand and pivoted to a 1:1 mapping from Kafka topic to Flink job, accepting additional operational overhead for much simpler maintenance and tuning.
2
Attempting to analyze cross-domain data by manually stitching together data from a data warehouse and siloed microservice databases. In a microservices architecture with hundreds of services, this approach becomes an onerous task as data is separated across different schemas, storage technologies, and processing cadences.
Netflix's data science and engineering teams struggled with this before the RDG was built. A graph representation provides a unified view of interconnected data without manual denormalization.
3
Not implementing deduplication before writing to downstream storage. At internet scale, overlapping updates to the same graph entities are common and writing every update individually creates unnecessary throughput pressure on downstream storage systems.
Netflix uses stateful Flink process functions with configurable time windows to buffer and merge overlapping updates, significantly reducing the write volume to Data Mesh.

Related Concepts

Graph Databases
Stream Processing Architecture
Event-driven Architecture
Microservices Data Integration
Stateful Stream Processing
Data Deduplication
Event Sourcing
Real-time Analytics
Data Mesh Architecture
Schema Registry
Exactly-once Processing Semantics
Graph Traversal Algorithms