Overview
Netflix built a Real-Time Distributed Graph (RDG) to connect member interaction data across their expanding business verticals including streaming, live events, and mobile games. This first part of a multi-part series covers the ingestion and processing pipeline architecture, which uses Apache Kafka for event ingestion and Apache Flink for stream processing to generate graph nodes and edges at internet scale.
What You'll Learn
Why Netflix chose a graph representation over traditional data warehouses for cross-domain member interaction analysis
How to architect a real-time data ingestion pipeline using Apache Kafka and Apache Flink at million-messages-per-second scale
How to transform raw event streams into graph nodes and edges through filtering, enrichment, and deduplication stages
When to split a monolithic Flink job into multiple jobs mapped 1:1 to Kafka source topics for better operational stability
Why microservices data isolation creates challenges for cross-domain analytics and how a graph solves them
Prerequisites & Requirements
- Understanding of stream processing concepts and event-driven architectures
- Familiarity with graph data structures (nodes, edges, traversals)
- Basic understanding of Apache Kafka topics, consumers, and message processing
- Understanding of microservices architecture and its data isolation implications(optional)
- Experience with distributed systems and data pipeline design(optional)
Key Questions Answered
Why did Netflix build a Real-Time Distributed Graph instead of using a traditional data warehouse?
How does Netflix ingest data into their Real-Time Distributed Graph at scale?
What does the Apache Flink processing pipeline look like for Netflix's graph system?
Why did Netflix move from one monolithic Flink job to multiple jobs?
What are the advantages of using a graph over relational tables for cross-domain data analysis?
How does Netflix handle deduplication in their real-time graph processing pipeline?
What is Netflix Data Mesh and how does it connect to the graph system?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Map each Kafka source topic to its own dedicated Flink job rather than building a monolithic consumer. A 1:1 mapping from source topic to processing job allows independent tuning of CPU, memory, parallelism, and checkpointing intervals per data stream, which is critical when topics have varying throughput patterns throughout the day.Netflix initially tried a single Flink job consuming all topics but found it impossible to find stable configurations. The tradeoff of more jobs to manage was worth the improved maintainability and operational stability.
2Implement a deduplication stage using stateful process functions and timers in your stream processing pipeline. Buffering and merging overlapping updates to the same entities within a configurable time window significantly reduces downstream write volume, which is essential when operating at millions of records per second.Netflix uses this approach before publishing to Data Mesh to reduce the 1M+ messages/second input down to manageable write volumes across their storage systems.
3Persist streaming data to both real-time systems and batch-accessible data warehouse tables (such as Apache Iceberg) simultaneously. This dual-write strategy enables data backfill scenarios when older data is no longer available in Kafka topics due to retention policies, providing a safety net for data recovery and historical analysis.Netflix tailors Kafka retention policies per topic based on throughput and record size to balance data availability with storage costs, making the Iceberg backup essential for completeness.
4Choose a graph data model when your use case requires analyzing relationships across multiple microservices' data domains. Graphs excel at relationship-centric queries, flexible schema evolution, and pattern detection — capabilities that require expensive joins or manual denormalization in traditional table-based models.Netflix's microservices architecture resulted in siloed data across hundreds of services, making cross-domain member interaction analysis extremely difficult until they adopted a graph representation.
5Design your graph data model to be as generic and flexible as possible from the start, so that adding new node and edge types is an infrequent operation. Separating each node and edge type into its own Kafka topic allows bespoke tuning and scaling per type, even though it increases the number of topics to manage.Netflix accepted the tradeoff of managing more Kafka topics in exchange for the ability to independently scale and tune each graph entity type as their business verticals expanded.