Announcing Suro: Backbone of Netflix’s Data Pipeline

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•7 min read•advanced•

--

•View Original

ApacheApache KafkaAWSAWS EC2

Overview

The article announces Suro, Netflix's new data pipeline backbone designed to handle the massive scale of event data generated by its applications. Suro is built for scalability, resilience, and dynamic configuration, allowing Netflix to efficiently process over 1.5 million events per second.

What You'll Learn

1

How to implement a scalable data pipeline using Suro

2

Why dynamic event dispatching is crucial for data processing

3

When to use batch processing versus real-time computation

Key Questions Answered

What is Suro and how does it function as a data pipeline?

Suro is Netflix's data pipeline backbone that collects and dispatches events generated by applications. It consists of a producer client, a collector server, and a plugin framework, enabling dynamic filtering and dispatching of events to multiple consumers.

How does Suro handle different data formats?

Suro supports arbitrary data formats, allowing users to plug in their own serialization and deserialization code. This flexibility is crucial for processing diverse types of events generated by various applications.

What are the performance metrics for Suro?

During stress tests, Suro was able to handle over 1.5 million events per second during peak hours, demonstrating its capability to manage large-scale data efficiently.

How does Suro ensure resilience against failures?

Suro is designed to be resilient, particularly against failures introduced by Netflix's Simian Army tools, such as Chaos Monkey. This ensures that the data pipeline remains operational even during unexpected disruptions.

Key Statistics & Figures

Events processed per second

1.5 million

This figure represents the peak event processing capability of Suro during high traffic periods.

Events processed per day

80 billion

This statistic highlights the scale at which Netflix operates its data pipeline, necessitating a robust solution like Suro.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Infrastructure

AWS EC2

Hosts the web services and applications that generate events for Suro.

Data Processing

Hadoop

Processes collected events to generate offline business reports.

Message Broker

Kafka

Used for dispatching events to designated topics for real-time processing.

Storage

S3

Stores aggregated data for further processing by Hadoop jobs.

Analytics

Druid

Indexes log lines on the fly for immediate querying.

Search Engine

Elasticsearch

Ingests log lines for querying and analysis.

Key Actionable Insights

1
Implementing Suro can significantly enhance your data processing capabilities, allowing for both batch and real-time computations. This flexibility is essential for adapting to varying data processing needs.
Organizations dealing with large volumes of data can benefit from Suro's architecture, which supports dynamic event dispatching and resilience against failures.

2
Utilizing Suro's plugin framework enables customization of data handling processes, which can improve operational efficiency.
By allowing users to define their own serialization methods, Suro can cater to specific data requirements, enhancing the overall data pipeline performance.

Common Pitfalls

1

Failing to configure Suro dynamically can lead to inefficient data processing and missed operational insights.

Without proper configuration, the system may not adapt to changing data needs, resulting in bottlenecks and delays in data availability.

Related Concepts

Data Pipeline Architecture

Event-driven Systems

Big Data Processing Frameworks