Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•10 min read•advanced•

--

•View Original

ApacheApache SparkGraphQLJavaReactSpringSpring Boot

Overview

The article discusses how Netflix scales its Muse application to provide data-driven creative insights at a massive scale, focusing on the architectural evolution and optimizations made to handle trillions of rows of data. Key advancements include the use of HyperLogLog sketches for distinct counts and the implementation of the Hollow library for efficient data access.

What You'll Learn

1

How to implement HyperLogLog sketches for efficient distinct counting in analytics applications

2

Why using the Hollow library can improve data access performance in large-scale applications

3

How to optimize Druid configurations for better query performance

Prerequisites & Requirements

Understanding of Online Analytical Processing (OLAP) concepts
Familiarity with Apache Druid and its architecture(optional)
Experience with data processing frameworks like Apache Spark(optional)

Key Questions Answered

How does Netflix handle distinct counting in its analytics systems?

Netflix uses HyperLogLog sketches to perform distinct counting efficiently, allowing them to estimate unique user interactions with a margin of error of 1-2%. This method significantly reduces resource consumption compared to traditional counting methods, especially over large datasets.

What architectural changes were made to the Muse application for scalability?

The Muse application evolved from a simple dashboard to a React app using a GraphQL layer and Spring Boot GRPC microservices. This architecture supports advanced filtering and grouping capabilities while maintaining high performance and data accuracy.

What optimizations were implemented for the Druid cluster?

Key optimizations for the Druid cluster included increasing the broker count, tuning segment sizes to the 300-700 MB range, leveraging Druid lookups for data enrichment, and filtering data at ingestion to improve query performance and reduce latency.

Key Statistics & Figures

Latency reduction

Approx 50%

Achieved across common OLAP query patterns after implementing HyperLogLog sketches.

Error margin for distinct counts

1-2%

This margin is achieved using the Apache Datasketches library for estimating distinct counts.

Druid p99 latencies

Decreased by roughly 50%

This reduction was a result of various optimizations implemented in the Druid cluster.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing

Apache Spark

Used for batch data processing and ETL tasks.

Analytics Database

Apache Druid

Serves as the primary analytics database for querying large volumes of data.

Data Access

Hollow

Provides efficient in-memory storage and access to precomputed aggregates.

API

Graphql

Used as the querying layer for the Muse application.

Backend Framework

Spring Boot

Used to build microservices for the Muse application.

Key Actionable Insights

1
Implement HyperLogLog sketches in your analytics applications to improve performance in counting distinct users.
This method allows for efficient estimation of unique counts with minimal resource usage, making it ideal for applications dealing with large datasets.

2
Utilize the Hollow library for managing in-memory data access to enhance application performance.
Hollow allows for quick access to precomputed aggregates, reducing the load on primary data stores like Druid and improving response times.

3
Regularly tune your Druid configurations based on query patterns to maintain optimal performance.
Adjusting parameters like broker counts and segment sizes can lead to significant improvements in query throughput and latency.

Common Pitfalls

1

Overlooking the importance of tuning Druid configurations can lead to suboptimal performance.

Without proper tuning, query latencies can increase significantly, affecting user experience and system performance.

2

Failing to validate new metrics against legacy systems can result in data integrity issues.

It's crucial to have a robust validation process to ensure that new implementations do not introduce discrepancies in reported metrics.

Related Concepts

Online Analytical Processing (olap)

Data Pipeline Architecture

Performance Optimization Techniques

Data Aggregation Methods