Evolution of ML Fact Store

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•14 min read•intermediate•

--

•View Original

AWSMachine Learning

Overview

The article discusses the evolution of the Axion ML fact store at Netflix, focusing on its design, components, and the lessons learned during its development. It highlights the importance of high-quality data for machine learning algorithms and how Axion helps in reducing training-serving skew and improving offline experimentation.

What You'll Learn

1

How to implement a fact logging client for machine learning applications

2

Why monitoring data quality is crucial for machine learning models

3

How to optimize query performance in data storage solutions

Prerequisites & Requirements

Understanding of machine learning concepts and data pipelines
Familiarity with ETL processes and data storage solutions like Iceberg(optional)

Key Questions Answered

What are the main components of the Axion ML fact store?

The Axion ML fact store consists of four main components: the fact logging client, ETL, query client, and data quality infrastructure. These components work together to collect, process, and ensure the quality of data used for machine learning features and recommendations.

How does Axion reduce training-serving skew?

Axion reduces training-serving skew by ensuring that the same data and code are used for both online and offline feature generation. This synchronization allows for more accurate model training and improved recommendation accuracy.

What strategies are used to monitor data quality in Axion?

Axion employs three strategies to monitor data quality: aggregations to track data trends, consistent sampling for quick checks, and random sampling to catch rare issues. These methods help identify data corruption and maintain trust in the data quality.

Why was querying the single Iceberg table slow?

Querying the single Iceberg table was slow due to the need to filter down from several hundred million rows to a much smaller dataset. This inefficiency stemmed from downloading all data from S3 before filtering, which was not optimal for performance.

Key Statistics & Figures

Data quality issues identified early

95%

This statistic reflects the effectiveness of the monitoring approaches implemented in Axion over the past two years.

Query performance improvement with EVCache

3x-50x faster

This improvement highlights the efficiency gained when querying specific data patterns compared to traditional methods.

Technologies & Tools

Data Storage

Axion

Used as the primary ML fact store for generating features and recommendations.

Data Storage

Iceberg

Utilized for storing large blobs of unstructured data logged by the fact logging client.

Data Storage

Evcache

A key-value store introduced to optimize query performance for specific patterns.

Key Actionable Insights

1
Implement a robust data quality monitoring system to ensure the integrity of your datasets.
Data quality is crucial for the performance of machine learning models. By proactively monitoring for data corruption, you can maintain trust in your data and improve model outcomes.

2
Consider denormalizing your data storage to enhance query performance.
While normalization can save space, denormalization can significantly improve query speed, especially for large datasets, as seen in Axion's transition to using nested Parquet format.

3
Utilize a key-value store like EVCache for low-latency queries.
By optimizing access patterns and using a key-value store, you can achieve faster query performance, which is essential for real-time applications.

Common Pitfalls

1

Over-optimizing data storage solutions can lead to complexity that hinders performance.

Premature optimizations often complicate systems and make them harder to maintain. It's essential to prioritize simplicity and scalability in initial designs.

2

Neglecting comprehensive testing frameworks can result in significant issues down the line.

Without robust testing, including scalability and performance tests, systems can encounter unforeseen problems that are costly to resolve later.

Related Concepts

Data Quality Monitoring

Feature Generation In Machine Learning

Etl Processes And Optimization

Denormalization Vs Normalization In Data Storage