Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

Netflix Technology Blog
11 min readadvanced
--
View Original

Overview

This article discusses the formulation of predicting 'out of memory' (OOM) kills in the Netflix app as a machine learning problem. It highlights the challenges of dataset curation, the importance of device characteristics, and the steps taken to analyze and predict OOM kills to enhance user experience.

What You'll Learn

1

How to analyze device capabilities for OOM kill prediction

2

Why dataset curation is critical for machine learning applications

3

How to implement a graded window approach for labeling data

4

When to apply feature engineering techniques in memory management

Prerequisites & Requirements

  • Understanding of machine learning concepts and memory management
  • Experience with data engineering and analysis(optional)

Key Questions Answered

What are the challenges in curating a dataset for OOM kill prediction?
Curating a dataset for OOM kill prediction is challenging due to the need to gather data from various sources, including device characteristics and real-time user data. The dataset is often biased as OOM kills are infrequent, leading to a predominance of normal runtime states in the data.
How is the OOM kill prediction problem formulated as a machine learning task?
The OOM kill prediction problem is formulated as a multi-class classification task using features from runtime memory readings and device characteristics. The output variable is labeled based on the proximity of memory readings to OOM kills, allowing the use of algorithms like ANNs and XGBoost.
What labeling strategy is used for OOM kill data?
A sliding window approach is used for labeling OOM kill data, where memory readings are categorized based on their proximity to the OOM kill event. This includes a graded window approach that assigns levels to readings based on their closeness to the kill, facilitating a multi-class classification model.
What insights can be gained from analyzing memory readings prior to OOM kills?
Analyzing memory readings prior to OOM kills can reveal patterns and peaks in memory usage, indicating potential causes for crashes. For instance, early peaks may represent crashes not visible to users, while sharp declines can indicate user-facing issues, aiding in preemptive actions.

Key Statistics & Figures

Percentage of non-kill related entries
99.1%
This statistic highlights the challenge of predicting OOM kills, as the dataset is heavily skewed towards normal runtime states.
Frequency of OOM kills
0.9%
This low frequency indicates that OOM kills are rare events, making prediction and prevention critical for maintaining app performance.

Technologies & Tools

Data Processing
Sparksql
Used for querying and processing large datasets in the context of OOM kill prediction.

Key Actionable Insights

1
Implement a robust data curation process to ensure high-quality datasets for machine learning.
A well-structured dataset is crucial for accurate predictions. By addressing biases and ensuring comprehensive data collection, you can enhance the reliability of your models.
2
Utilize a graded window approach for labeling data to improve classification accuracy.
This approach allows for more nuanced labeling of memory readings, which can lead to better model performance and insights into OOM kill occurrences.
3
Incorporate device-specific characteristics into your predictive models.
Understanding the unique capabilities and limitations of different devices can help tailor predictions and preemptive actions, ultimately enhancing user experience.

Common Pitfalls

1
Failing to account for biases in the dataset can lead to inaccurate predictions.
Since OOM kills are infrequent, the dataset will likely be skewed towards normal states. This can mislead the model unless proper techniques are employed to balance the dataset.
2
Neglecting the importance of feature engineering can result in suboptimal model performance.
Without careful consideration of which features to include, the model may miss critical insights that could improve its predictive capabilities.

Related Concepts

Machine Learning
Data Engineering
Memory Management
Predictive Analytics