Spike detection in Alert Correlation

Nishant Singh

•

Nishant Singh

•10 min read•intermediate•

--

•View Original

Iris

Overview

The article discusses the implementation of spike detection in Alert Correlation at LinkedIn, aimed at improving incident response times and reducing false positives in alert systems. It details the methodologies used for real-time anomaly detection and the significance of utilizing metrics and confidence scores to identify root causes of service issues.

What You'll Learn

1

How to implement median absolute deviation for anomaly detection

2

Why dynamic alert thresholds improve alert accuracy

3

How to classify alerts as real issues or spikes

Prerequisites & Requirements

Understanding of alert correlation and anomaly detection concepts
Familiarity with LinkedIn's monitoring systems and metrics framework(optional)

Key Questions Answered

How does LinkedIn's Alert Correlation improve incident response times?

LinkedIn's Alert Correlation improves incident response times by utilizing a combination of alerts and metrics to identify root causes of service outages. It employs a confidence score to determine the likelihood of a service being responsible for an issue, which helps in prioritizing alerts and reducing false escalations.

What is the role of median absolute deviation in spike detection?

Median absolute deviation (MAD) is used in spike detection to robustly estimate outliers in alert data. By calculating the median of the past 30 minutes of metrics, MAD helps to identify anomalies without being skewed by extreme values, thus improving the accuracy of alerts.

What metrics are used to classify alerts as real issues or spikes?

Alerts are classified based on several metrics including the duration of the alert, the confidence score, and the number of affected services. This classification helps to distinguish between genuine issues and transient spikes, reducing alert fatigue for on-call engineers.

Key Statistics & Figures

Percentage of recommendations classified as spikes

36%

This classification was achieved over an average period of a week, significantly improving the quality of alerts posted to Slack.

Accuracy of alert recommendations

99%

This accuracy was achieved after implementing the spike detection algorithm, enhancing the reliability of alerts.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Auto Metrics Framework

Used for fetching metrics with active alerts to identify root causes of service issues.

Communication

Slack

Used to share alert recommendations and notifications with service owners.

Key Actionable Insights

1
Implement median absolute deviation for real-time anomaly detection to enhance alert accuracy.
This method allows for robust outlier detection, minimizing the impact of extreme values on alert thresholds. By applying MAD, teams can better differentiate between real issues and false positives.

2
Utilize dynamic alert thresholds that adapt based on historical data trends.
Dynamic thresholds help in reducing alert fatigue by ensuring that alerts are only triggered when metrics significantly deviate from normal patterns, thus improving the overall efficiency of incident response.

Common Pitfalls

1

Relying on historically configured alerts can lead to false positives due to sensitivity to anomalies.

This occurs because teams often set high thresholds to avoid false alarms, which can result in genuine issues being overlooked. Regularly reviewing and adjusting alert configurations can help mitigate this risk.