Incident Report: Spotify Outage on April 16, 2025

Spotify Engineering
3 min readintermediate
--
View Original

Overview

On April 16, 2025, Spotify experienced a significant outage affecting users worldwide due to a bug triggered by a change in Envoy Proxy filter order. The incident highlighted vulnerabilities in their system, leading to a commitment to improve their infrastructure and prevent future occurrences.

What You'll Learn

1

How to identify and mitigate risks associated with configuration changes in production systems

2

Why proper resource allocation is crucial in Kubernetes environments

3

How to enhance monitoring capabilities to detect issues early

Prerequisites & Requirements

  • Understanding of Envoy Proxy and Kubernetes concepts
  • Familiarity with cloud infrastructure and monitoring tools(optional)

Key Questions Answered

What caused the Spotify outage on April 16, 2025?
The outage was caused by a change in the order of Envoy filters, which triggered a bug leading to simultaneous crashes of all Envoy instances. This was compounded by a misconfiguration of the Envoy max heap size exceeding Kubernetes memory limits, causing continuous cycling of servers.
What steps is Spotify taking to prevent future outages?
Spotify is addressing the bug that caused the Envoy crash, fixing the configuration mismatch between Envoy heap size and Kubernetes memory limits, improving the rollout process for configuration changes, and enhancing monitoring capabilities to catch issues sooner.
How did the outage affect different global regions?
The outage primarily affected users worldwide, except for the Asia Pacific region, which experienced lower traffic due to timezone differences. This region's Envoy memory usage did not reach the Kubernetes limit, allowing it to remain unaffected.
What was the timeline of the Spotify outage?
The outage began at 12:18 UTC with Envoy filter changes, leading to crashes and a significant drop in traffic by 12:20 UTC. Recovery began at 14:20 UTC for Europe and 15:10 UTC for the US, with all traffic patterns normal by 15:40 UTC.

Key Statistics & Figures

Duration of outage
3 hours and 25 minutes
The outage occurred between 12:20 and 15:45 UTC.
Time of Envoy filter change
12:18 UTC
This change triggered the subsequent outage.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Networking
Envoy Proxy
Used for managing network traffic and implementing custom filters.
Orchestration
Kubernetes
Used for managing containerized applications and ensuring resource allocation.

Key Actionable Insights

1
Implement a thorough testing process for configuration changes in production environments to identify potential issues before deployment.
This can help prevent incidents like the Spotify outage, where a seemingly low-risk change led to widespread service disruption.
2
Regularly review and adjust resource allocations in Kubernetes to ensure they align with application demands and avoid memory limit issues.
Proper resource management can prevent crashes and ensure stability during traffic spikes, as seen in the Spotify incident.
3
Enhance monitoring systems to provide real-time alerts for unusual traffic patterns or system behavior.
Improved monitoring can help teams respond quickly to emerging issues, potentially mitigating the impact of outages.

Common Pitfalls

1
Applying configuration changes simultaneously across all regions without adequate testing can lead to widespread failures.
This was a key factor in the Spotify outage, where a low-risk change triggered a critical bug affecting all Envoy instances.

Related Concepts

Envoy Proxy
Kubernetes Resource Management
Traffic Management Strategies