Overview
On March 8, 2022, Spotify experienced a global outage due to issues in its cloud-hosted service discovery system, primarily linked to Google Cloud Traffic Director and a bug in the gRPC library. The incident affected user logins, prompting a swift response to restore services.
What You'll Learn
1
How to identify and mitigate service discovery issues in microservices
2
Why monitoring and alerting are crucial for cloud services
3
When to revert to DNS-based service discovery for stability
Key Questions Answered
What caused the Spotify outage on March 8, 2022?
The outage was caused by an issue with Google Cloud Traffic Director and a bug in the gRPC library, which prevented users from logging back into the Spotify app after being logged out. This highlighted vulnerabilities in the service discovery systems used by Spotify.
What steps did Spotify take to resolve the outage?
Spotify began implementing fixes at 18:39 UTC, reverting affected systems to DNS-based service discovery, which gradually restored functionality. The incident was fully mitigated by 20:35 UTC.
What improvements will Spotify implement post-outage?
Spotify plans to work with Google Cloud to understand the Traffic Director issues better, enhance monitoring and alerting systems, and invest in resiliency measures to prevent similar outages in the future.
Key Statistics & Figures
Incident start time
18:12 UTC / 13:12 ET
This is when reports of users being logged out began to surface.
Remediation start time
18:39 UTC / 13:39 ET
This marks when Spotify began implementing fixes to restore affected systems.
Incident resolution time
20:35 UTC / 15:35 ET
This is when the incident was fully mitigated at Spotify.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Cloud Service
Google Cloud Traffic Director
Used for service discovery in some of Spotify's microservices.
Library
Grpc
A client library that had a bug contributing to the outage.
Key Actionable Insights
1Implement robust monitoring and alerting systems to catch service discovery issues early.By enhancing monitoring, teams can identify potential outages before they impact users, ensuring a more reliable service.
2Consider reverting to DNS-based service discovery as a fallback during critical outages.This strategy can provide immediate stability while investigating the root cause of the failure in more complex systems.
3Collaborate closely with cloud service providers to understand their systems and potential points of failure.Building a strong partnership can lead to quicker resolutions and better preparedness for future incidents.
Common Pitfalls
1
Relying solely on one service discovery method can lead to significant outages.
When issues arise in that method, as seen with Traffic Director, it can cause widespread service disruptions. Diversifying service discovery strategies can mitigate this risk.