Incident Report: Spotify Outage on March 8, 2022

Spotify Engineering

Spotify

•

Spotify Engineering

•2 min read•intermediate•

--

•View Original

Google CloudgRPC

Overview

On March 8, 2022, Spotify experienced a global outage due to issues in its cloud-hosted service discovery system, primarily linked to Google Cloud Traffic Director and a bug in the gRPC library. The incident affected user logins, prompting a swift response to restore services.

What You'll Learn

1

How to identify and mitigate service discovery issues in microservices

2

Why monitoring and alerting are crucial for cloud services

3

When to revert to DNS-based service discovery for stability

Key Questions Answered

What caused the Spotify outage on March 8, 2022?

The outage was caused by an issue with Google Cloud Traffic Director and a bug in the gRPC library, which prevented users from logging back into the Spotify app after being logged out. This highlighted vulnerabilities in the service discovery systems used by Spotify.

What steps did Spotify take to resolve the outage?

Spotify began implementing fixes at 18:39 UTC, reverting affected systems to DNS-based service discovery, which gradually restored functionality. The incident was fully mitigated by 20:35 UTC.

What improvements will Spotify implement post-outage?

Spotify plans to work with Google Cloud to understand the Traffic Director issues better, enhance monitoring and alerting systems, and invest in resiliency measures to prevent similar outages in the future.

Key Statistics & Figures

Incident start time

18:12 UTC / 13:12 ET

This is when reports of users being logged out began to surface.

Remediation start time

18:39 UTC / 13:39 ET

This marks when Spotify began implementing fixes to restore affected systems.

Incident resolution time

20:35 UTC / 15:35 ET

This is when the incident was fully mitigated at Spotify.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Service

Google Cloud Traffic Director

Used for service discovery in some of Spotify's microservices.

Library

Grpc

A client library that had a bug contributing to the outage.

Key Actionable Insights

1
Implement robust monitoring and alerting systems to catch service discovery issues early.
By enhancing monitoring, teams can identify potential outages before they impact users, ensuring a more reliable service.

2
Consider reverting to DNS-based service discovery as a fallback during critical outages.
This strategy can provide immediate stability while investigating the root cause of the failure in more complex systems.

3
Collaborate closely with cloud service providers to understand their systems and potential points of failure.
Building a strong partnership can lead to quicker resolutions and better preparedness for future incidents.

Common Pitfalls

1

Relying solely on one service discovery method can lead to significant outages.

When issues arise in that method, as seen with Traffic Director, it can cause widespread service disruptions. Diversifying service discovery strategies can mitigate this risk.