Slack’s Outage on January 4th 2021

And now we welcome the new year. Full of things that have never been.  — Rainer Maria Rilke January 4th 2021 was the first working day of the year for many around the globe, and for most of us at Slack too (except of course for our on-callers and our customer experience team, who never…

Laura Nolan
10 min readadvanced
--
View Original

Overview

This article details the outage experienced by Slack on January 4th, 2021, highlighting the causes, the incident response, and the lessons learned. It discusses the impact of network degradation on service availability and the subsequent recovery efforts involving AWS infrastructure.

What You'll Learn

1

How to effectively manage incident response during service outages

2

Why monitoring tools are critical for diagnosing infrastructure issues

3

When to escalate network issues to cloud providers like AWS

Prerequisites & Requirements

  • Understanding of cloud infrastructure and incident management
  • Familiarity with monitoring and alerting tools(optional)

Key Questions Answered

What caused Slack's outage on January 4th, 2021?
The outage was primarily caused by network degradation within AWS infrastructure, which led to packet loss and increased latency. This saturation affected Slack's ability to serve messages, resulting in a significant drop in service availability.
How did Slack respond to the incident?
Slack initiated its incident response protocol, rolling back recent changes and escalating network issues to AWS. They faced challenges due to the unavailability of their monitoring dashboards, which hampered their ability to diagnose the problem effectively.
What lessons did Slack learn from the outage?
Slack learned the importance of having independent monitoring tools and the need to regularly load test their provisioning services. They also recognized the necessity of preemptively scaling AWS resources after holiday periods to prevent similar incidents.

Key Statistics & Figures

Slack message success rate
99%
This was a significant drop from their usual success rate of over 99.999% during the outage.
Number of servers added to the web tier
1,200
Slack attempted to add this many servers between 7:01am PST and 7:15am PST to handle increased load.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement independent monitoring systems that are not reliant on the same infrastructure as your primary services.
This ensures that even during outages, you can still monitor the health of your services and quickly diagnose issues.
2
Regularly conduct load testing on critical services like provisioning to identify bottlenecks before they impact production.
This proactive approach helps in understanding how your systems will behave under stress and allows for timely adjustments.
3
Establish clear escalation protocols with cloud providers to address network issues swiftly.
Having a direct line of communication can significantly reduce downtime and improve response times during incidents.

Common Pitfalls

1
Relying on a single monitoring system that is dependent on the same infrastructure can lead to blind spots during outages.
This can prevent teams from diagnosing issues effectively, as seen when Slack's monitoring tools failed during the incident.

Related Concepts

Incident Management Best Practices
Cloud Infrastructure Scaling
Network Performance Monitoring