Slack’s Outage on January 4th 2021

Laura Nolan

And now we welcome the new year. Full of things that have never been. — Rainer Maria Rilke January 4th 2021 was the first working day of the year for many around the globe, and for most of us at Slack too (except of course for our on-callers and our customer experience team, who never…

Slack

•

Laura Nolan

•10 min read•advanced•

--

•View Original

ApacheAWSChef

Overview

This article details the outage experienced by Slack on January 4th, 2021, highlighting the causes, the incident response, and the lessons learned. It discusses the impact of network degradation on service availability and the subsequent recovery efforts involving AWS infrastructure.

What You'll Learn

1

How to effectively manage incident response during service outages

2

Why monitoring tools are critical for diagnosing infrastructure issues

3

When to escalate network issues to cloud providers like AWS

Prerequisites & Requirements

Understanding of cloud infrastructure and incident management
Familiarity with monitoring and alerting tools(optional)

Key Questions Answered

What caused Slack's outage on January 4th, 2021?

The outage was primarily caused by network degradation within AWS infrastructure, which led to packet loss and increased latency. This saturation affected Slack's ability to serve messages, resulting in a significant drop in service availability.

How did Slack respond to the incident?

Slack initiated its incident response protocol, rolling back recent changes and escalating network issues to AWS. They faced challenges due to the unavailability of their monitoring dashboards, which hampered their ability to diagnose the problem effectively.

What lessons did Slack learn from the outage?

Slack learned the importance of having independent monitoring tools and the need to regularly load test their provisioning services. They also recognized the necessity of preemptively scaling AWS resources after holiday periods to prevent similar incidents.

Key Statistics & Figures

Slack message success rate

99%

This was a significant drop from their usual success rate of over 99.999% during the outage.

Number of servers added to the web tier

1,200

Slack attempted to add this many servers between 7:01am PST and 7:15am PST to handle increased load.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Infrastructure

AWS

Used as the primary cloud provider for hosting Slack's services.

Key Actionable Insights

1
Implement independent monitoring systems that are not reliant on the same infrastructure as your primary services.
This ensures that even during outages, you can still monitor the health of your services and quickly diagnose issues.

2
Regularly conduct load testing on critical services like provisioning to identify bottlenecks before they impact production.
This proactive approach helps in understanding how your systems will behave under stress and allows for timely adjustments.

3
Establish clear escalation protocols with cloud providers to address network issues swiftly.
Having a direct line of communication can significantly reduce downtime and improve response times during incidents.

Common Pitfalls

1

Relying on a single monitoring system that is dependent on the same infrastructure can lead to blind spots during outages.

This can prevent teams from diagnosing issues effectively, as seen when Slack's monitoring tools failed during the incident.

Related Concepts

Incident Management Best Practices

Cloud Infrastructure Scaling

Network Performance Monitoring

Apache Airflow is a tool for describing, executing, and monitoring workflows. At Slack, we use Airflow to orchestrate and manage our data warehouse workflows, which includes product and business metrics and also is used for different engineering use-cases (e.g. search and offline indexing). For two years we’ve been running Airflow 1.8, and it was time for…

AWSMySQLAWS S3

11 min read

Has Summary

--

Slack

Intermediate

Building Self-driving Kafka clusters using open source components

In this article, I will talk about how Slack uses Kafka, and how a small-but-mighty team built and operationalized a self-driving Kafka cluster over the last four years to run at scale. Kafka is used at Slack as a pub-sub system, playing an essential role in the all-important Job Queue, our asynchronous job execution framework…

AWSTypeScriptTerraform

14 min read

Has Summary

--

Slack

Intermediate

How Women Lead Data Engineering at Slack

The Data Engineering team is responsible for Slack’s data lake, analytics dashboards, and other data services. The team’s mission is to empower users to leverage data to make decisions quickly, accurately, and easily. Slack’s data lake grew in size from sub-petabyte to over 100 petabytes in recent years and it now spans millions of tables.…

ReactAWSKubernetes

11 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Slack’s Outage on January 4th 2021". Explore more engineering insights on AWS, MySQL, TypeScript.