All Hands on Deck

Ryan Katkov

This story speaks to the process behind incident response at Slack and uses the May 12th, 2020 outage as an example. For a deeper technical review of the same outage, read Laura Nolan’s post, “A Terrible, Horrible, No-Good, Very Bad Day at Slack” Slack is a critical tool for millions of people, so it’s natural when…

Slack

•

Ryan Katkov

•11 min read•intermediate•

--

•View Original

ChefHAProxyMachine LearningPagerDuty

Overview

The article 'All Hands on Deck' details Slack's incident response process during a significant outage on May 12, 2020. It outlines the steps taken to restore service, the roles involved, and the importance of a structured response to ensure reliability for millions of users.

What You'll Learn

1

How to invoke an incident response using Slack's custom Incident Bot

2

Why a structured incident response process is essential for service reliability

3

When to escalate an incident to a Sev-1 level

4

How to conduct an effective incident review to prevent future issues

Prerequisites & Requirements

Understanding of incident management processes
Experience in software engineering or operations(optional)

Key Questions Answered

What steps does Slack take during an incident response?

Slack's incident response involves detecting issues, assembling a response team via the Incident Bot, verifying the scope of the incident, dispatching additional resources, and mitigating the problem. This structured approach ensures a coordinated effort to restore service efficiently.

How does Slack manage communication during outages?

During outages, Slack utilizes alternative communication methods such as Zoom calls and company-wide emails to keep teams informed. This ensures that all stakeholders are aware of the situation and can contribute to the resolution process effectively.

What is the Major Incident Command structure at Slack?

The Major Incident Command structure at Slack involves trained engineers who facilitate incident responses, ensuring that each development team is responsible for their services. This model enhances operational awareness and distributes the incident management workload.

What does the All Clear signal indicate in incident management?

The All Clear signal indicates that the incident has been resolved, and systems are stable. It involves reviewing signals, assessing risks of recurrence, and ensuring that all customer issues have been addressed before officially closing the incident.

Key Statistics & Figures

Time taken to resolve the outage

48 minutes

The service was restored 48 minutes after the start of the outage, demonstrating the effectiveness of Slack's incident response process.

API service scaling

80%

The API service automatically scaled up by 80% to accommodate unexpected demand during the incident.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Haproxy

Used for load balancing in Slack's web service architecture.

Backend

Consul-template

Manages the configuration of active servers for Slack's web service backend.

Tools

Pagerduty

Used for alerting critical responders during incidents.

Key Actionable Insights

1
Implement a structured incident response process similar to Slack's to enhance reliability.
A well-defined process allows teams to respond quickly and effectively during outages, minimizing downtime and customer impact.

2
Utilize tools like incident bots to streamline communication and coordination during incidents.
Automating the assembly of response teams can significantly reduce response times and improve the overall efficiency of incident management.

3
Conduct regular incident reviews to foster a culture of learning and continuous improvement.
Reviewing incidents helps teams identify weaknesses in their processes and implement changes to prevent future occurrences.

Common Pitfalls

1

Failing to have a clear escalation path can lead to confusion during incidents.

Without defined roles and responsibilities, teams may struggle to coordinate effectively, prolonging the resolution process.

2

Neglecting to conduct incident reviews can result in repeated mistakes.

If teams do not analyze past incidents, they miss opportunities to learn and improve their processes, increasing the risk of future outages.

Related Concepts

Incident Management Best Practices

Reliability Engineering Principles

Incident Response Frameworks

Public channels provide much of Slack’s advantages over email: they are searchable, long-lasting, themed conversations that are easy to join and leave. But for users, curating the perfect set of channels can leave them feeling like Goldilocks — it’s easy to be in too many, too few, or miss critical ones. A common customer request is for tools…

PHPJenkinsChef

9 min read

Has Summary

--

Slack

Advanced

A Terrible, Horrible, No-Good, Very Bad Day at Slack

This story describes the technical details of the problems that caused the Slack downtime on May 12th, 2020. To learn more about the process behind incident response for same outage, read Ryan Katkov’s post, “All Hands on Deck”. On May 12, 2020, Slack had our first significant outage in a long time. We published a summary…

AWSChefConsul

9 min read

Has Summary

--

Slack

Intermediate

Blocking Slack Invite Spam With Machine Learning

A fact of life for building an internet service is that, sooner or later, bad actors are going to come along and try to abuse the system. Slack is no exception — spammers try to use our invite function as a way to send out spam emails. Having built up the infrastructure to easily deploy…

KubernetesJenkinsMachine Learning

9 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "All Hands on Deck". Explore more engineering insights on PHP, Jenkins, AWS.