All Hands on Deck

This story speaks to the process behind incident response at Slack and uses the May 12th, 2020 outage as an example. For a deeper technical review of the same outage, read Laura Nolan’s post, “A Terrible, Horrible, No-Good, Very Bad Day at Slack” Slack is a critical tool for millions of people, so it’s natural when…

Ryan Katkov
11 min readintermediate
--
View Original

Overview

The article 'All Hands on Deck' details Slack's incident response process during a significant outage on May 12, 2020. It outlines the steps taken to restore service, the roles involved, and the importance of a structured response to ensure reliability for millions of users.

What You'll Learn

1

How to invoke an incident response using Slack's custom Incident Bot

2

Why a structured incident response process is essential for service reliability

3

When to escalate an incident to a Sev-1 level

4

How to conduct an effective incident review to prevent future issues

Prerequisites & Requirements

  • Understanding of incident management processes
  • Experience in software engineering or operations(optional)

Key Questions Answered

What steps does Slack take during an incident response?
Slack's incident response involves detecting issues, assembling a response team via the Incident Bot, verifying the scope of the incident, dispatching additional resources, and mitigating the problem. This structured approach ensures a coordinated effort to restore service efficiently.
How does Slack manage communication during outages?
During outages, Slack utilizes alternative communication methods such as Zoom calls and company-wide emails to keep teams informed. This ensures that all stakeholders are aware of the situation and can contribute to the resolution process effectively.
What is the Major Incident Command structure at Slack?
The Major Incident Command structure at Slack involves trained engineers who facilitate incident responses, ensuring that each development team is responsible for their services. This model enhances operational awareness and distributes the incident management workload.
What does the All Clear signal indicate in incident management?
The All Clear signal indicates that the incident has been resolved, and systems are stable. It involves reviewing signals, assessing risks of recurrence, and ensuring that all customer issues have been addressed before officially closing the incident.

Key Statistics & Figures

Time taken to resolve the outage
48 minutes
The service was restored 48 minutes after the start of the outage, demonstrating the effectiveness of Slack's incident response process.
API service scaling
80%
The API service automatically scaled up by 80% to accommodate unexpected demand during the incident.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Haproxy
Used for load balancing in Slack's web service architecture.
Backend
Consul-template
Manages the configuration of active servers for Slack's web service backend.
Tools
Pagerduty
Used for alerting critical responders during incidents.

Key Actionable Insights

1
Implement a structured incident response process similar to Slack's to enhance reliability.
A well-defined process allows teams to respond quickly and effectively during outages, minimizing downtime and customer impact.
2
Utilize tools like incident bots to streamline communication and coordination during incidents.
Automating the assembly of response teams can significantly reduce response times and improve the overall efficiency of incident management.
3
Conduct regular incident reviews to foster a culture of learning and continuous improvement.
Reviewing incidents helps teams identify weaknesses in their processes and implement changes to prevent future occurrences.

Common Pitfalls

1
Failing to have a clear escalation path can lead to confusion during incidents.
Without defined roles and responsibilities, teams may struggle to coordinate effectively, prolonging the resolution process.
2
Neglecting to conduct incident reviews can result in repeated mistakes.
If teams do not analyze past incidents, they miss opportunities to learn and improve their processes, increasing the risk of future outages.

Related Concepts

Incident Management Best Practices
Reliability Engineering Principles
Incident Response Frameworks