Disasterpiece Theater: Slack’s process for approachable Chaos Engineering

Richard Crowley

Slack is a large and complex piece of software that’s been added to and changed many times over the last five years. We added features, grew to 10,000,000 DAUs, and made major architectural changes. We made assumptions and tested them with processes that often resembled science. Whenever we launch features or make changes, we test…

Slack

•

Richard Crowley

•11 min read•advanced•

--

•View Original

AWSChefConsulGrafanaJenkinsMySQLPythonTypeScript

Overview

The article discusses Slack's approach to Chaos Engineering through a process called Disasterpiece Theater, which aims to enhance the reliability of their systems by intentionally causing failures in a controlled environment. It outlines the preparation, execution, and learning outcomes from these exercises, emphasizing the importance of testing fault tolerance in production systems.

What You'll Learn

1

How to conduct a controlled failure exercise in production environments

2

Why regular testing of fault tolerance is essential for system reliability

3

When to abort a production exercise based on system response

Prerequisites & Requirements

Understanding of Chaos Engineering principles
Experience with production system monitoring(optional)

Key Questions Answered

What is Disasterpiece Theater and how does it work?

Disasterpiece Theater is Slack's process for conducting controlled failure exercises to test the fault tolerance of their systems. It involves identifying potential failures, preparing a detailed plan, and executing the exercise in both development and production environments to observe system behavior and gather insights.

What are the key steps in preparing for a Disasterpiece Theater exercise?

Preparation involves writing a detailed plan that outlines the failure to be incited, documenting commands, selecting affected EC2 instances, and establishing metrics and logs to monitor during the exercise. This ensures safety and maximizes learning during the exercise.

What outcomes have been observed from Disasterpiece Theater exercises?

The exercises have revealed vulnerabilities in system availability and correctness, leading to improvements in Slack's infrastructure. For instance, one exercise identified a cache inconsistency issue that was addressed before it could impact users.

How does Slack ensure safety during these exercises?

Slack ensures safety by conducting exercises at publicized times, involving relevant experts, and monitoring system metrics closely. They also have a go/no-go decision process to abort exercises if the risk of disruption is deemed too high.

Key Statistics & Figures

Daily Active Users

10,000,000

Slack's growth in user base necessitated rigorous testing of system reliability.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Infrastructure

EC2

Used to host Slack's services and for selecting instances involved in failure exercises.

Monitoring

Grafana

Utilized for projecting dashboards during the exercises to monitor system performance.

Monitoring

Kibana

Used for searching logs and metrics during the exercises.

Key Actionable Insights

1
Regularly conduct controlled failure exercises to enhance system reliability.
By intentionally causing failures in a safe environment, teams can identify weaknesses and improve their systems before real incidents occur, fostering a culture of resilience.

2
Document all hypotheses and expected outcomes before each exercise.
This practice not only guides the exercise but also provides a benchmark for assessing the system's performance during and after the test.

3
Involve cross-functional teams in the preparation and execution of exercises.
Engaging diverse expertise ensures comprehensive coverage of potential failure scenarios and enhances the learning experience for all participants.

Common Pitfalls

1

Failing to accurately predict the impact of a controlled failure can lead to unexpected disruptions.

This often occurs when the failure scenario is not thoroughly tested in a development environment first. To avoid this, always conduct preliminary tests to validate assumptions before moving to production.

Related Concepts

Chaos Engineering

Fault Tolerance

Resilience Engineering

What happens when your distributed service has challenges with stampeding herds of internal requests? How do you prevent cascading failures between internal services? How might you re-architect your workflows when naive horizontal or vertical scaling reaches their respective limits? These were the challenges facing Slack engineers during their day-to-day development workflows in 2020. Multiple internal…

TypeScriptMySQLAWS

19 min read

Includes Code

Has Summary

--

Slack

Intermediate

Building the Next Evolution of Cloud Networks at Slack

At Slack, we’ve gone through an evolution of our AWS infrastructure from the early days of running a few hand-built EC2 instances, all the way to provisioning thousands of EC2s instances across multiple AWS regions, using the latest AWS services to build reliable and scalable infrastructure. One of the pain points inherited from the early…

TypeScriptAWSDynamoDB

12 min read

Has Summary

--

Slack

Intermediate

Building Self-driving Kafka clusters using open source components

In this article, I will talk about how Slack uses Kafka, and how a small-but-mighty team built and operationalized a self-driving Kafka cluster over the last four years to run at scale. Kafka is used at Slack as a pub-sub system, playing an essential role in the all-important Job Queue, our asynchronous job execution framework…

AWSTypeScriptTerraform

14 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Disasterpiece Theater: Slack’s process for approachable Chaos Engineering". Explore more engineering insights on TypeScript, MySQL, AWS.