Disasterpiece Theater: Slack’s process for approachable Chaos Engineering

Slack is a large and complex piece of software that’s been added to and changed many times over the last five years. We added features, grew to 10,000,000 DAUs, and made major architectural changes. We made assumptions and tested them with processes that often resembled science. Whenever we launch features or make changes, we test…

Richard Crowley
11 min readadvanced
--
View Original

Overview

The article discusses Slack's approach to Chaos Engineering through a process called Disasterpiece Theater, which aims to enhance the reliability of their systems by intentionally causing failures in a controlled environment. It outlines the preparation, execution, and learning outcomes from these exercises, emphasizing the importance of testing fault tolerance in production systems.

What You'll Learn

1

How to conduct a controlled failure exercise in production environments

2

Why regular testing of fault tolerance is essential for system reliability

3

When to abort a production exercise based on system response

Prerequisites & Requirements

  • Understanding of Chaos Engineering principles
  • Experience with production system monitoring(optional)

Key Questions Answered

What is Disasterpiece Theater and how does it work?
Disasterpiece Theater is Slack's process for conducting controlled failure exercises to test the fault tolerance of their systems. It involves identifying potential failures, preparing a detailed plan, and executing the exercise in both development and production environments to observe system behavior and gather insights.
What are the key steps in preparing for a Disasterpiece Theater exercise?
Preparation involves writing a detailed plan that outlines the failure to be incited, documenting commands, selecting affected EC2 instances, and establishing metrics and logs to monitor during the exercise. This ensures safety and maximizes learning during the exercise.
What outcomes have been observed from Disasterpiece Theater exercises?
The exercises have revealed vulnerabilities in system availability and correctness, leading to improvements in Slack's infrastructure. For instance, one exercise identified a cache inconsistency issue that was addressed before it could impact users.
How does Slack ensure safety during these exercises?
Slack ensures safety by conducting exercises at publicized times, involving relevant experts, and monitoring system metrics closely. They also have a go/no-go decision process to abort exercises if the risk of disruption is deemed too high.

Key Statistics & Figures

Daily Active Users
10,000,000
Slack's growth in user base necessitated rigorous testing of system reliability.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Infrastructure
EC2
Used to host Slack's services and for selecting instances involved in failure exercises.
Monitoring
Grafana
Utilized for projecting dashboards during the exercises to monitor system performance.
Monitoring
Kibana
Used for searching logs and metrics during the exercises.

Key Actionable Insights

1
Regularly conduct controlled failure exercises to enhance system reliability.
By intentionally causing failures in a safe environment, teams can identify weaknesses and improve their systems before real incidents occur, fostering a culture of resilience.
2
Document all hypotheses and expected outcomes before each exercise.
This practice not only guides the exercise but also provides a benchmark for assessing the system's performance during and after the test.
3
Involve cross-functional teams in the preparation and execution of exercises.
Engaging diverse expertise ensures comprehensive coverage of potential failure scenarios and enhances the learning experience for all participants.

Common Pitfalls

1
Failing to accurately predict the impact of a controlled failure can lead to unexpected disruptions.
This often occurs when the failure scenario is not thoroughly tested in a development environment first. To avoid this, always conduct preliminary tests to validate assumptions before moving to production.

Related Concepts

Chaos Engineering
Fault Tolerance
Resilience Engineering