Break Stuff on Purpose

Sean Madden

Incidents are stressful but inevitable. Even services designed for availability will eventually encounter a failure. Engineers naturally find it daunting to defend their systems against the “infinite number of ways” things can go wrong.

Slack

•

Sean Madden

•8 min read•intermediate•

--

•View Original

ChefElasticsearchJenkinsKubernetesPythonTypeScript

Overview

The article 'Break Stuff on Purpose' discusses the importance of intentionally causing failures in systems to improve recovery processes and enhance resilience. It shares a real incident at Slack where a failure led to significant data loss, and how the team turned this experience into a valuable learning opportunity by conducting controlled exercises to test their recovery procedures.

What You'll Learn

1

How to conduct controlled failure exercises to improve system resilience

2

Why regular testing of backup and recovery processes is essential for system reliability

3

How to identify and fix issues in runbooks and recovery procedures

Prerequisites & Requirements

Basic understanding of system architecture and incident response
Familiarity with Elasticsearch and Kibana(optional)

Key Questions Answered

What incident prompted Slack engineers to improve their recovery processes?

On January 29th, 2024, Slack's Kibana cluster failed due to a lack of disk space, leading to significant data loss when recovery efforts failed. This incident highlighted the need for better backup procedures and incident response practices.

How did Slack engineers test their new recovery processes?

The engineers conducted a planned exercise where they intentionally filled the disk on a development Kibana cluster to simulate a failure. They then executed their new backup and recovery procedures to ensure they worked effectively, learning valuable lessons in the process.

What are the benefits of intentionally breaking systems during testing?

Intentionally breaking systems allows teams to uncover hidden issues and test their recovery processes in a controlled environment. This proactive approach can lead to improved system resilience and better preparedness for real incidents.

What challenges did the Slack team face during their recovery exercise?

During the recovery exercise, the team encountered issues with their runbook, such as unclear commands and formatting problems. These challenges highlighted the need for better documentation and understanding of the recovery process.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Frontend

Kibana

Used for visualizing application performance data and managing dashboards.

Backend

Elasticsearch

Serves as the data store for Kibana, providing the necessary backend support for dashboard functionalities.

Key Actionable Insights

1
Conduct regular chaos engineering exercises to test your systems' resilience.
By intentionally causing failures, teams can identify weaknesses in their systems and improve their incident response capabilities, ensuring better preparedness for real-world outages.

2
Keep your backup and recovery procedures up to date and regularly test them.
Outdated backups can lead to significant data loss during incidents. Regular testing ensures that recovery processes are effective and that teams are familiar with the steps needed to restore services.

3
Document and refine your runbooks based on real incidents and testing outcomes.
Clear and comprehensive runbooks are essential for effective incident response. Regularly updating them based on lessons learned helps teams respond more efficiently during actual incidents.

Common Pitfalls

1

Neglecting to regularly test backup and recovery processes can lead to outdated procedures that fail during an incident.

Many teams assume their backups are functioning without verification. This can result in significant data loss and recovery failures when an actual incident occurs.

2

Failing to document and update runbooks can lead to confusion and inefficiency during incident response.

Runbooks that are not regularly reviewed and updated can become obsolete, making it difficult for teams to execute recovery procedures effectively under pressure.

Related Concepts

Chaos Engineering

Incident Response

Backup And Recovery Strategies

System Resilience

If you are are an engineer whose organization uses Linux in production, I have two quick questions for you: 1) How many unique outbound TCP connections have your servers made in the past hour? 2) Which processes and users initiated each of those connections? If you can answer both of these questions, fantastic! You can skip the…

TypeScriptElasticsearchJenkins

11 min read

Includes Code

Has Summary

--

Slack

Advanced

Tracing at Slack: Thinking in Causal Graphs

“Why is it slow?” is the hardest problem to debug in a complex distributed system like Slack. To diagnose a slow-loading channel with over a hundred thousand users, we’d need to look at client-side metrics, server-side metrics, and logs. It could be a client-side issue: a slow network connection or hardware. On the other hand,…

TypeScriptJavaScriptJava

20 min read

Includes Code

Has Summary

--

Slack

Intermediate

Technology Lifecycle

This blog post discusses the strategies that Slack uses to manage the lifecycle (development, support, and eventual retirement) of infrastructure projects, through the lens of the migration through three successive internal “platform” offerings. Our challenges Circa 2020, our Cloud Engineering team (now evolved into multiple teams responsible for narrower aspects) was responsible for managing our…

KubernetesTypeScriptTerraform

12 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Break Stuff on Purpose". Explore more engineering insights on TypeScript, Elasticsearch, JavaScript.