Break Stuff on Purpose

Incidents are stressful but inevitable. Even services designed for availability will eventually encounter a failure. Engineers naturally find it daunting to defend their systems against the “infinite number of ways” things can go wrong.

Sean Madden
8 min readintermediate
--
View Original

Overview

The article 'Break Stuff on Purpose' discusses the importance of intentionally causing failures in systems to improve recovery processes and enhance resilience. It shares a real incident at Slack where a failure led to significant data loss, and how the team turned this experience into a valuable learning opportunity by conducting controlled exercises to test their recovery procedures.

What You'll Learn

1

How to conduct controlled failure exercises to improve system resilience

2

Why regular testing of backup and recovery processes is essential for system reliability

3

How to identify and fix issues in runbooks and recovery procedures

Prerequisites & Requirements

  • Basic understanding of system architecture and incident response
  • Familiarity with Elasticsearch and Kibana(optional)

Key Questions Answered

What incident prompted Slack engineers to improve their recovery processes?
On January 29th, 2024, Slack's Kibana cluster failed due to a lack of disk space, leading to significant data loss when recovery efforts failed. This incident highlighted the need for better backup procedures and incident response practices.
How did Slack engineers test their new recovery processes?
The engineers conducted a planned exercise where they intentionally filled the disk on a development Kibana cluster to simulate a failure. They then executed their new backup and recovery procedures to ensure they worked effectively, learning valuable lessons in the process.
What are the benefits of intentionally breaking systems during testing?
Intentionally breaking systems allows teams to uncover hidden issues and test their recovery processes in a controlled environment. This proactive approach can lead to improved system resilience and better preparedness for real incidents.
What challenges did the Slack team face during their recovery exercise?
During the recovery exercise, the team encountered issues with their runbook, such as unclear commands and formatting problems. These challenges highlighted the need for better documentation and understanding of the recovery process.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Frontend
Kibana
Used for visualizing application performance data and managing dashboards.
Backend
Elasticsearch
Serves as the data store for Kibana, providing the necessary backend support for dashboard functionalities.

Key Actionable Insights

1
Conduct regular chaos engineering exercises to test your systems' resilience.
By intentionally causing failures, teams can identify weaknesses in their systems and improve their incident response capabilities, ensuring better preparedness for real-world outages.
2
Keep your backup and recovery procedures up to date and regularly test them.
Outdated backups can lead to significant data loss during incidents. Regular testing ensures that recovery processes are effective and that teams are familiar with the steps needed to restore services.
3
Document and refine your runbooks based on real incidents and testing outcomes.
Clear and comprehensive runbooks are essential for effective incident response. Regularly updating them based on lessons learned helps teams respond more efficiently during actual incidents.

Common Pitfalls

1
Neglecting to regularly test backup and recovery processes can lead to outdated procedures that fail during an incident.
Many teams assume their backups are functioning without verification. This can result in significant data loss and recovery failures when an actual incident occurs.
2
Failing to document and update runbooks can lead to confusion and inefficiency during incident response.
Runbooks that are not regularly reviewed and updated can become obsolete, making it difficult for teams to execute recovery procedures effectively under pressure.

Related Concepts

Chaos Engineering
Incident Response
Backup And Recovery Strategies
System Resilience