A Terrible, Horrible, No-Good, Very Bad Day at Slack

This story describes the technical details of the problems that caused the Slack downtime on May 12th, 2020. To learn more about the process behind incident response for same outage, read Ryan Katkov’s post, “All Hands on Deck”. On May 12, 2020, Slack had our first significant outage in a long time. We published a summary…

Laura Nolan
9 min readadvanced
--
View Original

Overview

The article provides an in-depth analysis of a significant outage experienced by Slack on May 12, 2020, detailing the technical issues that led to the incident. It discusses the response strategies employed to mitigate the impact and the lessons learned for future improvements.

What You'll Learn

1

How to identify and mitigate database performance issues during high load

2

Why effective load balancing is crucial for web application performance

3

How to implement dynamic configuration updates in HAProxy without downtime

Prerequisites & Requirements

  • Understanding of load balancing and web application architecture
  • Familiarity with HAProxy and Consul(optional)

Key Questions Answered

What caused the Slack outage on May 12, 2020?
The outage was triggered by a significant load increase in the database infrastructure due to a configuration change that exposed a longstanding performance bug. This led to a cascading failure in the webapp tier, resulting in HTTP 503 errors.
How did Slack respond to the incident?
Slack's response involved quickly rolling back the problematic configuration change, scaling up the webapp fleet, and investigating the load balancer's performance. They identified a bug in the synchronization process between the webapp instances and the HAProxy state, which contributed to the outage.
What architectural changes are planned to prevent future outages?
Slack is moving towards using Envoy Proxy for ingress load balancing, which offers better integration with dynamic service discovery and aims to eliminate the operational complexities associated with HAProxy.

Key Statistics & Figures

Increase in webapp instance count during incident
75%
This scaling was necessary to handle the increased load after the initial database performance issues.
Duration of the initial customer impact
3 minutes
This brief incident occurred before the configuration rollback was implemented.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement robust monitoring for critical infrastructure components to catch issues early.
Monitoring systems should be regularly tested and updated to ensure they can detect anomalies, especially in static environments where changes are infrequent.
2
Consider using dynamic configuration management for load balancers to reduce downtime during updates.
Dynamic configurations allow for quick adjustments to backend services without the need for full reloads, minimizing the risk of service disruption.
3
Regularly review and test incident response plans to ensure effectiveness.
Conducting drills and simulations can help teams prepare for real incidents and refine their response strategies.

Common Pitfalls

1
Failing to update monitoring systems can lead to undetected issues.
In this case, the monitoring for HAProxy did not catch the stale backend state, which contributed to the outage. Regular reviews and updates of monitoring systems are essential.

Related Concepts

Incident Management
Load Balancing Strategies
Database Performance Optimization
Service Discovery Mechanisms