A Terrible, Horrible, No-Good, Very Bad Day at Slack

Laura Nolan

This story describes the technical details of the problems that caused the Slack downtime on May 12th, 2020. To learn more about the process behind incident response for same outage, read Ryan Katkov’s post, “All Hands on Deck”. On May 12, 2020, Slack had our first significant outage in a long time. We published a summary…

Slack

•

Laura Nolan

•9 min read•advanced•

--

•View Original

AWSChefConsulEnvoyHAProxy

Overview

The article provides an in-depth analysis of a significant outage experienced by Slack on May 12, 2020, detailing the technical issues that led to the incident. It discusses the response strategies employed to mitigate the impact and the lessons learned for future improvements.

What You'll Learn

1

How to identify and mitigate database performance issues during high load

2

Why effective load balancing is crucial for web application performance

3

How to implement dynamic configuration updates in HAProxy without downtime

Prerequisites & Requirements

Understanding of load balancing and web application architecture
Familiarity with HAProxy and Consul(optional)

Key Questions Answered

What caused the Slack outage on May 12, 2020?

The outage was triggered by a significant load increase in the database infrastructure due to a configuration change that exposed a longstanding performance bug. This led to a cascading failure in the webapp tier, resulting in HTTP 503 errors.

How did Slack respond to the incident?

Slack's response involved quickly rolling back the problematic configuration change, scaling up the webapp fleet, and investigating the load balancer's performance. They identified a bug in the synchronization process between the webapp instances and the HAProxy state, which contributed to the outage.

What architectural changes are planned to prevent future outages?

Slack is moving towards using Envoy Proxy for ingress load balancing, which offers better integration with dynamic service discovery and aims to eliminate the operational complexities associated with HAProxy.

Key Statistics & Figures

Increase in webapp instance count during incident

75%

This scaling was necessary to handle the increased load after the initial database performance issues.

Duration of the initial customer impact

3 minutes

This brief incident occurred before the configuration rollback was implemented.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Load Balancer

Haproxy

Used to distribute requests to the webapp tier and manage backend server states.

Service Discovery

Consul

Employed for service discovery and to manage backend health checks.

Load Balancer

Envoy Proxy

Planned for future use to improve load balancing and service discovery integration.

Key Actionable Insights

1
Implement robust monitoring for critical infrastructure components to catch issues early.
Monitoring systems should be regularly tested and updated to ensure they can detect anomalies, especially in static environments where changes are infrequent.

2
Consider using dynamic configuration management for load balancers to reduce downtime during updates.
Dynamic configurations allow for quick adjustments to backend services without the need for full reloads, minimizing the risk of service disruption.

3
Regularly review and test incident response plans to ensure effectiveness.
Conducting drills and simulations can help teams prepare for real incidents and refine their response strategies.

Common Pitfalls

1

Failing to update monitoring systems can lead to undetected issues.

In this case, the monitoring for HAProxy did not catch the stale backend state, which contributed to the outage. Regular reviews and updates of monitoring systems are essential.

Related Concepts

Incident Management

Load Balancing Strategies

Database Performance Optimization

Service Discovery Mechanisms

Slack has a global customer base, with millions of simultaneously connected users at peak times. Most of the communication between users involves sending lots of tiny messages to each other. For much of Slack’s history, we’ve used HAProxy as a load balancer for all incoming traffic. Today, we’ll talk about problems we faced with HAProxy,…

AWSChefEnvoy

14 min read

Includes Code

Has Summary

--

Slack

Advanced

Slack’s Migration to a Cellular Architecture

Summary In recent years, cellular architectures have become increasingly popular for large online services as a way to increase redundancy and limit the blast radius of site failures. In pursuit of these goals, we have migrated the most critical user-facing services at Slack from a monolithic to a cell-based architecture over the last 1.5 years.…

JavaChefConsul

10 min read

Has Summary

--

Slack

Advanced

Advanced Rollout Techniques: Custom Strategies for Stateful Apps in Kubernetes

In a previous blog post—A Simple Kubernetes Admission Webhook—I discussed the process of creating a Kubernetes webhook without relying on Kubebuilder. At Slack, we use this webhook for various tasks, like helping us support long-lived Pods (see Supporting Long-Lived Pods), and today, I delve once more into the topic of long-lived Pods, focusing on our…

ReactKubernetesAWS

13 min read

Includes Code

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "A Terrible, Horrible, No-Good, Very Bad Day at Slack". Explore more engineering insights on AWS, Chef, Java.