Slack’s Migration to a Cellular Architecture

Summary In recent years, cellular architectures have become increasingly popular for large online services as a way to increase redundancy and limit the blast radius of site failures. In pursuit of these goals, we have migrated the most critical user-facing services at Slack from a monolithic to a cell-based architecture over the last 1.5 years.…

Cooper Bethea
10 min readadvanced
--
View Original

Overview

Slack has migrated its critical user-facing services from a monolithic architecture to a cellular architecture over the past 1.5 years. This transition aims to enhance redundancy and limit the impact of site failures, allowing for more resilient service delivery.

What You'll Learn

1

How to implement a cellular architecture for improved service redundancy

2

Why understanding gray failures is crucial in distributed systems

3

How to design a traffic draining mechanism for availability zones

Key Questions Answered

What led Slack to migrate to a cellular architecture?
Slack's migration was primarily driven by the need to enhance redundancy and limit the blast radius of site failures. An incident on June 30, 2021, where a network disruption in one availability zone affected service delivery, highlighted the limitations of their previous architecture.
How does the AZ drain mechanism work?
The AZ drain mechanism allows Slack to quickly remove traffic from an affected availability zone within 5 minutes without causing user-visible errors. This is achieved by using Envoy's weighted clusters and dynamic weight assignment to reroute traffic seamlessly.
What is a gray failure and how does it affect Slack's services?
A gray failure occurs when different components of a system have inconsistent views of availability. In Slack's case, during an outage, some systems saw backends as available while others did not, leading to user-visible errors despite the infrastructure's redundancy.

Key Statistics & Figures

Slack's availability SLA
99.99%
This SLA allows for less than 1 hour of total unavailability per year.
Time to drain traffic from an AZ
5 minutes
This is the target time for removing traffic from an affected availability zone to maintain service availability.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a cellular architecture can significantly enhance service resilience against outages.
By isolating services within availability zones, companies can limit the impact of failures to a single zone, allowing for quicker recovery and reduced downtime.
2
Understanding and addressing gray failures is essential for maintaining service reliability.
Recognizing that different components may perceive availability differently can help engineers design better failure detection and remediation strategies.
3
Utilizing traffic draining mechanisms can improve user experience during outages.
By quickly rerouting traffic away from affected zones, companies can maintain service availability and minimize user disruption.

Common Pitfalls

1
Assuming that redundancy across availability zones guarantees complete service availability.
This misconception can lead to gray failures where some components perceive availability differently, resulting in user-visible errors during outages.