Slack’s Migration to a Cellular Architecture

Cooper Bethea

Summary In recent years, cellular architectures have become increasingly popular for large online services as a way to increase redundancy and limit the blast radius of site failures. In pursuit of these goals, we have migrated the most critical user-facing services at Slack from a monolithic to a cell-based architecture over the last 1.5 years.…

Slack

•

Cooper Bethea

•10 min read•advanced•

--

•View Original

ChefConsulEnvoyHAProxyJava

Overview

Slack has migrated its critical user-facing services from a monolithic architecture to a cellular architecture over the past 1.5 years. This transition aims to enhance redundancy and limit the impact of site failures, allowing for more resilient service delivery.

What You'll Learn

1

How to implement a cellular architecture for improved service redundancy

2

Why understanding gray failures is crucial in distributed systems

3

How to design a traffic draining mechanism for availability zones

Key Questions Answered

What led Slack to migrate to a cellular architecture?

Slack's migration was primarily driven by the need to enhance redundancy and limit the blast radius of site failures. An incident on June 30, 2021, where a network disruption in one availability zone affected service delivery, highlighted the limitations of their previous architecture.

How does the AZ drain mechanism work?

The AZ drain mechanism allows Slack to quickly remove traffic from an affected availability zone within 5 minutes without causing user-visible errors. This is achieved by using Envoy's weighted clusters and dynamic weight assignment to reroute traffic seamlessly.

What is a gray failure and how does it affect Slack's services?

A gray failure occurs when different components of a system have inconsistent views of availability. In Slack's case, during an outage, some systems saw backends as available while others did not, leading to user-visible errors despite the infrastructure's redundancy.

Key Statistics & Figures

Slack's availability SLA

99.99%

This SLA allows for less than 1 hour of total unavailability per year.

Time to drain traffic from an AZ

5 minutes

This is the target time for removing traffic from an affected availability zone to maintain service availability.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Load Balancer

Envoy

Used for managing traffic routing and implementing the AZ drain mechanism.

Database

Vitess

Serves as Slack's main datastore, providing strongly consistent semantics.

Key Actionable Insights

1
Implementing a cellular architecture can significantly enhance service resilience against outages.
By isolating services within availability zones, companies can limit the impact of failures to a single zone, allowing for quicker recovery and reduced downtime.

2
Understanding and addressing gray failures is essential for maintaining service reliability.
Recognizing that different components may perceive availability differently can help engineers design better failure detection and remediation strategies.

3
Utilizing traffic draining mechanisms can improve user experience during outages.
By quickly rerouting traffic away from affected zones, companies can maintain service availability and minimize user disruption.

Common Pitfalls

1

Assuming that redundancy across availability zones guarantees complete service availability.

This misconception can lead to gray failures where some components perceive availability differently, resulting in user-visible errors during outages.

This story describes the technical details of the problems that caused the Slack downtime on May 12th, 2020. To learn more about the process behind incident response for same outage, read Ryan Katkov’s post, “All Hands on Deck”. On May 12, 2020, Slack had our first significant outage in a long time. We published a summary…

AWSChefConsul

9 min read

Has Summary

--

Slack

Advanced

Applying Product Thinking to Slack’s Internal Compute Platform

According to a recent Thoughtworks radar, “the industry is increasingly gaining experience with platform engineering product teams that create and support internal platforms.” They caveated this with a piece of advice: “When creating a platform, it’s critical to have clearly defined customers and products that will benefit from it rather than building in a vacuum.”…

DockerKubernetesJava

13 min read

Has Summary

--

Slack

Advanced

Real-time Messaging

Did you know that ground stations transmit signals to satellites 22,236 miles above the equator in geostationary orbits, and that those signals are then beamed down to the entire North American subcontinent? Satellite radios today serve hundreds of channels across 9,540,000 square miles. Unless you’re working at a secret military facility, deep underground, you can…

JavaScriptJavaChef

9 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Slack’s Migration to a Cellular Architecture". Explore more engineering insights on AWS, Chef, Docker.