The Query Strikes Again

Emad Mokhtar

On Thursday, 12 Oct. 2022, the EMEA part of the Datastores team — the team responsible for Slack’s database clusters — was having an onsite day in Amsterdam, the Netherlands. We’re sitting together for the first time after new engineers had joined the team, when suddenly a few of us were paged: There was an…

Slack

•

Emad Mokhtar

•16 min read•advanced•

--

•View Original

ChefMySQLPythonTypeScript

Overview

The article discusses a critical incident experienced by Slack's Datastores team due to a spike in database load from a mass user deletion, leading to failed queries and system instability. It details the root causes, the response strategies employed, and the preventive measures implemented to avoid future occurrences.

What You'll Learn

1

How to manage database load during high-volume operations

2

Why sharding is essential for large datasets in MySQL

3

How to optimize asynchronous job processing in a database

Prerequisites & Requirements

Understanding of database sharding and replication concepts
Experience with MySQL and asynchronous job management(optional)

Key Questions Answered

What caused the database overload during the incident?

The database overload was triggered by a customer removing a large number of users in a single operation, which initiated the 'forget user' asynchronous job. This led to a spike in database load as the job required unsubscribing users from multiple channels and threads, overwhelming the database shards.

How did Slack's team respond to the database failures?

The team responded by disabling the problematic job, manually provisioning larger replicas to mitigate memory issues, and optimizing the 'leave channel' job to reduce database contention. These actions helped stabilize the database and restore service.

What measures are being taken to prevent similar incidents?

To prevent future incidents, Slack's Datastores team has implemented throttling mechanisms and the circuit breaker pattern to manage query loads effectively. These strategies help protect the database from being overwhelmed by excessive queries.

What optimizations were made to the 'forget user' job?

The 'forget user' job was optimized to issue a single 'unsubscribe from all threads' job instead of multiple 'leave channel' jobs, significantly reducing database contention during user deletions. This change improves performance during high-volume operations.

Key Statistics & Figures

Percentage of user subscriptions affected by the shard

6%

This percentage indicates the portion of user subscription data that was concentrated in one shard, which contributed to the overload during the incident.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Vitess

Used to manage Slack’s MySQL clusters and facilitate sharding.

Database

Mysql

Serves as the underlying database system for Slack's data storage.

Key Actionable Insights

1
Implement sharding in your database architecture to handle large datasets more efficiently.
Sharding allows for distributed database management, reducing the load on individual instances and improving overall performance, especially during high-volume operations.

2
Optimize asynchronous job processing to minimize database contention.
By refining job logic to limit the scope of database queries, you can significantly reduce the load on your database during peak operations, ensuring smoother performance.

3
Utilize monitoring tools to identify and respond to database load spikes in real-time.
Having a robust monitoring system in place allows teams to make informed decisions quickly, preventing incidents from escalating and impacting users.

Common Pitfalls

1

Failing to account for the cumulative load of asynchronous jobs can lead to database overload.

When multiple jobs are triggered simultaneously, especially during high-volume operations, it can overwhelm the database, leading to failures. It's crucial to optimize job processing and monitor load effectively.

Related Concepts

Database Sharding

Asynchronous Job Management

Database Replication

Performance Optimization Strategies

Slack is a large and complex piece of software that’s been added to and changed many times over the last five years. We added features, grew to 10,000,000 DAUs, and made major architectural changes. We made assumptions and tested them with processes that often resembled science. Whenever we launch features or make changes, we test…

TypeScriptAWSMySQL

11 min read

Has Summary

--

Slack

Intermediate

Hacklang at Slack: A Better PHP

Slack launched in 2014 with a PHP 5 backend. Along with several other companies, we switched to HHVM in 2016 because it ran our PHP code faster. We stayed with HHVM because it offers an entirely new language: Hack (searchable as Hacklang). Hack makes our developers faster by improving productivity through better tooling. Hack began as a superset of PHP, retaining its best…

TypeScriptJavaScriptJava

10 min read

Includes Code

Has Summary

--

Slack

Beginner

Migrating Slack Airflow to Python 3 Without Disruption

Last year, we migrated Airflow from 1.8 to 1.10 at Slack (see here) and we did a “Big bang” upgrade because of the constraints we had. This year, due to Python 2 reaching end of life, we again had a major migration of Airflow from Python 2 to 3 and we wanted to put our…

TypeScriptSQLReact

10 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "The Query Strikes Again". Explore more engineering insights on TypeScript, AWS, JavaScript.