The Query Strikes Again

On Thursday, 12 Oct. 2022, the EMEA part of the Datastores team — the team responsible for Slack’s database clusters — was having an onsite day in Amsterdam, the Netherlands. We’re sitting together for the first time after new engineers had joined the team, when suddenly a few of us were paged: There was an…

Emad Mokhtar
16 min readadvanced
--
View Original

Overview

The article discusses a critical incident experienced by Slack's Datastores team due to a spike in database load from a mass user deletion, leading to failed queries and system instability. It details the root causes, the response strategies employed, and the preventive measures implemented to avoid future occurrences.

What You'll Learn

1

How to manage database load during high-volume operations

2

Why sharding is essential for large datasets in MySQL

3

How to optimize asynchronous job processing in a database

Prerequisites & Requirements

  • Understanding of database sharding and replication concepts
  • Experience with MySQL and asynchronous job management(optional)

Key Questions Answered

What caused the database overload during the incident?
The database overload was triggered by a customer removing a large number of users in a single operation, which initiated the 'forget user' asynchronous job. This led to a spike in database load as the job required unsubscribing users from multiple channels and threads, overwhelming the database shards.
How did Slack's team respond to the database failures?
The team responded by disabling the problematic job, manually provisioning larger replicas to mitigate memory issues, and optimizing the 'leave channel' job to reduce database contention. These actions helped stabilize the database and restore service.
What measures are being taken to prevent similar incidents?
To prevent future incidents, Slack's Datastores team has implemented throttling mechanisms and the circuit breaker pattern to manage query loads effectively. These strategies help protect the database from being overwhelmed by excessive queries.
What optimizations were made to the 'forget user' job?
The 'forget user' job was optimized to issue a single 'unsubscribe from all threads' job instead of multiple 'leave channel' jobs, significantly reducing database contention during user deletions. This change improves performance during high-volume operations.

Key Statistics & Figures

Percentage of user subscriptions affected by the shard
6%
This percentage indicates the portion of user subscription data that was concentrated in one shard, which contributed to the overload during the incident.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement sharding in your database architecture to handle large datasets more efficiently.
Sharding allows for distributed database management, reducing the load on individual instances and improving overall performance, especially during high-volume operations.
2
Optimize asynchronous job processing to minimize database contention.
By refining job logic to limit the scope of database queries, you can significantly reduce the load on your database during peak operations, ensuring smoother performance.
3
Utilize monitoring tools to identify and respond to database load spikes in real-time.
Having a robust monitoring system in place allows teams to make informed decisions quickly, preventing incidents from escalating and impacting users.

Common Pitfalls

1
Failing to account for the cumulative load of asynchronous jobs can lead to database overload.
When multiple jobs are triggered simultaneously, especially during high-volume operations, it can overwhelm the database, leading to failures. It's crucial to optimize job processing and monitor load effectively.

Related Concepts

Database Sharding
Asynchronous Job Management
Database Replication
Performance Optimization Strategies