Scaling Slack’s Job Queue

Saroj Yadav

Slack uses a job queue system for business logic that is too time-consuming to run in the context of a web request. This system is a critical component of our architecture, used for every Slack message post, push notification, URL unfurl, calendar reminder, and billing calculation. On our busiest days, the system processes over 1.4…

Slack

•

Saroj Yadav

•16 min read•intermediate•

--

•View Original

AWSChefConsulJSONPHPRedis

Overview

The article discusses the challenges and solutions involved in scaling Slack's job queue system, which processes billions of tasks efficiently using Kafka and Redis. It details the architectural changes made to improve performance, reliability, and operational flexibility in handling job executions.

What You'll Learn

1

How to integrate Kafka with existing job queue systems

2

Why decoupling job execution from Redis improves scalability

3

How to implement Kafkagate for efficient job enqueuing

Prerequisites & Requirements

Understanding of distributed systems and message queues
Familiarity with Kafka and Redis(optional)

Key Questions Answered

What architectural changes were made to Slack's job queue?

Slack replaced its Redis in-memory store with Kafka for durable storage, developed a new job scheduler, and decoupled job execution from Redis. These changes aimed to improve scalability, reliability, and operational flexibility, addressing previous limitations in the job queue architecture.

How does Kafkagate facilitate job enqueuing?

Kafkagate is a stateless service that exposes an HTTP POST interface for enqueuing jobs into Kafka. It uses the Sarama Golang driver to relay requests to Kafka while maintaining persistent connections, ensuring low latency and high availability for job submissions from the web application.

What were the performance metrics of Slack's job queue system?

On peak days, Slack's job queue processes over 1.4 billion jobs, with a peak rate of 33,000 jobs per second. This high throughput demonstrates the system's capability to handle large volumes of tasks efficiently.

Key Statistics & Figures

Peak job processing rate

33,000 jobs per second

This rate is achieved during Slack's busiest days, showcasing the system's scalability.

Total jobs processed on peak days

1.4 billion jobs

This figure illustrates the volume of tasks handled by the job queue system daily.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Kafka

Used for durable storage and buffering of job tasks to prevent memory exhaustion.

Backend

Redis

Initially used for job queuing, now works in conjunction with Kafka for task execution.

Programming Language

Go

Used to develop Kafkagate and JQRelay for job enqueuing and relaying tasks.

Key Actionable Insights

1
Implementing Kafka in front of Redis can significantly enhance job queue performance by providing durable storage and preventing memory exhaustion.
This approach allows for high enqueue rates without overwhelming the Redis cluster, which is crucial for maintaining system availability during peak loads.

2
Using a stateless service like Kafkagate for job enqueuing can simplify integration with existing applications while optimizing performance.
Kafkagate's design minimizes latency and ensures that the application can handle job submissions efficiently, which is essential for high-demand environments.

3
Regular load and failure testing of the Kafka cluster is vital to ensure system reliability and performance under various conditions.
By simulating different failure scenarios, teams can identify potential weaknesses in the architecture and address them proactively, ensuring robust operation in production.

Common Pitfalls

1

Failing to account for memory limits in Redis can lead to system outages during high enqueue rates.

This issue arises when the enqueue rate exceeds the dequeue rate, causing Redis to run out of memory and preventing new jobs from being processed.

2

Overloading Redis with too many job workers can create a feedback loop that hampers performance.

When job workers are added without considering Redis's capacity, it can lead to increased polling and load, ultimately slowing down job processing.

Related Concepts

Distributed Systems

Message Queuing

Job Scheduling

Scalability In Software Architecture

For development teams, process can often be antithetical to speed. Ease of deployment and security tend to have an inverse relationship, with some resentment for the security team occasionally mixed in. You may have seen the following tweet: https://twitter.com/petecheslock/status/595617204273618944?lang=en We believe things don’t have to be like that. In this post, we will discuss how…

TypeScriptPHPChef

13 min read

Has Summary

--

These articles from Fly.io and other leading engineering teams share similar topics with "Scaling Slack’s Job Queue". Explore more engineering insights on Golang, AWS, Redis.