Building Self-driving Kafka clusters using open source components

Suman Karumuri

In this article, I will talk about how Slack uses Kafka, and how a small-but-mighty team built and operationalized a self-driving Kafka cluster over the last four years to run at scale. Kafka is used at Slack as a pub-sub system, playing an essential role in the all-important Job Queue, our asynchronous job execution framework…

Slack

•

Suman Karumuri

•14 min read•intermediate•

--

•View Original

ApacheAWSChefConsulPrometheusPythonTerraformTypeScript

Overview

This article discusses how Slack built and operationalized self-driving Kafka clusters using open source components over four years. It highlights the challenges faced, the automation of Kafka operations, and the architecture that supports Slack's data processing needs.

What You'll Learn

1

How to automate Kafka operational tasks to reduce overhead

2

Why tuning partition counts can improve Kafka cluster stability

3

How to implement chaos engineering practices for Kafka

4

When to split Kafka clusters to enhance performance

Prerequisites & Requirements

Understanding of Kafka architecture and operations
Familiarity with Chef and Terraform(optional)

Key Questions Answered

What challenges did Slack face in managing Kafka clusters?

Slack faced fragmentation of Kafka versions and duplication of efforts across teams, leading to operational inefficiencies. They aimed to standardize and automate Kafka management to reduce the overhead associated with daily operations.

How does Slack ensure stability in their Kafka clusters?

Slack ensures stability by tuning partition counts to be multiples of broker counts, which helps in evenly distributing load. They also limit replication bandwidth during partition rebalances to prevent resource starvation for producers and consumers.

What is the process for upgrading Kafka clusters at Slack?

Slack employs a cutover process for upgrading Kafka clusters, which includes starting a new cluster, running validation tests, stopping data production to the old cluster, and then switching to the new cluster. This minimizes downtime and ensures a smooth transition.

What tools does Slack use for managing Kafka clusters?

Slack utilizes tools like Chef for OS management, Terraform for provisioning, and Cruise Control for automating cluster rebalances. Additionally, they use Kafka Manager for operational visibility and monitoring.

Key Statistics & Figures

Data managed by Slack's Kafka clusters

0.7 petabytes

This data is distributed across 10 Kafka clusters running on hundreds of nodes.

Messages processed per second

millions

Slack's Kafka infrastructure achieves an aggregate throughput of 6.5 Gbps at peak.

On-call alerts for the logging pipeline

71 alerts in a month

After migrating topics to smaller clusters, alerts dropped to 9 in the following month.

Improvement in log latency

from 1.5 hours to 3-4 minutes

This improvement was observed after splitting large topics off the main Kafka cluster.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Kafka

Used as a pub-sub system for managing asynchronous job execution and data movement.

Tools

Chef

Used to manage the base OS and deploy Kafka software.

Tools

Terraform

Used for provisioning and managing infrastructure in AWS.

Tools

Cruise Control

Automates cluster rebalance operations to ensure even utilization of nodes.

Tools

Kafka Manager

Provides visibility into Kafka cluster metadata and simplifies routine operations.

Monitoring

Prometheus

Used for exporting consumer offset information as metrics.

Key Actionable Insights

1
Automate routine Kafka operations to minimize manual intervention and reduce on-call burdens.
By automating tasks like topic creation and partition management, teams can focus on more strategic initiatives rather than day-to-day operational issues.

2
Implement chaos engineering practices to test the resilience of Kafka clusters under load.
Conducting controlled chaos experiments helps identify potential failure modes and improves the overall reliability of the Kafka infrastructure.

3
Standardize Kafka configurations and operational runbooks to enhance team efficiency.
Having a single source of truth for Kafka operations reduces confusion and ensures that all team members are aligned on best practices.

4
Monitor consumer offsets and cluster health using Prometheus metrics.
Real-time monitoring allows teams to quickly identify and address issues related to consumer lag and cluster performance.

Common Pitfalls

1

Failing to properly manage partition counts can lead to hot spotting in Kafka clusters.

Hot spotting occurs when some brokers handle significantly more load than others, causing instability. To avoid this, ensure that partition counts are multiples of the number of brokers.

2

Neglecting chaos engineering can result in unpreparedness for real-world failures.

Without testing the system under stress, teams may overlook critical failure modes that could impact service reliability.

Related Concepts

Kafka Architecture And Operations

Automation In Infrastructure Management

Chaos Engineering Principles

Monitoring And Observability In Distributed Systems

Since its inception, Slack has fostered a culture of inclusion and diversity. The Security organization at Slack is a prime example of how women can thrive in the security space, transitioning to security from different backgrounds and expertises. With Slack’s strong commitment to diversity, it should not be a surprise that nearly a third of…

TypeScriptPHPHTML

12 min read

Has Summary

--

Slack

Advanced

Disasterpiece Theater: Slack’s process for approachable Chaos Engineering

Slack is a large and complex piece of software that’s been added to and changed many times over the last five years. We added features, grew to 10,000,000 DAUs, and made major architectural changes. We made assumptions and tested them with processes that often resembled science. Whenever we launch features or make changes, we test…

TypeScriptAWSMySQL

11 min read

Has Summary

--

Slack

Intermediate

Building the Next Evolution of Cloud Networks at Slack

At Slack, we’ve gone through an evolution of our AWS infrastructure from the early days of running a few hand-built EC2 instances, all the way to provisioning thousands of EC2s instances across multiple AWS regions, using the latest AWS services to build reliable and scalable infrastructure. One of the pain points inherited from the early…

TypeScriptAWSDynamoDB

12 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Building Self-driving Kafka clusters using open source components". Explore more engineering insights on TypeScript, PHP, AWS.