Building Self-driving Kafka clusters using open source components

In this article, I will talk about how Slack uses Kafka, and how a small-but-mighty team built and operationalized a self-driving Kafka cluster over the last four years to run at scale. Kafka is used at Slack as a pub-sub system, playing an essential role in the all-important Job Queue, our asynchronous job execution framework…

Suman Karumuri
14 min readintermediate
--
View Original

Overview

This article discusses how Slack built and operationalized self-driving Kafka clusters using open source components over four years. It highlights the challenges faced, the automation of Kafka operations, and the architecture that supports Slack's data processing needs.

What You'll Learn

1

How to automate Kafka operational tasks to reduce overhead

2

Why tuning partition counts can improve Kafka cluster stability

3

How to implement chaos engineering practices for Kafka

4

When to split Kafka clusters to enhance performance

Prerequisites & Requirements

  • Understanding of Kafka architecture and operations
  • Familiarity with Chef and Terraform(optional)

Key Questions Answered

What challenges did Slack face in managing Kafka clusters?
Slack faced fragmentation of Kafka versions and duplication of efforts across teams, leading to operational inefficiencies. They aimed to standardize and automate Kafka management to reduce the overhead associated with daily operations.
How does Slack ensure stability in their Kafka clusters?
Slack ensures stability by tuning partition counts to be multiples of broker counts, which helps in evenly distributing load. They also limit replication bandwidth during partition rebalances to prevent resource starvation for producers and consumers.
What is the process for upgrading Kafka clusters at Slack?
Slack employs a cutover process for upgrading Kafka clusters, which includes starting a new cluster, running validation tests, stopping data production to the old cluster, and then switching to the new cluster. This minimizes downtime and ensures a smooth transition.
What tools does Slack use for managing Kafka clusters?
Slack utilizes tools like Chef for OS management, Terraform for provisioning, and Cruise Control for automating cluster rebalances. Additionally, they use Kafka Manager for operational visibility and monitoring.

Key Statistics & Figures

Data managed by Slack's Kafka clusters
0.7 petabytes
This data is distributed across 10 Kafka clusters running on hundreds of nodes.
Messages processed per second
millions
Slack's Kafka infrastructure achieves an aggregate throughput of 6.5 Gbps at peak.
On-call alerts for the logging pipeline
71 alerts in a month
After migrating topics to smaller clusters, alerts dropped to 9 in the following month.
Improvement in log latency
from 1.5 hours to 3-4 minutes
This improvement was observed after splitting large topics off the main Kafka cluster.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Kafka
Used as a pub-sub system for managing asynchronous job execution and data movement.
Tools
Chef
Used to manage the base OS and deploy Kafka software.
Tools
Terraform
Used for provisioning and managing infrastructure in AWS.
Tools
Cruise Control
Automates cluster rebalance operations to ensure even utilization of nodes.
Tools
Kafka Manager
Provides visibility into Kafka cluster metadata and simplifies routine operations.
Monitoring
Prometheus
Used for exporting consumer offset information as metrics.

Key Actionable Insights

1
Automate routine Kafka operations to minimize manual intervention and reduce on-call burdens.
By automating tasks like topic creation and partition management, teams can focus on more strategic initiatives rather than day-to-day operational issues.
2
Implement chaos engineering practices to test the resilience of Kafka clusters under load.
Conducting controlled chaos experiments helps identify potential failure modes and improves the overall reliability of the Kafka infrastructure.
3
Standardize Kafka configurations and operational runbooks to enhance team efficiency.
Having a single source of truth for Kafka operations reduces confusion and ensures that all team members are aligned on best practices.
4
Monitor consumer offsets and cluster health using Prometheus metrics.
Real-time monitoring allows teams to quickly identify and address issues related to consumer lag and cluster performance.

Common Pitfalls

1
Failing to properly manage partition counts can lead to hot spotting in Kafka clusters.
Hot spotting occurs when some brokers handle significantly more load than others, causing instability. To avoid this, ensure that partition counts are multiples of the number of brokers.
2
Neglecting chaos engineering can result in unpreparedness for real-world failures.
Without testing the system under stress, teams may overlook critical failure modes that could impact service reliability.

Related Concepts

Kafka Architecture And Operations
Automation In Infrastructure Management
Chaos Engineering Principles
Monitoring And Observability In Distributed Systems