Continuous Load Testing

Shreya Ramesh

Building load test infrastructure is tricky and poses many questions. How can we identify performance regressions in newly deployed builds, given the overhead of spinning up test clients? To gather the most representative results, should we load test at our peak hours or when there’s a lull? How do we incentivize engineers to invest time…

Slack

•

Shreya Ramesh

•16 min read•advanced•

--

•View Original

AWSChefDynamoDBJenkinsJSONKubernetesPrometheus

Overview

The article discusses the implementation of continuous load testing at Slack using a tool called Koi Pond. It highlights the challenges faced, the technical background of the solution, and the benefits of integrating load testing into the development process.

What You'll Learn

1

How to implement continuous load testing using Koi Pond

2

Why building a culture of performance is crucial in software development

3

How to ensure safety and resilience in load testing environments

Prerequisites & Requirements

Understanding of load testing concepts and practices
Familiarity with Kubernetes and AWS services(optional)

Key Questions Answered

What is Koi Pond and how does it facilitate load testing?

Koi Pond is a load testing tool at Slack that simulates user behavior by making API requests and sending messages over WebSocket. It operates within Kubernetes pods and allows for continuous load testing, enabling engineers to identify performance regressions in real-time.

What safety measures are implemented in continuous load testing?

Safety measures include the Automatic Shutdown service, which halts load testing if performance metrics fall below defined thresholds. This ensures that load tests do not negatively impact production services and helps maintain system integrity.

How does Koi Pond ensure resilience during load testing?

Koi Pond has been backed by AWS DynamoDB to persist load test data, allowing it to maintain state even during pod restarts. This resilience is crucial for continuous testing and helps in analyzing historical performance data.

What are the benefits of integrating load testing into release cycles?

Integrating load testing into release cycles allows teams to verify the performance of features before deployment. It helps catch performance regressions early, ensuring a smoother user experience and reducing the risk of incidents post-release.

Key Statistics & Figures

Maximum Koi per School

5,000

Koi are spun up in Kubernetes pods, referred to as Schools, with a maximum of 5,000 koi per School.

API success rate threshold

95%

If the web API success rate is sustained below 95% for five minutes, the Automatic Shutdown service will halt active load tests.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration

Kubernetes

Used to manage the deployment of Koi Pond pods for load testing.

Database

AWS Dynamodb

Provides a NoSQL database backend for Koi Pond to persist load test data.

Monitoring

Prometheus

Used for querying metrics to support the Automatic Shutdown service.

Key Actionable Insights

1
Implement continuous load testing to proactively identify performance issues before they reach production.
By continuously running load tests, teams can catch performance regressions early, which reduces the risk of negative impacts on user experience during high-traffic events.

2
Utilize the Automatic Shutdown service to safeguard production environments during load testing.
This service helps prevent load tests from causing disruptions by automatically stopping tests if performance metrics drop below acceptable levels, thus maintaining system integrity.

3
Leverage historical data from continuous load testing to validate significant changes in the system.
With a robust dataset reflecting the usage of large customers, teams can confidently deploy changes, knowing they have tested against realistic load scenarios.

Common Pitfalls

1

Failing to account for shared infrastructure during load testing can lead to unintended consequences.

Since some parts of the load test environment are shared with production, it's crucial to implement safety features to prevent load tests from affecting live services.

Related Concepts

Load Testing

Performance Engineering

Continuous Integration/Continuous Deployment (ci/Cd)

Our build platform is an essential piece of delivering code to production efficiently and safely at Slack. Over time it has undergone a lot of changes, and in 2021 the Build team started looking at the long-term vision. Some questions the Build team wanted to answer were: When should we invest in modernizing our build…

AWSDockerKubernetes

13 min read

Has Summary

--

ClickHouse

Intermediate

How we Built a 19 PiB Logging Platform with ClickHouse and Saved Millions

AWSKubernetesSQL

36 min read

Includes Code

Has Summary

--

Slack

Advanced

Tracing at Slack: Thinking in Causal Graphs

“Why is it slow?” is the hardest problem to debug in a complex distributed system like Slack. To diagnose a slow-loading channel with over a hundred thousand users, we’d need to look at client-side metrics, server-side metrics, and logs. It could be a client-side issue: a slow network connection or hardware. On the other hand,…

TypeScriptJavaScriptJava

20 min read

Includes Code

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Continuous Load Testing". Explore more engineering insights on AWS, Docker, Kubernetes.