Load Testing with Koi Pond

Shreya Ramesh

Complex systems are difficult to reason about at scale; we often can’t accurately extrapolate system behavior and performance, so we need to derive that data empirically. We use load testing to do just that: find the limits of our systems and weed out bugs at a large scale in a controlled environment. Slack is a…

Slack

•

Shreya Ramesh

•16 min read•advanced•

--

•View Original

ChefGolangJavaJSONKubernetesPuppetPythonTypeScript

Overview

The article discusses Slack's approach to load testing using a tool called Koi Pond, which simulates user interactions to assess system performance under heavy loads. It highlights the complexities of testing a system like Slack and the evolution of their load testing strategies to ensure reliability and scalability.

What You'll Learn

1

How to effectively simulate user behavior for load testing using Koi Pond

2

Why it's crucial to model complex user interactions in load testing

3

When to use formations to test thundering herd scenarios

Prerequisites & Requirements

Understanding of API interactions and real-time services
Familiarity with load testing tools and methodologies(optional)

Key Questions Answered

How does Koi Pond simulate user interactions for load testing?

Koi Pond simulates user interactions by spinning up slimmed-down versions of Slack clients, called koi, which establish websocket connections and send API calls. This allows for realistic modeling of user behavior and testing of both backend and real-time services under load.

What are the benefits of using Koi Pond over previous load testing tools?

Koi Pond is significantly more cost-effective and scalable compared to previous tools like Puppet Show, allowing Slack to simulate up to 2 million users at a fraction of the cost. It also provides a more realistic testing environment by mimicking actual user behavior and interactions.

What challenges does Slack face in load testing?

Slack faces challenges such as accurately modeling complex user interactions, ensuring the load testing environment reflects real-world usage, and scaling up gradually to avoid overwhelming system components. These challenges necessitate careful planning and execution of load tests.

When should formations be used in load testing?

Formations should be used when testing scenarios that involve multiple users performing the same action simultaneously, such as reacting to a message in a channel. This helps to assess the impact on real-time services and backend systems under realistic load conditions.

Key Statistics & Figures

Cost efficiency of Koi Pond

0.26%

Running 2 million users with Koi Pond costs only 0.26% of running 150,000 users with Puppet Show.

Maximum simulated users tested

2 million

Koi Pond successfully simulated up to 2 million users in a single workspace without major issues.

Initial number of koi simulated

5,000

Testing began with 5,000 koi and scaled up to 500,000 leading up to a customer launch.

Technologies & Tools

Load Testing Tool

Koi Pond

Used to simulate user interactions and assess system performance under load.

Communication Protocol

Websocket

Facilitates real-time communication between clients and the server during load tests.

Backend Service

API

Endpoints called by koi to simulate user actions during load testing.

Key Actionable Insights

1
Implement Koi Pond for load testing to simulate realistic user behavior and interactions.
Using Koi Pond allows for comprehensive testing of both backend services and real-time interactions, ensuring that Slack can handle high loads effectively.

2
Gradually scale the number of simulated users during load tests to identify potential bottlenecks.
This approach helps in understanding the system's limits and prevents overwhelming any single component, which can lead to failures.

3
Combine manual QA testing with automated load testing for critical features.
This hybrid approach ensures that nuanced user interactions are accurately tested, which is crucial for complex systems like Slack.

Common Pitfalls

1

Scaling up too quickly can strain system resources and lead to failures.

When preparing for a customer launch, going from 5,000 to 100,000 koi caused strain on a database. Gradual scaling allows for better monitoring of system performance and identification of issues.

2

Failing to accurately replicate real-world usage patterns can lead to misleading test results.

Using a single SSO token for high-rate calls resulted in no websocket events. Testing with multiple tokens provided a more accurate representation of load on the system.

Related Concepts

Load Testing Methodologies

Real-time Communication Protocols

API Design And Interaction Patterns

Slack launched in 2014 with a PHP 5 backend. Along with several other companies, we switched to HHVM in 2016 because it ran our PHP code faster. We stayed with HHVM because it offers an entirely new language: Hack (searchable as Hacklang). Hack makes our developers faster by improving productivity through better tooling. Hack began as a superset of PHP, retaining its best…

TypeScriptJavaScriptJava

10 min read

Includes Code

Has Summary

--

Slack

Advanced

Tracing at Slack: Thinking in Causal Graphs

“Why is it slow?” is the hardest problem to debug in a complex distributed system like Slack. To diagnose a slow-loading channel with over a hundred thousand users, we’d need to look at client-side metrics, server-side metrics, and logs. It could be a client-side issue: a slow network connection or hardware. On the other hand,…

TypeScriptJavaScriptJava

20 min read

Includes Code

Has Summary

--

Slack

Advanced

Building the Next Evolution of Cloud Networks at Slack – A Retrospective

About a year ago, I wrote a blog post called Building the Next Evolution of Cloud Networks at Slack. In it, we discussed how Slack’s AWS infrastructure has evolved over the years and the pain points that drove us to spin up a brand-new network architecture redesign project called Whitecastle. If you have not had…

TypeScriptGolangAWS

14 min read

Includes Code

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Load Testing with Koi Pond". Explore more engineering insights on TypeScript, JavaScript, Golang.