Balancing Safety and Velocity in CI/CD at Slack

Carlos Valdez

In 2021, we changed developer testing workflows for Webapp, Slack’s main monorepo, from predominantly testing before merging to a multi-tiered testing workflow after merging. This changed our previous definition of safety and developer workflows between testing and deploys. In this project, we aimed to ensure frequent, reliable, and high-quality releases to our customers for a…

Slack

•

Carlos Valdez

•15 min read•intermediate•

--

•View Original

ChefGrafanaPythonTypeScript

Overview

The article discusses how Slack evolved its CI/CD workflows to balance safety and developer velocity. It highlights the transition from pre-merge testing to a multi-tiered testing approach that improved testing efficiency and reliability while maintaining code quality.

What You'll Learn

1

How to implement a multi-tiered testing workflow in CI/CD

2

Why reducing pre-merge test coverage can improve developer velocity

3

How to create effective alerting mechanisms for CI failures

Prerequisites & Requirements

Understanding of CI/CD principles and practices
Familiarity with automated testing frameworks(optional)

Key Questions Answered

What changes did Slack make to improve CI/CD workflows?

Slack transitioned from a predominantly pre-merge testing approach to a multi-tiered testing workflow that includes pre-merge, post-merge, and regression pipelines. This allowed for faster feedback loops and reduced test flakiness, ultimately improving developer velocity without compromising safety.

How did Slack reduce test turnaround time?

By implementing a new testing pipeline structure, Slack decreased the test turnaround time (p95) from over 30 minutes to consistently below 18 minutes. This was achieved by executing a smaller subset of critical tests pre-merge and batching the remaining tests post-merge.

What was the impact on test flakiness after the workflow changes?

After the implementation of the new testing workflows, test flakiness per pull request decreased significantly by over 90%, dropping to consistently less than 5%. This improvement was crucial for enhancing developer experience and maintaining code quality.

Key Statistics & Figures

Test turnaround time (p95)

Consistently below 18 minutes

This metric improved from over 30 minutes after implementing the new testing workflows.

Test flakiness per pull request

Consistently less than 5%

This marked a significant decrease of over 90% from previous levels.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Hacklang

Used for the backend API of Slack's Webapp.

Frontend

Typescript

Used for the frontend client of Slack's Webapp.

Key Actionable Insights

1
Implement a multi-tiered testing workflow to balance safety and speed.
By separating tests into pre-merge and post-merge pipelines, teams can reduce the time developers spend waiting for test results while still ensuring critical tests are run before deployment.

2
Create human-centric alerting mechanisms for CI failures.
Establishing clear escalation paths and alert systems can help teams quickly identify and resolve issues, minimizing the impact of failures on the development process.

3
Focus on reducing flaky tests to improve developer productivity.
By addressing the root causes of test flakiness, teams can enhance the reliability of their testing processes, leading to faster development cycles and a better overall experience for engineers.

Common Pitfalls

1

Relying solely on pre-merge tests can lead to slow feedback loops and increased frustration among developers.

This often results in developers waiting long periods for test results, which can hinder productivity and lead to a negative development experience.

Slack is a large and complex piece of software that’s been added to and changed many times over the last five years. We added features, grew to 10,000,000 DAUs, and made major architectural changes. We made assumptions and tested them with processes that often resembled science. Whenever we launch features or make changes, we test…

TypeScriptAWSMySQL

11 min read

Has Summary

--

Slack

Intermediate

Building the Next Evolution of Cloud Networks at Slack

At Slack, we’ve gone through an evolution of our AWS infrastructure from the early days of running a few hand-built EC2 instances, all the way to provisioning thousands of EC2s instances across multiple AWS regions, using the latest AWS services to build reliable and scalable infrastructure. One of the pain points inherited from the early…

TypeScriptAWSDynamoDB

12 min read

Has Summary

--

Slack

Beginner

Tracing Notifications

Notifications are a key aspect of the Slack user experience. Users rely on timely notifications of mentions and DMs to keep on top of important information. Poor notification completeness erodes the trust of all Slack users. Notifications flow through almost all the systems in our infrastructure. As illustrated in Figure 1 below, a notification request…

TypeScriptChefGrafana

13 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Balancing Safety and Velocity in CI/CD at Slack". Explore more engineering insights on TypeScript, AWS, Chef.