Balancing Safety and Velocity in CI/CD at Slack

In 2021, we changed developer testing workflows for Webapp, Slack’s main monorepo, from predominantly testing before merging to a multi-tiered testing workflow after merging. This changed our previous definition of safety and developer workflows between testing and deploys. In this project, we aimed to ensure frequent, reliable, and high-quality releases to our customers for a…

Carlos Valdez
15 min readintermediate
--
View Original

Overview

The article discusses how Slack evolved its CI/CD workflows to balance safety and developer velocity. It highlights the transition from pre-merge testing to a multi-tiered testing approach that improved testing efficiency and reliability while maintaining code quality.

What You'll Learn

1

How to implement a multi-tiered testing workflow in CI/CD

2

Why reducing pre-merge test coverage can improve developer velocity

3

How to create effective alerting mechanisms for CI failures

Prerequisites & Requirements

  • Understanding of CI/CD principles and practices
  • Familiarity with automated testing frameworks(optional)

Key Questions Answered

What changes did Slack make to improve CI/CD workflows?
Slack transitioned from a predominantly pre-merge testing approach to a multi-tiered testing workflow that includes pre-merge, post-merge, and regression pipelines. This allowed for faster feedback loops and reduced test flakiness, ultimately improving developer velocity without compromising safety.
How did Slack reduce test turnaround time?
By implementing a new testing pipeline structure, Slack decreased the test turnaround time (p95) from over 30 minutes to consistently below 18 minutes. This was achieved by executing a smaller subset of critical tests pre-merge and batching the remaining tests post-merge.
What was the impact on test flakiness after the workflow changes?
After the implementation of the new testing workflows, test flakiness per pull request decreased significantly by over 90%, dropping to consistently less than 5%. This improvement was crucial for enhancing developer experience and maintaining code quality.

Key Statistics & Figures

Test turnaround time (p95)
Consistently below 18 minutes
This metric improved from over 30 minutes after implementing the new testing workflows.
Test flakiness per pull request
Consistently less than 5%
This marked a significant decrease of over 90% from previous levels.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Hacklang
Used for the backend API of Slack's Webapp.
Frontend
Typescript
Used for the frontend client of Slack's Webapp.

Key Actionable Insights

1
Implement a multi-tiered testing workflow to balance safety and speed.
By separating tests into pre-merge and post-merge pipelines, teams can reduce the time developers spend waiting for test results while still ensuring critical tests are run before deployment.
2
Create human-centric alerting mechanisms for CI failures.
Establishing clear escalation paths and alert systems can help teams quickly identify and resolve issues, minimizing the impact of failures on the development process.
3
Focus on reducing flaky tests to improve developer productivity.
By addressing the root causes of test flakiness, teams can enhance the reliability of their testing processes, leading to faster development cycles and a better overall experience for engineers.

Common Pitfalls

1
Relying solely on pre-merge tests can lead to slow feedback loops and increased frustration among developers.
This often results in developers waiting long periods for test results, which can hinder productivity and lead to a negative development experience.