Handling Flaky Tests at Scale: Auto Detection & Suppression

At Slack, the goal of the Mobile Developer Experience Team (DevXp) is to empower developers to ship code with confidence while enjoying a pleasant and productive engineering experience. We use metrics and surveys to measure productivity and developer experience, such as developer sentiment, CI stability, time to merge (TTM), and test failure rate. The DevXp…

Arpita Patel
17 min readintermediate
--
View Original

Overview

The article discusses how Slack's Mobile Developer Experience Team tackled the challenge of flaky tests in their CI/CD pipeline by implementing an automated detection and suppression system. This initiative significantly improved the stability of test jobs, reduced failure rates, and enhanced developer confidence.

What You'll Learn

1

How to automate the detection and suppression of flaky tests in a CI/CD pipeline

2

Why manual triaging of flaky tests is inefficient and how automation can improve developer experience

3

When to implement a suppression system for flaky tests based on historical data

Prerequisites & Requirements

  • Understanding of CI/CD processes and automated testing
  • Experience with test automation frameworks(optional)

Key Questions Answered

How did Slack reduce test job failures caused by flaky tests?
Slack implemented an automated suppression system that detects flaky tests based on their historical failure rates. This system allowed them to reduce test job failures from 57% to less than 5%, significantly improving CI stability and developer productivity.
What types of flaky tests did Slack identify?
Slack categorized flaky tests into two types: independent flaky tests, which fail regardless of the test set, and flaky tests due to systemic issues, which fail based on shared state or CI environment differences. This distinction helps in targeted troubleshooting and resolution.
What impact did the automation of flaky test handling have on developer sentiment?
The automation of flaky test handling led to improved developer sentiment, with 74% of developers reporting a positive impact on main branch stability. This indicates that the initiative not only stabilized the CI/CD process but also enhanced developer confidence in the system.

Key Statistics & Figures

Test job failure rate
Reduced from 57% to less than 5%
This reduction was achieved through the implementation of an automated flaky test suppression system.
PR build stability
Increased from 71% to 88%
The improvement in stability was observed shortly after the rollout of the automation project.
Main branch build stability
Improved from 61% to 90%
This increase in stability reflects the effectiveness of the automated system in handling flaky tests.
Time saved in triage
Saved 553 hours of triage time
This was achieved through the automation of the flaky test handling process, allowing developers to focus on more critical tasks.

Key Actionable Insights

1
Implement an automated system for detecting and suppressing flaky tests to improve CI/CD stability.
This approach minimizes the manual effort required to triage flaky tests and allows developers to focus on more critical tasks, thus enhancing overall productivity.
2
Categorize flaky tests to better understand their behavior and improve troubleshooting efforts.
By distinguishing between independent flaky tests and those affected by systemic issues, teams can apply targeted fixes, reducing the time spent on resolving test failures.
3
Regularly review and adjust the thresholds for test suppression based on historical data.
This ensures that tests are accurately classified and helps maintain the integrity of the CI/CD process, preventing flaky tests from leaking into the main branch.

Common Pitfalls

1
Relying solely on manual triaging of flaky tests can lead to inefficiencies and increased frustration among developers.
This happens because manual processes are time-consuming and can result in delays in merging PRs, ultimately affecting productivity.
2
Suppressing test results instead of execution can allow failing tests to leak into the main branch.
This occurs when new tests are incorrectly classified as flaky due to insufficient historical data, leading to confusion and instability in the CI/CD pipeline.

Related Concepts

Continuous Integration
Continuous Deployment
Automated Testing
Flaky Tests Management