Handling Flaky Tests at Scale: Auto Detection & Suppression

Arpita Patel

At Slack, the goal of the Mobile Developer Experience Team (DevXp) is to empower developers to ship code with confidence while enjoying a pleasant and productive engineering experience. We use metrics and surveys to measure productivity and developer experience, such as developer sentiment, CI stability, time to merge (TTM), and test failure rate. The DevXp…

Slack

•

Arpita Patel

•17 min read•intermediate•

--

•View Original

AWSChefFirebaseReact

Overview

The article discusses how Slack's Mobile Developer Experience Team tackled the challenge of flaky tests in their CI/CD pipeline by implementing an automated detection and suppression system. This initiative significantly improved the stability of test jobs, reduced failure rates, and enhanced developer confidence.

What You'll Learn

1

How to automate the detection and suppression of flaky tests in a CI/CD pipeline

2

Why manual triaging of flaky tests is inefficient and how automation can improve developer experience

3

When to implement a suppression system for flaky tests based on historical data

Prerequisites & Requirements

Understanding of CI/CD processes and automated testing
Experience with test automation frameworks(optional)

Key Questions Answered

How did Slack reduce test job failures caused by flaky tests?

Slack implemented an automated suppression system that detects flaky tests based on their historical failure rates. This system allowed them to reduce test job failures from 57% to less than 5%, significantly improving CI stability and developer productivity.

What types of flaky tests did Slack identify?

Slack categorized flaky tests into two types: independent flaky tests, which fail regardless of the test set, and flaky tests due to systemic issues, which fail based on shared state or CI environment differences. This distinction helps in targeted troubleshooting and resolution.

What impact did the automation of flaky test handling have on developer sentiment?

The automation of flaky test handling led to improved developer sentiment, with 74% of developers reporting a positive impact on main branch stability. This indicates that the initiative not only stabilized the CI/CD process but also enhanced developer confidence in the system.

Key Statistics & Figures

Test job failure rate

Reduced from 57% to less than 5%

This reduction was achieved through the implementation of an automated flaky test suppression system.

PR build stability

Increased from 71% to 88%

The improvement in stability was observed shortly after the rollout of the automation project.

Main branch build stability

Improved from 61% to 90%

This increase in stability reflects the effectiveness of the automated system in handling flaky tests.

Time saved in triage

Saved 553 hours of triage time

This was achieved through the automation of the flaky test handling process, allowing developers to focus on more critical tasks.

Key Actionable Insights

1
Implement an automated system for detecting and suppressing flaky tests to improve CI/CD stability.
This approach minimizes the manual effort required to triage flaky tests and allows developers to focus on more critical tasks, thus enhancing overall productivity.

2
Categorize flaky tests to better understand their behavior and improve troubleshooting efforts.
By distinguishing between independent flaky tests and those affected by systemic issues, teams can apply targeted fixes, reducing the time spent on resolving test failures.

3
Regularly review and adjust the thresholds for test suppression based on historical data.
This ensures that tests are accurately classified and helps maintain the integrity of the CI/CD process, preventing flaky tests from leaking into the main branch.

Common Pitfalls

1

Relying solely on manual triaging of flaky tests can lead to inefficiencies and increased frustration among developers.

This happens because manual processes are time-consuming and can result in delays in merging PRs, ultimately affecting productivity.

2

Suppressing test results instead of execution can allow failing tests to leak into the main branch.

This occurs when new tests are incorrectly classified as flaky due to insufficient historical data, leading to confusion and instability in the CI/CD pipeline.

Related Concepts

Continuous Integration

Continuous Deployment

Automated Testing

Flaky Tests Management

At Slack, the goal of the Mobile Developer Experience Team (DevXp) is to empower developers to ship code with confidence while enjoying a pleasant and productive engineering experience. We use metrics and surveys to measure productivity and developer experience, such as developer sentiment, CI stability, time to merge (TTM), and test failure rate. We have…

JavaKotlinGroovy

18 min read

Has Summary

--

Slack

Advanced

Flannel: An Application-Level Edge Cache to Make Slack Scale

Professor Robin Dunbar, when studying Neolithic farming villages and primate troupes in the 90s, theorized that the maximum number of stable relationships we can keep is around 148, known popularly as Dunbar’s number. This upper bound is due to the mental dossier kept on individual’s relationships, but more importantly, the number of cross relationships between…

TypeScriptReactAWS

8 min read

Has Summary

--

Slack

Intermediate

A Day in the Life of a Frontend Foundations Engineer at Slack

6:28 am First alarm rings. Snooze. 6:30 am Second alarm rings. Snooze. 6:34 am Final alarm rings and I know this is the last one, so I hop out of bed and immediately play some music. Music really has a way of waking me up, and I typically play Sofi Tukker or Rufus Du Sol…

TypeScriptReactPHP

9 min read

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Handling Flaky Tests at Scale: Auto Detection & Suppression". Explore more engineering insights on Java, Kotlin, TypeScript.