Improving developer productivity via flaky test management

Flaky tests are a well-known problem across the industry and Microsoft is no exception. In this post, Suresh Thummalapenta walks us through the team's comprehensive flaky test management system that helps to infer, triage, and quarantine those tests.

Suresh Thummalapenta
4 min readintermediate
--
View Original

Overview

The article discusses the challenges posed by flaky tests in software development and presents a comprehensive flaky test management system developed by Microsoft to enhance developer productivity. It details the system's phases of inference, reporting, and mitigation, which collectively aim to identify, manage, and suppress flaky tests effectively.

What You'll Learn

1

How to identify flaky tests using telemetry data

2

Why managing flaky tests is crucial for developer productivity

3

When to quarantine tests to prevent misleading failures

Key Questions Answered

What are flaky tests and why are they problematic?
Flaky tests are tests that pass and fail non-deterministically on the same code in the same environment. They create confusion for developers, leading to wasted time investigating failures that are not related to their changes, thus negatively impacting productivity.
How does Microsoft's flaky test management system work?
The system infers flaky tests from test execution telemetry, quarantines them to avoid false failures, and files bugs to notify developers. It operates in three phases: inference, reporting, and mitigation, ensuring that developers are informed and can address issues effectively.
What results has the flaky test management system achieved?
The system has identified approximately 49,000 flaky tests and helped pass 160,000 sessions that would have failed due to flaky test failures. This indicates a significant improvement in the quality of test collateral at Microsoft.

Key Statistics & Figures

Number of flaky tests identified
49,000
This statistic reflects the scale of the flaky test issue addressed by the management system.
Sessions passed due to flaky test management
160,000
This demonstrates the effectiveness of the system in preventing unnecessary test failures.

Technologies & Tools

Backend
Cloudbuild
Microsoft's distributed build service that integrates with the flaky test management system.
Backend
Cloudtest
Microsoft's verification service that aids in the execution and monitoring of tests.

Key Actionable Insights

1
Implement a flaky test management system to enhance developer productivity.
By effectively identifying and managing flaky tests, teams can reduce the time spent on false failures, allowing developers to focus on actual issues and improving overall workflow.
2
Utilize telemetry data to monitor test executions and identify flaky tests.
Monitoring telemetry can provide insights into test reliability, enabling teams to proactively address flaky tests before they disrupt the development process.
3
Establish policies for handling flaky tests, such as blocking PRs for developers with excessive flaky test bugs.
This can encourage accountability among developers and promote a culture of quality in testing practices.

Common Pitfalls

1
Failing to address flaky tests can lead to significant productivity losses.
When flaky tests are ignored, developers may spend excessive time troubleshooting issues that do not relate to their code changes, ultimately slowing down the development process.