Choosing a Sequential Testing Framework — Comparisons and Discussions

Mårten Schultzberg (Staff Data Scientist) and Sebastian Ankargren (Sr. Data Scientist)

Spotify

•

Mårten Schultzberg (Staff Data Scientist) and Sebastian Ankargren (Sr. Data Scientist)

•27 min read•advanced•

--

•View Original

Solid

Overview

The article discusses the selection of sequential testing frameworks for online experiments, particularly focusing on group sequential tests (GSTs) used by Spotify. It compares various statistical methods, their advantages, disadvantages, and the impact of data delivery methods on the choice of tests.

What You'll Learn

1

How to choose between different sequential tests based on data infrastructure and sample size estimates

2

Why group sequential tests are preferred for batch data analysis

3

When to use always valid inference tests for streaming data

Prerequisites & Requirements

Understanding of statistical testing concepts
Experience with A/B testing frameworks(optional)

Key Questions Answered

What are the main advantages of using group sequential tests?

Group sequential tests (GSTs) allow for multiple testing without inflating the false positive rate, making them suitable for experiments where data arrives in batches. They provide a flexible alpha spending approach, enabling experimenters to decide when to peek at results without predetermined limits.

How does peeking affect the false positive rate in statistical tests?

Peeking during data collection inflates the false positive rate because it allows multiple opportunities to find a significant result. For example, using a z-test repeatedly can double the intended false positive rate, leading to misleading conclusions.

What is the impact of sample size estimation on the choice of sequential test?

Accurate estimation of sample size is crucial when choosing a sequential test. If the expected sample size is underestimated, it can lead to an inflated false positive rate in group sequential tests, while always valid inference tests can maintain a bounded false positive rate regardless of sample size estimation.

What are the limitations of always valid inference tests?

Always valid inference tests require careful selection of parameters related to the mixing distribution, which can affect their statistical properties. Additionally, they may have lower power when analyzing batch data compared to streaming data, making them less effective in certain scenarios.

Key Statistics & Figures

False positive rate increase due to peeking

10%

When using a z-test repeatedly, the overall false positive rate can increase from the intended 5% to approximately 10%.

Empirical power results for group sequential tests

0.90

For a sample size of 500, the group sequential test shows an empirical power of 0.90 when using a quadratic alpha spending function.

Key Actionable Insights

1
Implement group sequential tests in your experimentation framework to optimize statistical power while managing risks associated with peeking.
This is particularly useful for companies that analyze data in batches, as it allows for flexible testing without compromising the integrity of results.

2
Use always valid inference tests when working with streaming data to ensure bounded false positive rates.
These tests are designed to handle continuous data collection, making them ideal for scenarios where immediate feedback is necessary.

3
Establish a clear understanding of your data infrastructure before selecting a sequential testing method.
Knowing whether your data is delivered in batches or streams will significantly influence the effectiveness of the chosen testing framework.

Common Pitfalls

1

Underestimating the expected sample size can lead to an inflated false positive rate in group sequential tests.

This occurs because the test's design relies on accurate sample size estimation; if the actual sample size is lower than expected, the test becomes overly conservative.

2

Misunderstanding the flexibility of group sequential tests regarding the timing and number of intermittent analyses.

Many believe that these analyses must be predetermined, but in reality, they can be conducted as needed, allowing for more adaptive experimentation.

Related Concepts

Statistical Testing Frameworks

A/B Testing Methodologies

Data Analysis Techniques