How Airbnb safeguards changes in production

Zack Loebel-Begelman

Part II: Near Real-time Experiments

Airbnb

•

Zack Loebel-Begelman

•9 min read•advanced•

--

•View Original

ApacheJavaMySQLScala

Overview

The article discusses Airbnb's Safe Deploy system, focusing on its architecture and engineering choices for implementing near real-time experiments. It highlights the components of the system, including the Ramp Controller, Near Real Time (NRT) pipeline, and the Measured framework, emphasizing the importance of safeguarding changes in production.

What You'll Learn

1

How to design a near real-time experimentation system for production environments

2

Why limiting near real-time results to the first 24 hours is effective for catching major issues

3

How to utilize Apache Flink for processing event streams in real-time

4

When to implement automated experiment ramping to minimize negative impacts

Prerequisites & Requirements

Understanding of event-driven architectures and data processing pipelines
Familiarity with Apache Flink and Kafka(optional)

Key Questions Answered

What are the main components of Airbnb's Safe Deploy system?

The Safe Deploy system consists of three main components: the Ramp Controller, which coordinates experiment configurations; the Near Real Time (NRT) pipeline, which processes and enriches data; and the Measured framework, which computes metrics and statistical significance of changes.

How does the Ramp Controller minimize negative impacts during experiments?

The Ramp Controller automates the ramping of experiments, gradually increasing exposure while monitoring metrics. If any egregiously negative metric is detected, it immediately shuts down the experiment to prevent further negative impacts.

What challenges did Airbnb face when implementing the NRT pipeline?

Airbnb encountered challenges such as handling out-of-order events and managing data aging. They addressed these by implementing a custom join mechanism and buffering strategies to ensure timely and accurate data processing.

Why was the initial focus of Safe Deploys on A/B tests?

The initial focus on A/B tests was to build trust in the system and gain experience with automated anomaly detection and remediation, which would help in safeguarding changes in production more effectively.

Key Statistics & Figures

Percentage of experiment starts using Safe Deploys

85%

Since enabling Safe Deploys by default, it has been utilized for over 85% of experiment starts.

Threshold for marking a metric as egregious

-20% change with p-value ≤ 0.01

A metric is considered egregious if it shows a percent change smaller than -20% with an adjusted p-value of less than or equal to 0.01.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Flink

Used for building the Near Real Time (NRT) pipeline to process event streams.

Messaging

Kafka

Employed for event streaming and communication between components.

Programming

Python

Used in the Measured framework for defining metrics and statistical models.

Database

Duckdb

Utilized for aggregating user-level data from event files.

Storage

S3

Serves as the storage solution for enriched measures and event data.

Key Actionable Insights

1
Implement a Ramp Controller to automate the ramping of experiments and monitor metrics effectively.
This approach minimizes human error and allows for quicker responses to negative impacts, enhancing the reliability of experiments.

2
Utilize Apache Flink for real-time data processing to improve the responsiveness of your experimentation system.
Flink's capabilities in handling event streams can significantly enhance the performance and scalability of your data processing pipelines.

3
Limit near real-time results to the first 24 hours of an experiment to focus on catching major issues.
This strategy allows teams to transition to batch results, which provide comprehensive insights without overwhelming the system with data.

Common Pitfalls

1

Relying solely on batch results for decision-making can lead to delayed responses to negative impacts.

This happens because batch results may not provide timely insights, making it crucial to implement near real-time monitoring for immediate feedback.

2

Underestimating the complexity of managing out-of-order events in streaming data.

This can lead to inaccurate data processing and results, so it's important to design robust mechanisms for handling event timing and ordering.

Related Concepts

Event-driven Architectures

Data Processing Pipelines

A/B Testing Methodologies

Anomaly Detection Systems