How Airbnb Built “Wall” to prevent data bugs

Gaining trust in data with extensive data quality, accuracy and anomaly checks

Subrata (Subu) Biswas
10 min readintermediate
--
View Original

Overview

The article discusses how Airbnb developed the Wall framework to enhance data quality and prevent data bugs across its data engineering workflows. It outlines the challenges faced with existing data quality checks and provides insights into the architecture and functionality of the Wall framework.

What You'll Learn

1

How to implement data quality checks using the Wall framework

2

Why centralized data quality checks improve data reliability

3

How to simplify ETL pipelines with the Wall framework

Prerequisites & Requirements

  • Understanding of data quality concepts and ETL processes
  • Familiarity with Apache Airflow
  • Experience with Python programming(optional)

Key Questions Answered

What challenges did Airbnb face in implementing data quality checks?
Airbnb faced challenges such as multiple approaches to adding data checks, redundant efforts among teams building tools in silos, and complicated Airflow DAG code that became difficult to maintain. These issues hindered the scalability and efficiency of data quality checks across the organization.
How does the Wall framework improve data quality at Airbnb?
The Wall framework standardizes data quality checks by providing a unified interface for adding checks to ETL pipelines. It allows users to define checks in a configuration file, simplifying the process and reducing the complexity of Airflow DAGs, ultimately ensuring more reliable data across the organization.
What are the key components of the Wall framework?
The Wall framework consists of three main components: WallApiManager, which orchestrates checks; WallConfigManager, which parses and validates configuration files; and CheckConfigModel, which defines individual checks and generates Airflow tasks. This architecture allows for extensibility and ease of use.
How can teams add their specific data quality checks to Wall?
Teams can easily add their specific checks to the Wall framework by creating their own CheckConfigModel classes. This allows for customization while adhering to the standardized approach of the Wall framework, promoting collaboration across the data engineering community.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing the Wall framework can drastically reduce the complexity of your ETL pipelines.
By centralizing data quality checks within Wall, teams can avoid duplicating code and streamline their workflows, leading to more maintainable and efficient data engineering practices.
2
Standardizing data quality checks across teams enhances data reliability.
When all teams use the same framework for data checks, it reduces inconsistencies and ensures that data quality standards are uniformly applied, which is crucial for making informed business decisions.
3
Utilizing configuration-driven checks allows for faster development and easier maintenance.
By defining checks in YAML files, teams can quickly adapt to changing requirements without modifying the underlying codebase, making it easier to scale data quality efforts.

Common Pitfalls

1
Failing to standardize data quality checks can lead to inconsistent implementations across teams.
Without a unified approach, teams may create redundant tools and frameworks, increasing maintenance overhead and complicating the data quality landscape.
2
Overcomplicating Airflow DAGs with numerous individual checks can hinder performance.
When each data quality check is treated as a separate task, it can lead to bloated DAG files that are hard to manage. Simplifying this with a framework like Wall can alleviate these issues.

Related Concepts

Data Quality Assurance
Etl Processes
Data Engineering Best Practices