How Airbnb Standardized Metric Computation at Scale

Amit Pahwa

Part II: The six design principles of Minerva compute infrastructure

Airbnb

•

Amit Pahwa

•16 min read•advanced•

--

•View Original

ApacheYAML

Overview

This article discusses how Airbnb standardized metric computation at scale through its Minerva platform, focusing on the design principles that enable efficient dataset management, consistency, and user experience. It highlights the importance of declarative configurations, data versioning, and self-healing mechanisms in ensuring reliable data insights.

What You'll Learn

1

How to define metrics and dimensions using Minerva's standardized approach

2

Why data versioning is crucial for maintaining dataset consistency

3

How to implement automated backfilling for datasets with zero downtime

4

When to utilize the Staging environment for testing changes before production

Prerequisites & Requirements

Understanding of data metrics and dimensions
Familiarity with data processing tools and platforms(optional)

Key Questions Answered

How does Minerva ensure data consistency across datasets?

Minerva uses a data versioning system that generates a unique hash for configuration changes. When any field impacting data generation is modified, the version updates automatically, triggering backfills for affected datasets. This ensures that all datasets remain consistent and up-to-date with the latest definitions.

What are the key design principles of Minerva?

Minerva is built on six design principles: Standardized, Declarative, Scalable, Consistent, Highly Available, and Well Tested. These principles guide the development of a robust metric computation infrastructure that simplifies user interactions and maintains data integrity.

What is the role of the Staging environment in Minerva?

The Staging environment allows users to test changes without affecting the Production environment. It backfills modified datasets automatically, ensuring that data consumers experience no downtime while changes are validated and merged into Production.

How does Minerva handle automated backfilling?

Minerva implements automated backfilling by identifying missing data during job execution. It dynamically includes this data in the current run, allowing for efficient recovery from transient issues and ensuring that datasets are always up-to-date.

Key Statistics & Figures

Number of datasets served by Minerva

5,000+

Minerva serves over 5,000 datasets across hundreds of users and 80+ teams, showcasing its scalability.

Technologies & Tools

Data Platform

Minerva

Used for standardized metric computation and dataset management at Airbnb.

Key Actionable Insights

1
Utilize Minerva's declarative configuration to streamline metric definitions.
By focusing on 'what' rather than 'how', users can quickly create and modify metrics without getting bogged down in implementation details, enhancing productivity.

2
Leverage the self-healing capabilities of Minerva to maintain data integrity.
This feature allows the system to automatically recover from transient issues, reducing the need for manual intervention and ensuring continuous data availability.

3
Implement batched backfills for efficient data recovery.
Batched backfills split large jobs into smaller, manageable tasks, which can run in parallel, minimizing the risk of long-running queries and improving overall system performance.

4
Use the prototyping tool in Minerva for rapid validation of new metrics.
This tool allows users to test changes in a sandbox environment, speeding up the iteration process and ensuring data accuracy before merging into Production.

Common Pitfalls

1

Failing to properly version datasets can lead to inconsistencies.

Without a robust versioning system, changes to metrics may not propagate correctly, causing discrepancies in data analysis and reporting.

2

Neglecting to utilize the Staging environment can result in production issues.

By not testing changes in a Staging environment, users risk introducing errors into Production, leading to potential downtime or data inaccuracies.

Related Concepts

Data Versioning In Data Management

Declarative Programming In Data Processing

Automated Data Recovery Techniques