Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•9 min read•intermediate•

--

•View Original

AWSAWS S3CassandragRPCMemcachedSQLYAML

Overview

The article discusses Bulldozer, a self-serve data platform developed by Netflix for efficiently moving batch data from data warehouse tables to online key-value stores. It highlights how Bulldozer simplifies data movement, enhances microservices performance, and manages data versioning effectively.

What You'll Learn

1

How to configure a Bulldozer job using YAML files

2

Why using a Key-Value Data Abstraction Layer improves data management

3

When to implement data version control in data pipelines

Prerequisites & Requirements

Understanding of data warehousing concepts
Familiarity with YAML and Protobuf(optional)

Key Questions Answered

What is Bulldozer and how does it work?

Bulldozer is a self-serve data platform that efficiently moves data from data warehouse tables to key-value stores in batches. It uses Netflix Scheduler for job scheduling and leverages Spark for data processing, allowing users to specify data sources and destinations in a YAML file.

How does Bulldozer handle data versioning?

Bulldozer creates a new namespace for each job execution, suffixed with the date, allowing for versioned datasets. It uses an alias namespace for consumers to read the latest version, ensuring seamless access and fallback mechanisms in case of data corruption.

What technologies does Bulldozer utilize?

Bulldozer utilizes technologies such as Spark for data processing, Protobuf for data serialization, and a Key-Value Data Abstraction Layer (KV DAL) to decouple applications from specific storage engines, enhancing flexibility and maintainability.

What are the requirements for a Bulldozer job?

A Bulldozer job requires a YAML configuration that specifies the data movement properties, including the source data and destination namespace in the Key-Value DAL. It also defines key and value columns for data mapping.

Key Statistics & Figures

Number of records transferred

billions

Bulldozer transfers billions of records from the data warehouse to key-value stores every day.

Number of Netflix subscribers

195 million

Netflix has over 195 million subscribers generating petabytes of data daily.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Spark

Used for reading data from the data warehouse and processing it into DataFrames.

Data Serialization

Protobuf

Used for defining data schemas and serializing/deserializing data in Bulldozer.

Backend

Key-value Data Abstraction Layer (kv Dal)

Provides a standardized interface for applications to interact with various storage engines.

Configuration

YAML

Used for specifying configurations for Bulldozer jobs.

Key Actionable Insights

1
Implementing a configuration-based approach for data movement can significantly reduce the complexity of ETL processes.
By using YAML files for configuration, teams can streamline data transfers without needing extensive coding, which saves time and reduces errors.

2
Leveraging a Key-Value Data Abstraction Layer can enhance application flexibility and reduce maintenance overhead.
This approach allows applications to interact with a standardized interface, making it easier to adapt to changes in underlying storage technologies.

3
Regularly scheduled data versioning is crucial for maintaining data integrity and availability.
By creating new namespaces for each version, teams can ensure that data consumers always access the most recent and complete dataset, facilitating seamless updates.

Common Pitfalls

1

Failing to ensure data integrity during batch transfers can lead to inconsistent data states.

It's crucial to implement mechanisms that guarantee either the full dataset is written or none at all, preventing partial updates that could confuse consumers.

2

Neglecting to manage data versioning can result in data consumers accessing outdated or corrupted data.

Without a proper version control system, data consumers may read from multiple versions, leading to inconsistencies and potential data corruption.

Related Concepts

Data Warehousing

Etl Processes

Microservices Architecture

Data Serialization