Opal: Building a mutable dataset in data lake

Bhupendra Jain

•

Bhupendra Jain

•16 min read•advanced•

--

•View Original

ApacheAvroMySQLOracleSQL

Overview

The article discusses Opal, a system developed at LinkedIn to manage mutable datasets within a data lake. It addresses the challenges of reflecting updates from online data sources in an immutable data lake and outlines the architecture and functionality of Opal, including its ingestion process and metadata management.

What You'll Learn

1

How to implement a mutable dataset architecture in a data lake using Opal

2

Why read-time filtering improves data visibility and reduces latency

3

How to manage metadata efficiently in a mutable dataset environment

Prerequisites & Requirements

Understanding of data lakes and data management concepts
Familiarity with Apache ORC and Gobblin(optional)

Key Questions Answered

How does Opal handle mutable data in an immutable data lake?

Opal ingests mutable data from various online sources and builds a mutable dataset on top of an immutable file system. It utilizes read-time filtering and validity metadata to present an updated view of the data without requiring data to be sorted or reorganized, thus minimizing latency.

What are the main components of the Opal data format?

The Opal data format consists of logical segments identified by unique IDs, with each row having a unique Opal row ID that tracks upstream mutations. It uses a columnar file format like Apache ORC to optimize full-table scan speed.

What challenges does Opal address compared to previous systems?

Opal addresses the write-amplification problem and significant latency issues associated with previous systems like Lumos by decoupling data layout from filtering conditions, allowing for real-time updates without extensive data rewriting.

Technologies & Tools

Storage Format

Apache Orc

Used to maximize full-table scan speed in Opal datasets.

Data Ingestion Tool

Gobblin

Drives offline ingestion at LinkedIn and is leveraged for ingesting data into the Opal format.

Key Actionable Insights

1
Implementing Opal can significantly reduce the latency of data updates in your data lake architecture.
By utilizing read-time filtering and validity metadata, Opal allows for real-time data visibility, which is crucial for applications requiring up-to-date information.

2
Leveraging a columnar file format like Apache ORC can enhance the performance of data retrieval operations.
This is particularly beneficial in environments where full-table scans are common, as it optimizes both storage and access speed.

3
Understanding the role of metadata in Opal is essential for effective data management.
The validity metadata bitmap and key store are critical for determining record visibility and managing updates efficiently.

Common Pitfalls

1

Failing to manage the validity metadata can lead to outdated or incorrect data being presented to users.

This can happen if the metadata is not updated in real-time during ingestion, which Opal addresses through its design.

Related Concepts

Data Lakes

Metadata Management

Data Ingestion Frameworks

Columnar Storage Formats