Riverbed Data Hydration — Part 1

A deep dive into the streaming aspect of the Lambda architecture framework of Riverbed.

Xiangmin Liang
9 min readadvanced
--
View Original

Overview

The article provides an in-depth exploration of Riverbed, a framework within Airbnb's tech stack that optimizes data consumption from system-of-record data stores and updates secondary read-optimized stores. It focuses on the streaming aspect of the Lambda architecture, detailing the construction of materialized views from Change Data Capture (CDC) events and the design of the Notification Pipeline.

What You'll Learn

1

How to define Riverbed pipelines using a declarative schema-based interface

2

Why using Directed Acyclic Graphs (DAGs) optimizes data joining in streaming systems

3

How to implement the Notification Pipeline to construct materialized views

Prerequisites & Requirements

  • Understanding of Lambda architecture and Change Data Capture (CDC)
  • Familiarity with Apache Kafka® and data streaming concepts(optional)

Key Questions Answered

How does the Notification Pipeline in Riverbed work?
The Notification Pipeline consumes Notification events from Kafka®, queries dependent data sources, and stitches together documents to be written into a read-optimized sink. It involves operations such as ingestion, join, stitch, and sink to ensure data freshness and consistency.
What is the purpose of JoinConditionsDag in Riverbed?
JoinConditionsDag is a Directed Acyclic Graph used to store the relationship metadata among data sources in Riverbed. Each node represents a unique data source, and edges represent join conditions, guiding the Notification Pipeline in fetching necessary data for materialized views.
What are the key operations in the Notification Pipeline?
The key operations include ingestion, where Notification events are consumed; join, where data is fetched from various sources; stitch, which models join results into a usable format; and sink, where the final documents are written into data sinks.

Technologies & Tools

Messaging
Apache Kafka®
Used for publishing and consuming Notification events in the Riverbed framework.

Key Actionable Insights

1
Implementing a DAG structure for data joins can significantly reduce memory usage and improve performance in streaming applications.
This approach is particularly beneficial when dealing with high cardinality joins, as it avoids the pitfalls of traditional flat table structures.
2
Utilizing a declarative schema-based interface for defining data pipelines can streamline the integration of multiple data sources.
This method simplifies the process for developers, enabling more efficient data management and retrieval.

Common Pitfalls

1
Failing to properly manage concurrency and versioning in data pipelines can lead to inconsistencies.
This often occurs when changes in the underlying data sources are not accurately captured and communicated, highlighting the importance of robust event-driven architectures.

Related Concepts

Lambda Architecture
Change Data Capture (cdc)
Data Streaming
Materialized Views