Open sourcing Brooklin: Near real-time data streaming at scale

Celia K.
10 min readintermediate
--
View Original

Overview

The article discusses the open-sourcing of Brooklin, a distributed service for near real-time data streaming at scale, which has been in production at LinkedIn since 2016. It highlights Brooklin's capabilities, use cases, and its advantages over previous solutions like Kafka MirrorMaker.

What You'll Learn

1

How to implement a streaming bridge for data across different environments

2

Why Brooklin is a suitable replacement for Kafka MirrorMaker

3

How to utilize change data capture (CDC) for real-time database updates

Prerequisites & Requirements

  • Understanding of data streaming concepts and distributed systems
  • Familiarity with Kafka and cloud services like AWS and Azure(optional)

Key Questions Answered

What is Brooklin and how does it function?
Brooklin is a distributed system designed for streaming data across various data stores and messaging systems with high reliability. It allows the creation of consumers and producers to extend its capabilities, making it suitable for diverse data environments.
What are the primary use cases for Brooklin?
Brooklin primarily serves two use cases: as a streaming bridge to connect different data environments and for change data capture (CDC) to stream database updates in real-time, enhancing application responsiveness and resource isolation.
How does Brooklin improve upon Kafka MirrorMaker?
Brooklin addresses scaling issues faced by Kafka MirrorMaker by allowing multiple independent data pipelines to coexist within a single cluster, significantly reducing operational complexity and improving stability.
What features does Brooklin offer for managing data streams?
Brooklin provides features such as dynamic provisioning of data pipelines via REST endpoints, diagnostics for monitoring stream status, and enhanced failure isolation to ensure that issues with one partition do not affect the entire pipeline.

Key Statistics & Figures

Messages processed daily
Over 2 trillion
This highlights Brooklin's capability to handle massive data streams efficiently.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Messaging System
Kafka
Used as one of the primary messaging systems for data streaming within Brooklin.
Messaging System
AWS Kinesis
Serves as a data source and destination for streaming data.
Messaging System
Azure Event Hubs
Another messaging system integrated with Brooklin for data streaming.
Database
Oracle
Used as a data source for change data capture within Brooklin.
Database
Espresso
LinkedIn's distributed document store utilized in conjunction with Brooklin.

Key Actionable Insights

1
Utilizing Brooklin as a streaming bridge can simplify data movement across different cloud services and data centers.
This is particularly useful for organizations operating in hybrid environments, as it centralizes data management and reduces the complexity of maintaining multiple data pipelines.
2
Implementing change data capture (CDC) with Brooklin can enhance application performance by reducing the need for frequent database queries.
By streaming updates in real-time, applications can react promptly to changes, improving user experience and system efficiency.
3
Transitioning from Kafka MirrorMaker to Brooklin can streamline operations and reduce the number of required clusters.
This consolidation leads to easier management and better resource utilization, allowing teams to focus on development rather than maintenance.

Common Pitfalls

1
Failing to manage multiple Kafka clusters effectively can lead to operational complexities.
Many organizations struggle with maintaining numerous Kafka MirrorMaker instances, which can be mitigated by using Brooklin's single-cluster approach for multiple pipelines.

Related Concepts

Data Streaming
Distributed Systems
Change Data Capture (cdc)
Kafka And Its Ecosystem