Stream Processing Hard Problems Part II: Data Access

LinkedIn Engineering Team

•

LinkedIn Engineering Team

•21 min read•advanced•

--

•View Original

ApacheApache KafkaAWSAzureCassandraOracleSQL

Overview

This article discusses the challenges of data access in high-scale stream processing, particularly focusing on the read/write and read-only data access patterns. It explores solutions for efficient data access, including the use of local and remote stores, and highlights the importance of partitioning and dataset size in stream processing architectures.

What You'll Learn

1

How to implement local state management in stream processing applications

2

Why partitioned data access is crucial for performance in stream processing

3

When to choose between local and remote data stores for stream processing

Prerequisites & Requirements

Basic understanding of stream processing concepts
Familiarity with Apache Samza and Apache Kafka(optional)

Key Questions Answered

What are the main data access patterns in stream processing?

The article identifies two main data access patterns in stream processing: read/write data, which involves maintaining state for each member, and read-only data (adjunct data), which requires looking up additional information, such as member profiles, to process events. Understanding these patterns is essential for optimizing data access in high-scale applications.

How does partitioning affect data access in stream processing?

Partitioning data access allows each event processing node to access a mutually exclusive set of members, significantly improving efficiency. When data is partitioned, applications can implement caching optimizations, leading to faster access times and reduced load on remote databases, which is crucial for high-scale stream processing.

What are the performance implications of using local versus remote data stores?

Local embedded databases can handle much higher throughput compared to remote databases. For instance, performance tests showed that a Samza job could process 1.1 million requests per second with a local store, while accessing a remote database dropped performance to less than 10,000 requests per second, highlighting the efficiency of local state management.

Key Statistics & Figures

Throughput of local state management

1.1 million requests per second

Achieved on a single SSD-based machine with a local embedded database.

Throughput of remote database access

Less than 10,000 requests per second

Observed when the same Samza job accessed a remote database.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Stream Processing Framework

Apache Samza

Used for building and managing stream processing applications at LinkedIn.

Messaging System

Apache Kafka

Serves as the durable pub-sub messaging pipe for event processing.

Embedded Database

Rocksdb

Used for local state management within Samza applications.

Nosql Database

Couchbase

Utilized as a remote cache to improve access times for adjunct data.

Key Actionable Insights

1
Implementing local state management can significantly enhance the performance of stream processing applications.
By co-locating state with event processing, applications can achieve higher throughput and lower latency, which is particularly beneficial in high-scale environments.

2
Utilizing partitioned data access can optimize resource usage and improve processing times.
When data is partitioned, each processing node can operate independently on a subset of data, reducing contention and improving overall system efficiency.

3
Consider the size of your dataset when choosing between local and remote data stores.
For smaller datasets, local stores can provide significant performance benefits, while larger datasets may necessitate remote databases for scalability.

Common Pitfalls

1

Neglecting to partition data can lead to performance bottlenecks in stream processing applications.

Without proper partitioning, multiple processing nodes may contend for the same data, leading to increased latency and reduced throughput.

2

Relying solely on remote databases for state management can hinder application performance.

As throughput demands increase, remote database access can become a significant bottleneck, making it essential to consider local state management for high-scale applications.

Related Concepts

Stream Processing Architectures

Data Caching Strategies

Event-driven Programming