Prototyping Venice: Derived Data Platform

Félix GV

•

Félix GV

•7 min read•advanced•

--

•View Original

ApacheApache Kafka

Overview

The article discusses the development of Venice, a derived data serving platform designed to improve the handling of derived data at LinkedIn. It highlights the challenges faced with existing systems and how Venice aims to unify batch and stream processing for better efficiency.

What You'll Learn

1

How to design a derived data serving platform using modern distributed systems principles

2

Why using Apache Kafka can streamline data ingestion processes

3

When to transition from batch processing to stream processing for real-time data applications

Prerequisites & Requirements

Understanding of distributed systems and data processing concepts
Familiarity with Apache Kafka and Hadoop(optional)

Key Questions Answered

What problems does Venice aim to solve in data processing?

Venice addresses issues of data staleness and inefficiency in transferring data from Hadoop to Voldemort. The existing process requires daily or frequent re-pushing of entire datasets, which is costly and leads to outdated information. Venice aims to provide a more efficient solution by allowing for real-time updates and reducing the need for complete data rebuilds.

How does Venice differ from Voldemort in handling data?

Venice combines the capabilities of handling both large bulk loads and streaming updates in a single system, unlike Voldemort, which operates as two separate systems for read-only and read-write datasets. This unification simplifies data querying and improves performance by reducing latency and reliance on multiple systems.

What is derived data and why is it important?

Derived data is information generated from other data signals, such as aggregates or machine learning outputs. It is crucial for applications that require real-time insights and recommendations, as it allows for more dynamic and relevant data usage compared to static source data.

Key Statistics & Figures

Data pushed from Hadoop to Voldemort

More than 25 terabytes per day per datacenter

This statistic highlights the scale of data management challenges that Venice aims to address.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Kafka

Used as the entry point for all writes in the Venice architecture, facilitating data ingestion from both batch and stream sources.

Backend

Hadoop

Serves as the source of truth for data that is processed and then served through Venice.

Backend

Voldemort

Previously used for serving read-only datasets, now being enhanced by Venice for better data handling.

Key Actionable Insights

1
Implementing a unified data serving platform like Venice can significantly enhance data processing efficiency.
By consolidating batch and stream processing, organizations can reduce latency and improve data freshness, which is essential for applications that rely on real-time data.

2
Utilizing Apache Kafka as the primary data ingestion point can streamline the handling of both batch and stream data.
Kafka's log-based structure allows for asynchronous data writes, which can improve overall system performance and reduce the complexity of managing multiple data sources.

Common Pitfalls

1

Relying solely on batch processing can lead to data staleness and inefficiencies.

This is often due to the need for frequent data rebuilds and pushes, which can be resource-intensive and slow. Transitioning to a system that supports real-time updates can mitigate these issues.

Related Concepts

Distributed Systems

Data Management

Stream Processing

Batch Processing