Open Sourcing Venice – LinkedIn’s Derived Data Platform

Félix GV

•

Félix GV

•24 min read•advanced•

--

•View Original

ApacheMySQL

Overview

The article discusses the open sourcing of Venice, LinkedIn's derived data platform, which supports over 1800 datasets and 300 applications. It highlights Venice's architecture, writing and reading mechanisms, and operational capabilities, emphasizing its scalability and efficiency in handling large data volumes.

What You'll Learn

1

How to write data to Venice using Push Jobs and Stream processors

2

Why Venice's architecture supports high throughput and low latency

3

How to implement hybrid write workloads in Venice

Prerequisites & Requirements

Understanding of data storage systems and their architectures
Familiarity with Apache Hadoop and stream processing frameworks(optional)

Key Questions Answered

What is Venice and what are its main features?

Venice is LinkedIn's derived data platform that supports high-throughput, low-latency data storage. It powers over 1800 datasets and is used by more than 300 applications, providing mechanisms for writing and reading data efficiently, including asynchronous writes and various read APIs.

How does Venice handle data ingestion at scale?

Venice ingests an average of 14 GB per second, equating to 39 million rows per second, with peak throughput reaching 50 GB per second. This capability is supported by its architecture that allows for asynchronous writes and hybrid workloads.

What are the different writing methods supported by Venice?

Venice supports multiple writing methods including Full Push jobs for complete dataset swaps, Incremental Push jobs for adding data, and Streaming writes for real-time data ingestion. These methods enable flexibility in how data is managed and updated.

What are the operational capabilities of Venice?

Venice is designed for massive scale with features like multi-region support, self-healing capabilities, and elastic scalability. It can handle administrative operations asynchronously, ensuring data remains available even during regional outages.

Key Statistics & Figures

Average data ingestion rate

14 GB per second

This rate translates to approximately 39 million rows ingested per second.

Peak data ingestion rate

50 GB per second

At peak throughput, Venice can handle up to 113 million rows per second.

Total daily ingestion

1.2 PB

Venice processes over three trillion rows daily.

Number of datasets supported

1800

Venice powers over 1800 datasets across various applications.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Hadoop

Used for writing data into Venice via Push Jobs.

Backend

Apache Samza

Integrated for stream processing to write data into Venice.

Database

Rocksdb

Used as the storage engine for Venice, supporting high-performance data access.

Key Actionable Insights

1
Implementing Venice can significantly enhance your data ingestion capabilities, especially for AI applications that require real-time data access.
Given Venice's ability to handle high throughput and low latency, it is particularly beneficial for organizations looking to optimize their data pipelines for machine learning and analytics.

2
Utilizing hybrid write workloads in Venice allows for seamless integration of batch and stream processing, enhancing data freshness and availability.
This approach is crucial for applications that require up-to-date information while maintaining high performance, such as recommendation systems and A/B testing frameworks.

3
Leveraging the Da Vinci client library can improve performance by reducing latency through local state management.
This is particularly useful for applications with stringent latency requirements, allowing for efficient data access without the overhead of remote queries.

Common Pitfalls

1

Assuming that Venice supports strongly consistent online write requests can lead to design flaws.

Venice's architecture is based on asynchronous writes, which means developers need to adapt their expectations and designs to accommodate eventual consistency.

2

Neglecting to configure quotas for multi-tenancy can result in performance degradation.

Without proper quotas, noisy neighbors can affect each other's performance, leading to suboptimal application behavior.

Related Concepts

Data Ingestion Strategies

Eventual Consistency In Distributed Systems

Stream Processing Frameworks