Real-time Analytics at Massive Scale with Pinot

LinkedIn Engineering Team

•

LinkedIn Engineering Team

•7 min read•advanced•

--

•View Original

ApacheOracle

Overview

The article discusses the development and implementation of Pinot, a distributed real-time analytics engine created at LinkedIn to handle massive data scales and provide real-time insights. It highlights the challenges faced with prior systems and how Pinot addresses these needs across various analytics products.

What You'll Learn

1

How to implement a distributed real-time analytics infrastructure using Pinot

2

Why specialized distributed systems are necessary for OLAP needs at scale

3

How to achieve low latency and high query per second (QPS) performance with large datasets

Prerequisites & Requirements

Understanding of distributed systems and OLAP concepts
Familiarity with Apache Kafka and Hadoop for data ingestion(optional)

Key Questions Answered

What challenges did LinkedIn face before implementing Pinot?

Before Pinot, LinkedIn's analytics products relied on generic storage systems like Oracle and Voldemort, which were not optimized for OLAP needs. The growing data volume and complexity of queries required a specialized solution, leading to the development of Pinot.

How does Pinot support real-time analytics at scale?

Pinot is designed to handle massive data volumes with low latency and high query performance. It supports real-time data ingestion from Kafka and uses a distributed architecture to parallelize query processing, enabling efficient analytics across various products.

What products at LinkedIn are powered by Pinot?

Pinot powers 18 member-facing analytics products and over 15 internal analytics products at LinkedIn, including 'Who’s Viewed Your Profile' and 'Company Follow Analytics', providing real-time insights and complex query capabilities.

What operational challenges does Pinot address?

Pinot simplifies operational aspects such as cluster rebalancing, adding or removing nodes, and re-bootstrapping, which are critical for managing large volumes of data effectively in a distributed environment.

Key Statistics & Figures

SLA for 'Who’s Viewed Your Profile'

10’s of milliseconds

Pinot is able to serve thousands of requests while maintaining this SLA.

Number of analytics products powered by Pinot

18 member-facing and 15 internal products

These products utilize Pinot to deliver real-time insights and analytics.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Ingestion

Apache Kafka

Used for real-time data indexing to support the needs of analytics products.

Data Processing

Hadoop

Supports the bootstrapping and reconciliation needs for Pinot.

Cluster Management

Apache Helix

Utilized for managing the distributed system architecture of Pinot.

Key Actionable Insights

1
Implementing a distributed analytics system like Pinot can significantly enhance the ability to process and analyze large datasets in real-time.
This is particularly useful for companies experiencing rapid data growth and needing immediate insights for decision-making.

2
Utilizing real-time data ingestion from Kafka can streamline the analytics process and reduce latency.
This approach allows for immediate access to data, which is crucial for applications requiring up-to-the-minute information.

3
Designing for operational simplicity in distributed systems can reduce maintenance overhead and improve system reliability.
By focusing on ease of operation, teams can spend more time on development and less on managing infrastructure.

Common Pitfalls

1

Relying on generic storage systems for OLAP needs can lead to performance bottlenecks and scalability issues.

These systems are not designed for the complex queries and high data volumes typical in analytics, which can result in slow response times and limited functionality.

Related Concepts

Distributed Systems

Real-time Data Processing

Olap Systems

Data Analytics