Real-time analytics on network flow data with Apache Pinot

LinkedIn Engineering Team

•

LinkedIn Engineering Team

•10 min read•intermediate•

--

•View Original

ApacheApache KafkaSQL

Overview

The article discusses how LinkedIn utilizes Apache Pinot for real-time analytics on network flow data, emphasizing the importance of observability in their infrastructure. It details the architecture of InFlow, a system developed to collect and analyze network flow data, and highlights the improvements in data freshness and query latencies achieved through this implementation.

What You'll Learn

1

How to implement real-time analytics using Apache Pinot

2

Why observability is crucial for large-scale infrastructure

3

When to use sFlow and IPFIX protocols for flow data collection

Prerequisites & Requirements

Understanding of network flow concepts and data analytics
Familiarity with Apache Kafka and Apache Pinot(optional)

Key Questions Answered

How does LinkedIn collect and analyze network flow data?

LinkedIn collects network flow data using InFlow, which receives flows from over 100 network devices at a rate of 50k flows per second. The data is enriched with additional fields and stored in Apache Pinot for real-time analytics, allowing engineers to troubleshoot and monitor network health effectively.

What improvements were made in query latency after onboarding to Pinot?

After transitioning to Apache Pinot, query latencies improved by as much as 95%, with some complex queries reducing from 6 minutes to just 4 seconds. This significant enhancement facilitates quicker access to data for troubleshooting and analytics.

What architecture does InFlow use for processing flow data?

InFlow's architecture consists of three main components: a flow collector, a flow enricher, and an InFlow API that utilizes Apache Pinot for storage. This microservices approach allows for independent scaling and adherence to the single responsibility principle.

Key Statistics & Figures

Flow collection rate

50k flows per second

This rate is achieved from over 100 different network devices on LinkedIn's backbone and edge.

Data freshness improvement

from 15 minutes to 1 minute

This improvement was a result of onboarding flow data to a real-time table on Apache Pinot.

Query latency reduction

up to 95%

This reduction was observed after transitioning to Apache Pinot for data storage.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Apache Pinot

Used for real-time storage and querying of network flow data.

Stream Processing

Apache Kafka

Utilized for handling the flow data ingestion pipeline.

Protocol

Sflow

One of the protocols used for collecting flow data from network devices.

Protocol

Ipfix

Another protocol supported for flow data collection.

Stream Processing

Apache Samza

Used for stream processing of incoming flow events.

Key Actionable Insights

1
Implementing a microservices architecture can significantly enhance scalability and maintainability in large systems.
By breaking down components into microservices, each can be scaled independently based on its requirements, preventing the system from becoming a monolith.

2
Utilizing real-time analytics tools like Apache Pinot can drastically reduce query latencies.
This is essential for operational environments where timely data access is critical for troubleshooting and decision-making.

3
Integrating flow data collection protocols like sFlow and IPFIX can improve network observability.
These protocols enable detailed insights into network traffic, which is vital for capacity planning and operational health monitoring.

Common Pitfalls

1

Failing to optimize query performance can lead to significant delays in data retrieval.

This can occur if indexes are not properly utilized or if inefficient queries are run, which can slow down troubleshooting efforts.

Related Concepts

Network Flow Data Analysis

Real-time Analytics

Microservices Architecture

Data Collection Protocols