Real-time analytics on network flow data with Apache Pinot

LinkedIn Engineering Team
10 min readintermediate
--
View Original

Overview

The article discusses how LinkedIn utilizes Apache Pinot for real-time analytics on network flow data, emphasizing the importance of observability in their infrastructure. It details the architecture of InFlow, a system developed to collect and analyze network flow data, and highlights the improvements in data freshness and query latencies achieved through this implementation.

What You'll Learn

1

How to implement real-time analytics using Apache Pinot

2

Why observability is crucial for large-scale infrastructure

3

When to use sFlow and IPFIX protocols for flow data collection

Prerequisites & Requirements

  • Understanding of network flow concepts and data analytics
  • Familiarity with Apache Kafka and Apache Pinot(optional)

Key Questions Answered

How does LinkedIn collect and analyze network flow data?
LinkedIn collects network flow data using InFlow, which receives flows from over 100 network devices at a rate of 50k flows per second. The data is enriched with additional fields and stored in Apache Pinot for real-time analytics, allowing engineers to troubleshoot and monitor network health effectively.
What improvements were made in query latency after onboarding to Pinot?
After transitioning to Apache Pinot, query latencies improved by as much as 95%, with some complex queries reducing from 6 minutes to just 4 seconds. This significant enhancement facilitates quicker access to data for troubleshooting and analytics.
What architecture does InFlow use for processing flow data?
InFlow's architecture consists of three main components: a flow collector, a flow enricher, and an InFlow API that utilizes Apache Pinot for storage. This microservices approach allows for independent scaling and adherence to the single responsibility principle.

Key Statistics & Figures

Flow collection rate
50k flows per second
This rate is achieved from over 100 different network devices on LinkedIn's backbone and edge.
Data freshness improvement
from 15 minutes to 1 minute
This improvement was a result of onboarding flow data to a real-time table on Apache Pinot.
Query latency reduction
up to 95%
This reduction was observed after transitioning to Apache Pinot for data storage.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Apache Pinot
Used for real-time storage and querying of network flow data.
Stream Processing
Apache Kafka
Utilized for handling the flow data ingestion pipeline.
Protocol
Sflow
One of the protocols used for collecting flow data from network devices.
Protocol
Ipfix
Another protocol supported for flow data collection.
Stream Processing
Apache Samza
Used for stream processing of incoming flow events.

Key Actionable Insights

1
Implementing a microservices architecture can significantly enhance scalability and maintainability in large systems.
By breaking down components into microservices, each can be scaled independently based on its requirements, preventing the system from becoming a monolith.
2
Utilizing real-time analytics tools like Apache Pinot can drastically reduce query latencies.
This is essential for operational environments where timely data access is critical for troubleshooting and decision-making.
3
Integrating flow data collection protocols like sFlow and IPFIX can improve network observability.
These protocols enable detailed insights into network traffic, which is vital for capacity planning and operational health monitoring.

Common Pitfalls

1
Failing to optimize query performance can lead to significant delays in data retrieval.
This can occur if indexes are not properly utilized or if inefficient queries are run, which can slow down troubleshooting efforts.

Related Concepts

Network Flow Data Analysis
Real-time Analytics
Microservices Architecture
Data Collection Protocols