Merging Telemetry and Logs from Microservices at Scale with Apache Spark

One of the most common challenges with big data is the ability to merge data from several sources with minimal cost and latency. It’s an even bigger challenge…

Niranjan Nataraja
11 min readintermediate
--
View Original

Overview

The article discusses the challenges and solutions involved in merging telemetry and logs from microservices at scale using Apache Spark. It highlights NVIDIA's implementation for their GeForce NOW service, detailing the architecture, optimization strategies, and the significant reduction in data processing latency from three hours to 15 minutes.

What You'll Learn

1

How to merge telemetry and logs from microservices using Apache Spark

2

Why using watermarking and stream-stream joins can reduce latency in data processing

3

When to implement a two-stage data processing pipeline for better scalability

4

How to optimize Spark streaming jobs to reduce costs and improve performance

Prerequisites & Requirements

  • Understanding of microservices architecture and big data concepts
  • Familiarity with Apache Spark and Kafka

Key Questions Answered

How did NVIDIA reduce data processing latency from three hours to 15 minutes?
NVIDIA achieved this significant reduction by implementing a two-stage data processing pipeline using Apache Spark's stream-stream joins and watermarking features. This allowed them to merge telemetry and logs in near-real time, addressing scalability and reliability requirements without increasing costs.
What architecture does NVIDIA use for merging telemetry and logs?
NVIDIA's architecture includes Apache Kafka as the message queue, Apache Spark for compute, Amazon S3 for storage, Delta Lake on Databricks as the data warehouse, and Elastic for search and analytics. This setup enables efficient data ingestion and processing from multiple microservices.
What challenges are faced when merging data from microservices?
The main challenges include handling different data arrival times for telemetry and logs, managing both structured and unstructured data, and computing thousands of metrics to build a holistic session document. Additionally, traditional big data systems often struggle with upserts.
What optimizations were implemented to improve Spark streaming jobs?
Key optimizations included using the for each batch writer for multiple sinks, employing unions and group by instead of joins to reduce checkpointing costs, and adjusting shuffle partitions to enhance throughput and reduce costs. These strategies helped maintain performance while managing expenses.

Key Statistics & Figures

End-to-end latency reduction
From three hours to 15 minutes
This improvement was achieved through the implementation of a streaming pipeline using Apache Spark.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Apache Kafka
Used as the message queue for telemetry and logs.
Backend
Apache Spark
Utilized for compute in the data processing pipeline.
Storage
Amazon S3
Serves as the storage solution for logs.
Data Warehouse
Delta Lake On Databricks
Acts as the data warehouse for processed data.
Search
Elastic
Used for search, analytics, and alerting.

Key Actionable Insights

1
Implement a two-stage data processing pipeline to enhance scalability and reduce latency.
This approach allows for better isolation of source and sink issues, ensuring that problems in one part of the pipeline do not affect the overall system performance.
2
Utilize watermarking in Spark streaming to efficiently manage late data and improve join operations.
Watermarking helps maintain performance by allowing Spark to manage data in memory efficiently, flushing old data automatically and ensuring that the system remains responsive.
3
Optimize checkpointing strategies to reduce costs associated with data processing.
By minimizing the number of checkpoints and using a single checkpointing directory for multiple sinks, you can significantly lower the operational costs of running Spark streaming jobs.

Common Pitfalls

1
Failing to optimize checkpointing can lead to high operational costs.
Checkpointing frequently incurs costs, especially when dealing with large datasets. By consolidating checkpoints and using efficient writing strategies, you can significantly reduce these expenses.
2
Not utilizing watermarking can result in performance degradation.
Without watermarking, Spark may struggle with late data, leading to increased latency and potential data loss in join operations. Implementing watermarking helps maintain system performance.

Related Concepts

Microservices Architecture
Big Data Processing
Real-time Analytics
Data Streaming