Apache Spark @Scale: A 60 TB+ production use case

Avery Ching

Visit the post for more.

Overview

The article discusses Facebook's experience in scaling Apache Spark to handle a 60 TB+ production use case, focusing on the migration from a Hive-based pipeline to a more efficient Spark implementation. It highlights the performance improvements achieved through various reliability and optimization strategies, demonstrating Spark's capability to manage large-scale data processing effectively.

What You'll Learn

1

How to migrate a Hive-based data processing pipeline to Apache Spark

2

Why optimizing Spark jobs for large data sets is crucial for performance

3

How to implement reliability fixes in long-running Spark jobs

Prerequisites & Requirements

Understanding of data processing frameworks like Hive and Spark
Experience with large-scale data processing(optional)

Key Questions Answered

What were the main challenges faced when scaling Spark for large data processing?

The main challenges included managing frequent node reboots, handling excessive output file generation, and ensuring fault tolerance during long-running jobs. These issues required significant reliability fixes and optimizations to the Spark infrastructure to achieve stable performance.

How does the performance of Spark compare to Hive for large data sets?

The Spark-based pipeline demonstrated significant performance improvements over the Hive pipeline, achieving 4.5-6x CPU efficiency, 3-4x resource reservation efficiency, and approximately 5x reduction in latency. This showcases Spark's ability to manage large-scale data processing more effectively than Hive.

What specific performance optimizations were implemented in Spark?

Key performance optimizations included fixing memory leaks in the sorter, reducing shuffle write latency, and caching index files to speed up shuffle fetches. These changes collectively improved job execution speed and resource utilization significantly.

What tools were used to identify performance bottlenecks in Spark jobs?

Tools such as Spark UI Metrics, Jstack, and Spark Linux Perf/Flame Graph support were utilized to identify performance bottlenecks. These tools provided insights into task execution times and CPU profiling across multiple machines.

Key Statistics & Figures

Input data size

60 TB

This is the size of the data processed in the Spark implementation.

Shuffle and sort data size

90 TB

This represents the amount of intermediate data handled during the Spark job.

Number of tasks

250,000

This is the total number of tasks spawned for the Spark job.

Performance improvement

4.5-6x CPU efficiency

This indicates the increase in CPU efficiency compared to the previous Hive-based pipeline.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing

Apache Spark

Used for large-scale data processing and analytics.

Data Processing

Hive

Previous data processing framework used before migrating to Spark.

Key Actionable Insights

1
Implement a single Spark job to replace multiple Hive jobs for better manageability and performance.
Migrating from a sharded Hive pipeline to a unified Spark job can significantly reduce complexity and improve execution speed, as demonstrated by the 60 TB data processing case at Facebook.

2
Utilize Spark's metrics and profiling tools to identify and resolve performance bottlenecks.
Regularly analyzing performance metrics can help maintain optimal job performance and resource utilization, especially in large-scale data processing scenarios.

3
Incorporate reliability fixes to handle common failure scenarios in long-running Spark jobs.
Ensuring fault tolerance through reliability improvements can prevent job failures and enhance overall system stability during extensive data processing tasks.

Common Pitfalls

1

Overcomplicating data processing pipelines by using too many small jobs.

This can lead to difficulties in monitoring and managing jobs, as seen in the original Hive implementation with hundreds of sharded jobs.

2

Neglecting to implement fault tolerance in long-running jobs.

Without proper reliability fixes, jobs can fail due to common issues like node reboots, leading to wasted resources and time.

Related Concepts

Data Processing Frameworks

Big Data Analytics

Performance Optimization Techniques