Apache Spark @Scale: A 60 TB+ production use case

Visit the post for more.

Avery Ching
15 min readadvanced
--
View Original

Overview

The article discusses Facebook's experience in scaling Apache Spark to handle a 60 TB+ production use case, focusing on the migration from a Hive-based pipeline to a more efficient Spark implementation. It highlights the performance improvements achieved through various reliability and optimization strategies, demonstrating Spark's capability to manage large-scale data processing effectively.

What You'll Learn

1

How to migrate a Hive-based data processing pipeline to Apache Spark

2

Why optimizing Spark jobs for large data sets is crucial for performance

3

How to implement reliability fixes in long-running Spark jobs

Prerequisites & Requirements

  • Understanding of data processing frameworks like Hive and Spark
  • Experience with large-scale data processing(optional)

Key Questions Answered

What were the main challenges faced when scaling Spark for large data processing?
The main challenges included managing frequent node reboots, handling excessive output file generation, and ensuring fault tolerance during long-running jobs. These issues required significant reliability fixes and optimizations to the Spark infrastructure to achieve stable performance.
How does the performance of Spark compare to Hive for large data sets?
The Spark-based pipeline demonstrated significant performance improvements over the Hive pipeline, achieving 4.5-6x CPU efficiency, 3-4x resource reservation efficiency, and approximately 5x reduction in latency. This showcases Spark's ability to manage large-scale data processing more effectively than Hive.
What specific performance optimizations were implemented in Spark?
Key performance optimizations included fixing memory leaks in the sorter, reducing shuffle write latency, and caching index files to speed up shuffle fetches. These changes collectively improved job execution speed and resource utilization significantly.
What tools were used to identify performance bottlenecks in Spark jobs?
Tools such as Spark UI Metrics, Jstack, and Spark Linux Perf/Flame Graph support were utilized to identify performance bottlenecks. These tools provided insights into task execution times and CPU profiling across multiple machines.

Key Statistics & Figures

Input data size
60 TB
This is the size of the data processed in the Spark implementation.
Shuffle and sort data size
90 TB
This represents the amount of intermediate data handled during the Spark job.
Number of tasks
250,000
This is the total number of tasks spawned for the Spark job.
Performance improvement
4.5-6x CPU efficiency
This indicates the increase in CPU efficiency compared to the previous Hive-based pipeline.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing
Apache Spark
Used for large-scale data processing and analytics.
Data Processing
Hive
Previous data processing framework used before migrating to Spark.

Key Actionable Insights

1
Implement a single Spark job to replace multiple Hive jobs for better manageability and performance.
Migrating from a sharded Hive pipeline to a unified Spark job can significantly reduce complexity and improve execution speed, as demonstrated by the 60 TB data processing case at Facebook.
2
Utilize Spark's metrics and profiling tools to identify and resolve performance bottlenecks.
Regularly analyzing performance metrics can help maintain optimal job performance and resource utilization, especially in large-scale data processing scenarios.
3
Incorporate reliability fixes to handle common failure scenarios in long-running Spark jobs.
Ensuring fault tolerance through reliability improvements can prevent job failures and enhance overall system stability during extensive data processing tasks.

Common Pitfalls

1
Overcomplicating data processing pipelines by using too many small jobs.
This can lead to difficulties in monitoring and managing jobs, as seen in the original Hive implementation with hundreds of sharded jobs.
2
Neglecting to implement fault tolerance in long-running jobs.
Without proper reliability fixes, jobs can fail due to common issues like node reboots, leading to wasted resources and time.

Related Concepts

Data Processing Frameworks
Big Data Analytics
Performance Optimization Techniques