Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack

Slack Data Engineering recently underwent data workload migration from AWS EMR 5 (Spark 2/Hive 2 processing engine) to EMR 6 (Spark 3 processing engine). In this blog, we will share our migration journey, challenges, and the performance gains we observed in the process. This blog aims to assist Data Engineers, Data Infrastructure Engineers, and Product…

Nilanjana Mukherjee
12 min readintermediate
--
View Original

Overview

This article details Slack's migration from AWS EMR 5 with Spark 2 to EMR 6 with Spark 3, highlighting the challenges faced and the performance improvements achieved. It serves as a guide for Data Engineers and Product Managers considering similar upgrades, emphasizing the benefits of Adaptive Query Execution and enhanced security features.

What You'll Learn

1

How to migrate from EMR 5 to EMR 6 while minimizing disruption

2

Why Adaptive Query Execution is crucial for optimizing data processing

3

How to enhance Airflow operators for better job management across Spark versions

Prerequisites & Requirements

  • Understanding of AWS EMR and Spark
  • Familiarity with Apache Airflow and Hive Metastore(optional)

Key Questions Answered

What were the main challenges faced during the migration to EMR 6?
The main challenges included supporting the same Hive catalog across Spark 2 and Spark 3 workloads, provisioning different versions of EMR clusters, controlling costs, and managing job libraries across these clusters. A phased approach was essential to avoid disrupting existing workflows.
How did Slack ensure data consistency during the migration?
Slack ensured data consistency by migrating their existing Hive Metastore catalog to a new version while maintaining backward compatibility. They took backups and executed a schema upgrade, allowing both EMR 5 and EMR 6 clusters to access the same catalog during the transition.
What performance improvements were observed post-migration?
Post-migration, Slack observed runtime performance improvements across most Airflow tasks, with enhancements ranging from 30% to 60%, and some jobs achieving up to a 90% boost in efficiency. These improvements were attributed to the new features in Spark 3 and EMR 6.

Key Statistics & Figures

Performance improvement in Airflow tasks
30% to 90%
Observed post-migration across various pipeline tasks

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Service
AWS Emr
Used for managing data analytics workloads
Data Processing Engine
Apache Spark
Used for data processing tasks in the migration
Workflow Orchestration
Apache Airflow
Used to manage and schedule data processing jobs
Data Storage
Hive Metastore
Used for storing metadata for data processing

Key Actionable Insights

1
Implement a phased migration strategy when upgrading data processing frameworks to minimize disruption.
This approach allows teams to gradually transition workloads without impacting ongoing projects, ensuring a smoother upgrade process.
2
Leverage Adaptive Query Execution in Spark 3 to optimize performance for skewed datasets.
By utilizing AQE, teams can reduce the need for complex code adjustments, simplifying the migration process and enhancing overall efficiency.
3
Enhance Airflow operators to support multiple Spark versions for better job management.
This allows teams to maintain flexibility in job submissions and ensures compatibility across different Spark environments, facilitating smoother transitions.

Common Pitfalls

1
Failing to maintain compatibility between different versions of Spark can lead to job failures and data inconsistencies.
This often occurs when teams do not adequately test their code against both Spark versions during migration, resulting in unexpected errors.
2
Neglecting to back up the Hive Metastore before migration can result in data loss.
Taking backups is crucial to ensure that data can be restored in case of migration issues, especially when dealing with critical datasets.

Related Concepts

Data Migration Strategies
Performance Optimization Techniques
Data Processing Frameworks