Overview
This article discusses Pinterest's transition from a Hadoop-based platform to a Kubernetes-based data processing solution named Moka. It outlines the rationale behind this shift, the architectural design of Moka, and the insights gained during the implementation process.
What You'll Learn
1
How to evaluate and select a data processing platform based on specific criteria
2
Why Kubernetes is a suitable replacement for Hadoop in large-scale data processing
3
How to implement a job submission service for Spark on Kubernetes using Archer
Prerequisites & Requirements
- Understanding of Kubernetes and Spark
- Familiarity with AWS services like EKS and S3(optional)
Key Questions Answered
What are the advantages of using Kubernetes for data processing?
Kubernetes offers container-based isolation, ease of deployment, built-in frameworks, and performance tuning options. These features enhance data privacy, security, and operational efficiency compared to traditional Hadoop systems.
How does the Archer job submission service work?
Archer converts job specifications into Kubernetes Custom Resource Definitions (CRDs) and submits them to EKS clusters. It tracks job status and integrates with existing UI frameworks for user interaction, ensuring efficient job management.
What challenges were faced during the migration from Hadoop to Moka?
Challenges included recompiling libraries for ARM architecture, upgrading to Java 11, and ensuring compatibility of containerized Spark images with the existing Hadoop environment. A dry run process was implemented to validate job submissions before production.
What is the role of the Remote Shuffle Service in Moka?
The Remote Shuffle Service, implemented using Apache Celeborn, improves data shuffling efficiency by decoupling storage and compute clusters. This reduces shuffle timeouts and enhances overall Spark job performance.
Key Statistics & Figures
Percentage of batch Spark workloads migrated to Moka
70%
As of the article's publication, approximately 70% of batch Spark workloads have been successfully migrated from Monarch to Moka.
Average performance improvement with Celeborn
5%
The use of Celeborn for the Remote Shuffle Service has led to an average improvement of 5% in Spark job performance.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Orchestration
Kubernetes
Used as the foundation for the new data processing platform, Moka.
Data Processing
Apache Spark
The primary framework for processing data workloads on the Moka platform.
Cloud Service
AWS Eks
Provides the managed Kubernetes service for deploying and managing Moka.
Data Processing
Apache Celeborn
Serves as the Remote Shuffle Service to improve data shuffling efficiency.
Scheduling
Yunikorn
Used for queue-based scheduling and resource management in Moka.
Key Actionable Insights
1Evaluate Kubernetes-based frameworks for their container management capabilities when transitioning from Hadoop.This is crucial as it allows for better data privacy and security, which are essential in modern data processing environments.
2Implement a dry run process for validating job submissions to ensure reliability in production.This approach helps detect unexpected failures in a staging environment, reducing the risk of job failures in production.
3Utilize Apache Celeborn as a Remote Shuffle Service to enhance data processing performance.This service can significantly reduce shuffle timeouts and improve IO efficiency, which is vital for large-scale data operations.
Common Pitfalls
1
Failing to account for library dependencies when migrating to ARM architecture can lead to performance issues.
This often occurs because certain libraries may not be optimized for ARM, resulting in increased memory usage and potential compatibility problems.
2
Neglecting to validate job submissions in a staging environment can lead to unexpected failures in production.
Without a proper validation process, discrepancies between environments can cause significant disruptions and downtime.
Related Concepts
Kubernetes Orchestration And Management
Data Processing Frameworks Like Apache Spark
Cloud Services For Scalable Data Solutions
Job Scheduling And Resource Management Techniques