Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

Pinterest Engineering

•

Pinterest Engineering

•19 min read•advanced•

--

•View Original

ApacheAWSEnvoyHelmJavaKubernetesPrometheusPuppetPySparkSQLTerraform

Overview

This article discusses Pinterest's transition from a Hadoop-based platform to a Kubernetes-based data processing solution named Moka. It outlines the rationale behind this shift, the architectural design of Moka, and the insights gained during the implementation process.

What You'll Learn

1

How to evaluate and select a data processing platform based on specific criteria

2

Why Kubernetes is a suitable replacement for Hadoop in large-scale data processing

3

How to implement a job submission service for Spark on Kubernetes using Archer

Prerequisites & Requirements

Understanding of Kubernetes and Spark
Familiarity with AWS services like EKS and S3(optional)

Key Questions Answered

What are the advantages of using Kubernetes for data processing?

Kubernetes offers container-based isolation, ease of deployment, built-in frameworks, and performance tuning options. These features enhance data privacy, security, and operational efficiency compared to traditional Hadoop systems.

How does the Archer job submission service work?

Archer converts job specifications into Kubernetes Custom Resource Definitions (CRDs) and submits them to EKS clusters. It tracks job status and integrates with existing UI frameworks for user interaction, ensuring efficient job management.

What challenges were faced during the migration from Hadoop to Moka?

Challenges included recompiling libraries for ARM architecture, upgrading to Java 11, and ensuring compatibility of containerized Spark images with the existing Hadoop environment. A dry run process was implemented to validate job submissions before production.

What is the role of the Remote Shuffle Service in Moka?

The Remote Shuffle Service, implemented using Apache Celeborn, improves data shuffling efficiency by decoupling storage and compute clusters. This reduces shuffle timeouts and enhances overall Spark job performance.

Key Statistics & Figures

Percentage of batch Spark workloads migrated to Moka

70%

As of the article's publication, approximately 70% of batch Spark workloads have been successfully migrated from Monarch to Moka.

Average performance improvement with Celeborn

5%

The use of Celeborn for the Remote Shuffle Service has led to an average improvement of 5% in Spark job performance.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration

Kubernetes

Used as the foundation for the new data processing platform, Moka.

Data Processing

Apache Spark

The primary framework for processing data workloads on the Moka platform.

Cloud Service

AWS Eks

Provides the managed Kubernetes service for deploying and managing Moka.

Data Processing

Apache Celeborn

Serves as the Remote Shuffle Service to improve data shuffling efficiency.

Scheduling

Yunikorn

Used for queue-based scheduling and resource management in Moka.

Key Actionable Insights

1
Evaluate Kubernetes-based frameworks for their container management capabilities when transitioning from Hadoop.
This is crucial as it allows for better data privacy and security, which are essential in modern data processing environments.

2
Implement a dry run process for validating job submissions to ensure reliability in production.
This approach helps detect unexpected failures in a staging environment, reducing the risk of job failures in production.

3
Utilize Apache Celeborn as a Remote Shuffle Service to enhance data processing performance.
This service can significantly reduce shuffle timeouts and improve IO efficiency, which is vital for large-scale data operations.

Common Pitfalls

1

Failing to account for library dependencies when migrating to ARM architecture can lead to performance issues.

This often occurs because certain libraries may not be optimized for ARM, resulting in increased memory usage and potential compatibility problems.

2

Neglecting to validate job submissions in a staging environment can lead to unexpected failures in production.

Without a proper validation process, discrepancies between environments can cause significant disruptions and downtime.

Related Concepts

Kubernetes Orchestration And Management

Data Processing Frameworks Like Apache Spark

Cloud Services For Scalable Data Solutions

Job Scheduling And Resource Management Techniques