Resource Management with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest

Pinterest Engineering

•

Pinterest Engineering

•10 min read•advanced•

--

•View Original

ApacheApache SparkAWSKubernetes

Overview

The article discusses the transition from Apache Hadoop YARN to Apache YuniKorn for resource management in Pinterest's batch processing platform, Monarch, now rebranded as Moka. It highlights the challenges faced during this migration and the solutions implemented to enhance resource efficiency and maintain service quality.

What You'll Learn

1

How to implement Apache YuniKorn for resource scheduling in Kubernetes

2

Why migrating from Apache Hadoop to Kubernetes improves resource management

3

How to monitor workflow SLO performance effectively

Prerequisites & Requirements

Understanding of Kubernetes and resource management concepts
Familiarity with Apache YuniKorn and its features(optional)

Key Questions Answered

What are the main challenges of migrating from Apache Hadoop to Kubernetes?

The main challenges include application isolation due to lack of containerization, GPU support limitations, significant upgrade efforts for Hadoop, and the diminishing community support for Hadoop as the industry shifts towards Kubernetes. These issues prompted Pinterest to adopt a Kubernetes-based platform.

How does Apache YuniKorn improve resource scheduling for batch processing?

Apache YuniKorn enhances resource scheduling by supporting hierarchical queues and application-aware scheduling, which considers user priorities and preemption. This allows for more efficient allocation of resources compared to the default Kubernetes scheduler.

What features were added to Apache YuniKorn during the migration?

Key features added include support for logging resource usage information for finished applications, which aggregates pod resource usage and reports a summary. This enhances the ability to track resource consumption and improve future resource allocation decisions.

How does Pinterest ensure workflow SLO performance in Moka?

Pinterest extended its existing Workflow SLO Performance Dashboards to include daily runtime results for applications running on Moka, aiming for at least 90% of tier 1 workflows to meet their SLO at least 90% of the time.

Key Statistics & Figures

Percentage of Spark workload migrated to Moka

50%

As of the article's writing, half of the Spark workload running on Monarch has been successfully migrated to Moka.

Target SLO performance for tier 1 workflows

90%

Pinterest aims for at least 90% of tier 1 workflows to meet their SLO at least 90% of the time.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Resource Scheduler

Apache Yunikorn

Used for managing resources in the Moka platform to improve scheduling and resource allocation.

Container Orchestration

Kubernetes

The underlying platform for the new Moka architecture, replacing Apache Hadoop.

Batch Processing Framework

Apache Hadoop

The legacy system from which Pinterest is migrating to improve resource management.

Key Actionable Insights

1
Implement Apache YuniKorn to manage resources more effectively in Kubernetes environments.
Using YuniKorn allows for hierarchical queue management and application-aware scheduling, which can significantly improve resource utilization and job performance in batch processing workloads.

2
Monitor workflow SLO performance to ensure high service quality.
By tracking SLO performance and adjusting resource allocations based on historical data, teams can maintain service reliability and meet user expectations.

3
Leverage insights from resource usage data to optimize future resource allocations.
Collecting and analyzing historical resource usage helps in predicting future needs, allowing for proactive adjustments to resource allocations and improving overall efficiency.

Common Pitfalls

1

Underestimating the complexity of migrating from Apache Hadoop to Kubernetes.

Many organizations may not fully grasp the engineering effort required to transition workloads, which can lead to delays and resource mismanagement during the migration process.

2

Neglecting to monitor resource usage effectively post-migration.

Failing to track resource consumption can result in inefficient resource allocation, leading to performance issues and unmet service level objectives.

Related Concepts

Resource Management In Cloud Environments

Batch Processing Frameworks Comparison

Kubernetes Best Practices