Uber’s Journey to Ray on Kubernetes: Resource Management

Bharat Joshi, Anant Vyas, Ben Wang, Axansh Sheth, Abhinav Dixit

Uber

•

Bharat Joshi, Anant Vyas, Ben Wang, Axansh Sheth, Abhinav Dixit

•10 min read•intermediate•

--

•View Original

ApacheApache SparkGrafanaKubernetes

Overview

This article discusses Uber's implementation of elastic resource management on Kubernetes, focusing on enhancements made to support Ray-based job management. It highlights the motivation behind these changes, the architecture for resource management, and the scheduling of workloads on heterogeneous clusters.

What You'll Learn

1

How to implement elastic resource management in Kubernetes

2

Why heterogeneous clusters improve resource utilization

3

How to schedule workloads on specific GPU models

Prerequisites & Requirements

Understanding of Kubernetes and container orchestration
Familiarity with Ray for distributed computing(optional)

Key Questions Answered

How does Uber manage resources in Kubernetes for Ray workloads?

Uber implements elastic resource management by creating resource pools that allow for dynamic sharing and preemption of resources among different workloads. This system ensures that resource pools can borrow resources from each other based on demand, optimizing overall resource utilization.

What are the benefits of using heterogeneous clusters in machine learning?

Heterogeneous clusters allow for optimized resource utilization by offloading non-GPU tasks to CPU-only nodes, which enhances performance and efficiency. This setup enables better management of workloads that require different types of resources, such as GPUs for training and CPUs for data processing.

What scheduling strategies are used for GPU workloads?

Uber employs a bin-packing scheduling strategy for GPU workloads to minimize fragmentation and ensure efficient use of GPU resources. Additionally, a GPU filter plugin is used to restrict non-GPU pods from running on GPU nodes, ensuring that these resources are reserved for appropriate workloads.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration

Kubernetes

Used for managing containerized applications and implementing resource management strategies.

Distributed Computing

Ray

Used for managing and executing machine learning workloads in a distributed environment.

Key Actionable Insights

1
Implementing elastic resource management can significantly enhance resource utilization in Kubernetes environments.
By allowing resource pools to share and borrow resources dynamically, organizations can maximize the use of their infrastructure, especially during peak demand periods.

2
Utilizing heterogeneous clusters can lead to cost savings and improved performance for machine learning tasks.
By offloading non-GPU tasks to CPU nodes, teams can ensure that GPU resources are used efficiently, leading to faster training times and reduced costs.

Common Pitfalls

1

Failing to properly configure resource pools can lead to inefficient resource utilization.

Without careful management of resource pools, teams may experience resource contention, leading to performance degradation and increased costs.

Related Concepts

Resource Management In Kubernetes

Distributed Computing With Ray

Machine Learning Workload Orchestration