Uber’s Journey to Ray on Kubernetes: Ray Setup

Bharat Joshi, Anant Vyas, Ben Wang, Min Cai, Axansh Sheth, Abhinav Dixit
18 min readintermediate
--
View Original

Overview

Uber's blog post discusses their migration of machine learning workloads to Kubernetes using Ray, detailing the challenges faced with their previous setup and the improvements achieved with the new architecture. The article outlines the motivations behind the transition, the objectives for simplifying user experience, and the federated resource management system implemented to optimize resource usage.

What You'll Learn

1

How to migrate machine learning workloads to Kubernetes using Ray

2

Why federated resource management is essential for optimizing resource utilization

3

How to implement job execution and monitoring in a Kubernetes environment

Prerequisites & Requirements

  • Understanding of Kubernetes and Ray
  • Experience with machine learning workflows(optional)

Key Questions Answered

What challenges did Uber face with their previous machine learning workload management?
Uber faced several challenges with their previous system, including difficult and manual resource management, inefficient resource utilization, inflexible capacity planning, and tight coupling with underlying infrastructure. These issues led to increased complexity and longer turnaround times for machine learning experiments.
How does Uber's new architecture improve machine learning workload management?
The new architecture simplifies user experience by abstracting infrastructure complexity and providing a declarative interface for job specifications. It also optimizes resource usage through federated resource management, allowing better allocation of GPU resources across clusters, resulting in improved training speeds.
What is the role of the global control plane in Uber's new system?
The global control plane consists of an API server and a controller manager that manage job requests and resource allocation. It utilizes custom resources to represent machine learning artifacts and monitors job lifecycles, ensuring efficient execution and resource management.

Key Statistics & Figures

Improvement in training speed
1.5 to 4 times
This improvement was observed after migrating all machine learning projects to the new ML Ray on Kubernetes architecture.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a federated resource management system to optimize resource allocation across clusters.
This approach allows for better utilization of available resources, especially in environments with varying workloads, ensuring that resources are allocated efficiently and reducing costs.
2
Utilize a global control plane to manage job execution and monitoring effectively.
By centralizing control and monitoring, teams can streamline operations and improve response times to job failures or resource shortages, enhancing overall system reliability.
3
Adopt a declarative interface for job specifications to simplify user interactions with the system.
This method reduces the complexity for users, allowing them to focus on job requirements rather than underlying infrastructure details, which can lead to faster deployment and iteration cycles.

Common Pitfalls

1
Overcommitting resources based on static configurations can lead to inefficient resource utilization.
This often happens when teams do not dynamically adjust their resource requests based on actual workload needs, resulting in some clusters being overutilized while others remain underutilized.

Related Concepts

Kubernetes Orchestration
Ray Framework For Distributed Computing
Resource Management Strategies In Machine Learning