Scaling AI/ML Infrastructure at Uber

Nav Kankani, Rush Tehrani, Anant Vyas

Uber

•

Nav Kankani, Rush Tehrani, Anant Vyas

•10 min read•advanced•

--

•View Original

ApacheApache KafkaApache SparkGenerative AIKubernetesLLaMAMachine Learning

Overview

The article discusses Uber's journey in scaling its AI/ML infrastructure, highlighting the transition from on-premise to cloud solutions, the implementation of new technologies, and the optimization of existing systems. Key focuses include maximizing resource utilization, enhancing training efficiency, and addressing the demands of emerging workloads like Generative AI.

What You'll Learn

1

How to implement a unified federation layer for batch workloads using Ray and Apache Spark

2

Why maximizing GPU utilization is critical for efficient AI/ML training

3

How to evaluate price-performance ratios of cloud SKUs for AI/ML workloads

Prerequisites & Requirements

Understanding of AI/ML concepts and infrastructure
Familiarity with Kubernetes and cloud platforms(optional)

Key Questions Answered

What are the key metrics for scaling AI/ML infrastructure at Uber?

Uber focuses on maximizing infrastructure utilization, establishing systems for emerging workloads, and ensuring reliability with a target of 99% uptime SLA for training dependencies. Efficiency is measured through metrics like Model Flops Utilization (MFU) and developer velocity.

How does Uber optimize its existing on-prem infrastructure?

Uber has created a unified federation layer for batch workloads that utilizes Ray and Apache Spark, addressing challenges like resource exposure and inconsistent utilization across Kubernetes clusters. This allows for better workload scheduling and resource allocation.

What improvements were made for LLM training efficiency?

Uber upgraded its network infrastructure to support higher bandwidth and better congestion control, which resulted in nearly a two-fold increase in training speed and significant reductions in training duration for large models.

What memory upgrades are being implemented to improve GPU allocation rates?

Uber is doubling the memory on GPU servers from 16GB to 32GB per DIMM channel to meet the demands of newer AI/ML workloads, which allows for better GPU allocation and utilization.

Key Statistics & Figures

Target uptime SLA for training dependencies

99%

This target ensures consistent and reliable outcomes for machine learning training tasks.

Increase in training speed due to network upgrades

nearly two-fold

This improvement was observed when higher networking bandwidth and better congestion control mechanisms were implemented.

Reduction in GPU usage due to memory offloading

34%

This reduction was achieved while increasing model flops utilization by 2x.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration

Kubernetes

Used for managing distributed GPU resources across various clusters.

Framework

Ray

Utilized for workload scheduling in the unified federation layer.

Framework

Apache Spark

Also used in conjunction with Ray for batch workload management.

Optimization

Tensorrt

Used for optimizing deep learning model performance during serving.

Key Actionable Insights

1
Implementing a unified federation layer for batch workloads can significantly enhance resource utilization and scheduling efficiency.
This approach allows for better management of distributed resources across Kubernetes clusters, ensuring that workloads are allocated based on demand and availability.

2
Upgrading network infrastructure to support higher bandwidth is crucial for improving the training efficiency of large models.
By enhancing network capabilities, Uber was able to achieve faster training times and better performance for Generative AI applications.

3
Doubling the memory on GPU servers can greatly improve allocation rates and enable the use of larger batch sizes.
This change is essential for meeting the increasing memory demands of modern AI/ML workloads, allowing for more efficient training processes.

Common Pitfalls

1

Failing to optimize resource allocation across clusters can lead to underutilization of available infrastructure.

This often occurs when there is a lack of inter-cluster scheduling capabilities, which can prevent efficient use of resources.

2

Neglecting the importance of network bandwidth can severely impact training efficiency.

Without sufficient bandwidth, large models may experience increased training times and reduced performance, particularly in distributed settings.

Related Concepts

AI/ML Infrastructure Scaling

Resource Optimization Techniques

Cloud Computing For AI/ML Workloads

Generative AI Applications