Open Source and In-House: How Uber Optimizes LLM Training

Bo Ling, Jiapei Huang, Baojun Liu, Chongxiao Cao, Anant Vyas, Peng Zhang

Uber

•

Bo Ling, Jiapei Huang, Baojun Liu, Chongxiao Cao, Anant Vyas, Peng Zhang

•11 min read•advanced•

--

•View Original

ApacheApache KafkaApache SparkCometDockerGoogle CloudGPTGPT-4Hugging FaceKubernetesMistralPyTorchSQLTransformers

Overview

The article discusses how Uber optimizes the training of Large Language Models (LLMs) using both open-source and in-house models. It highlights the integration of advanced technologies and methodologies to enhance performance and scalability in various applications such as Uber Eats recommendations and customer support.

What You'll Learn

1

How to leverage open-source models for LLM training

2

Why continuous pre-training and instruction fine-tuning improve LLM performance

3

How to optimize GPU memory usage during LLM training

Prerequisites & Requirements

Understanding of Large Language Models and their applications
Familiarity with PyTorch and Ray for distributed training(optional)

Key Questions Answered

How does Uber optimize LLM training using both open-source and in-house models?

Uber utilizes a combination of open-source models like Meta® Llama 2 and Mistral AI Mixtral® along with proprietary models from OpenAI and Google. This hybrid approach allows Uber to integrate domain-specific knowledge, enhancing the performance of LLMs for applications such as Uber Eats and customer support.

What hardware does Uber use for LLM training?

Uber employs NVIDIA® A100 and H100 GPU instances for LLM workflows. On-premises, each A100 host has 4 GPUs and 600 GB memory, while Google Cloud hosts feature 8 H100 GPUs with 1872 GB CPU memory, enabling efficient training and evaluation of LLMs.

What are the key components of Uber's LLM training stack?

Uber's LLM training stack includes PyTorch, Ray, Hugging Face, and DeepSpeed. These tools facilitate distributed training, model optimization, and efficient resource management, allowing Uber to train state-of-the-art LLMs effectively.

What optimizations did Uber implement to improve training throughput?

Uber implemented optimizations such as DeepSpeed ZeRO-stage-3 CPU Optimizer Offload and flash attention, which reduced GPU memory usage and allowed for increased batch sizes. These optimizations significantly enhanced training throughput, achieving up to 2-3 times improvement.

Key Statistics & Figures

GPU memory reduction with DeepSpeed ZeRO-stage-3

34%

This optimization allowed for a 2-3 times increase in training throughput.

Batch size increase during Llama 2 70B training

2 to 7 times

This increase was achieved while maintaining maximum GPU memory usage.

Throughput increase with flash attention

50%

This optimization allowed for doubling the batch size while keeping training speeds consistent.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Pytorch

Used as the deep learning framework for training LLMs.

Backend

Ray

Facilitates distributed training and resource management.

Backend

Deepspeed

Optimizes deep learning training and inference.

Backend

Hugging Face

Provides APIs and tools for training transformer-based models.

Orchestration

Kubernetes

Manages computing resources for LLM training workloads.

Key Actionable Insights

1
Utilizing a hybrid model approach can significantly enhance LLM performance.
By combining open-source models with proprietary insights, organizations can leverage domain-specific knowledge to improve model accuracy and user experience.

2
Optimizing GPU memory usage is crucial for efficient LLM training.
Implementing techniques like CPU offload and flash attention can lead to substantial improvements in training throughput and resource utilization.

3
Continuous pre-training and fine-tuning can yield better results in LLM applications.
By regularly updating models with domain-specific data, organizations can maintain high accuracy and relevance in their AI-driven services.

Common Pitfalls

1

Failing to optimize GPU memory can lead to inefficient training processes.

Without proper memory management, organizations may experience bottlenecks that hinder performance and increase costs.

2

Neglecting the importance of continuous model updates can result in outdated AI capabilities.

Models that are not regularly fine-tuned may fail to meet user expectations and business needs, leading to reduced effectiveness.

Related Concepts

Distributed Training

Large Language Models

Model Optimization Techniques