•Bo Ling, Jiapei Huang, Baojun Liu, Chongxiao Cao, Anant Vyas, Peng Zhang•11 min read•advanced•
--
•View OriginalOverview
The article discusses how Uber optimizes the training of Large Language Models (LLMs) using both open-source and in-house models. It highlights the integration of advanced technologies and methodologies to enhance performance and scalability in various applications such as Uber Eats recommendations and customer support.
What You'll Learn
1
How to leverage open-source models for LLM training
2
Why continuous pre-training and instruction fine-tuning improve LLM performance
3
How to optimize GPU memory usage during LLM training
Prerequisites & Requirements
- Understanding of Large Language Models and their applications
- Familiarity with PyTorch and Ray for distributed training(optional)
Key Questions Answered
How does Uber optimize LLM training using both open-source and in-house models?
Uber utilizes a combination of open-source models like Meta® Llama 2 and Mistral AI Mixtral® along with proprietary models from OpenAI and Google. This hybrid approach allows Uber to integrate domain-specific knowledge, enhancing the performance of LLMs for applications such as Uber Eats and customer support.
What hardware does Uber use for LLM training?
Uber employs NVIDIA® A100 and H100 GPU instances for LLM workflows. On-premises, each A100 host has 4 GPUs and 600 GB memory, while Google Cloud hosts feature 8 H100 GPUs with 1872 GB CPU memory, enabling efficient training and evaluation of LLMs.
What are the key components of Uber's LLM training stack?
Uber's LLM training stack includes PyTorch, Ray, Hugging Face, and DeepSpeed. These tools facilitate distributed training, model optimization, and efficient resource management, allowing Uber to train state-of-the-art LLMs effectively.
What optimizations did Uber implement to improve training throughput?
Uber implemented optimizations such as DeepSpeed ZeRO-stage-3 CPU Optimizer Offload and flash attention, which reduced GPU memory usage and allowed for increased batch sizes. These optimizations significantly enhanced training throughput, achieving up to 2-3 times improvement.
Key Statistics & Figures
GPU memory reduction with DeepSpeed ZeRO-stage-3
34%
This optimization allowed for a 2-3 times increase in training throughput.
Batch size increase during Llama 2 70B training
2 to 7 times
This increase was achieved while maintaining maximum GPU memory usage.
Throughput increase with flash attention
50%
This optimization allowed for doubling the batch size while keeping training speeds consistent.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Pytorch
Used as the deep learning framework for training LLMs.
Backend
Ray
Facilitates distributed training and resource management.
Backend
Deepspeed
Optimizes deep learning training and inference.
Backend
Hugging Face
Provides APIs and tools for training transformer-based models.
Orchestration
Kubernetes
Manages computing resources for LLM training workloads.
Key Actionable Insights
1Utilizing a hybrid model approach can significantly enhance LLM performance.By combining open-source models with proprietary insights, organizations can leverage domain-specific knowledge to improve model accuracy and user experience.
2Optimizing GPU memory usage is crucial for efficient LLM training.Implementing techniques like CPU offload and flash attention can lead to substantial improvements in training throughput and resource utilization.
3Continuous pre-training and fine-tuning can yield better results in LLM applications.By regularly updating models with domain-specific data, organizations can maintain high accuracy and relevance in their AI-driven services.
Common Pitfalls
1
Failing to optimize GPU memory can lead to inefficient training processes.
Without proper memory management, organizations may experience bottlenecks that hinder performance and increase costs.
2
Neglecting the importance of continuous model updates can result in outdated AI capabilities.
Models that are not regularly fine-tuned may fail to meet user expectations and business needs, leading to reduced effectiveness.
Related Concepts
Distributed Training
Large Language Models
Model Optimization Techniques