Training a Recommender System on DGX A100 with 100B+ Parameters in TensorFlow 2

Tomasz Grel

Learn how using the combination of model parallel and data parallel enables practitioners to train large-scale recommender systems in minutes instead of days.

NVIDIA

•

Tomasz Grel

•12 min read•intermediate•

--

•View Original

Deep LearningEmbeddingLessTensorFlow

Overview

This article discusses the training of a large-scale recommender system with over 113 billion parameters using hybrid-parallel training on NVIDIA's DGX A100 with TensorFlow 2. It highlights the challenges of fitting large embedding tables in GPU memory and presents solutions that leverage model parallelism and data parallelism to achieve significant speedups in training times.

What You'll Learn

1

How to implement hybrid-parallel training for large recommender systems

2

Why model parallelism is essential for handling large embedding tables

3

How to optimize training speed using mixed precision and XLA

Prerequisites & Requirements

Understanding of deep learning concepts and recommender systems
Familiarity with TensorFlow 2 and NVIDIA hardware(optional)

Key Questions Answered

What is the hybrid-parallel approach in training recommender systems?

The hybrid-parallel approach combines model parallelism for embedding layers and data parallelism for dense layers, allowing efficient use of multiple GPUs. This method enables the training of large models, such as the 113 billion-parameter DLRM, significantly reducing training time.

What speedup can be achieved using the DGX A100 for training large models?

Using the DGX A100, the training of a 113 billion-parameter DLRM model achieved a 672x speedup compared to a dual-socket CPU system. This demonstrates the effectiveness of leveraging high memory bandwidth and fast GPU-to-GPU communication for large-scale training.

How does the column-wise split mode work for large embedding tables?

In column-wise split mode, each GPU holds a subset of columns from every embedding table, allowing for larger tables to be distributed across multiple GPUs. This method is essential when individual tables exceed the memory capacity of a single GPU, enabling efficient training of large models.

What performance optimizations were implemented for the DLRM model?

Performance optimizations included using mixed precision to increase speed by 23%, fusing embedding tables of the same width for a 39% speedup, and applying the XLA compiler for a 3.36x improvement. These techniques collectively enhance the efficiency of training large models.

Key Statistics & Figures

Total parameters in DLRM model

113 billion

This is the size of the model trained using the hybrid-parallel approach.

Speedup over dual-socket CPU system

672x

Achieved using the DGX A100 for training the DLRM model.

Total size of embeddings

421 GiB

The total size of all embeddings for the model being trained.

Technologies & Tools

Framework

Tensorflow 2

Used for training the recommender system.

Hardware

Nvidia Dgx A100

The platform used for training the large-scale recommender system.

Library

Horovod

Used for collective communication operations in multi-GPU training.

Key Actionable Insights

1
Implement hybrid-parallel training to optimize the performance of large recommender systems.
This approach allows for efficient use of multiple GPUs, significantly reducing training times from days to minutes, which is crucial for rapid experimentation and deployment in production environments.

2
Utilize mixed precision training to enhance computational speed and reduce memory usage.
By applying mixed precision, you can achieve faster training times while maintaining model accuracy, making it a valuable technique for deep learning practitioners.

3
Consider using column-wise split mode for embedding tables that exceed single GPU memory limits.
This method allows for the distribution of large embedding tables across multiple GPUs, ensuring that you can train very large models without running into memory constraints.

Common Pitfalls

1

Neglecting to optimize embedding lookups can lead to significant performance bottlenecks.

Embedding lookups are memory-bound operations, and failing to implement strategies like mixed precision or fusing tables can slow down training considerably.

2

Using a single GPU for large models without considering memory constraints.

This can result in out-of-memory errors. Instead, employing model parallelism or column-wise splitting is essential for handling large embedding tables effectively.

Related Concepts

Deep Learning Recommendation Models

Model Parallelism

Data Parallelism

Performance Optimization Techniques