Facebook Trains ImageNet in 1 Hour

Facebook published a paper today detailing how they are able to train nearly 1.3 million images in under an hour using 256 Tesla P100 GPUs that previously took…

Brad Nemire
2 min readadvanced
--
View Original

Overview

Facebook's recent paper reveals their ability to train nearly 1.3 million images in under an hour using 256 Tesla P100 GPUs, significantly reducing the training time of a ResNet-50 deep learning model on ImageNet from 29 hours to one. This achievement was made possible by distributing training across larger minibatches and optimizing communication between GPUs.

What You'll Learn

1

How to train deep learning models using distributed systems

2

Why using larger minibatch sizes can improve training efficiency

3

How to implement linear scaling rules for learning rates

Prerequisites & Requirements

  • Understanding of deep learning concepts and GPU architecture
  • Familiarity with NVIDIA Collective Communications Library (NCCL)(optional)

Key Questions Answered

How did Facebook reduce ImageNet training time from 29 hours to 1 hour?
Facebook achieved this reduction by distributing training across 256 Tesla P100 GPUs with larger minibatch sizes up to 8,192 images. They implemented a linear scaling rule for learning rates and developed a warmup scheme to address optimization challenges early in training, resulting in near-linear SGD scaling.
What technology did Facebook use for deep learning in this project?
Facebook utilized the open-source deep learning framework Caffe2 and their Big Basin GPU server, which features eight NVIDIA Tesla P100 GPU accelerators interconnected using NVIDIA NVLink. This setup facilitated efficient training across multiple GPUs.
What is the significance of using NCCL in distributed training?
NVIDIA Collective Communications Library (NCCL) is crucial for optimizing multi-GPU collective communication. It enables efficient local reduction and enhances performance during distributed training, allowing for faster convergence and improved scalability.

Key Statistics & Figures

Training time reduction
From 29 hours to 1 hour
This statistic highlights the efficiency achieved by Facebook's distributed training approach.
Number of images trained
1.3 million images
This figure illustrates the scale of the training task that was accomplished in a significantly reduced timeframe.
Minibatch size
Up to 8,192 images
This size was used effectively across 256 GPUs to optimize training performance.

Technologies & Tools

Framework
Caffe2
Used for building and training deep learning models.
Hardware
Nvidia Tesla P100
GPU accelerators used for distributed training.
Hardware
Nvidia Nvlink
Interconnect technology used to link multiple GPUs.
Library
Nvidia Collective Communications Library (nccl)
Used for optimizing multi-GPU communication during training.

Key Actionable Insights

1
Utilizing larger minibatch sizes can drastically reduce training time for deep learning models.
By increasing the minibatch size to 8,192 images across multiple GPUs, Facebook was able to maintain accuracy while significantly speeding up the training process, which is essential for large-scale image recognition tasks.
2
Implementing a linear scaling rule for learning rates is vital when increasing minibatch sizes.
This approach helps in maintaining model performance and stability during training, especially when leveraging multiple GPUs, ensuring that the learning process remains effective.

Common Pitfalls

1
Failing to adjust learning rates appropriately when increasing minibatch sizes can lead to suboptimal training outcomes.
Without a proper scaling rule, models may converge poorly or take longer to train, negating the benefits of using larger batches.