Scaling TensorFlow and Caffe to 256 GPUs

Brad Nemire

IBM Research unveiled a “Distributed Deep Learning” (DDL) library that enables cuDNN-accelerated deep learning frameworks like TensorFlow, Caffe…

NVIDIA

•

Brad Nemire

•1 min read•intermediate•

--

•View Original

Deep LearningMachine LearningResNetTensorFlow

Overview

IBM Research introduced a Distributed Deep Learning (DDL) library that allows deep learning frameworks like TensorFlow and Caffe to scale across multiple IBM servers with numerous GPUs. The library significantly reduces training time, achieving a 58x speedup in training ImageNet-22K on 256 NVIDIA P100 GPUs.

What You'll Learn

1

How to utilize the Distributed Deep Learning library for scaling deep learning models

2

Why using multiple GPUs can drastically reduce training times for AI models

3

When to implement distributed training in your deep learning projects

Key Questions Answered

How does the Distributed Deep Learning library improve training times?

The Distributed Deep Learning library enables deep learning frameworks to leverage hundreds of GPUs across multiple IBM servers, resulting in a significant reduction in training times. For instance, training ImageNet-22K with ResNet-101 was reduced from 16 days to just 7 hours, demonstrating a 58x speedup.

What deep learning frameworks are compatible with the DDL library?

The Distributed Deep Learning library supports several popular deep learning frameworks, including TensorFlow, Caffe, Torch, and Chainer. This compatibility allows a wide range of organizations to utilize the library for enhanced scalability in their AI training processes.

What hardware was used to achieve the training speedup mentioned in the article?

The training speedup was achieved using 64 IBM Power Systems servers equipped with a total of 256 NVIDIA P100 GPU accelerators. This powerful hardware setup is crucial for handling the demands of large-scale deep learning tasks.

Key Statistics & Figures

Training time reduction

58x speedup

Achieved by training ImageNet-22K using ResNet-101 on 256 NVIDIA P100 GPUs.

Total GPUs used

256 NVIDIA P100

Utilized across 64 IBM Power Systems servers for training.

Original training duration

16 days

The time taken before implementing the DDL library.

New training duration

7 hours

The time taken after implementing the DDL library.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Library

Distributed Deep Learning

Enables scaling of deep learning frameworks across multiple GPUs.

Framework

Tensorflow

One of the deep learning frameworks that can utilize the DDL library.

Framework

Caffe

Another deep learning framework compatible with the DDL library.

Hardware

Nvidia P100

GPU accelerators used for training deep learning models.

Software

Ibm Powerai

Enterprise deep learning software that includes the DDL library.

Key Actionable Insights

1
Leverage the Distributed Deep Learning library to enhance your AI model training efficiency.
By adopting this library, organizations can significantly reduce the time required for training complex models, allowing data scientists to iterate faster and improve model performance.

2
Consider scaling your deep learning projects across multiple GPUs to handle larger datasets.
Using multiple GPUs not only speeds up training but also enables the handling of more complex models and larger datasets, which is essential for achieving state-of-the-art results in AI.

3
Explore the technical preview of DDL available in IBM's PowerAI software.
This preview allows organizations to experiment with the scaling features without a full commitment, making it easier to assess the potential benefits for their specific use cases.