Getting Kubernetes Ready for the NVIDIA A100 GPU with Multi-Instance GPU

Multi-Instance GPU (MIG) is a new feature of the latest generation of NVIDIA GPUs, such as A100. It enables users to maximize the utilization of a single GPU by…

Arts Yang
12 min readadvanced
--
View Original

Overview

The article discusses how to prepare Kubernetes for utilizing the NVIDIA A100 GPU with the Multi-Instance GPU (MIG) feature. It outlines the benefits of MIG, including improved GPU utilization and support for multiple workloads, and provides detailed instructions on configuring Kubernetes to leverage these capabilities.

What You'll Learn

1

How to enable Multi-Instance GPU (MIG) support in Kubernetes

2

When to use the none, single, or mixed strategies for MIG in Kubernetes

3

How to configure Kubernetes job scripts for different MIG strategies

Prerequisites & Requirements

  • Supported Docker version with the latest version of nvidia-docker2
  • Basic understanding of Kubernetes and GPU concepts(optional)

Key Questions Answered

What is Multi-Instance GPU (MIG) and how does it work with Kubernetes?
Multi-Instance GPU (MIG) is a feature of NVIDIA A100 GPUs that allows multiple workloads to run concurrently on a single GPU with hardware-level isolation. This enables better resource utilization and allows multiple users to share the GPU effectively. Kubernetes supports MIG through specific plugins that manage GPU resources.
How do I enable MIG support for Kubernetes?
To enable MIG support for Kubernetes, install the k8s-device-plugin and gpu-feature-discovery Helm repositories, ensuring you have compatible versions. Then, configure the MIG strategy you wish to use (none, single, or mixed) and deploy the plugins accordingly.
What are the differences between the none, single, and mixed strategies in MIG?
The none strategy does not enable MIG, using GPUs as traditional resources. The single strategy allows a node to expose only one type of MIG device across all GPUs, while the mixed strategy permits a combination of different MIG devices and non-MIG GPUs on the same node.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration
Kubernetes
Used for managing containerized applications and enabling GPU resource allocation.
Containerization
Nvidia-docker2
Facilitates the use of NVIDIA GPUs in Docker containers.

Key Actionable Insights

1
Implementing MIG can significantly enhance GPU utilization in your Kubernetes cluster, allowing multiple deep learning workloads to run simultaneously on a single A100 GPU.
This is particularly beneficial in environments where GPU resources are underutilized, as it maximizes the return on investment in hardware.
2
When configuring Kubernetes for MIG, carefully choose the appropriate strategy (none, single, mixed) based on your workload requirements.
Selecting the right strategy ensures optimal performance and resource allocation, preventing job fragmentation across nodes.

Common Pitfalls

1
Failing to properly configure the MIG strategy can lead to inefficient resource utilization and job failures.
Ensure that the selected strategy aligns with the types of workloads you intend to run, as mismatches can prevent jobs from executing correctly.

Related Concepts

Deep Learning Workloads
Nvidia A100 GPU Capabilities
Kubernetes Resource Management