Orchestrating Accelerated Virtual Machines with Kubernetes Using NVIDIA GPU Operator

The latest release of GPU Operator adds support for KubeVirt and OpenShift Virtualization, enabling the use of Kubernetes to orchestrate GPU-accelerated…

Charu Chaubal
4 min readintermediate
--
View Original

Overview

The article discusses how NVIDIA GPU Operator v22.9 enhances Kubernetes orchestration by enabling GPU-accelerated virtual machines through KubeVirt and OpenShift Virtualization. It highlights the integration of NVIDIA technologies that allow for efficient management of both containerized and virtualized workloads in a unified environment.

What You'll Learn

1

How to deploy GPU-accelerated virtual machines using NVIDIA GPU Operator

2

Why KubeVirt and OpenShift Virtualization are essential for managing VMs in Kubernetes

3

When to use PCI passthrough versus NVIDIA vGPU for GPU workloads

Prerequisites & Requirements

  • Understanding of Kubernetes and virtualization concepts
  • Familiarity with NVIDIA GPU Operator and KubeVirt(optional)

Key Questions Answered

How does NVIDIA GPU Operator support KubeVirt and OpenShift Virtualization?
NVIDIA GPU Operator v22.9 introduces support for GPU-accelerated virtual machines, allowing them to run alongside GPU-accelerated containers in the same Kubernetes cluster. This integration simplifies the management of both workloads and enhances operational efficiency by providing unified orchestration capabilities.
What are the limitations of using NVIDIA GPU Operator with virtual machines?
Currently, MIG-backed vGPU instances are not supported, and a GPU worker node can only run one type of GPU workload at a time—either containers, VMs with PCI passthrough, or VMs with NVIDIA vGPU. This limitation requires careful planning of GPU resource allocation.
What configurations are necessary to enable GPU support for virtual machines?
To enable GPU support for virtual machines, set the 'sandboxWorkloads.enabled' option to 'true' in ClusterPolicy. This allows the GPU Operator to manage and deploy the necessary software components for virtual machines, which is disabled by default.
What is the role of the NVIDIA KubeVirt device plug-in?
The NVIDIA KubeVirt device plug-in discovers and advertises both physical and NVIDIA vGPU devices to kubelet, enabling them to be requested and assigned to virtual machines. This facilitates the integration of GPU resources into the Kubernetes orchestration framework.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration
Kubernetes
Used for managing containerized and virtualized workloads in a unified environment.
Software
Nvidia GPU Operator
Facilitates GPU management and orchestration for both containers and virtual machines.
Virtualization
Kubevirt
Enables the management of virtual machines within Kubernetes.
Virtualization
Openshift Virtualization
Provides virtualization capabilities integrated with Kubernetes.

Key Actionable Insights

1
Enable the 'sandboxWorkloads.enabled' option in ClusterPolicy to leverage GPU-accelerated virtual machines.
This setting allows the GPU Operator to manage the deployment of software components necessary for virtual machines, enhancing the capabilities of your Kubernetes cluster.
2
Utilize node labels to control GPU workload deployment effectively.
By using the 'nvidia.com/gpu.workload.config' node label, administrators can dictate the type of GPU workloads a node supports, optimizing resource allocation and performance.
3
Understand the trade-offs between using PCI passthrough and NVIDIA vGPU.
PCI passthrough offers the highest performance but does not allow GPU sharing, while NVIDIA vGPU enables multiple VMs to share a single GPU, making it crucial to choose based on workload requirements.

Common Pitfalls

1
Assuming that MIG-backed vGPU instances are supported when they are not.
This misconception can lead to deployment failures or performance issues, so it's essential to verify the supported configurations before planning your GPU workloads.
2
Neglecting to set the correct node labels for GPU workload management.
Without proper node labeling, the GPU Operator defaults to a single workload type, which may not align with your operational needs, leading to inefficient resource utilization.

Related Concepts

Cloud-native Application Deployment
Container Orchestration
Virtualization Technologies
GPU Acceleration In Computing