Orchestrating Accelerated Virtual Machines with Kubernetes Using NVIDIA GPU Operator

The latest release of GPU Operator adds support for KubeVirt and OpenShift Virtualization, enabling the use of Kubernetes to orchestrate GPU-accelerated…

NVIDIA

•

Charu Chaubal

•4 min read•intermediate•

•View Original

GitLabKubernetes

Overview

The article discusses how NVIDIA GPU Operator v22.9 enhances Kubernetes orchestration by enabling GPU-accelerated virtual machines through KubeVirt and OpenShift Virtualization. It highlights the integration of NVIDIA technologies that allow for efficient management of both containerized and virtualized workloads in a unified environment.

What You'll Learn

How to deploy GPU-accelerated virtual machines using NVIDIA GPU Operator

Why KubeVirt and OpenShift Virtualization are essential for managing VMs in Kubernetes

When to use PCI passthrough versus NVIDIA vGPU for GPU workloads

Prerequisites & Requirements

Understanding of Kubernetes and virtualization concepts
Familiarity with NVIDIA GPU Operator and KubeVirt(optional)

Key Questions Answered

How does NVIDIA GPU Operator support KubeVirt and OpenShift Virtualization?

NVIDIA GPU Operator v22.9 introduces support for GPU-accelerated virtual machines, allowing them to run alongside GPU-accelerated containers in the same Kubernetes cluster. This integration simplifies the management of both workloads and enhances operational efficiency by providing unified orchestration capabilities.

What are the limitations of using NVIDIA GPU Operator with virtual machines?

Currently, MIG-backed vGPU instances are not supported, and a GPU worker node can only run one type of GPU workload at a time—either containers, VMs with PCI passthrough, or VMs with NVIDIA vGPU. This limitation requires careful planning of GPU resource allocation.

What configurations are necessary to enable GPU support for virtual machines?

To enable GPU support for virtual machines, set the 'sandboxWorkloads.enabled' option to 'true' in ClusterPolicy. This allows the GPU Operator to manage and deploy the necessary software components for virtual machines, which is disabled by default.

What is the role of the NVIDIA KubeVirt device plug-in?

The NVIDIA KubeVirt device plug-in discovers and advertises both physical and NVIDIA vGPU devices to kubelet, enabling them to be requested and assigned to virtual machines. This facilitates the integration of GPU resources into the Kubernetes orchestration framework.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration

Kubernetes

Used for managing containerized and virtualized workloads in a unified environment.

Software

Nvidia GPU Operator

Facilitates GPU management and orchestration for both containers and virtual machines.

Virtualization

Kubevirt

Enables the management of virtual machines within Kubernetes.

Virtualization

Openshift Virtualization

Provides virtualization capabilities integrated with Kubernetes.

Key Actionable Insights

1
Enable the 'sandboxWorkloads.enabled' option in ClusterPolicy to leverage GPU-accelerated virtual machines.
This setting allows the GPU Operator to manage the deployment of software components necessary for virtual machines, enhancing the capabilities of your Kubernetes cluster.

2
Utilize node labels to control GPU workload deployment effectively.
By using the 'nvidia.com/gpu.workload.config' node label, administrators can dictate the type of GPU workloads a node supports, optimizing resource allocation and performance.

3
Understand the trade-offs between using PCI passthrough and NVIDIA vGPU.
PCI passthrough offers the highest performance but does not allow GPU sharing, while NVIDIA vGPU enables multiple VMs to share a single GPU, making it crucial to choose based on workload requirements.

Common Pitfalls

Assuming that MIG-backed vGPU instances are supported when they are not.

This misconception can lead to deployment failures or performance issues, so it's essential to verify the supported configurations before planning your GPU workloads.

Neglecting to set the correct node labels for GPU workload management.

Without proper node labeling, the GPU Operator defaults to a single workload type, which may not align with your operational needs, leading to inefficient resource utilization.

Related Concepts

Cloud-native Application Deployment

Container Orchestration

Virtualization Technologies

GPU Acceleration In Computing

Continue exploring similar engineering topics

Cloudflare

Intermediate

Workers Builds: integrated CI/CD built on the Workers platform

JWTJavaScriptGolang

15 min read

Includes Code

Has Summary

NVIDIA

Advanced

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges…

Kubernetes

12 min read

Has Summary

Airbnb

Advanced

Safeguarding Dynamic Configuration Changes at Scale

How Airbnb ships dynamic config changes safely and reliably

KubernetesAWSGit

9 min read

Has Summary

These articles from Cloudflare and other leading engineering teams share similar topics with "Orchestrating Accelerated Virtual Machines with Kubernetes Using NVIDIA GPU Operator". Explore more engineering insights on JWT, JavaScript, Kubernetes.