How NVIDIA Uses PyTorch

566 engineering articles about PyTorch from NVIDIA's engineering team

Other NVIDIA Technologies

Python(740)Deep Learning(505)TensorFlow(444)Docker(292)Kubernetes(251)AWS(202)

Other Companies Using PyTorch

Articles

Filter:

NVIDIA

Advanced

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

The article discusses the use of NVFP4 low-precision model training to achieve higher throughput without sacrificing accuracy in AI model training.

Hugging FacePyTorch

Aditya Vavre

7 min read

Includes Code

Has Summary

NVIDIA

Advanced

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

The article discusses how the NVIDIA cuda. compute library enables Python developers to write high-performance GPU code without needing to resort to C++.

PythonPyTorch

Daniel Rodriguez

5 min read

Includes Code

Has Summary

NVIDIA

Advanced

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s

The article discusses how NVIDIA's hardware-software co-design significantly enhanced the inference performance of Sarvam AI's Sovereign 30B model, achieving a 4x speedup on NVIDIA Blackwell archit...

Hugging FacePyTorchTransformer

Utkarsh Uppal

14 min read

Has Summary

NVIDIA

Advanced

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

The article discusses NVIDIA TensorRT LLM AutoDeploy, a beta feature that automates the inference optimization process for large language models (LLMs).

Hugging FacePyTorchTransformersV

Lucas Liebenwein

8 min read

Includes Code

Has Summary

NVIDIA

Advanced

Build with Kimi K2.5 Multimodal VLM Using NVIDIA GPU-Accelerated Endpoints

Kimi K2. 5 is an advanced multimodal vision language model (VLM) developed by Kimi, optimized for various AI tasks.

EmbeddingFine-tuningHugging FacePyTorch

Anu Srivastava

4 min read

Includes Code

Has Summary

NVIDIA

Advanced

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

The article discusses the challenges of Expert Parallel communication in training Mixture-of-Experts (MoE) models and introduces Hybrid-EP, an efficient communication solution that leverages NVIDIA...

PythonPyTorch

Fan Yu

10 min read

Has Summary

NVIDIA

Intermediate

Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

The article discusses the Universal Sparse Tensor (UST), a framework designed to efficiently handle sparse tensors across various applications, including scientific computing and deep learning.

PyTorchSciPy

Aart J.C. Bik

13 min read

Includes Code

Has Summary

NVIDIA

Advanced

Streamlining CUB with a Single-Call API

The article discusses the transition from the traditional two-phase API of the CUB library to a new single-call API introduced in CUDA 13. 1.

PyTorch

Giannis Gonidelis

8 min read

Includes Code

Has Summary

NVIDIA

Intermediate

How to Write High-Performance Matrix Multiply in NVIDIA CUDA Tile

This article provides a detailed guide on implementing high-performance matrix multiplication using NVIDIA's cuTile framework in CUDA.

PythonPyTorch

Jinman Xie

13 min read

Includes Code

Has Summary

NVIDIA

Advanced

Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell

The article discusses NVIDIA's advancements in AI model inference performance through the Blackwell architecture, emphasizing improvements in token throughput per watt and the enhancements made to ...

Deep LearningPythonPyTorch

Ashraf Eassa

5 min read

Has Summary

NVIDIA

Advanced

Open Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs

The article discusses how recent upgrades to open source AI tools enhance the performance of small language models (SLMs) and diffusion models on NVIDIA RTX PCs.

Diffusion ModelsGPTOllamaPyTorch

Annamalai Chockalingam

7 min read

Has Summary

NVIDIA

Intermediate

New Software and Model Optimizations Supercharge NVIDIA DGX Spark

The article discusses the latest software and model optimizations for NVIDIA DGX Spark, highlighting significant performance improvements in AI workflows.

GPTHugging FacePyTorch

Allen Bourgoyne

5 min read

Has Summary

NVIDIA

Advanced

Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer

The article discusses the NVIDIA Rubin platform, which introduces six new chips designed to create a powerful AI supercomputer.

AssemblyHugging FaceJAXKubernetesLessPyTorchRLHFTransformer

Kyle Aubrey

59 min read

Has Summary

NVIDIA

Advanced

Accelerate AI Inference for Edge and Robotics with NVIDIA Jetson T4000 and NVIDIA JetPack 7.1

NVIDIA introduces the Jetson T4000, enhancing AI and real-time reasoning for robotics and edge AI applications with up to 1200 FP4 TFLOPs of AI compute and 64 GB of memory.

MistralPythonPyTorch

Shashank Maheshwari

9 min read

Has Summary

NVIDIA

Advanced

Accelerating AI-Powered Chemistry and Materials Science Simulations with NVIDIA ALCHEMI Toolkit-Ops

The article discusses the NVIDIA ALCHEMI Toolkit-Ops, a specialized toolkit designed to accelerate AI-powered atomistic simulations in chemistry and materials science.

JAXPythonPyTorchWarp

Justin S. Smith

10 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Simulate Robotic Environments Faster with NVIDIA Isaac Sim and World Labs Marble

This article discusses how to rapidly simulate robotic environments using NVIDIA Isaac Sim and World Labs Marble.

KongPythonPyTorch

Wonsik Han

10 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Using AI Physics for Technology Computer-Aided Design Simulations

The article discusses the integration of AI Physics into Technology Computer-Aided Design (TCAD) simulations, highlighting its significance in semiconductor manufacturing.

Graph Neural NetworksHugging FaceNeural NetworksPythonPyTorch

Ram Cherukuri

7 min read

Has Summary

NVIDIA

Intermediate

Achieve CUTLASS C++ Performance with Python APIs Using CuTe DSL

The article discusses how CuTe DSL, a new Python API for CUTLASS 4, simplifies GPU kernel development by reducing compilation times and maintaining performance efficiency similar to CUTLASS C++.

Multi-Head AttentionPythonPyTorch

Brandon Sun

8 min read

Includes Code

Has Summary

NVIDIA

Advanced

Building Scalable and Fault-Tolerant NCCL Applications

The article discusses the NVIDIA Collective Communications Library (NCCL) and its capabilities for building scalable and fault-tolerant applications.

KubernetesPyTorch

Luke Robison

11 min read

Includes Code

Has Summary

NVIDIA

Advanced

Gen AI Super-Resolution Accelerates Weather Prediction with Scalable, Low-Compute Models

The article discusses how NVIDIA's CorrDiff model leverages generative AI for downscaling weather predictions, significantly improving efficiency and reducing computational costs.

Fine-tuningPythonPyTorchYAML

Alicia Sui

11 min read

Includes Code

Has Summary

NVIDIA

Advanced

How to Achieve 4x Faster Inference for Math Problem Solving

This article discusses how to achieve 4x faster inference for math problem solving using large language models by optimizing the serving stack, quantization strategy, and decoding methods.

Hugging FacePythonPyTorch

Igor Gitman

7 min read

Includes Code

Has Summary

NVIDIA

Advanced

Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond

The article discusses the introduction of a new Kubernetes abstraction called ComputeDomains, designed to facilitate secure GPU-to-GPU memory operations across node boundaries in multi-node NVLink ...

HelmKubernetesPyTorch

Kevin Klues

13 min read

Includes Code

Has Summary

NVIDIA

Advanced

Enhancing GPU-Accelerated Vector Search in Faiss with NVIDIA cuVS

The article discusses how NVIDIA cuVS enhances GPU-accelerated vector search in the Faiss library, providing significant performance improvements for similarity search and clustering of dense vecto...

PythonPyTorch

Tarang Jain

10 min read

Includes Code

Has Summary

NVIDIA

Advanced

Democratizing Large-Scale Mixture-of-Experts Training with NVIDIA PyTorch Paralism

The article discusses how NVIDIA's NeMo Automodel simplifies the training of large-scale mixture-of-experts (MoE) models in PyTorch, making it accessible to a broader audience.

GPTHugging FacePyTorchTransformer

Hemil Desai

7 min read

Includes Code

Has Summary

NVIDIA

Advanced

Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes

The article discusses how to scale biology transformer models using PyTorch and NVIDIA BioNeMo Recipes, focusing on advanced parallel computing techniques and the integration of the NVIDIA Transfor...

Hugging FacePyTorchTransformerTransformers

Kyle Tretina

6 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Streamline AI Infrastructure with NVIDIA Run:ai on Microsoft Azure

The article discusses how NVIDIA Run:ai enhances AI infrastructure management on Microsoft Azure by optimizing GPU utilization and simplifying workload orchestration.

AzureAzure Blob StorageHugging FaceKubernetesPyTorch

Julie Adrounie

8 min read

Has Summary

NVIDIA

Intermediate

How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks

The article discusses how the NVIDIA DGX Spark supercomputer enhances performance for intensive AI tasks, providing a local alternative to cloud computing.

Fine-tuningGPTHugging FacePyTorchscikit-learn

Allen Bourgoyne

5 min read

Has Summary

NVIDIA

Advanced

Enabling Scalable AI-Driven Molecular Dynamics Simulations

The article discusses the integration of machine learning interatomic potentials (MLIPs) into molecular dynamics (MD) simulations using the ML-IAP-Kokkos interface within the LAMMPS MD package.

CythonPythonPyTorch

Justin S. Smith

14 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Train a Quadruped Locomotion Policy and Simulate Cloth Manipulation with NVIDIA Isaac Lab and Newton

This article discusses the integration of the Newton physics engine with NVIDIA Isaac Lab for training quadruped locomotion policies and simulating cloth manipulation.

ApacheNumPyPythonPyTorchReinforcement LearningWarpYAML

Mohammad Mohajerani

13 min read

Includes Code

Has Summary

NVIDIA

Advanced

Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer

The article discusses the challenges of cold start latency in deploying large language models (LLMs) and introduces the NVIDIA Run:ai Model Streamer, an open-source Python SDK designed to optimize ...

AWSAWS S3HTTPSHugging FacePythonPyTorchTransformers

Omer Dayan

12 min read

Has Summary

NVIDIA

Advanced

Autodesk Research Brings Warp Speed to Computational Fluid Dynamics on NVIDIA GH200

The article discusses Autodesk Research's development of the Accelerated Lattice Boltzmann (XLB) library, which enhances computational fluid dynamics (CFD) performance using NVIDIA's Warp and GH200...

FortranJAXNumbaNumPyPythonPyTorchWarp

Mehdi Ataei

7 min read

Has Summary

NVIDIA

Advanced

Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6

This article discusses the optimization of vision AI workloads using NVIDIA's CUDA-accelerated implementation of SMPTE VC-6, a codec designed for efficient interaction with modern compute architect...

PythonPyTorchV

Andreas Kieslinger

12 min read

Includes Code

Has Summary

NVIDIA

Advanced

How Quantization Aware Training Enables Low-Precision Accuracy Recovery

The article discusses how Quantization Aware Training (QAT) and Quantization Aware Distillation (QAD) can enhance low-precision model accuracy recovery beyond traditional Post-Training Quantization...

Hugging FacePyTorch

Eduardo Alvarez

9 min read

Includes Code

Has Summary

NVIDIA

Beginner

Developers Can Now Get NVIDIA CUDA Directly from Their Favorite Third-Party Platforms

NVIDIA is simplifying the deployment of its CUDA software stack by collaborating with various third-party platforms, enabling developers to access CUDA directly through their preferred package mana...

OpenCVPythonPyTorch

Jonathan Bentz

3 min read

Has Summary

NVIDIA

Advanced

Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing

The article discusses how to enhance the efficiency of Large Language Models (LLMs) during inference by utilizing CPU-GPU memory sharing through NVIDIA's NVLink C2C technology.

Hugging FaceLarge Language ModelsPythonPyTorch

Afroze Syed

6 min read

Includes Code

Has Summary

NVIDIA

Advanced

Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2

The article discusses the challenges of selecting optimal General Matrix Multiplication (GEMM) kernels on NVIDIA GPUs and introduces NVIDIA Matmul Heuristics (nvMatmulHeuristics) as a solution to i...

JSONPythonPyTorch

Harrison Barclay

7 min read

Includes Code

Has Summary

NVIDIA

Advanced

Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training

The article discusses fine-tuning the gpt-oss model for improved accuracy and performance through Quantization Aware Training (QAT) and Supervised Fine-Tuning (SFT).

GPTHugging FacePyTorchTransformerTransformers

Eduardo Alvarez

7 min read

Includes Code

Has Summary

NVIDIA

Advanced

Introducing NVIDIA Jetson Thor, the Ultimate Platform for Physical AI

The article introduces the NVIDIA Jetson Thor, a powerful platform designed for physical AI and humanoid robotics.

GeminiHugging FacePyTorchTransformer

Shashank Maheshwari

13 min read

Has Summary

NVIDIA

Advanced

NVIDIA Hardware Innovations and Open Source Contributions Are Shaping AI

The article discusses how NVIDIA's hardware innovations, particularly the Blackwell architecture and NVFP4 precision, along with their open source contributions, are driving advancements in AI.

GPTHugging FaceJAXKubernetesPythonPyTorchTransformer

George Chellapa

8 min read

Has Summary

NVIDIA

Intermediate

Reinforcement Learning with NVIDIA NeMo-RL: Megatron-Core Support for Optimized Training Throughput

The article discusses the enhancements in reinforcement learning training throughput using NVIDIA NeMo-RL with Megatron-Core support.

PyTorchReinforcement LearningYAML

Anna Shors

7 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Streamline CUDA-Accelerated Python Install and Packaging Workflows with Wheel Variants

The article discusses the introduction of Wheel Variants, a new Python packaging standard aimed at improving the installation and packaging workflows for CUDA-accelerated Python packages.

DockerJAXPythonPyTorchSciPy

Jonathan Dekhtiar

15 min read

Includes Code

Has Summary

NVIDIA

Advanced

Optimizing LLMs for Performance and Accuracy with Post-Training Quantization

The article discusses the optimization of large language models (LLMs) through post-training quantization (PTQ), emphasizing its benefits in enhancing inference performance while maintaining accura...

Hugging FacePyTorchV

Eduardo Alvarez

12 min read

Includes Code

Has Summary

NVIDIA

Advanced

Double PyTorch Inference Speed for Diffusion Models Using Torch-TensorRT

The article discusses how to double the inference speed of diffusion models in PyTorch using Torch-TensorRT, an AI inference library that optimizes machine learning models for NVIDIA GPUs.

DiffusersDiffusion ModelsHugging FacePyTorchStable Diffusion

Adrian Wang

8 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Driving AI-Powered Robotics Development with NVIDIA Isaac for Healthcare

The article discusses the impending shortage of healthcare workers and how AI-enabled robotic systems, powered by NVIDIA Isaac for Healthcare, can address these challenges.

PyTorchTensorFlow

Ansley Dunn

6 min read

Includes Code

Has Summary

NVIDIA

Intermediate

NVIDIA Dynamo Adds Support for AWS Services to Deliver Cost-Efficient Inference at Scale

NVIDIA Dynamo has integrated support for AWS services, enhancing cost-efficient inference for large language models (LLMs) on NVIDIA GPU-based Amazon EC2 instances.

AWSKubernetesPyTorch

Amr Elmeleegy

4 min read

Has Summary

NVIDIA

Advanced

Reinforcement Learning with NVIDIA NeMo-RL: Reproducing a DeepScaleR Recipe Using GRPO

The article introduces NVIDIA NeMo-RL, an open-source library for reinforcement learning that supports scalable training from single-GPU to thousand-GPU models.

Hugging FacePythonPyTorchReinforcement Learning

Alexander Bukharin

5 min read

Includes Code

Has Summary

NVIDIA

Advanced

Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python

The article discusses the introduction of cuda-cccl, a Python library that provides high-level building blocks for NVIDIA CUDA kernel fusion, enabling developers to write efficient algorithms witho...

LessPythonPyTorchTensorFlowXGBoost

Ashwin Srinath

5 min read

Includes Code

Has Summary

NVIDIA

Advanced

LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM

This article provides a comprehensive guide on benchmarking LLM inference using TensorRT-LLM, focusing on performance tuning techniques.

JSONPythonPyTorch

Francesco Di Natale

10 min read

Includes Code

Has Summary

NVIDIA

Intermediate

RAPIDS Adds GPU Polars Streaming, a Unified GNN API, and Zero-Code ML Speedups

RAPIDS version 25.

DaskPolarsPythonPyTorchscikit-learn

Brian Tepera

6 min read

Includes Code

Has Summary

NVIDIA

Intermediate

Advanced NVIDIA CUDA Kernel Optimization Techniques: Handwritten PTX

The article discusses advanced optimization techniques for NVIDIA CUDA kernels, specifically focusing on handwritten Parallel Thread Execution (PTX) code.

FortranPythonPyTorch

Jonathan Bentz

11 min read

Includes Code

Has Summary