How NVIDIA Uses PyTorch
566 engineering articles about PyTorch from NVIDIA's engineering team
Other NVIDIA Technologies
Other Companies Using PyTorch
Articles
Filter:
The article discusses the use of NVFP4 low-precision model training to achieve higher throughput without sacrificing accuracy in AI model training.
Aditya Vavre
7 min read
Includes Code
Has Summary
--
The article discusses how the NVIDIA cuda. compute library enables Python developers to write high-performance GPU code without needing to resort to C++.
The article discusses how NVIDIA's hardware-software co-design significantly enhanced the inference performance of Sarvam AI's Sovereign 30B model, achieving a 4x speedup on NVIDIA Blackwell archit...
Utkarsh Uppal
14 min read
Has Summary
--
The article discusses NVIDIA TensorRT LLM AutoDeploy, a beta feature that automates the inference optimization process for large language models (LLMs).
โโLucas Liebenwein
8 min read
Includes Code
Has Summary
--
Kimi K2. 5 is an advanced multimodal vision language model (VLM) developed by Kimi, optimized for various AI tasks.
Anu Srivastava
4 min read
Includes Code
Has Summary
--
The article discusses the challenges of Expert Parallel communication in training Mixture-of-Experts (MoE) models and introduces Hybrid-EP, an efficient communication solution that leverages NVIDIA...
The article discusses the Universal Sparse Tensor (UST), a framework designed to efficiently handle sparse tensors across various applications, including scientific computing and deep learning.
The article discusses the transition from the traditional two-phase API of the CUB library to a new single-call API introduced in CUDA 13. 1.
Giannis Gonidelis
8 min read
Includes Code
Has Summary
--
This article provides a detailed guide on implementing high-performance matrix multiplication using NVIDIA's cuTile framework in CUDA.
The article discusses NVIDIA's advancements in AI model inference performance through the Blackwell architecture, emphasizing improvements in token throughput per watt and the enhancements made to ...
Ashraf Eassa
5 min read
Has Summary
--
The article discusses how recent upgrades to open source AI tools enhance the performance of small language models (SLMs) and diffusion models on NVIDIA RTX PCs.
Annamalai Chockalingam
7 min read
Has Summary
--
The article discusses the latest software and model optimizations for NVIDIA DGX Spark, highlighting significant performance improvements in AI workflows.
Allen Bourgoyne
5 min read
Has Summary
--
The article discusses the NVIDIA Rubin platform, which introduces six new chips designed to create a powerful AI supercomputer.
Kyle Aubrey
59 min read
Has Summary
--
NVIDIA introduces the Jetson T4000, enhancing AI and real-time reasoning for robotics and edge AI applications with up to 1200 FP4 TFLOPs of AI compute and 64 GB of memory.
The article discusses the NVIDIA ALCHEMI Toolkit-Ops, a specialized toolkit designed to accelerate AI-powered atomistic simulations in chemistry and materials science.
This article discusses how to rapidly simulate robotic environments using NVIDIA Isaac Sim and World Labs Marble.
The article discusses the integration of AI Physics into Technology Computer-Aided Design (TCAD) simulations, highlighting its significance in semiconductor manufacturing.
Ram Cherukuri
7 min read
Has Summary
--
The article discusses how CuTe DSL, a new Python API for CUTLASS 4, simplifies GPU kernel development by reducing compilation times and maintaining performance efficiency similar to CUTLASS C++.
Brandon Sun
8 min read
Includes Code
Has Summary
--
The article discusses the NVIDIA Collective Communications Library (NCCL) and its capabilities for building scalable and fault-tolerant applications.
Luke Robison
11 min read
Includes Code
Has Summary
--
The article discusses how NVIDIA's CorrDiff model leverages generative AI for downscaling weather predictions, significantly improving efficiency and reducing computational costs.
Alicia Sui
11 min read
Includes Code
Has Summary
--
This article discusses how to achieve 4x faster inference for math problem solving using large language models by optimizing the serving stack, quantization strategy, and decoding methods.
Igor Gitman
7 min read
Includes Code
Has Summary
--
The article discusses the introduction of a new Kubernetes abstraction called ComputeDomains, designed to facilitate secure GPU-to-GPU memory operations across node boundaries in multi-node NVLink ...
Kevin Klues
13 min read
Includes Code
Has Summary
--
The article discusses how NVIDIA cuVS enhances GPU-accelerated vector search in the Faiss library, providing significant performance improvements for similarity search and clustering of dense vecto...
The article discusses how NVIDIA's NeMo Automodel simplifies the training of large-scale mixture-of-experts (MoE) models in PyTorch, making it accessible to a broader audience.
Hemil Desai
7 min read
Includes Code
Has Summary
--
The article discusses how to scale biology transformer models using PyTorch and NVIDIA BioNeMo Recipes, focusing on advanced parallel computing techniques and the integration of the NVIDIA Transfor...
Kyle Tretina
6 min read
Includes Code
Has Summary
--
The article discusses how NVIDIA Run:ai enhances AI infrastructure management on Microsoft Azure by optimizing GPU utilization and simplifying workload orchestration.
Julie Adrounie
8 min read
Has Summary
--
The article discusses how the NVIDIA DGX Spark supercomputer enhances performance for intensive AI tasks, providing a local alternative to cloud computing.
Allen Bourgoyne
5 min read
Has Summary
--
The article discusses the integration of machine learning interatomic potentials (MLIPs) into molecular dynamics (MD) simulations using the ML-IAP-Kokkos interface within the LAMMPS MD package.
Train a Quadruped Locomotion Policy and Simulate Cloth Manipulation with NVIDIA Isaac Lab and Newton
This article discusses the integration of the Newton physics engine with NVIDIA Isaac Lab for training quadruped locomotion policies and simulating cloth manipulation.
The article discusses the challenges of cold start latency in deploying large language models (LLMs) and introduces the NVIDIA Run:ai Model Streamer, an open-source Python SDK designed to optimize ...
Omer Dayan
12 min read
Has Summary
--
The article discusses Autodesk Research's development of the Accelerated Lattice Boltzmann (XLB) library, which enhances computational fluid dynamics (CFD) performance using NVIDIA's Warp and GH200...
This article discusses the optimization of vision AI workloads using NVIDIA's CUDA-accelerated implementation of SMPTE VC-6, a codec designed for efficient interaction with modern compute architect...
The article discusses how Quantization Aware Training (QAT) and Quantization Aware Distillation (QAD) can enhance low-precision model accuracy recovery beyond traditional Post-Training Quantization...
Eduardo Alvarez
9 min read
Includes Code
Has Summary
--
NVIDIA is simplifying the deployment of its CUDA software stack by collaborating with various third-party platforms, enabling developers to access CUDA directly through their preferred package mana...
The article discusses how to enhance the efficiency of Large Language Models (LLMs) during inference by utilizing CPU-GPU memory sharing through NVIDIA's NVLink C2C technology.
Afroze Syed
6 min read
Includes Code
Has Summary
--
The article discusses the challenges of selecting optimal General Matrix Multiplication (GEMM) kernels on NVIDIA GPUs and introduces NVIDIA Matmul Heuristics (nvMatmulHeuristics) as a solution to i...
The article discusses fine-tuning the gpt-oss model for improved accuracy and performance through Quantization Aware Training (QAT) and Supervised Fine-Tuning (SFT).
Eduardo Alvarez
7 min read
Includes Code
Has Summary
--
The article introduces the NVIDIA Jetson Thor, a powerful platform designed for physical AI and humanoid robotics.
Shashank Maheshwari
13 min read
Has Summary
--
The article discusses how NVIDIA's hardware innovations, particularly the Blackwell architecture and NVFP4 precision, along with their open source contributions, are driving advancements in AI.
George Chellapa
8 min read
Has Summary
--
The article discusses the enhancements in reinforcement learning training throughput using NVIDIA NeMo-RL with Megatron-Core support.
Anna Shors
7 min read
Includes Code
Has Summary
--
The article discusses the introduction of Wheel Variants, a new Python packaging standard aimed at improving the installation and packaging workflows for CUDA-accelerated Python packages.
The article discusses the optimization of large language models (LLMs) through post-training quantization (PTQ), emphasizing its benefits in enhancing inference performance while maintaining accura...
Eduardo Alvarez
12 min read
Includes Code
Has Summary
--
The article discusses how to double the inference speed of diffusion models in PyTorch using Torch-TensorRT, an AI inference library that optimizes machine learning models for NVIDIA GPUs.
Adrian Wang
8 min read
Includes Code
Has Summary
--
The article discusses the impending shortage of healthcare workers and how AI-enabled robotic systems, powered by NVIDIA Isaac for Healthcare, can address these challenges.
Ansley Dunn
6 min read
Includes Code
Has Summary
--
NVIDIA Dynamo has integrated support for AWS services, enhancing cost-efficient inference for large language models (LLMs) on NVIDIA GPU-based Amazon EC2 instances.
Amr Elmeleegy
4 min read
Has Summary
--
The article introduces NVIDIA NeMo-RL, an open-source library for reinforcement learning that supports scalable training from single-GPU to thousand-GPU models.
Alexander Bukharin
5 min read
Includes Code
Has Summary
--
The article discusses the introduction of cuda-cccl, a Python library that provides high-level building blocks for NVIDIA CUDA kernel fusion, enabling developers to write efficient algorithms witho...
Ashwin Srinath
5 min read
Includes Code
Has Summary
--
This article provides a comprehensive guide on benchmarking LLM inference using TensorRT-LLM, focusing on performance tuning techniques.
RAPIDS version 25.
Brian Tepera
6 min read
Includes Code
Has Summary
--
The article discusses advanced optimization techniques for NVIDIA CUDA kernels, specifically focusing on handwritten Parallel Thread Execution (PTX) code.