How NVIDIA Uses Python

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

Advanced

The article discusses how the NVIDIA cuda. compute library enables Python developers to write high-performance GPU code without needing to resort to C++.

Daniel Rodriguez

5 min read

Includes Code

Has Summary

R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab

Advanced

The article discusses NVIDIA Isaac Lab, a GPU-native simulation framework designed to enhance multimodal robot learning by addressing the challenges of traditional simulation methods.

ModalPythonWarp

Oyindamola Omotuyi

9 min read

Includes Code

Has Summary

Using Accelerated Computing to Live-Steer Scientific Experiments at Massive Research Facilities

Advanced

The article discusses how accelerated computing, particularly through NVIDIA's technologies, is transforming scientific experiments at large research facilities like the NSF-DOE Vera C.

NumPyPythonSciPy

Quynh L. Nguyen

12 min read

Has Summary

How to Build a Document Processing Pipeline for RAG with Nemotron

Advanced

The article provides a comprehensive guide on building a document processing pipeline using NVIDIA Nemotron RAG, focusing on the extraction of structured data from complex documents like PDFs.

DockerEmbeddingHugging FaceJSONPythonRedistorchvision

Chia-Chih Chen

9 min read

Includes Code

Has Summary

Accelerating Long-Context Model Training in JAX and XLA

Advanced

The article discusses the integration of the NVSHMEM communication library into the Accelerated Linear Algebra (XLA) compiler to optimize long-context model training in JAX.

DockerJAXPython

Sevin Fide Varoglu

9 min read

Includes Code

Has Summary

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

Advanced

The article discusses the challenges of Expert Parallel communication in training Mixture-of-Experts (MoE) models and introduces Hybrid-EP, an efficient communication solution that leverages NVIDIA...

Fan Yu

10 min read

Has Summary

Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton

Advanced

The article discusses the integration of CUDA Tile as a backend for OpenAI Triton, a Python DSL for writing GPU kernels.

Jie Xin

7 min read

Includes Code

Has Summary

How to Unlock Local Detail in Coarse Climate Projections with NVIDIA Earth-2

Advanced

The article discusses how to utilize NVIDIA Earth-2 to downscale coarse climate projections into high-resolution, bias-corrected fields, enabling better assessment of local climate extremes.

Deep LearningHugging FacePythonYAML

Georg Ertl

11 min read

Includes Code

Has Summary

How to Train an AI Agent for Command-Line Tasks with Synthetic Data and Reinforcement Learning

Advanced

This article explores how to train an AI agent to operate a new Command Line Interface (CLI) using synthetic data generation and reinforcement learning.

Hugging FaceJSONPythonReinforcement LearningRLHFShell

Chris Alexiuk

11 min read

Includes Code

Has Summary

How to Write High-Performance Matrix Multiply in NVIDIA CUDA Tile

Intermediate

This article provides a detailed guide on implementing high-performance matrix multiplication using NVIDIA's cuTile framework in CUDA.

Jinman Xie

13 min read

Includes Code

Has Summary

Build an AI Catalog System That Delivers Localized, Interactive Product Experiences

Advanced

This article provides a comprehensive tutorial on building an AI-powered catalog enrichment system that enhances e-commerce product listings using NVIDIA's advanced models.

DockerFastAPIGenerative AIJSONPython

Antonio Martinez

10 min read

Includes Code

Has Summary

Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell

Advanced

The article discusses NVIDIA's advancements in AI model inference performance through the Blackwell architecture, emphasizing improvements in token throughput per watt and the enhancements made to ...

Deep LearningPythonPyTorch

Ashraf Eassa

5 min read

Has Summary

Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM

Intermediate

The article discusses the introduction of NVIDIA TensorRT Edge-LLM, an open-source C++ framework designed for high-performance inference of Large Language Models (LLMs) and Vision Language Models (...

ChiHugging FacePythonTransformers

Lin Chai

5 min read

Includes Code

Has Summary

Build and Orchestrate End-to-End SDG Workflows with NVIDIA Isaac Sim and NVIDIA OSMO

Intermediate

The article discusses how to build and orchestrate end-to-end synthetic data generation (SDG) workflows using NVIDIA Isaac Sim and NVIDIA OSMO.

AzureGradioKubernetesPostgreSQLPythonRedisYAML

Asawaree Bhide

11 min read

Includes Code

Has Summary

Accelerate AI Inference for Edge and Robotics with NVIDIA Jetson T4000 and NVIDIA JetPack 7.1

Advanced

NVIDIA introduces the Jetson T4000, enhancing AI and real-time reasoning for robotics and edge AI applications with up to 1200 FP4 TFLOPs of AI compute and 64 GB of memory.

MistralPythonPyTorch

Shashank Maheshwari

9 min read

Has Summary

How to Build a Voice Agent with RAG and Safety Guardrails

Advanced

This article provides a comprehensive tutorial on building a voice agent using NVIDIA's Nemotron models, focusing on retrieval-augmented generation (RAG) and safety guardrails.

EmbeddingHugging FacePythonTransformerTransformers

Chris Alexiuk

8 min read

Includes Code

Has Summary

Building Autonomous Vehicles That Reason with NVIDIA Alpamayo

Intermediate

The article discusses NVIDIA's Alpamayo, a comprehensive ecosystem designed for developing reasoning-based autonomous vehicle (AV) systems.

gRPCHugging FacePython

Marco Pavone

11 min read

Includes Code

Has Summary

Accelerating AI-Powered Chemistry and Materials Science Simulations with NVIDIA ALCHEMI Toolkit-Ops

Advanced

The article discusses the NVIDIA ALCHEMI Toolkit-Ops, a specialized toolkit designed to accelerate AI-powered atomistic simulations in chemistry and materials science.

JAXPythonPyTorchWarp

Justin S. Smith

10 min read

Includes Code

Has Summary

Simulate Robotic Environments Faster with NVIDIA Isaac Sim and World Labs Marble

Intermediate

This article discusses how to rapidly simulate robotic environments using NVIDIA Isaac Sim and World Labs Marble.

KongPythonPyTorch

Wonsik Han

10 min read

Includes Code

Has Summary

Simulate an Accurate Radio Environment Using NVIDIA Aerial Omniverse Digital Twin

Advanced

The article discusses how to simulate an accurate radio environment for 5G and 6G systems using the NVIDIA Aerial Omniverse Digital Twin (AODT).

gRPCMATLABNumPyPythonYAML

Tommaso Balercia

10 min read

Includes Code

Has Summary

Using AI Physics for Technology Computer-Aided Design Simulations

Intermediate

The article discusses the integration of AI Physics into Technology Computer-Aided Design (TCAD) simulations, highlighting its significance in semiconductor manufacturing.

Graph Neural NetworksHugging FaceNeural NetworksPythonPyTorch

Ram Cherukuri

7 min read

Has Summary

Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT-LLM

Advanced

The article discusses the Skip Softmax technique, a method for accelerating long-context inference in large language models (LLMs) using NVIDIA TensorRT-LLM.

PythonVYAML

Laikh Tewari

6 min read

Includes Code

Has Summary

Advanced Large-Scale Quantum Simulation Techniques in cuQuantum SDK v25.11

Intermediate

The article discusses advanced techniques for large-scale quantum simulations using the cuQuantum SDK v25. 11, focusing on the new functionalities for Pauli propagation and stabilizer simulations.

Tom Lubowe

11 min read

Includes Code

Has Summary

Reducing CUDA Binary Size to Distribute cuML on PyPI

Intermediate

The article discusses the efforts made by the NVIDIA team to reduce the binary size of CUDA C++ libraries, specifically for the cuML library, enabling its distribution via PyPI.

Divye Gala

8 min read

Includes Code

Has Summary

How to Train Scientific Agents with Reinforcement Learning

Intermediate

The article discusses the development of scientific AI agents using reinforcement learning (RL) techniques, specifically through the NVIDIA NeMo framework.

ApacheAzurePythonReinforcement LearningRLHF

Christian Munley

12 min read

Includes Code

Has Summary

How to Scale Fast Fourier Transforms to Exascale on Modern NVIDIA GPU Architectures

Advanced

The article discusses the advancements in scaling Fast Fourier Transforms (FFTs) using NVIDIA's cuFFTMp library on modern GPU architectures, particularly focusing on performance improvements on the...

Zan Xu

7 min read

Includes Code

Has Summary

Enhancing Communication Observability of AI Workloads with NCCL Inspector

Advanced

The article discusses the NCCL Inspector, a profiling and analysis tool designed to enhance communication observability for AI workloads using the NVIDIA Collective Communication Library (NCCL).

JSONPython

Sirshak Das

6 min read

Includes Code

Has Summary

Improve AI-Native 6G Design with the NVIDIA Aerial Omniverse Digital Twin

Advanced

The article discusses the transformation of AI-native 6G network design through the NVIDIA Aerial Omniverse Digital Twin, emphasizing the need for a dynamic, continuous integration approach to Radi...

gRPCMATLABPython

Tommaso Balercia

7 min read

Has Summary

NVIDIA CUDA 13.1 Powers Next-Gen GPU Programming with NVIDIA CUDA Tile and Performance Gains

Advanced

NVIDIA CUDA 13.

Jonathan Bentz

10 min read

Includes Code

Has Summary

Simplify GPU Programming with NVIDIA CUDA Tile in Python

Intermediate

The article discusses the introduction of NVIDIA CUDA 13. 1 and its new tile-based programming model for GPUs, which simplifies GPU programming in Python through cuTile.

Jonathan Bentz

7 min read

Includes Code

Has Summary

Focus on Your Algorithm—NVIDIA CUDA Tile Handles the Hardware

Advanced

The article discusses the launch of NVIDIA CUDA Tile with CUDA 13. 1, which introduces a virtual instruction set for tile-based parallel programming.

NumPyPython

Jonathan Bentz

5 min read

Has Summary

Making Robot Perception More Efficient on NVIDIA Jetson Thor

Advanced

The article discusses enhancing robot perception efficiency on the NVIDIA Jetson Thor platform by utilizing specialized hardware accelerators alongside powerful GPUs.

NumPyOpenCVPILPython

Chintan Intwala

15 min read

Includes Code

Has Summary

NVIDIA NVQLink Architecture Integrates Accelerated Computing with Quantum Processors

Advanced

The article discusses NVIDIA's NVQLink architecture, which integrates accelerated computing with quantum processors to enhance quantum error correction and calibration.

Shane Caldwell

7 min read

Includes Code

Has Summary

Achieve CUTLASS C++ Performance with Python APIs Using CuTe DSL

Intermediate

The article discusses how CuTe DSL, a new Python API for CUTLASS 4, simplifies GPU kernel development by reducing compilation times and maintaining performance efficiency similar to CUTLASS C++.

Multi-Head AttentionPythonPyTorch

Brandon Sun

8 min read

Includes Code

Has Summary

How to Get Started with Neural Shading for Your Game or Application

Intermediate

The article discusses neural shading as a transformative approach to real-time rendering, integrating trainable models into graphics pipelines to enhance visual fidelity and performance.

PythonRenderV

Shannon Woods

20 min read

Includes Code

Has Summary

Gen AI Super-Resolution Accelerates Weather Prediction with Scalable, Low-Compute Models

Advanced

The article discusses how NVIDIA's CorrDiff model leverages generative AI for downscaling weather predictions, significantly improving efficiency and reducing computational costs.

Fine-tuningPythonPyTorchYAML

Alicia Sui

11 min read

Includes Code

Has Summary

How to Achieve 4x Faster Inference for Math Problem Solving

Advanced

This article discusses how to achieve 4x faster inference for math problem solving using large language models by optimizing the serving stack, quantization strategy, and decoding methods.

Hugging FacePythonPyTorch

Igor Gitman

7 min read

Includes Code

Has Summary

Building an Interactive AI Agent for Lightning-Fast Machine Learning Tasks

Intermediate

The article discusses the development of an interactive AI agent designed to streamline machine learning workflows by leveraging GPU acceleration.

Machine LearningPythonscikit-learnStreamlit

Allison Ding

7 min read

Includes Code

Has Summary

Enhancing GPU-Accelerated Vector Search in Faiss with NVIDIA cuVS

Advanced

The article discusses how NVIDIA cuVS enhances GPU-accelerated vector search in the Faiss library, providing significant performance improvements for similarity search and clustering of dense vecto...

Tarang Jain

10 min read

Includes Code

Has Summary

How to Predict Biomolecular Structures Using the OpenFold3 NIM

Advanced

The article discusses the advancements in biomolecular structure prediction using OpenFold3, a deep learning model integrated into the NVIDIA ecosystem.

DockerPython

Kyle Tretina

5 min read

Includes Code

Has Summary

How Code Execution Drives Key Risks in Agentic AI Systems

Intermediate

The article discusses the security risks associated with AI-driven applications that generate and execute code autonomously.

AWSAWS EC2DockerPython

John Irwin

8 min read

Includes Code

Has Summary

Powering AI-Native 6G Research with the NVIDIA Sionna Research Kit

Advanced

The article discusses the NVIDIA Sionna Research Kit, an open-source platform designed to facilitate AI-native 6G research through GPU acceleration.

Sebastian Cammerer

5 min read

Includes Code

Has Summary

Train an LLM on NVIDIA Blackwell with Unsloth—and Scale for Production

Intermediate

The article discusses how to fine-tune and scale large language models (LLMs) using the open-source Unsloth framework on NVIDIA Blackwell GPUs.

DockerFine-tuningHugging FacePython

Paul Abruzzo

5 min read

Includes Code

Has Summary

Create Your Own Bash Computer Use Agent with NVIDIA Nemotron in One Hour

Advanced

This article guides readers through the process of creating a Bash computer use agent using the NVIDIA Nemotron Nano v2 model.

Hugging FaceJSONPython

Mehran Maghoumi

14 min read

Includes Code

Has Summary

Enabling Scalable AI-Driven Molecular Dynamics Simulations

Advanced

The article discusses the integration of machine learning interatomic potentials (MLIPs) into molecular dynamics (MD) simulations using the ML-IAP-Kokkos interface within the LAMMPS MD package.

CythonPythonPyTorch

Justin S. Smith

14 min read

Includes Code

Has Summary

Accelerate Qubit Research with NVIDIA cuQuantum Integrations in QuTiP and scQubits

Advanced

The article discusses the integration of NVIDIA cuQuantum with the Quantum Toolbox in Python (QuTiP) and scQubits, highlighting how these integrations accelerate quantum simulations for novel qubit...

AWSPythonRapids

Tom Lubowe

4 min read

Includes Code

Has Summary

From Assistant to Adversary: Exploiting Agentic AI Developer Tools

Intermediate

The article discusses the dual role of AI-enabled developer tools, highlighting both their potential to accelerate coding and the security vulnerabilities they introduce.

ClaudeCopilotPython

Becca Lynch

9 min read

Includes Code

Has Summary

Train a Quadruped Locomotion Policy and Simulate Cloth Manipulation with NVIDIA Isaac Lab and Newton

Intermediate

This article discusses the integration of the Newton physics engine with NVIDIA Isaac Lab for training quadruped locomotion policies and simulating cloth manipulation.

ApacheNumPyPythonPyTorchReinforcement LearningWarpYAML

Mohammad Mohajerani

13 min read

Includes Code

Has Summary

3 Easy Ways to Supercharge Your Robotics Development Using OpenUSD

Intermediate

The article discusses how OpenUSD can enhance robotics development through improved data ingestion, aggregation, and the use of SimReady assets.

Hugging FacePython

Matias Codesal

6 min read

Includes Code

Has Summary