PyTorch Programming Tutorials &amp; Engineering Articles

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

Advanced

The article discusses the use of NVFP4 low-precision model training to achieve higher throughput without sacrificing accuracy in AI model training.

Hugging FacePyTorch

Aditya Vavre

7 min read

Includes Code

Has Summary

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

Advanced

The article discusses how the NVIDIA cuda. compute library enables Python developers to write high-performance GPU code without needing to resort to C++.

Daniel Rodriguez

5 min read

Includes Code

Has Summary

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s

Advanced

The article discusses how NVIDIA's hardware-software co-design significantly enhanced the inference performance of Sarvam AI's Sovereign 30B model, achieving a 4x speedup on NVIDIA Blackwell archit...

Hugging FacePyTorchTransformer

Utkarsh Uppal

14 min read

Has Summary

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

Advanced

The article discusses NVIDIA TensorRT LLM AutoDeploy, a beta feature that automates the inference optimization process for large language models (LLMs).

Hugging FacePyTorchTransformersV

Lucas Liebenwein

8 min read

Includes Code

Has Summary

Build with Kimi K2.5 Multimodal VLM Using NVIDIA GPU-Accelerated Endpoints

Advanced

Kimi K2. 5 is an advanced multimodal vision language model (VLM) developed by Kimi, optimized for various AI tasks.

EmbeddingFine-tuningHugging FacePyTorch

Anu Srivastava

4 min read

Includes Code

Has Summary

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

Advanced

The article discusses the challenges of Expert Parallel communication in training Mixture-of-Experts (MoE) models and introduces Hybrid-EP, an efficient communication solution that leverages NVIDIA...

Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models (Part 1)

Fan Yu

10 min read

Has Summary

Advanced

This article discusses the re-architecture of the serving stack for next-generation ads lightweight ranking models at Pinterest, moving from a traditional Two-Tower architecture to a more complex G...

Machine LearningPyTorchThrift

Pinterest Engineering

11 min read

Has Summary

Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

Intermediate

The article discusses the Universal Sparse Tensor (UST), a framework designed to efficiently handle sparse tensors across various applications, including scientific computing and deep learning.

PyTorchSciPy

Aart J.C. Bik

13 min read

Includes Code

Has Summary

Google

Advanced

LiteRT: The Universal Framework for On-Device AI

LiteRT has evolved from its TensorFlow Lite foundation into a universal on-device AI inference framework, now offering production-ready GPU acceleration across six platforms and streamlined NPU int...

GeminiHugging FaceJAXPyTorchTensorFlow

Lu Wang, Chintan Parikh, Jingjiang Li, Terry Heo

9 min read

Includes Code

Has Summary

Streamlining CUB with a Single-Call API

Advanced

The article discusses the transition from the traditional two-phase API of the CUB library to a new single-call API introduced in CUDA 13. 1.

PyTorch

Giannis Gonidelis

8 min read

Includes Code

Has Summary

How to Write High-Performance Matrix Multiply in NVIDIA CUDA Tile

Intermediate

This article provides a detailed guide on implementing high-performance matrix multiplication using NVIDIA's cuTile framework in CUDA.

Jinman Xie

13 min read

Includes Code

Has Summary

Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell

Advanced

The article discusses NVIDIA's advancements in AI model inference performance through the Blackwell architecture, emphasizing improvements in token throughput per watt and the enhancements made to ...

Deep LearningPythonPyTorch

Ashraf Eassa

5 min read

Has Summary

Open Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs

Advanced

The article discusses how recent upgrades to open source AI tools enhance the performance of small language models (SLMs) and diffusion models on NVIDIA RTX PCs.

Diffusion ModelsGPTOllamaPyTorch

Annamalai Chockalingam

7 min read

Has Summary

New Software and Model Optimizations Supercharge NVIDIA DGX Spark

Intermediate

The article discusses the latest software and model optimizations for NVIDIA DGX Spark, highlighting significant performance improvements in AI workflows.

GPTHugging FacePyTorch

Allen Bourgoyne

5 min read

Has Summary

Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer

Advanced

The article discusses the NVIDIA Rubin platform, which introduces six new chips designed to create a powerful AI supercomputer.

AssemblyHugging FaceJAXKubernetesLessPyTorchRLHFTransformer

Kyle Aubrey

59 min read

Has Summary

Accelerate AI Inference for Edge and Robotics with NVIDIA Jetson T4000 and NVIDIA JetPack 7.1

Advanced

NVIDIA introduces the Jetson T4000, enhancing AI and real-time reasoning for robotics and edge AI applications with up to 1200 FP4 TFLOPs of AI compute and 64 GB of memory.

MistralPythonPyTorch

Shashank Maheshwari

9 min read

Has Summary

Accelerating AI-Powered Chemistry and Materials Science Simulations with NVIDIA ALCHEMI Toolkit-Ops

Advanced

The article discusses the NVIDIA ALCHEMI Toolkit-Ops, a specialized toolkit designed to accelerate AI-powered atomistic simulations in chemistry and materials science.

JAXPythonPyTorchWarp

Justin S. Smith

10 min read

Includes Code

Has Summary

Simulate Robotic Environments Faster with NVIDIA Isaac Sim and World Labs Marble

Intermediate

This article discusses how to rapidly simulate robotic environments using NVIDIA Isaac Sim and World Labs Marble.

KongPythonPyTorch

Wonsik Han

10 min read

Includes Code

Has Summary

Using AI Physics for Technology Computer-Aided Design Simulations

Intermediate

The article discusses the integration of AI Physics into Technology Computer-Aided Design (TCAD) simulations, highlighting its significance in semiconductor manufacturing.

Graph Neural NetworksHugging FaceNeural NetworksPythonPyTorch

Ram Cherukuri

7 min read

Has Summary

OpenAI

Intermediate

OpenAI co-founds the Agentic AI Foundation under the Linux Foundation

OpenAI has co-founded the Agentic AI Foundation (AAIF) under the Linux Foundation to promote open-source agentic AI.

AWSCopilotGeminiKubernetesNode.jsPyTorch

OpenAI

5 min read

Has Summary

Uber

Advanced

Evolution and Scale of Uber’s Delivery Search Platform

The article discusses the evolution and scaling of Uber's Delivery Search Platform, emphasizing the transition from traditional lexical search to a semantic search model that enhances user experien...

ApacheEmbeddingHugging FacePyTorchTransformers

Divya Nagar, Zheng Liu, Jiasen Xu, Bo Ling, Haoyang Chen

11 min read

Has Summary

Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization

Advanced

The article introduces Zoomer, Meta's automated debugging and optimization platform designed to enhance AI performance across its extensive infrastructure.

PyTorchThrift

Prashant Gupta

10 min read

Has Summary

Achieve CUTLASS C++ Performance with Python APIs Using CuTe DSL

Intermediate

The article discusses how CuTe DSL, a new Python API for CUTLASS 4, simplifies GPU kernel development by reducing compilation times and maintaining performance efficiency similar to CUTLASS C++.

Multi-Head AttentionPythonPyTorch

Brandon Sun

8 min read

Includes Code

Has Summary

Building Scalable and Fault-Tolerant NCCL Applications

Advanced

The article discusses the NVIDIA Collective Communications Library (NCCL) and its capabilities for building scalable and fault-tolerant applications.

KubernetesPyTorch

Luke Robison

11 min read

Includes Code

Has Summary

Gen AI Super-Resolution Accelerates Weather Prediction with Scalable, Low-Compute Models

Advanced

The article discusses how NVIDIA's CorrDiff model leverages generative AI for downscaling weather predictions, significantly improving efficiency and reducing computational costs.

Fine-tuningPythonPyTorchYAML

Alicia Sui

11 min read

Includes Code

Has Summary

How to Achieve 4x Faster Inference for Math Problem Solving

Advanced

This article discusses how to achieve 4x faster inference for math problem solving using large language models by optimizing the serving stack, quantization strategy, and decoding methods.

Hugging FacePythonPyTorch

Igor Gitman

7 min read

Includes Code

Has Summary

Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond

Advanced

The article discusses the introduction of a new Kubernetes abstraction called ComputeDomains, designed to facilitate secure GPU-to-GPU memory operations across node boundaries in multi-node NVLink ...

HelmKubernetesPyTorch

Kevin Klues

13 min read

Includes Code

Has Summary

Meta’s Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation

Intermediate

Meta's Generative Ads Recommendation Model (GEM) is a cutting-edge foundation model designed to enhance ad performance and advertiser ROI by improving the relevance of ad recommendations.

PyTorch

Huayu Li

12 min read

Has Summary

Enhancing GPU-Accelerated Vector Search in Faiss with NVIDIA cuVS

Advanced

The article discusses how NVIDIA cuVS enhances GPU-accelerated vector search in the Faiss library, providing significant performance improvements for similarity search and clustering of dense vecto...

I've created a website to track the team's activity

Tarang Jain

10 min read

Includes Code

Has Summary

ClickHouse

Intermediate

The article discusses the creation of a website for tracking team activity across GitHub repositories, initially intended as a single report but evolved into a comprehensive tool for comparing vari...

ElasticsearchHTMLMongoDBNeonPolarsPyTorchRedisRustSQLSupabaseZig

Alexey Milovidov

4 min read

Includes Code

Has Summary

Democratizing Large-Scale Mixture-of-Experts Training with NVIDIA PyTorch Paralism

Advanced

The article discusses how NVIDIA's NeMo Automodel simplifies the training of large-scale mixture-of-experts (MoE) models in PyTorch, making it accessible to a broader audience.

GPTHugging FacePyTorchTransformer

Hemil Desai

7 min read

Includes Code

Has Summary

Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes

Advanced

The article discusses how to scale biology transformer models using PyTorch and NVIDIA BioNeMo Recipes, focusing on advanced parallel computing techniques and the integration of the NVIDIA Transfor...

Hugging FacePyTorchTransformerTransformers

Kyle Tretina

6 min read

Includes Code

Has Summary

Advanced

A Decade of AI Platform at Pinterest

The article reflects on a decade of AI platform development at Pinterest, detailing the evolution from fragmented machine learning stacks to a unified AI platform that supports various models.

AutoMLDockerEmbeddingGenerative AIJavaKubernetesLightGBMPySparkPythonPyTorchSeedSQLTensorFlowThriftTransformer

Pinterest Engineering

22 min read

Has Summary

Video Invisible Watermarking at Scale

Advanced

The article discusses Meta's implementation of invisible watermarking technology for video content, focusing on its applications for content provenance, AI detection, and source identification.

PyTorch

Wes Castro

10 min read

Has Summary

Streamline AI Infrastructure with NVIDIA Run:ai on Microsoft Azure

Intermediate

The article discusses how NVIDIA Run:ai enhances AI infrastructure management on Microsoft Azure by optimizing GPU utilization and simplifying workload orchestration.

AzureAzure Blob StorageHugging FaceKubernetesPyTorch

Julie Adrounie

8 min read

Has Summary

Cursor

Intermediate

Composer: Building a fast frontier model with RL

The article discusses Composer, a new agent model designed for software engineering that achieves coding results four times faster than similar models.

GeminiGPTPyTorch

4 min read

Has Summary

How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks

Intermediate

The article discusses how the NVIDIA DGX Spark supercomputer enhances performance for intensive AI tasks, providing a local alternative to cloud computing.

Fine-tuningGPTHugging FacePyTorchscikit-learn

Allen Bourgoyne

5 min read

Has Summary

Uber

Advanced

Enabling Deep Model Explainability with Integrated Gradients at Uber

This article discusses how Uber has integrated explainability into its machine learning platform, Michelangelo, using Integrated Gradients (IG) to provide interpretable attributions for deep learni...

EmbeddingKerasLIMEMachine LearningPyTorchSHAPTensorFlowXGBoostYAML

Hugh Chen, Eric Wang, Gaoyuan Huang, Howard Yu, Jia Li, Sally Lee

14 min read

Has Summary

Enabling Scalable AI-Driven Molecular Dynamics Simulations

Advanced

The article discusses the integration of machine learning interatomic potentials (MLIPs) into molecular dynamics (MD) simulations using the ML-IAP-Kokkos interface within the LAMMPS MD package.

CythonPythonPyTorch

Justin S. Smith

14 min read

Includes Code

Has Summary

Google

Intermediate

Introducing Coral NPU: A full-stack platform for Edge AI

The article introduces Coral NPU, a full-stack, open-source platform designed to enhance Edge AI capabilities on low-power devices.

Generative AIJAXPyTorchTensorFlow

Billy Rutledge

8 min read

Has Summary

Train a Quadruped Locomotion Policy and Simulate Cloth Manipulation with NVIDIA Isaac Lab and Newton

Intermediate

This article discusses the integration of the Newton physics engine with NVIDIA Isaac Lab for training quadruped locomotion policies and simulating cloth manipulation.

ApacheNumPyPythonPyTorchReinforcement LearningWarpYAML

Mohammad Mohajerani

13 min read

Includes Code

Has Summary

Meta’s Infrastructure Evolution and the Advent of AI

Intermediate

The article discusses Meta's evolution in infrastructure over 21 years, highlighting the significant changes brought about by AI.

ApacheLarge Language ModelsMySQLPrometheusPyTorch

Yee Jiun Song

20 min read

Has Summary

Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer

Advanced

The article discusses the challenges of cold start latency in deploying large language models (LLMs) and introduces the NVIDIA Run:ai Model Streamer, an open-source Python SDK designed to optimize ...

AWSAWS S3HTTPSHugging FacePythonPyTorchTransformers

Omer Dayan

12 min read

Has Summary

Autodesk Research Brings Warp Speed to Computational Fluid Dynamics on NVIDIA GH200

Advanced

The article discusses Autodesk Research's development of the Accelerated Lattice Boltzmann (XLB) library, which enhances computational fluid dynamics (CFD) performance using NVIDIA's Warp and GH200...

FortranJAXNumbaNumPyPythonPyTorchWarp

Mehdi Ataei

7 min read

Has Summary

Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6

Advanced

This article discusses the optimization of vision AI workloads using NVIDIA's CUDA-accelerated implementation of SMPTE VC-6, a codec designed for efficient interaction with modern compute architect...

PythonPyTorchV

Andreas Kieslinger

12 min read

Includes Code

Has Summary

How Quantization Aware Training Enables Low-Precision Accuracy Recovery

Advanced

The article discusses how Quantization Aware Training (QAT) and Quantization Aware Distillation (QAD) can enhance low-precision model accuracy recovery beyond traditional Post-Training Quantization...

Hugging FacePyTorch

Eduardo Alvarez

9 min read

Includes Code

Has Summary

Cursor

Intermediate

Improving Cursor Tab with online RL

The article discusses how Cursor enhances its Tab model for predicting developer actions using online reinforcement learning.

CopilotPyTorch

Jacob Jackson, Phillip Kravtsov, Shomil Jain

6 min read

Has Summary

Advanced

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 2 of 2)

The article discusses Pinterest's transition to Moka, a next-generation data processing platform built on AWS Elastic Kubernetes Service (EKS).

AWSHelmJavaKubernetesLoad BalancerPrometheusPySparkPythonPyTorchReactTerraform

Pinterest Engineering

16 min read

Has Summary

Developers Can Now Get NVIDIA CUDA Directly from Their Favorite Third-Party Platforms

Beginner

NVIDIA is simplifying the deployment of its CUDA software stack by collaborating with various third-party platforms, enabling developers to access CUDA directly through their preferred package mana...

OpenCVPythonPyTorch

Jonathan Bentz

3 min read

Has Summary