#
PyTorch Programming Tutorials & Engineering Articles
716 PyTorch tutorials, guides, and engineering insights from NVIDIA, Meta, Uber, and more
Companies Using This
PyTorch Articles & Tutorials
Filter:
The article discusses the use of NVFP4 low-precision model training to achieve higher throughput without sacrificing accuracy in AI model training.
Aditya Vavre
7 min read
Includes Code
Has Summary
--
The article discusses how the NVIDIA cuda. compute library enables Python developers to write high-performance GPU code without needing to resort to C++.
The article discusses how NVIDIA's hardware-software co-design significantly enhanced the inference performance of Sarvam AI's Sovereign 30B model, achieving a 4x speedup on NVIDIA Blackwell archit...
Utkarsh Uppal
14 min read
Has Summary
--
The article discusses NVIDIA TensorRT LLM AutoDeploy, a beta feature that automates the inference optimization process for large language models (LLMs).
Lucas Liebenwein
8 min read
Includes Code
Has Summary
--
Kimi K2. 5 is an advanced multimodal vision language model (VLM) developed by Kimi, optimized for various AI tasks.
Anu Srivastava
4 min read
Includes Code
Has Summary
--
The article discusses the challenges of Expert Parallel communication in training Mixture-of-Experts (MoE) models and introduces Hybrid-EP, an efficient communication solution that leverages NVIDIA...
This article discusses the re-architecture of the serving stack for next-generation ads lightweight ranking models at Pinterest, moving from a traditional Two-Tower architecture to a more complex G...
Pinterest Engineering
11 min read
Has Summary
--
The article discusses the Universal Sparse Tensor (UST), a framework designed to efficiently handle sparse tensors across various applications, including scientific computing and deep learning.
LiteRT has evolved from its TensorFlow Lite foundation into a universal on-device AI inference framework, now offering production-ready GPU acceleration across six platforms and streamlined NPU int...
Lu Wang, Chintan Parikh, Jingjiang Li, Terry Heo
9 min read
Includes Code
Has Summary
--
The article discusses the transition from the traditional two-phase API of the CUB library to a new single-call API introduced in CUDA 13. 1.
Giannis Gonidelis
8 min read
Includes Code
Has Summary
--
This article provides a detailed guide on implementing high-performance matrix multiplication using NVIDIA's cuTile framework in CUDA.
The article discusses NVIDIA's advancements in AI model inference performance through the Blackwell architecture, emphasizing improvements in token throughput per watt and the enhancements made to ...
Ashraf Eassa
5 min read
Has Summary
--
The article discusses how recent upgrades to open source AI tools enhance the performance of small language models (SLMs) and diffusion models on NVIDIA RTX PCs.
Annamalai Chockalingam
7 min read
Has Summary
--
The article discusses the latest software and model optimizations for NVIDIA DGX Spark, highlighting significant performance improvements in AI workflows.
Allen Bourgoyne
5 min read
Has Summary
--
The article discusses the NVIDIA Rubin platform, which introduces six new chips designed to create a powerful AI supercomputer.
Kyle Aubrey
59 min read
Has Summary
--
NVIDIA introduces the Jetson T4000, enhancing AI and real-time reasoning for robotics and edge AI applications with up to 1200 FP4 TFLOPs of AI compute and 64 GB of memory.
The article discusses the NVIDIA ALCHEMI Toolkit-Ops, a specialized toolkit designed to accelerate AI-powered atomistic simulations in chemistry and materials science.
This article discusses how to rapidly simulate robotic environments using NVIDIA Isaac Sim and World Labs Marble.
The article discusses the integration of AI Physics into Technology Computer-Aided Design (TCAD) simulations, highlighting its significance in semiconductor manufacturing.
Ram Cherukuri
7 min read
Has Summary
--
OpenAI has co-founded the Agentic AI Foundation (AAIF) under the Linux Foundation to promote open-source agentic AI.
The article discusses the evolution and scaling of Uber's Delivery Search Platform, emphasizing the transition from traditional lexical search to a semantic search model that enhances user experien...
Divya Nagar, Zheng Liu, Jiasen Xu, Bo Ling, Haoyang Chen
11 min read
Has Summary
--
The article introduces Zoomer, Meta's automated debugging and optimization platform designed to enhance AI performance across its extensive infrastructure.
The article discusses how CuTe DSL, a new Python API for CUTLASS 4, simplifies GPU kernel development by reducing compilation times and maintaining performance efficiency similar to CUTLASS C++.
Brandon Sun
8 min read
Includes Code
Has Summary
--
The article discusses the NVIDIA Collective Communications Library (NCCL) and its capabilities for building scalable and fault-tolerant applications.
Luke Robison
11 min read
Includes Code
Has Summary
--
The article discusses how NVIDIA's CorrDiff model leverages generative AI for downscaling weather predictions, significantly improving efficiency and reducing computational costs.
Alicia Sui
11 min read
Includes Code
Has Summary
--
This article discusses how to achieve 4x faster inference for math problem solving using large language models by optimizing the serving stack, quantization strategy, and decoding methods.
Igor Gitman
7 min read
Includes Code
Has Summary
--
The article discusses the introduction of a new Kubernetes abstraction called ComputeDomains, designed to facilitate secure GPU-to-GPU memory operations across node boundaries in multi-node NVLink ...
Kevin Klues
13 min read
Includes Code
Has Summary
--
Meta's Generative Ads Recommendation Model (GEM) is a cutting-edge foundation model designed to enhance ad performance and advertiser ROI by improving the relevance of ad recommendations.
Huayu Li
12 min read
Has Summary
--
The article discusses how NVIDIA cuVS enhances GPU-accelerated vector search in the Faiss library, providing significant performance improvements for similarity search and clustering of dense vecto...
The article discusses the creation of a website for tracking team activity across GitHub repositories, initially intended as a single report but evolved into a comprehensive tool for comparing vari...
The article discusses how NVIDIA's NeMo Automodel simplifies the training of large-scale mixture-of-experts (MoE) models in PyTorch, making it accessible to a broader audience.
Hemil Desai
7 min read
Includes Code
Has Summary
--
The article discusses how to scale biology transformer models using PyTorch and NVIDIA BioNeMo Recipes, focusing on advanced parallel computing techniques and the integration of the NVIDIA Transfor...
Kyle Tretina
6 min read
Includes Code
Has Summary
--
The article reflects on a decade of AI platform development at Pinterest, detailing the evolution from fragmented machine learning stacks to a unified AI platform that supports various models.
AutoMLDockerEmbeddingGenerative AIJavaKubernetesLightGBMPySparkPythonPyTorchSeedSQLTensorFlowThriftTransformer
Pinterest Engineering
22 min read
Has Summary
--
The article discusses Meta's implementation of invisible watermarking technology for video content, focusing on its applications for content provenance, AI detection, and source identification.
Wes Castro
10 min read
Has Summary
--
The article discusses how NVIDIA Run:ai enhances AI infrastructure management on Microsoft Azure by optimizing GPU utilization and simplifying workload orchestration.
Julie Adrounie
8 min read
Has Summary
--
The article discusses Composer, a new agent model designed for software engineering that achieves coding results four times faster than similar models.
The article discusses how the NVIDIA DGX Spark supercomputer enhances performance for intensive AI tasks, providing a local alternative to cloud computing.
Allen Bourgoyne
5 min read
Has Summary
--
This article discusses how Uber has integrated explainability into its machine learning platform, Michelangelo, using Integrated Gradients (IG) to provide interpretable attributions for deep learni...
Hugh Chen, Eric Wang, Gaoyuan Huang, Howard Yu, Jia Li, Sally Lee
14 min read
Has Summary
--
The article discusses the integration of machine learning interatomic potentials (MLIPs) into molecular dynamics (MD) simulations using the ML-IAP-Kokkos interface within the LAMMPS MD package.
The article introduces Coral NPU, a full-stack, open-source platform designed to enhance Edge AI capabilities on low-power devices.
Billy Rutledge
8 min read
Has Summary
--
Train a Quadruped Locomotion Policy and Simulate Cloth Manipulation with NVIDIA Isaac Lab and Newton
This article discusses the integration of the Newton physics engine with NVIDIA Isaac Lab for training quadruped locomotion policies and simulating cloth manipulation.
The article discusses Meta's evolution in infrastructure over 21 years, highlighting the significant changes brought about by AI.
Yee Jiun Song
20 min read
Has Summary
--
The article discusses the challenges of cold start latency in deploying large language models (LLMs) and introduces the NVIDIA Run:ai Model Streamer, an open-source Python SDK designed to optimize ...
Omer Dayan
12 min read
Has Summary
--
The article discusses Autodesk Research's development of the Accelerated Lattice Boltzmann (XLB) library, which enhances computational fluid dynamics (CFD) performance using NVIDIA's Warp and GH200...
This article discusses the optimization of vision AI workloads using NVIDIA's CUDA-accelerated implementation of SMPTE VC-6, a codec designed for efficient interaction with modern compute architect...
The article discusses how Quantization Aware Training (QAT) and Quantization Aware Distillation (QAD) can enhance low-precision model accuracy recovery beyond traditional Post-Training Quantization...
Eduardo Alvarez
9 min read
Includes Code
Has Summary
--
The article discusses how Cursor enhances its Tab model for predicting developer actions using online reinforcement learning.
The article discusses Pinterest's transition to Moka, a next-generation data processing platform built on AWS Elastic Kubernetes Service (EKS).
Pinterest Engineering
16 min read
Has Summary
--
NVIDIA is simplifying the deployment of its CUDA software stack by collaborating with various third-party platforms, enabling developers to access CUDA directly through their preferred package mana...
The article discusses how to enhance the efficiency of Large Language Models (LLMs) during inference by utilizing CPU-GPU memory sharing through NVIDIA's NVLink C2C technology.
Afroze Syed
6 min read
Includes Code
Has Summary
--