Kubernetes Programming Tutorials &amp; Engineering Articles

Safeguarding Dynamic Configuration Changes at Scale

Advanced

The article discusses how Airbnb manages dynamic configuration changes safely and reliably at scale.

AWSGitKubernetes

Cosmo W. Q

9 min read

Has Summary

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

Advanced

The article discusses how NVIDIA Run:ai enhances AI workload performance through dynamic GPU fractioning, enabling efficient resource allocation and high throughput for large language models (LLMs).

Balancing cost and reliability for Spark on Kubernetes

Boskey Savla

12 min read

Has Summary

Notion

Intermediate

The article discusses the development and implementation of Spot Balancer, a tool created by Notion in collaboration with AWS to optimize the cost and reliability of running Apache Spark on Kuberne...

AWSKubernetesRedis

Justin Lee

7 min read

Includes Code

Has Summary

Advanced

Is it over for metrics?

The article discusses the evolving role of metrics in observability, emphasizing that while they remain important, their function is shifting towards being an optimization layer rather than the cor...

KubernetesPrometheus

7 min read

Includes Code

Has Summary

Quantifying infrastructure noise in agentic coding evals

Intermediate

ClickHouse Release 26.1

ClickHouse 26. 1 is a major release featuring 25 new features, 43 performance optimizations, and 176 bug fixes.

KubernetesSQL

17 min read

Includes Code

Has Summary

Anthropic

Intermediate

This article from Anthropic's engineering team quantifies how infrastructure configuration—specifically resource allocation and enforcement methodology—introduces significant noise into agentic cod...

ClaudeKubernetesscikit-learn

9 min read

Includes Code

Has Summary

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare

Advanced

The article discusses the introduction of time-based fairshare in NVIDIA Run:ai v2.

KubernetesPrometheusYAML

Ekin Karabulut

11 min read

Has Summary

Shopify

Intermediate

SkyPilot at Shopify: Multi-cloud GPUs without the pain

Shopify uses SkyPilot, an open-source framework, to manage GPU-intensive ML training workloads across multiple cloud providers (Nebius and GCP).

DockerKubernetesMachine LearningYAML

Javier Moreno

7 min read

Includes Code

Has Summary

Palantir

Intermediate

Securing Agents in Production (Agentic Runtime, #1)

The article discusses securing agents in production using Palantir's Agentic Runtime, focusing on the security architecture necessary for operational AI agents.

Scaling PostgreSQL to power 800 million ChatGPT users

Palantir

13 min read

Has Summary

OpenAI

Advanced

OpenAI details how they scaled PostgreSQL to support 800 million ChatGPT users, achieving millions of queries per second through a single-primary architecture with nearly 50 read replicas across mu...

AzureAzure Cosmos DBCachingKubernetesPostgreSQLSQL

Bohan Zhang

13 min read

Has Summary

Build and Orchestrate End-to-End SDG Workflows with NVIDIA Isaac Sim and NVIDIA OSMO

Intermediate

The article discusses how to build and orchestrate end-to-end synthetic data generation (SDG) workflows using NVIDIA Isaac Sim and NVIDIA OSMO.

AzureGradioKubernetesPostgreSQLPythonRedisYAML

Asawaree Bhide

11 min read

Includes Code

Has Summary

From Monitoring to Observability: Our Ultra-Marathon to a Cloud-Native Platform

Intermediate

Uber Engineering details their migration from a legacy monolithic monitoring system to a modern, cloud-native observability platform for their corporate network infrastructure.

ElasticsearchFastAPIGrafanaKubernetesPagerDutyPrometheusRedis

Razvan Cicu, Giovanni Pepe

9 min read

Has Summary

Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer

Advanced

The article discusses the NVIDIA Rubin platform, which introduces six new chips designed to create a powerful AI supercomputer.

AssemblyHugging FaceJAXKubernetesLessPyTorchRLHFTransformer

Kyle Aubrey

59 min read

Has Summary

Enabling Horizontal Autoscaling of Enterprise RAG Components on Kubernetes

Advanced

This article discusses the implementation of horizontal autoscaling for Retrieval-Augmented Generation (RAG) components on Kubernetes, focusing on NVIDIA's microservices architecture.

DockerGrafanaHelmKubernetesMicroservicesPrometheus

Juana Nakfour

23 min read

Includes Code

Has Summary

OpenAI

Intermediate

OpenAI co-founds the Agentic AI Foundation under the Linux Foundation

OpenAI has co-founded the Agentic AI Foundation (AAIF) under the Linux Foundation to promote open-source agentic AI.

AWSCopilotGeminiKubernetesNode.jsPyTorch

OpenAI

5 min read

Has Summary

Automate Kubernetes AI Cluster Health with NVSentinel

Intermediate

The article discusses NVSentinel, an open-source system designed to automate the monitoring and health management of Kubernetes AI clusters, particularly those utilizing NVIDIA GPUs.

AWSKubernetes

Lalit Adithya

6 min read

Includes Code

Has Summary

Build Efficient Financial Data Workflows with AI Model Distillation

Advanced

The article discusses the use of AI Model Distillation to create efficient financial data workflows, focusing on the optimization of large language models (LLMs) for applications in quantitative fi...

Deep LearningDockerElasticsearchFine-tuningJSONKubernetesMicroservicesYAML

Dhruv Desai

10 min read

Includes Code

Has Summary

Build and Run Secure, Data-Driven AI Agents

Advanced

The article discusses the deployment of secure, data-driven AI agents using NVIDIA's AI-Q Research Assistant and Enterprise RAG Blueprints on AWS.

AWSDockerGitGrafanaHelmKubernetesPrometheusServerlessTerraform

Abdullahi Olaoye

8 min read

Includes Code

Has Summary

Building Scalable and Fault-Tolerant NCCL Applications

Advanced

The article discusses the NVIDIA Collective Communications Library (NCCL) and its capabilities for building scalable and fault-tolerant applications.

KubernetesPyTorch

Luke Robison

11 min read

Includes Code

Has Summary

Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond

Advanced

The article discusses the introduction of a new Kubernetes abstraction called ComputeDomains, designed to facilitate secure GPU-to-GPU memory operations across node boundaries in multi-node NVLink ...

HelmKubernetesPyTorch

Kevin Klues

13 min read

Includes Code

Has Summary

Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

Advanced

The article discusses NVIDIA Grove, a Kubernetes API designed to streamline complex AI inference workloads by managing multicomponent systems.

HelmHugging FaceKubernetesYAML

Sanjay Chatterjee

9 min read

Includes Code

Has Summary

What's new in ClickStack. October '25.

Advanced

The October 2025 edition of What's New in ClickStack highlights significant updates to the open-source observability stack for ClickHouse, including the introduction of alerting features, customiza...

KubernetesPagerDutyPythonSQL

9 min read

Includes Code

Has Summary

Building Zone Failure Resilience in Apache Pinot™ at Uber

Advanced

This article discusses the implementation of zone failure resilience in Apache Pinot at Uber, detailing strategies to ensure uninterrupted service during zone failures.

ApacheApache KafkaGrafanaKubernetes

Si Lao, Christina Li, Xuanyi Li, Yang Yang, Ujwala Tulshigiri

10 min read

Has Summary

Netflix

Advanced

Supercharging the ML and AI Development Experience at Netflix with Metaflow

Netflix introduces Spin, a new feature in Metaflow 2.

AWSClaudeKubernetes

Netflix Technology Blog

10 min read

Includes Code

Has Summary

Advanced

A Decade of AI Platform at Pinterest

The article reflects on a decade of AI platform development at Pinterest, detailing the evolution from fragmented machine learning stacks to a unified AI platform that supports various models.

AutoMLDockerEmbeddingGenerative AIJavaKubernetesLightGBMPySparkPythonPyTorchSeedSQLTensorFlowThriftTransformer

Pinterest Engineering

22 min read

Has Summary

Make Sense of Video Analytics by Integrating NVIDIA AI Blueprints

Advanced

This article discusses the integration of NVIDIA AI Blueprints for enhancing video analytics through the combination of Video Search and Summarization (VSS) and Retrieval-Augmented Generation (RAG).

Ilyas Bankole-Hameed

10 min read

Includes Code

Has Summary

Streamline AI Infrastructure with NVIDIA Run:ai on Microsoft Azure

Intermediate

The article discusses how NVIDIA Run:ai enhances AI infrastructure management on Microsoft Azure by optimizing GPU utilization and simplifying workload orchestration.

AzureAzure Blob StorageHugging FaceKubernetesPyTorch

Julie Adrounie

8 min read

Has Summary

Slack

Intermediate

Advancing Our Chef Infrastructure: Safety Without Disruption

This article details Slack's approach to making Chef infrastructure deployments safer by splitting a single production Chef environment into six bucketed environments (prod-1 through prod-6) mapped...

AWSChefJSONKubernetesPythonTypeScript

Archie Gunasekara

16 min read

Includes Code

Has Summary

Fly.io

Advanced

Corrosion

The article discusses Corrosion, a novel service discovery system developed by Fly. io that addresses the challenges of state synchronization in distributed systems.

ConsulDockerKubernetesRustSQLSQLiteVault

Thomas Ptacek, Peter Cai

11 min read

Includes Code

Has Summary

Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

Advanced

The article discusses the challenges and solutions for scaling large Mixture-of-Experts (MoE) models using Wide Expert Parallelism on NVIDIA's NVL72 rack-scale systems.

KubernetesLessLoad Balancer

Eduardo Alvarez

10 min read

Has Summary

Understanding Memory Management on Hardware-Coherent Platforms

Advanced

The article discusses memory management on hardware-coherent platforms, specifically focusing on the differences between Non-Uniform Memory Access (NUMA) and Coherent Driver-based Memory Management...

Kumar Sankaran

6 min read

Includes Code

Has Summary

From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store

Intermediate

This article details how Airbnb evolved the traffic management system for Mussel, their multi-tenant key-value store for derived data, from simple QPS-based rate limiting to a layered, adaptive qua...

KubernetesRate LimitingRedis

Shravan Gaonkar

11 min read

Includes Code

Has Summary

Slack

Intermediate

Deploy Safety: Reducing customer impact from change

Slack's Deploy Safety Program, launched in mid-2023, achieved a 90% reduction in customer impact hours by January 2025 through automated detection, remediation, and cultural changes across all depl...

AWSChefJenkinsKubernetesPythonSolidTerraformTypeScript

Sam Bailey

12 min read

Has Summary

Cadence Workflow Joins the Cloud Native Computing Foundation

Advanced

The article announces that the Cadence project has joined the Cloud Native Computing Foundation (CNCF), highlighting its commitment to open-source development.

EnvoyKubernetesPrometheus

Uber Engineering

3 min read

Has Summary

Smarter Anomaly Detection in Semiconductor Manufacturing with NVIDIA NV-Tesseract and NVIDIA NIM

Advanced

The article discusses the implementation of NVIDIA NV-Tesseract and NVIDIA NIM for smarter anomaly detection in semiconductor manufacturing.

DockerFine-tuningJSONKubernetes

Aditi Gautam

7 min read

Includes Code

Has Summary

Enable Gang Scheduling and Workload Prioritization in Ray with NVIDIA KAI Scheduler

Advanced

The article discusses the integration of the NVIDIA KAI Scheduler with Ray, enabling advanced scheduling features like gang scheduling, workload prioritization, and autoscaling in Ray clusters.

HelmHugging FaceKubernetesYAML

Ekin Karabulut

9 min read

Includes Code

Has Summary

Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo

Advanced

The article discusses the integration of NVIDIA Run:ai v2. 23 with NVIDIA Dynamo to address the challenges of large language model (LLM) inference across distributed environments.

HelmHugging FaceJSONKubernetesYAML

Ekin Karabulut

9 min read

Includes Code

Has Summary

Google

Intermediate

Apigee Operator for Kubernetes and GKE Inference Gateway integration for Auth and AI/LLM policies

The article discusses the integration of the Apigee Operator for Kubernetes with the GKE Inference Gateway to enhance API management for AI and Large Language Models (LLMs).

Artificial IntelligenceGoogle CloudKubernetesLarge Language ModelsOpenAI API

Sanjay Pujare, Jennifer Bennett

4 min read

Includes Code

Has Summary

Building a Next-Generation Key-Value Store at Airbnb

Intermediate

Airbnb completely rearchitected Mussel, their core key-value store for derived data, migrating from v1 to v2 with a NewSQL backend.

ChefKubernetes

Shravan Gaonkar

10 min read

Has Summary

Uber’s Strategy to Upgrading 2M+ Spark Jobs

Advanced

Uber's migration from Spark 2. 4 to Spark 3. 3 involved upgrading over 2 million Spark applications, utilizing innovative automation tools like Iron Dome.

ApacheApache SparkJavaKubernetesMySQLOraclePySparkPythonScalaSQL

Amruth Sampath, Arnav Balyan, Nimesh Khandelwal, Sumit Singh, Parth Halani, Suprit Acharya

8 min read

Has Summary

Breaking free from rising observability costs with open, cost-efficient architectures

Intermediate

The article discusses the rising costs associated with observability in software engineering and proposes a shift towards open, cost-efficient architectures.

ApacheClaudeDatadogElasticsearchGrafanaJSONKubernetesPrometheusSplunkSQL

Mike Shi

13 min read

Has Summary

Viaduct, Five Years On: Modernizing the Data-Oriented Service Mesh

Intermediate

The article discusses the evolution and modernization of Viaduct, Airbnb's data-oriented service mesh, highlighting its transition to open-source software.

GraphQLKotlinKubernetesMicroservicesService MeshTypeScript

Adam Miskiewicz

10 min read

Includes Code

Has Summary

Deploy Scalable AI Inference with NVIDIA NIM Operator 3.0.0

Intermediate

The article discusses the deployment of scalable AI inference using NVIDIA NIM Operator 3. 0. 0, highlighting its capabilities in managing AI inference pipelines across Kubernetes environments.

Generative AIHugging FaceKubernetesMicroservicesServerless

Meenakshi Kaushik

6 min read

Includes Code

Has Summary

Advanced

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 2 of 2)

The article discusses Pinterest's transition to Moka, a next-generation data processing platform built on AWS Elastic Kubernetes Service (EKS).

AWSHelmJavaKubernetesLoad BalancerPrometheusPySparkPythonPyTorchReactTerraform

Pinterest Engineering

16 min read

Has Summary

Instrumenting your NextJS application with OpenTelemetry and ClickStack

Advanced

This article discusses how to instrument a Next. js application using OpenTelemetry and ClickStack, focusing on the integration of observability and analytics through ClickHouse.

ApacheGrafanaJavaScriptJSONKubernetesNew RelicNext.jsNode.jsPythonReactSQLVercel

Dale McDiarmid

19 min read

Includes Code

Has Summary

Palantir

Advanced

How Palantir Mastered In-Toto

The article discusses how Palantir implemented in-toto to enhance their software supply chain security, detailing the challenges faced and lessons learned throughout the process.

DockerGitHelmKubernetes

Palantir

27 min read

Has Summary

Intermediate

Developer Experience at Pinterest: The Journey to PinConsole

The article discusses Pinterest's journey in enhancing developer experience through the creation of PinConsole, an Internal Developer Platform built on Backstage.

AWSAWS RDSCachingCDNGraphQLKubernetesOAuthPagerDutyPostgreSQLReact

Pinterest Engineering

15 min read

Has Summary

NVIDIA Hardware Innovations and Open Source Contributions Are Shaping AI

Advanced

The article discusses how NVIDIA's hardware innovations, particularly the Blackwell architecture and NVFP4 precision, along with their open source contributions, are driving advancements in AI.

GPTHugging FaceJAXKubernetesPythonPyTorchTransformer

George Chellapa

8 min read

Has Summary