Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer

Kyle Aubrey

AI has entered an industrial phase. What began as systems performing discrete AI model training and human-facing inference has evolved into always-on AI…

NVIDIA

•

Kyle Aubrey

•59 min read•advanced•

--

•View Original

AssemblyHugging FaceJAXKubernetesLessPyTorchRLHFTransformer

Overview

The article discusses the NVIDIA Rubin platform, which introduces six new chips designed to create a powerful AI supercomputer. It emphasizes the need for a new architectural approach to meet the demands of modern AI factories, focusing on extreme co-design and the integration of hardware and software for enhanced performance and efficiency.

What You'll Learn

1

How to leverage the NVIDIA Rubin platform for AI factory deployments

2

Why extreme co-design is essential for modern AI workloads

3

How to optimize power and cooling in AI data centers

Key Questions Answered

What are the key features of the NVIDIA Rubin platform?

The NVIDIA Rubin platform features six new chips, including the Vera CPU and Rubin GPU, designed for high-performance AI workloads. It emphasizes extreme co-design for efficient power, cooling, and data movement, enabling sustained performance and lower costs per token.

How does the Rubin platform improve AI factory economics?

The Rubin platform lowers the cost per token while increasing tokens per watt and tokens per rack. By maximizing utilization and minimizing operational friction, it transforms AI factory operations from traditional batch processing to continuous, efficient intelligence production.

What advancements does the Vera CPU bring to AI factories?

The Vera CPU features 88 custom-designed Olympus cores optimized for AI workloads, providing high bandwidth and low latency for data movement. This design enhances GPU utilization and supports efficient orchestration across training and inference tasks.

What is the significance of NVLink 6 in the Rubin platform?

NVLink 6 provides 3.6 TB/s of bidirectional GPU-to-GPU bandwidth, enabling all-to-all communication across 72 GPUs in a single rack. This high bandwidth is crucial for communication-heavy workloads, improving efficiency and reducing latency in AI processing.

Key Statistics & Figures

Cost per token for inference

up to 10x lower compared to Blackwell NVL72

This improvement is particularly notable in interactive agent workloads, where responsiveness is critical.

Tokens per second per GPU

up to 10x higher throughput than Blackwell NVL72

This performance is achieved under interactive operating conditions, showcasing the efficiency of the Rubin architecture.

Power efficiency improvement

up to 30% more compute provisioning within the same power envelope

This is facilitated by the power smoothing and energy storage mechanisms integrated into the Rubin platform.

Technologies & Tools

Processor

Nvidia Vera CPU

Optimized for data movement and orchestration in AI workloads.

Graphics Card

Nvidia Rubin GPU

Designed for high-performance AI compute with enhanced memory bandwidth.

Interconnect

Nvidia Nvlink 6

Provides high bandwidth for GPU-to-GPU communication.

Data Processing Unit

Nvidia Bluefield-4 Dpu

Handles control, security, and orchestration for AI factories.

Network Switch

Nvidia Spectrum-6 Ethernet Switch

Facilitates scale-out connectivity for AI workloads.

Key Actionable Insights

1
Implementing the NVIDIA Rubin platform can significantly enhance the performance of AI workloads by leveraging its extreme co-design features.
This is particularly relevant for organizations looking to scale their AI capabilities efficiently while maintaining high performance and low operational costs.

2
Utilizing the Vera CPU in conjunction with Rubin GPUs allows for improved data orchestration and memory access, leading to higher GPU utilization.
This is essential for applications requiring continuous data processing and real-time inference, making it a critical consideration for AI factory setups.

3
Adopting NVLink 6 can eliminate bottlenecks in GPU communication, facilitating faster data transfer rates and improved overall system performance.
This is vital for AI models that depend on rapid data exchange between GPUs, especially in large-scale training scenarios.

Related Concepts

AI Factory Architecture

Extreme Co-design Principles

Nvidia AI Enterprise Software Stack