How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s

Utkarsh Uppal

As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost…

NVIDIA

•

Utkarsh Uppal

•14 min read•advanced•

--

•View Original

Hugging FacePyTorchTransformer

Overview

The article discusses how NVIDIA's hardware-software co-design significantly enhanced the inference performance of Sarvam AI's Sovereign 30B model, achieving a 4x speedup on NVIDIA Blackwell architecture. It highlights the collaboration's focus on optimizing model performance while adhering to strict latency and cost requirements.

What You'll Learn

1

How to optimize large language models for inference performance using NVIDIA GPUs

2

Why kernel-level optimizations are crucial for reducing latency in AI models

3

When to implement disaggregated serving to improve throughput in AI applications

4

How to leverage mixed scheduling strategies for better GPU utilization

Prerequisites & Requirements

Understanding of AI model architectures and inference optimization techniques
Familiarity with NVIDIA GPUs and related software frameworks(optional)

Key Questions Answered

What performance improvements were achieved with the Sarvam 30B model?

The Sarvam 30B model achieved a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, with specific kernel and scheduling optimizations contributing to a 2x speedup and NVFP4 weight quantization providing an additional 2x speedup.

How does the mixture-of-experts architecture enhance model performance?

The mixture-of-experts (MoE) architecture allows the Sarvam models to efficiently scale intelligence by using a shared expert design, enabling high active parameter counts and complex memory access patterns that improve reasoning and linguistic density.

What are the service level agreements (SLAs) for the Sarvam 30B model?

The SLAs established for the Sarvam 30B model include a P95 time to first token (TTFT) of less than 1000 ms and a P95 inter-token latency (ITL) of less than 15 ms, ensuring a responsive user experience even under load.

What optimizations were made to the MoE routing mechanism?

The MoE routing mechanism was optimized by implementing a Fused TopK kernel that combines logit computation and selection into a single CUDA kernel, significantly reducing latency and improving throughput.

Key Statistics & Figures

Inference speedup on NVIDIA Blackwell

4x

Achieved over baseline NVIDIA H100 GPUs

P95 time to first token (TTFT)

< 1000 ms

Service level agreement for the Sarvam 30B model

P95 inter-token latency (ITL)

< 15 ms

Service level agreement for the Sarvam 30B model

Total layer time reduction

1.34x faster

Total time per transformer layer in a prefill iteration

Throughput improvement with mixed chunks

1.15x

Compared to separate prefill and decode chunks

Throughput increase with disaggregated serving

1.5x

Compared to baseline H100 performance

Technologies & Tools

Hardware

Nvidia Blackwell

Used to enhance inference performance for Sarvam AI's models

Hardware

Nvidia H100

Baseline GPU for performance comparisons

Software

Nvidia Nemo Framework

Framework used for training and optimizing Sarvam AI's models

Software

Nvidia Nemotron

Libraries for training, fine-tuning, and deploying models

Key Actionable Insights

1
Implement kernel-level optimizations to reduce latency in AI models.
By replacing standard implementations with architecture-specific fused kernels, significant speedups can be achieved, as demonstrated in the Sarvam 30B model optimizations.

2
Utilize mixed prefill and decode scheduling to enhance GPU utilization.
This strategy allows for better resource management, leading to a 15% increase in total system throughput while maintaining SLA requirements.

3
Consider disaggregated serving for models that fit within a single GPU's memory.
This approach can eliminate inter-GPU communication overhead, resulting in a 1.5x increase in decode throughput, as shown in the Sarvam 30B model's performance.

Common Pitfalls

1

Neglecting the importance of kernel optimizations can lead to suboptimal model performance.

Without targeted optimizations, models may fail to meet latency requirements, resulting in poor user experiences.

2

Overlooking the benefits of disaggregated serving can hinder throughput improvements.

Failing to separate prefill and decode processes can introduce unnecessary overhead, limiting the model's ability to scale effectively.

Related Concepts

Mixture-of-experts Architecture

Kernel Optimization Techniques

AI Model Performance Metrics

Nvidia GPU Architectures