How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s

As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost…

Utkarsh Uppal
14 min readadvanced
--
View Original

Overview

The article discusses how NVIDIA's hardware-software co-design significantly enhanced the inference performance of Sarvam AI's Sovereign 30B model, achieving a 4x speedup on NVIDIA Blackwell architecture. It highlights the collaboration's focus on optimizing model performance while adhering to strict latency and cost requirements.

What You'll Learn

1

How to optimize large language models for inference performance using NVIDIA GPUs

2

Why kernel-level optimizations are crucial for reducing latency in AI models

3

When to implement disaggregated serving to improve throughput in AI applications

4

How to leverage mixed scheduling strategies for better GPU utilization

Prerequisites & Requirements

  • Understanding of AI model architectures and inference optimization techniques
  • Familiarity with NVIDIA GPUs and related software frameworks(optional)

Key Questions Answered

What performance improvements were achieved with the Sarvam 30B model?
The Sarvam 30B model achieved a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, with specific kernel and scheduling optimizations contributing to a 2x speedup and NVFP4 weight quantization providing an additional 2x speedup.
How does the mixture-of-experts architecture enhance model performance?
The mixture-of-experts (MoE) architecture allows the Sarvam models to efficiently scale intelligence by using a shared expert design, enabling high active parameter counts and complex memory access patterns that improve reasoning and linguistic density.
What are the service level agreements (SLAs) for the Sarvam 30B model?
The SLAs established for the Sarvam 30B model include a P95 time to first token (TTFT) of less than 1000 ms and a P95 inter-token latency (ITL) of less than 15 ms, ensuring a responsive user experience even under load.
What optimizations were made to the MoE routing mechanism?
The MoE routing mechanism was optimized by implementing a Fused TopK kernel that combines logit computation and selection into a single CUDA kernel, significantly reducing latency and improving throughput.

Key Statistics & Figures

Inference speedup on NVIDIA Blackwell
4x
Achieved over baseline NVIDIA H100 GPUs
P95 time to first token (TTFT)
< 1000 ms
Service level agreement for the Sarvam 30B model
P95 inter-token latency (ITL)
< 15 ms
Service level agreement for the Sarvam 30B model
Total layer time reduction
1.34x faster
Total time per transformer layer in a prefill iteration
Throughput improvement with mixed chunks
1.15x
Compared to separate prefill and decode chunks
Throughput increase with disaggregated serving
1.5x
Compared to baseline H100 performance

Technologies & Tools

Hardware
Nvidia Blackwell
Used to enhance inference performance for Sarvam AI's models
Hardware
Nvidia H100
Baseline GPU for performance comparisons
Software
Nvidia Nemo Framework
Framework used for training and optimizing Sarvam AI's models
Software
Nvidia Nemotron
Libraries for training, fine-tuning, and deploying models

Key Actionable Insights

1
Implement kernel-level optimizations to reduce latency in AI models.
By replacing standard implementations with architecture-specific fused kernels, significant speedups can be achieved, as demonstrated in the Sarvam 30B model optimizations.
2
Utilize mixed prefill and decode scheduling to enhance GPU utilization.
This strategy allows for better resource management, leading to a 15% increase in total system throughput while maintaining SLA requirements.
3
Consider disaggregated serving for models that fit within a single GPU's memory.
This approach can eliminate inter-GPU communication overhead, resulting in a 1.5x increase in decode throughput, as shown in the Sarvam 30B model's performance.

Common Pitfalls

1
Neglecting the importance of kernel optimizations can lead to suboptimal model performance.
Without targeted optimizations, models may fail to meet latency requirements, resulting in poor user experiences.
2
Overlooking the benefits of disaggregated serving can hinder throughput improvements.
Failing to separate prefill and decode processes can introduce unnecessary overhead, limiting the model's ability to scale effectively.

Related Concepts

Mixture-of-experts Architecture
Kernel Optimization Techniques
AI Model Performance Metrics
Nvidia GPU Architectures