How Early Access to NVIDIA GB200 Systems Helped LMArena Build a Model to Evaluate LLMs

LMArena at the University of California, Berkeley is making it easier to see which large language models excel at specific tasks, thanks to help from NVIDIA and…

Jason Perlow
6 min readintermediate
--
View Original

Overview

LMArena, in collaboration with NVIDIA and Nebius, has developed the Prompt-to-Leaderboard (P2L) model to evaluate the performance of large language models (LLMs) across various tasks. Utilizing NVIDIA GB200 NVL72 systems, they have achieved scalable AI workloads and rapid deployment, significantly enhancing the evaluation process of LLMs.

What You'll Learn

1

How to deploy the Prompt-to-Leaderboard (P2L) model using NVIDIA GB200 NVL72 systems

2

Why using human-generated rankings improves model evaluation for LLMs

3

How to leverage cost-based routing for AI model selection

Prerequisites & Requirements

  • Understanding of large language models and their evaluation metrics
  • Familiarity with NVIDIA DGX Cloud and Nebius AI Cloud platforms(optional)

Key Questions Answered

How does LMArena evaluate which LLMs perform best for specific tasks?
LMArena evaluates LLMs by capturing user preferences across tasks and applying Bradley-Terry coefficients to rank models based on human votes. This method allows for detailed, prompt-specific leaderboards that reflect the strengths of different models in areas like math, coding, and creative writing.
What are the key features of the NVIDIA GB200 NVL72 system?
The NVIDIA GB200 NVL72 integrates 36 Grace CPUs and 72 Blackwell GPUs, providing high-bandwidth, low-latency performance. It also features up to 30 TB of fast, unified LPDDR5X and HBM3E memory, which is essential for handling demanding AI tasks efficiently.
What benefits does the P2L model provide to developers?
The P2L model helps developers by eliminating guesswork in model selection, as it uses data-driven insights to route queries to the best-performing models based on specific tasks and budget constraints. This ensures optimal performance and cost-effectiveness.
How quickly can models be trained on the NVIDIA GB200 NVL72?
LMArena demonstrated that they could train their state-of-the-art model on the NVIDIA GB200 NVL72 in just four days, showcasing the system's rapid time-to-value for AI workloads.

Key Statistics & Figures

Training time for state-of-the-art model
4 days
This was achieved using the NVIDIA GB200 NVL72 system.
Number of Grace CPUs in NVIDIA GB200 NVL72
36
These CPUs work in conjunction with the GPUs for enhanced performance.
Number of Blackwell GPUs in NVIDIA GB200 NVL72
72
The GPUs are crucial for handling intensive AI workloads.
Total memory capacity of NVIDIA GB200 NVL72
30 TB
This includes fast, unified LPDDR5X and HBM3E memory.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Hardware
Nvidia Gb200 Nvl72
Used for deploying the P2L model and handling AI workloads.
Cloud Platform
Nvidia Dgx Cloud
Facilitates the deployment and scaling of AI models.
Cloud Platform
Nebius AI Cloud
Collaborates with NVIDIA to provide a shared environment for AI workloads.
Framework
Pytorch
One of the key AI frameworks validated for use with GB200 NVL72.
Framework
Deepspeed
Another validated framework for optimizing AI workloads.
Framework
Hugging Face Transformers
Used for building and deploying transformer models.

Key Actionable Insights

1
Leverage the P2L model to enhance your AI application’s performance evaluation.
By using human-generated rankings, you can create more nuanced evaluations of LLMs, leading to better model selection for specific tasks.
2
Utilize cost-based routing in your AI applications to optimize resource allocation.
Setting budget constraints allows your system to automatically select the best-performing model within those limits, improving efficiency and cost-effectiveness.
3
Take advantage of the NVIDIA GB200 NVL72's architecture for scalable AI workloads.
The integration of Grace CPUs and Blackwell GPUs allows for high throughput and efficient resource management, making it ideal for demanding AI tasks.

Common Pitfalls

1
Underestimating the complexity of deploying AI workloads on new architectures.
Many developers may find it challenging to adapt their applications to the unique features of the GB200 NVL72, which requires careful planning and testing.
2
Neglecting the importance of real-time feedback in model evaluation.
Failing to implement a feedback loop can lead to suboptimal model performance, as it prevents continuous improvement based on user interactions.

Related Concepts

Large Language Models (llms)
AI Workload Scalability
Cost-based Routing In AI Applications
Human-in-the-loop Evaluation Methods