Introducing Big Basin: Our next-generation AI hardware

Kevin Lee

Visit the post for more.

Overview

The article introduces Big Basin, Facebook's next-generation AI hardware designed to enhance the performance of AI services. It highlights the improvements over the previous Big Sur GPU server, including greater arithmetic throughput and memory capacity, enabling faster training of larger machine learning models.

What You'll Learn

1

How to leverage modular design in AI hardware for scalability

2

Why disaggregated architecture improves serviceability and thermal efficiency

3

How to utilize NVIDIA Tesla P100 GPUs for deep learning training

4

When to implement open-source hardware designs for AI applications

Key Questions Answered

What are the key improvements of Big Basin over Big Sur?

Big Basin offers a 30% increase in model size capability due to enhanced arithmetic throughput and memory expansion from 12 GB to 16 GB. It also features better performance per watt, with single-precision floating-point arithmetic per GPU increasing from 7 teraflops to 10.6 teraflops, allowing for faster training of complex models.

How does the modular design of Big Basin benefit AI training?

The modular design allows for the disaggregation of CPU and GPU components, enabling independent scaling and easier upgrades. This design enhances serviceability, reduces operational complexity, and improves thermal efficiency by positioning GPUs directly in front of cool air intake.

What is the architecture of the Big Basin GPU system?

Big Basin features eight NVIDIA Tesla P100 GPU accelerators connected via NVIDIA NVLink in an eight-GPU hybrid cube mesh. This architecture enhances deep learning training capabilities and is designed to work with the NVIDIA Deep Learning SDK for improved performance.

What role does open-source play in the development of Big Basin?

Facebook is open-sourcing the design of Big Basin through the Open Compute Project, allowing for collaborative innovation in AI hardware development. This approach aims to foster advancements in building complex AI systems for a more connected world.

Key Statistics & Figures

Memory increase

from 12 GB to 16 GB

This increase allows for training models that are 30% larger.

Single-precision floating-point performance

increased from 7 teraflops to 10.6 teraflops

This improvement enhances performance per watt in AI training.

Throughput improvement with ResNet-50

almost 100 percent improvement

This was achieved compared to the previous Big Sur GPU server.

Technologies & Tools

Hardware

Nvidia Tesla P100

Used as GPU accelerators in the Big Basin system for deep learning training.

Technology

Nvidia Nvlink

Connects the GPUs in an eight-GPU hybrid cube mesh for improved performance.

Initiative

Open Compute Project

Platform for open-sourcing the design of Big Basin.

Key Actionable Insights

1
Implementing a modular design in AI hardware can significantly enhance scalability and performance.
By decoupling components like CPUs and GPUs, organizations can upgrade parts independently, leading to better resource management and adaptability to new technologies.

2
Utilizing NVIDIA Tesla P100 GPUs can dramatically improve deep learning training times.
With increased arithmetic throughput and memory, these GPUs allow researchers to experiment with larger models more efficiently, which is crucial in fast-paced AI development environments.

3
Adopting open-source hardware designs can drive innovation in AI applications.
Collaborative efforts in hardware design can lead to breakthroughs in AI capabilities, benefiting the entire tech community and enhancing the development of more sophisticated AI systems.

Common Pitfalls

1

Neglecting the importance of modularity in hardware design can lead to scalability issues.

Without a modular approach, upgrading or replacing components can become cumbersome, leading to increased downtime and operational complexity.