The end-to-end refresh of our server hardware fleet

Visit the post for more.

6 min readbeginner
--
View Original

Overview

The article discusses Facebook's end-to-end refresh of its server hardware fleet to enhance performance and scalability for its applications and services. Key highlights include the introduction of new storage and compute platforms, including Bryce Canyon, Big Basin, Tioga Pass, and Yosemite v2, each designed to improve efficiency and support advanced computing tasks.

What You'll Learn

1

How to leverage modular server designs for improved efficiency

2

Why high-density storage solutions are crucial for media applications

3

When to implement dual-socket motherboards for enhanced performance

Key Questions Answered

What is the purpose of the Bryce Canyon storage chassis?
The Bryce Canyon storage chassis is designed for high-density storage, supporting 72 HDDs in 4 Open Rack units, providing increased efficiency and performance for applications like photo and video storage.
How does Big Basin improve AI model training?
Big Basin allows training of models that are 30 percent larger and achieves nearly 100 percent improvement in throughput for image classification models like ResNet-50 compared to its predecessor, Big Sur.
What enhancements does Tioga Pass offer over Leopard?
Tioga Pass features a dual-socket motherboard with upgraded PCIe slots, allowing for greater flexibility and bandwidth, and introduces support for M.2 NVMe SSDs, enhancing performance for various compute services.
What are the key features of Yosemite v2?
Yosemite v2 supports four 1S server cards in a new chassis design, allows hot servicing without powering down, and is compatible with both Mono Lake and Twin Lakes server cards, enhancing flexibility and power efficiency.

Key Statistics & Figures

HDD density improvement
20 percent higher than Open Vault
This improvement is achieved with the new Bryce Canyon storage chassis.
Compute capability increase
4x increase over Honey Badger
This enhancement is provided by leveraging the Mono Lake server card in the Bryce Canyon storage server.
Memory size increase
from 12 GB to 16 GB in Big Basin
This increase allows for training larger AI models.

Technologies & Tools

Framework
Open Compute Project
Provides design specifications and hardware files for the new server hardware.
Standard
Open Rack V2
Compatibility standard for the new server chassis designs.
Software
Openbmc
Management framework for thermals and power in new hardware.

Key Actionable Insights

1
Adopting modular server designs like those in Bryce Canyon can significantly enhance storage efficiency.
This approach allows for easy upgrades and scalability, which is essential for handling increasing data loads in media applications.
2
Implementing high-performance compute platforms such as Big Basin can drastically improve AI model training times.
With the ability to train larger models faster, teams can iterate more quickly on AI projects, leading to better outcomes.
3
Utilizing dual-socket motherboards in server designs can provide substantial performance benefits.
This is particularly relevant for applications requiring high computational power, such as deep learning and complex data processing.

Common Pitfalls

1
Neglecting the importance of thermal management in server design can lead to performance degradation.
Without proper cooling solutions, servers can overheat, resulting in reduced efficiency and potential hardware failure.

Related Concepts

Modular Server Architecture
High-density Storage Solutions
AI Model Training Optimization
Thermal Management In Data Centers