Power Loss Siren: Making Meta resilient to power loss events

There are thousands of distributed services running on millions of servers in Meta’s data centers. Part of ensuring the reliability of those services means making them resilient to power loss event…

Raghunathan Modoor Jagannathan
11 min readintermediate
--
View Original

Overview

The article discusses the development and implementation of the Power Loss Siren (PLS) at Meta, a system designed to enhance the resilience of data center services against power loss events. It details how PLS operates at the rack level to detect impending power loss and facilitate proactive service mitigation, thereby improving overall service reliability and simplifying infrastructure management.

What You'll Learn

1

How to implement a proactive power loss detection system using existing infrastructure

2

Why leveraging in-rack batteries can enhance service reliability during power outages

3

How to configure mitigation handlers for services in response to power loss alerts

Prerequisites & Requirements

  • Understanding of data center power distribution and server architecture
  • Familiarity with monitoring and alert systems(optional)

Key Questions Answered

What is the Power Loss Siren and how does it work?
The Power Loss Siren (PLS) is a low latency, distributed power loss detection and alert system that operates at the rack level in Meta's data centers. It uses existing in-rack batteries to notify services of impending power loss, enabling proactive mitigation actions to prevent service degradation or downtime.
How does PLS improve service reliability during power loss events?
PLS enhances service reliability by allowing services to failover proactively rather than reactively. It provides alerts at least 45 seconds before a power loss occurs, enabling services to initiate mitigation handlers and maintain operations while running on battery power.
What are the common causes of power loss events in data centers?
Common causes of power loss events include device faults, voltage sags, and maintenance failures. Device faults are the most frequent, often resulting from catastrophic failures or electrical short circuits in power devices.
What are the main components of the PLS architecture?
The PLS architecture consists of two main components: PLS Relay, which monitors power supply units for outages, and PLS Handler, which listens for alerts and executes mitigation actions on the servers. Both components are designed for high reliability and low latency.

Key Statistics & Figures

Error rate reduction during power loss events
100x reduction
By leveraging PLS signals, the peak error rate during large-scale power loss events can be reduced from hundreds of thousands of requests per second to a few thousand.
Battery backup duration
90 seconds
In-rack batteries can provide backup power for 90 seconds during an input AC power loss, allowing time for mitigation actions.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Mysql
Used for user data storage, with a geo-distributed architecture to handle writes and ensure data availability.
Backend
Pls Relay
Monitoring daemon that detects power loss and alerts servers within the rack.
Backend
Pls Handler
Listener daemon that executes mitigation actions based on alerts from PLS Relay.

Key Actionable Insights

1
Implementing a proactive power loss detection system can significantly reduce downtime for critical services.
By utilizing existing in-rack batteries and configuring mitigation handlers, services can maintain operations during power outages, which is crucial for maintaining user experience.
2
Regularly review and update the configuration of mitigation handlers to adapt to changing service requirements.
As services evolve, their response to power loss events may need adjustments to ensure optimal performance and reliability.
3
Consider the hierarchical power distribution model when designing data center infrastructure.
This model helps in fault isolation and can prevent larger outages by containing issues within lower levels of the power distribution hierarchy.

Common Pitfalls

1
Relying solely on dual-powered rows for power redundancy can lead to underutilization and limited fault tolerance.
While dual-powered rows provide some redundancy, they do not protect against all power loss scenarios, particularly at higher levels of the power distribution hierarchy. Transitioning to a system like PLS can alleviate these issues.

Related Concepts

Power Distribution In Data Centers
Proactive Service Mitigation Strategies
Impact Of Power Loss On Distributed Systems