A Closer Look at the Christmas Eve Outage

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•6 min read•intermediate•

--

•View Original

AWSLoad Balancer

Overview

The article discusses the Christmas Eve outage of Netflix in 2012, which was caused by issues with the Amazon Web Services (AWS) Elastic Load Balancer (ELB) service. It details the timeline of the outage, its impact on Netflix streaming, and the measures taken to restore service and improve resiliency.

What You'll Learn

1

How to analyze the impact of cloud service outages on streaming applications

2

Why maintaining multiple availability zones is critical for service reliability

3

When to implement regional resiliency strategies in cloud architectures

Prerequisites & Requirements

Understanding of cloud services and load balancing concepts
Experience with AWS services, particularly Elastic Load Balancer(optional)

Key Questions Answered

What caused the Netflix streaming outage on Christmas Eve 2012?

The outage was caused by data being deleted by a maintenance process that was inadvertently run against the production ELB state data. This led to the failure of several ELBs, affecting Netflix streaming for various devices in the US, Canada, and Latin America.

How long did the outage last and when was service restored?

The outage began at around 12:30 PM Pacific Time on December 24 and affected devices for about seven hours, with most customers able to use the service again by 10:30 PM on Christmas Eve. Full restoration of all ELBs occurred by around 8 AM on December 25.

What measures did AWS take to prevent future outages?

AWS implemented safeguards against the failure that caused the outage and expressed confidence in recovering ELB state data significantly faster in future incidents. They also restored missing state data from backups, which took all night.

How did Netflix manage to keep some services operational during the outage?

Despite the outage affecting many devices, the Netflix website remained operational, allowing new customer sign-ups and streaming on Macs and PCs, albeit with higher latency. Some devices continued to function normally due to their ELB configurations.

Key Statistics & Figures

Duration of outage

Approximately 7 hours

The outage began at 12:30 PM PST on December 24 and was mostly resolved by 10:30 PM the same day.

Time to restore service to all ELBs

By 8 AM on December 25

Full restoration of service occurred after additional cleanup work was completed.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Service

Amazon Web Services

AWS Elastic Load Balancer was the service that experienced issues leading to the outage.

Key Actionable Insights

1
Implementing a multi-availability zone strategy can significantly enhance service reliability.
By distributing services across multiple zones, applications can maintain functionality even if one zone experiences issues, as demonstrated by Netflix's ability to operate across three zones.

2
Regularly review and test backup and recovery processes for critical cloud services.
The outage highlighted the importance of having robust recovery mechanisms in place, as AWS had to restore missing state data from backups, which took considerable time.

3
Monitoring and alerting systems should be in place to detect anomalies in service performance.
Early detection of issues can help mitigate the impact of outages, allowing teams to respond quickly and minimize service disruption.

Common Pitfalls

1

Failing to implement adequate safeguards against data loss during maintenance processes.

The outage was triggered by a maintenance process that inadvertently deleted critical data. Organizations should ensure that maintenance tasks are thoroughly tested and that rollback procedures are in place.

Related Concepts

Cloud Service Resiliency

Load Balancing Strategies

Disaster Recovery Planning

Multi-region Architecture