Post-mortem of October 22, 2012 AWS degradation

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•6 min read•beginner•

--

•View Original

AWSCassandraKong

Overview

The article discusses the service degradation experienced by AWS on October 22, 2012, and how Netflix managed to minimize customer impact during the outage. It highlights the event timeline, best practices for building high availability systems, and lessons learned from the incident.

What You'll Learn

1

How to implement zone evacuation drills to handle AWS outages

2

Why building redundancy across multiple Availability Zones is crucial for service reliability

3

How to utilize the Simian Army for testing system resilience

Prerequisites & Requirements

Understanding of cloud architecture and AWS services
Familiarity with cloud management tools like Asgard(optional)

Key Questions Answered

What caused the AWS service degradation on October 22, 2012?

The AWS service degradation was caused by issues in the Elastic Block Store (EBS) service, which affected many websites. Netflix initially did not see any impact due to their design choices, but some customers experienced intermittent problems later in the day.

How did Netflix manage to minimize customer impact during the outage?

Netflix minimized customer impact by implementing a zone evacuation drill, which allowed them to quickly evacuate the affected Availability Zone in just 20 minutes. Their architecture, designed for resilience, ensured that services continued to operate despite the degradation.

What best practices did Netflix implement for high availability?

Netflix employs several best practices for high availability, including building redundancy across three Availability Zones, using the Simian Army for resilience testing, and conducting incident reviews to learn from outages. These practices help ensure that their services remain reliable even during failures.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Management Tool

Asgard

Used for managing application deployments and facilitating zone evacuations.

Database

Cassandra

Configured with a replication factor of three across different Availability Zones to ensure data availability during outages.

Key Actionable Insights

1
Implement zone evacuation drills as part of your disaster recovery plan to ensure quick response to outages.
By regularly practicing zone evacuations, teams can reduce recovery time during real incidents, as demonstrated by Netflix's ability to restore service in just 20 minutes.

2
Design your applications to be resilient against single instance failures by utilizing multi-AZ architectures.
This approach not only enhances availability but also simplifies recovery processes during outages, as seen in Netflix's handling of the AWS degradation.

3
Utilize chaos engineering tools like the Simian Army to continuously test and improve system resilience.
Regular testing of system capabilities against failures helps identify weaknesses and improve overall service reliability.

Common Pitfalls

1

Failing to open alerts early during service degradation can lead to delayed response and increased customer impact.

Netflix recognized that an earlier alert could have helped them identify the issue more quickly, emphasizing the importance of timely monitoring and alerting.

Related Concepts

High Availability Systems

Disaster Recovery Planning

Chaos Engineering