Building Services at Airbnb, Part 3

The third in a series on building services architecture, this article looks at how we built resilience engineering practices into the…

Liang Guo
14 min readadvanced
--
View Original

Overview

In this article, the authors discuss resilience engineering practices integrated into Airbnb's service platform, which supports their service-oriented architecture. They highlight the importance of resilience as a requirement, not just a feature, and share various strategies implemented to enhance service availability and performance.

What You'll Learn

1

How to implement asynchronous request processing in Java services

2

Why request queuing is essential for handling burst traffic

3

How to apply load shedding techniques to prevent service overload

4

When to use dependency isolation to enhance service resilience

Prerequisites & Requirements

  • Understanding of service-oriented architecture and resilience engineering concepts
  • Familiarity with Java and Dropwizard framework(optional)

Key Questions Answered

What resilience engineering practices are implemented at Airbnb?
Airbnb has integrated various resilience engineering practices into its service platform, including asynchronous request processing, request queuing, load shedding, dependency isolation, and outlier server host detection. These practices aim to improve service availability and handle traffic spikes effectively.
How does Airbnb handle service overload and prevent cascading failures?
Airbnb implements load shedding techniques, such as service back pressure and client quota-based rate limiting, to manage excessive load. These strategies prevent cascading failures by ensuring that overloaded services can gracefully reject requests, allowing for system recovery.
What is the impact of resilience on service availability at Airbnb?
Resilience is crucial for maintaining service availability, especially in a distributed architecture. For instance, if 20 dependent services each have 99.9% availability, the overall uptime for a service could drop to 98.0%, resulting in significant downtime if resilience measures are not in place.
What role does request queuing play in service performance?
Request queuing helps services absorb burst traffic and prevents resource exhaustion. By implementing a controlled delay queue, Airbnb can manage incoming requests more effectively, ensuring that services remain responsive even under heavy load.

Key Statistics & Figures

Uptime impact of dependent services
98.0%
If 20 dependent services each have 99.9% availability, the overall uptime for a service drops to 98.0%.
OAuth service error rate spike
25%
During a traffic surge, the OAuth service experienced an initial error rate spike to 25% over a period of 5 minutes.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Dropwizard
Used for developing Java services with enhanced asynchronous request processing capabilities.
Networking
Envoy
Implemented for client-side smart load balancing and outlier detection.

Key Actionable Insights

1
Implement asynchronous request processing to enhance throughput and resource utilization in your services.
Asynchronous processing allows services to handle more concurrent requests without blocking I/O threads, which is particularly beneficial during traffic spikes.
2
Utilize request queuing techniques to manage burst traffic effectively.
By applying a controlled delay queue, services can prevent overload and maintain performance during high demand periods.
3
Adopt load shedding strategies to protect services from excessive load.
Implementing service back pressure and client quota-based rate limiting can help maintain service stability and prevent cascading failures during traffic surges.
4
Use dependency isolation to mitigate the impact of problematic downstream services.
By isolating dependencies, services can continue to function even if one or more downstream services experience issues, thereby enhancing overall resilience.

Common Pitfalls

1
Inconsistent implementation of fault tolerance measures can lead to service failures.
Without a standardized approach to resilience engineering, services may become weak links, causing overall system instability during high traffic or failure scenarios.

Related Concepts

Service-oriented Architecture
Resilience Engineering
Load Balancing Techniques
Dependency Management In Microservices