LinkedOut: A Request-Level Failure Injection Framework

Logan Rosen
12 min readintermediate
--
View Original

Overview

The article discusses LinkedOut, a request-level failure injection framework developed by LinkedIn to enhance resilience engineering. It details how the framework allows engineers to simulate various types of failures in a controlled manner, ensuring minimal impact on user experience while testing system robustness.

What You'll Learn

1

How to implement request-level failure injection in distributed systems

2

Why controlled experimentation is essential for resilience testing

3

How to utilize the LiX framework for A/B testing and feature gating

Prerequisites & Requirements

  • Understanding of resilience engineering concepts
  • Familiarity with Rest.li framework(optional)

Key Questions Answered

What types of failures can LinkedOut simulate?
LinkedOut can simulate three types of failures: Error, which throws a DisruptException to mock resource unavailability; Delay, which injects latency before passing the request downstream; and Timeout, which waits for a specified timeout period before throwing a TimeoutException.
How does the LinkedOut web application facilitate automated testing?
The LinkedOut web application allows for automated testing by leveraging a service account with access to all products, enabling engineers to run tests across various parts of LinkedIn. It uses the Celery distributed task queue framework to scale testing across multiple hosts.
What is the purpose of the Chrome extension developed for LinkedOut?
The Chrome extension allows users to easily inject failure data into requests by discovering involved services and applying disruptions with minimal effort. It enhances the user experience by providing a simple interface for testing resilience in real-time.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing request-level failure injection can significantly improve the robustness of your application. By simulating errors, delays, and timeouts, you can identify weaknesses in your system before they impact users.
This approach allows for proactive resilience testing, which is crucial in complex distributed systems where failures can occur unexpectedly.
2
Utilizing the LiX framework for A/B testing can help target specific segments of traffic for failure testing. This targeted approach minimizes user impact while providing valuable insights into system behavior under stress.
By carefully selecting test groups, engineers can gather data that informs better system design and user experience.

Common Pitfalls

1
Relying solely on service accounts for testing can lead to misleading results, as they may not accurately represent real user experiences.
This can result in tests failing due to conditions that would not occur for actual users, highlighting the importance of using realistic test scenarios.

Related Concepts

Resilience Engineering
Failure Injection Testing
A/B Testing Frameworks
Distributed Systems