Resilience Engineering at LinkedIn with Project Waterbear

Bhaskaran Devaraj
12 min readintermediate
--
View Original

Overview

The article discusses LinkedIn's approach to resilience engineering through Project Waterbear, which aims to enhance the reliability of services and infrastructure. It outlines the project's design goals, chaos engineering practices, cultural changes, and improvements to the Rest.li framework.

What You'll Learn

1

How to implement chaos engineering practices using LinkedOut

2

Why graceful degradation is essential for user experience during failures

3

How to enhance the Rest.li framework for better resilience

Prerequisites & Requirements

  • Understanding of resilience engineering concepts
  • Familiarity with Rest.li framework(optional)

Key Questions Answered

What is Project Waterbear and its goals?
Project Waterbear at LinkedIn aims to provide application resilience as a service, focusing on improving the reliability of applications through chaos engineering, cultural changes, and enhancements to the Rest.li framework. The project addresses increasing complexity and interdependencies in LinkedIn's infrastructure.
How does LinkedOut facilitate chaos engineering?
LinkedOut is a framework that allows developers to simulate failures in production environments with minimal effort. It integrates with the Rest.li client, enabling granular control over failure scenarios, which helps ensure that applications can handle various failure modes without impacting user experience.
What are the key design goals of Project Waterbear?
The design goals of Project Waterbear include ensuring resilient resource clusters, maintaining robust infrastructure, intelligently handling failures, gracefully degrading services, and increasing SRE happiness through self-healing systems. These goals are essential for scaling LinkedIn's services effectively.
What cultural changes were implemented for resilience engineering?
Cultural changes for resilience engineering at LinkedIn include promoting transparency around service failures, encouraging graceful degradation, and involving all teams in resilience efforts. This approach fosters a collaborative environment where every team can contribute to improving system resilience.

Technologies & Tools

Backend
Rest.li
Used as the framework for building LinkedIn's microservices and for implementing resilience features.
Infrastructure
Saltstack
Utilized for simulating infrastructure failures in production environments.

Key Actionable Insights

1
Implement chaos engineering practices using LinkedOut to proactively identify weaknesses in your applications.
By simulating failures in a controlled manner, teams can better understand how their applications will behave under stress, leading to improved resilience and user experience.
2
Adopt graceful degradation strategies to enhance user experience during service disruptions.
Planning for graceful degradation ensures that when non-core services fail, the main functionalities remain operational, minimizing user impact and maintaining satisfaction.
3
Enhance the Rest.li framework settings to optimize performance and resilience.
Customizing Rest.li settings based on service-specific needs can prevent single-node failures from affecting user-facing applications, thus improving overall system reliability.

Common Pitfalls

1
Failing to account for the complexity of interdependencies can lead to unexpected failures during service disruptions.
As systems grow, the number of dependencies increases, making it crucial to plan for how failures in one area can impact others. Regular testing and chaos engineering can help identify these risks.

Related Concepts

Resilience Engineering
Chaos Engineering
Graceful Degradation
Microservices Architecture