parking_lot: ffffffffffffffff...

The basic idea is simple: customers give us Docker containers, and tell us which one of 30+ regions around the world they want them to run in. We convert the containers into lightweight virtual machines, and then link them to an Anycast network. If y

Peter Cai, Pavel Borzenkov
19 min readadvanced
--
View Original

Overview

The article discusses a complex concurrency bug encountered in Fly.io's Anycast router, implemented in Rust. It details the challenges of managing state in a distributed system and the eventual resolution involving the parking_lot library's RWLock implementation.

What You'll Learn

1

How to manage state in a distributed system effectively

2

Why using optimized lock implementations can prevent deadlocks

3

How to implement lazy-loading for state management in proxies

Prerequisites & Requirements

  • Understanding of Rust programming and concurrency concepts
  • Familiarity with the parking_lot library(optional)

Key Questions Answered

What caused the deadlocks in the Fly.io Anycast router?
The deadlocks were caused by a concurrency issue in the RWLock implementation of the parking_lot library, specifically due to a bitwise double free condition when managing read and write locks. This led to corrupted lock states, causing the proxy to lock up.
How does the Corrosion routing protocol manage state across servers?
Corrosion uses a globally replicated SQLite database structured as CRDTs, with individual worker servers acting as the source of truth. Updates are disseminated using SWIM gossip, allowing for rapid state synchronization across thousands of servers.
What is the role of the watchdog system in the Fly.io proxy?
The watchdog system monitors the internal control channel of the fly-proxy for responsiveness. If it detects a deadlock or exhaustion, it automatically restarts the proxy, mitigating the impact of these issues on service availability.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a watchdog system to monitor application responsiveness and recover from deadlocks automatically.
This approach can significantly reduce downtime in distributed systems by ensuring that temporary issues do not lead to prolonged outages.
2
Consider using optimized lock implementations like parking_lot for better performance and reduced contention.
These libraries can provide additional features such as lock timeouts, which help in managing concurrency more effectively.
3
Adopt lazy-loading strategies for state management to optimize resource usage in distributed applications.
This can prevent unnecessary loading of state information that is not needed for specific instances of your application, improving efficiency.

Common Pitfalls

1
Over-reliance on RAII-style lock acquisition can lead to unclear locking intervals and potential deadlocks.
It's crucial to be explicit about lock lifetimes, as implicit behavior can obscure the actual flow of control and lead to concurrency issues.

Related Concepts

Concurrency In Distributed Systems
State Management Strategies
Optimized Locking Mechanisms