Taming A Voracious Rust Proxy

The basic idea of our service is that we run containers for our users, as hardware-isolated virtual machines (Fly Machines), on hardware we own around the world. What makes that interesting is that we also connect every Fly Machine to a global Anycas

Peter Cai
7 min readintermediate
--
View Original

Overview

The article discusses a performance issue encountered with the Rust-based proxy service, fly-proxy, which is part of Fly.io's infrastructure. It details the investigation into elevated HTTP errors and CPU utilization, leading to the discovery of a bug in the TlsStream state machine that caused busy loops under certain conditions.

What You'll Learn

1

How to diagnose performance issues in Rust applications using profiling tools

2

Why understanding the async Rust ecosystem is crucial for performance optimization

3

When to update dependencies to avoid bugs and vulnerabilities

Prerequisites & Requirements

  • Understanding of Rust programming and async/await concepts
  • Experience with performance profiling and debugging in Rust

Key Questions Answered

What caused the elevated HTTP errors and CPU utilization in fly-proxy?
The elevated HTTP errors and CPU utilization were caused by a bug in the TlsStream state machine within Rustls, which led to busy loops when certain conditions were met during TLS session closures. This was triggered during load testing by a partner, Tigris Data.
How does the async Rust ecosystem affect performance in applications?
The async Rust ecosystem, particularly with Futures and Wakers, can introduce performance pitfalls if not managed correctly. Mismanagement can lead to busy loops and inefficient CPU usage, as seen with the TlsStream bug that caused significant CPU utilization without actual I/O operations.
What lessons were learned from the incident with fly-proxy?
Key lessons include the importance of keeping dependencies updated to avoid vulnerabilities and the need for better instrumentation to detect spurious wakeups. Additionally, understanding the async Rust framework is critical for optimizing performance and avoiding similar issues in the future.

Key Statistics & Figures

CPU utilization
Skyrocketing
Observed during the incident with elevated fly-proxy HTTP errors.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Programming Language
Rust
Used to develop the fly-proxy service and manage asynchronous operations.
Runtime
Tokio
Provides the async runtime for managing Futures and Wakers in Rust applications.
Library
Rustls
Used for TLS handling in the fly-proxy service.

Key Actionable Insights

1
Regularly profile your Rust applications to identify performance bottlenecks early.
Profiling can reveal unexpected CPU usage patterns, allowing for timely fixes before they escalate into larger issues.
2
Implement robust monitoring for your async operations to catch spurious wakeups.
By tracking these events, you can gain insights into potential inefficiencies in your async code and address them proactively.
3
Stay updated with the latest changes in your dependencies, especially those related to critical components like TlsStream.
This can prevent running into known bugs that could lead to performance degradation or security vulnerabilities.

Common Pitfalls

1
Entering a polling loop without actual progress can lead to busy loops and high CPU usage.
This often occurs when the underlying state machine does not advance, causing the application to waste resources on unnecessary polling.

Related Concepts

Asynchronous Programming In Rust
Profiling And Debugging Techniques
TLS And Security In Network Applications