Engineering Failover Handling in Uber’s Mobile Networking Infrastructure

Sivabalan Narayanan, Rajesh Mahindra, Christopher Francis
21 min readadvanced
--
View Original

Overview

The article discusses the implementation of a failover handling mechanism in Uber's mobile networking infrastructure, designed to ensure reliable network communication for its applications. It highlights the challenges faced during the design process, the evolution of the failover handler, and the performance improvements achieved post-implementation.

What You'll Learn

1

How to design a failover handler using a finite state machine

2

Why distinguishing between network errors and domain outages is crucial for application performance

3

How to implement canary requests to improve domain availability detection

Prerequisites & Requirements

  • Understanding of network protocols like TCP and QUIC
  • Experience with mobile application development(optional)

Key Questions Answered

How does Uber's failover handler improve network reliability?
Uber's failover handler intelligently routes traffic based on network conditions, maximizing the use of primary domains and minimizing unnecessary switches to backup domains. This design reduces latency and error rates, enhancing user experience during outages.
What are the key performance improvements observed after implementing the failover handler?
After rolling out the failover handler, Uber observed a 25-30 percent reduction in tail-end latencies for HTTPS traffic compared to previous solutions, along with lower error rates during cloud outages, leading to better user experiences.
What challenges did Uber face while designing the failover handler?
Uber faced challenges in distinguishing between user-end connectivity failures and actual domain outages, particularly due to the unreliable nature of mobile networks. This complexity necessitated a robust design to ensure seamless user experiences.
What role do canary requests play in the failover handler?
Canary requests are used to verify the availability of backup domains before routing traffic to them. This mechanism helps in accurately detecting domain failures and prevents unnecessary switches during transient network issues.

Key Statistics & Figures

Reduction in tail-end latencies
25-30%
Observed after implementing the failover handler compared to previous solutions.
Percentage of traffic routed through primary domains
99%
Achieved with the new failover handler, significantly higher than the previous round-robin method.
Increase in QUIC usage
5-25%
Noted with the new failover handler compared to the round-robin approach.

Technologies & Tools

Protocol
Quic
Used for improving performance and reducing latencies during poor network conditions.
Protocol
TCP
Utilized for secure connections in Uber's mobile applications.

Key Actionable Insights

1
Implementing a failover handler as a finite state machine can significantly enhance application reliability.
This design allows for clear state transitions based on network conditions, ensuring that applications can quickly adapt to outages without sacrificing performance.
2
Utilizing canary requests can improve the accuracy of domain availability detection.
By sending dedicated health checks to backup domains, applications can avoid unnecessary latency caused by switching to domains that may also be experiencing issues.
3
Monitoring and adjusting recovery timers can optimize user experience during primary domain outages.
Fine-tuning recovery timers helps balance responsiveness with stability, ensuring that the system does not switch domains too aggressively during intermittent connectivity.

Common Pitfalls

1
Aggressive domain switching can lead to unnecessary latency and degraded user experience.
This occurs when the system reacts to transient network errors without confirming the primary domain's availability, potentially causing users to experience slower performance.
2
Failing to accurately distinguish between mobile connectivity issues and actual domain outages can result in poor application performance.
If the failover handler switches to a backup domain during temporary connectivity loss, it may lead to a suboptimal experience once the network recovers.

Related Concepts

Network Reliability In Mobile Applications
Finite State Machines In Software Design
Impact Of Latency On User Experience