A Foolish Consistency: Consul at Fly.io

We set the scene, as usual, with sandwiches. Dig if you will the picture: a global Sandwich Bracket application, ascertaining once and for all the greatest sandwich on the planet. Fly.io wants our app, sandwich-bracket, deployed close to users aroun

Overview

The article discusses the challenges and strategies of using Consul for service discovery at Fly.io, highlighting the complexities of maintaining consistency across a global infrastructure. It emphasizes the importance of adapting service discovery mechanisms to better suit the needs of a large-scale application platform.

What You'll Learn

1

How to implement service discovery using Consul in a distributed system

2

Why maintaining consistency across global services can be problematic

3

When to use alternative messaging systems like NATS for load tracking

Prerequisites & Requirements

  • Understanding of distributed systems and service discovery concepts
  • Familiarity with Consul and its API(optional)

Key Questions Answered

What is Consul and how does it function in service discovery?
Consul is a distributed database that serves as a source of truth for services running in a system. It uses a cluster of servers that maintain a log of updates via the Raft consensus protocol, allowing agents on machines to report service events and maintain a consistent view of the services across the infrastructure.
What challenges does Fly.io face with Consul's consistency?
Fly.io struggles with maintaining a consistent view of services across its global infrastructure, as keeping all regions synchronized about service states can lead to inefficiencies and increased complexity. The article suggests that aiming for perfect consistency may not be practical.
How does Fly.io handle routing for services deployed in different regions?
Fly.io uses Anycast routing to direct traffic to the nearest instance of a service based on its unique IPv4 address. This allows votes for different sandwiches to be routed to the appropriate instances in various global locations, optimizing user experience.
What alternative systems does Fly.io use for tracking load?
Fly.io has transitioned from using Consul for load tracking to a messaging system called NATS. This change allows for more efficient load tracking without the overhead of maintaining consistency that Consul requires, as NATS is simpler and more flexible.

Key Statistics & Figures

Consul traffic across Fly.io's fleet
10 gb/sec
This high volume of traffic was driven by the inefficiencies in how Consul handled long-polling queries, leading to excessive data refreshes.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Consider using a messaging system like NATS for tracking load across distributed services instead of relying solely on Consul.
NATS provides a simpler and more flexible approach to managing load information, which can help alleviate the complexity and overhead associated with maintaining a consistent state in a service discovery system.
2
Evaluate the necessity of global consistency in your service architecture; it may be more beneficial to focus on resilient routing strategies.
The article highlights that aiming for perfect consistency can lead to inefficiencies. Instead, implementing strategies that allow for smart routing based on potentially stale data can improve performance.
3
Explore the potential of local caching solutions to reduce the dependency on Consul for real-time data access.
By creating a local cache of service states, Fly.io has reduced the load on Consul and improved response times, suggesting that similar strategies could benefit other distributed systems.

Common Pitfalls

1
Relying on Consul for tracking load can lead to inefficiencies and increased traffic due to its design limitations.
The article emphasizes that using Consul for load tracking is not ideal because it can result in excessive data refreshes and high traffic, suggesting that simpler alternatives like NATS are preferable.
2
Attempting to maintain global consistency across distributed services can complicate architecture and lead to performance issues.
The article points out that striving for perfect consistency may not be practical, and instead, focusing on resilient routing strategies can yield better performance outcomes.

Related Concepts

Service Discovery Mechanisms
Distributed Systems Architecture
Load Balancing Strategies
Consensus Protocols