Seamless Istio Upgrades at Scale

How Airbnb upgrades tens of thousands of pods on dozens of Kubernetes clusters to new Istio versions

Rushy R. Panchal
10 min readintermediate
--
View Original

Overview

The article discusses how Airbnb manages seamless upgrades of Istio across tens of thousands of pods and multiple Kubernetes clusters. It outlines the challenges faced, the architecture employed, and the detailed upgrade process that ensures zero downtime while maintaining high availability.

What You'll Learn

1

How to implement a seamless Istio upgrade process across Kubernetes clusters

2

Why gradual rollouts minimize risk during upgrades

3

How to utilize a mutation framework like Krispr for workload management

Prerequisites & Requirements

  • Understanding of Kubernetes and service mesh concepts
  • Familiarity with Istio and its architecture(optional)

Key Questions Answered

How does Airbnb ensure zero downtime during Istio upgrades?
Airbnb achieves zero downtime by implementing a canary upgrade model, allowing both the current and new versions of Istiod to run simultaneously. This setup enables workloads to communicate across different Istiod versions without requiring workload owners to coordinate, thus maintaining high availability.
What are the main challenges faced during Istio upgrades at Airbnb?
The main challenges include coordinating thousands of workloads independently, minimizing risks through gradual rollouts, and ensuring that upgrades can be rolled back without requiring coordination from all workload teams. These challenges necessitate a robust upgrade process.
What is the role of Krispr in the upgrade process?
Krispr is an in-house mutation framework that injects the Istio revision label into Kubernetes workloads. This decouples infrastructure upgrades from workload deployments, allowing for seamless upgrades without requiring individual teams to update their workloads.
How does Airbnb manage Istio upgrades for virtual machines?
For virtual machines, Airbnb deploys an artifact containing istio-proxy and a script to run istio-iptables. The on-host daemon mxagent manages the installation, polling key-value tags to determine the version, ensuring that upgrades are atomic and manageable.

Key Statistics & Figures

Total number of Istio upgrades performed
14
Airbnb has successfully upgraded Istio 14 times since they began using it.
Peak QPS handled by workloads
tens of millions
The workloads running on Airbnb's infrastructure send tens of millions of queries per second at peak.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a canary upgrade model to minimize risks during software upgrades.
This approach allows for gradual rollouts, enabling teams to monitor the impact of new versions before full deployment, thereby reducing the risk of widespread issues.
2
Utilize a mutation framework like Krispr to streamline workload management.
By decoupling workload deployments from infrastructure upgrades, teams can ensure that their applications are always running the latest versions without manual intervention.
3
Establish a robust monitoring system to track the health of workloads during upgrades.
Monitoring helps identify potential regressions early in the upgrade process, allowing teams to respond quickly and maintain service availability.

Common Pitfalls

1
Relying on individual teams to update workload configurations can lead to inconsistent upgrade statuses.
This can result in some workloads running outdated versions while others are on the latest, complicating the upgrade process and increasing the risk of issues.

Related Concepts

Service Mesh Architecture
Canary Deployment Strategies
Kubernetes Workload Management
Infrastructure As Code