Reliably Upgrading Apache Airflow at Slack’s Scale

Apache Airflow is a tool for describing, executing, and monitoring workflows. At Slack, we use Airflow to orchestrate and manage our data warehouse workflows, which includes product and business metrics and also is used for different engineering use-cases (e.g. search and offline indexing). For two years we’ve been running Airflow 1.8, and it was time for…

Ashwin Shankar
11 min readintermediate
--
View Original

Overview

This article details Slack's experience upgrading Apache Airflow from version 1.8 to 1.10, focusing on the challenges faced and the strategies employed to ensure a smooth transition without impacting their extensive data workflows. Key points include the importance of reliability, fast rollback, minimized downtime, and preserving historical data during the upgrade process.

What You'll Learn

1

How to upgrade Apache Airflow while ensuring minimal downtime

2

Why preserving historical data is crucial during an upgrade

3

How to implement a fast rollback strategy for database upgrades

Prerequisites & Requirements

  • Understanding of Apache Airflow and its architecture
  • Experience with database management and schema upgrades

Key Questions Answered

What were the main requirements for upgrading Apache Airflow at Slack?
The main requirements included ensuring reliability of the Airflow scheduler and webserver, enabling fast rollback capabilities, minimizing downtime during the upgrade, and preserving historical metadata for previous runs. These factors were crucial to maintain the integrity of workflows and meet service level agreements.
What upgrade strategies were considered for Apache Airflow?
Two strategies were considered: a Red-Black upgrade, which involves running old and new versions side-by-side, and a Big-Bang upgrade, where all DAGs are moved to the new version at once. The Red-Black upgrade was deemed infeasible due to database sharing issues, leading to the choice of the Big-Bang upgrade.
What issues were encountered during the Airflow 1.10 upgrade?
Issues included the removal of the adhoc attribute from task objects, which required consolidating tasks into new DAGs, and compatibility problems with the Presto operator due to changes in the future package. UI issues also arose, affecting data quality when marking tasks as successful.

Key Statistics & Figures

Daily records processed
700 billion
This statistic highlights the scale at which Slack operates, emphasizing the need for a reliable upgrade process.
Daily active users
12 million
This figure illustrates the extensive user base that relies on Slack's data workflows.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a comprehensive runbook for upgrades to ensure all steps are followed accurately.
A runbook can help prevent mistakes during the upgrade process by providing clear, step-by-step instructions, which is especially important in complex environments like Slack's.
2
Regularly upgrade systems to avoid large, disruptive changes.
Frequent upgrades can help mitigate risks associated with major version changes, making it easier to manage dependencies and maintain system stability.
3
Enhance testing environments by increasing the number of development DAGs.
Having more DAGs in the development environment can help catch issues earlier in the upgrade process, reducing the likelihood of problems arising after deployment.

Common Pitfalls

1
Failing to adequately test the upgrade process can lead to unexpected issues post-deployment.
Without thorough testing in a development environment, critical issues may arise during or after the upgrade, impacting user workflows and data integrity.
2
Neglecting to communicate upgrade plans to stakeholders can result in confusion and dissatisfaction.
Clear communication about the upgrade timeline and expected downtime is essential to manage stakeholder expectations and reduce frustration.

Related Concepts

Data Workflow Management
Database Schema Upgrades
Cloud Computing With AWS
Data Processing Frameworks Like Apache Spark And Hive