Lessons Learned From Running Apache Airflow at Scale

Shopify shares some of the lessons we learned and solutions we built in order to run Airflow at scale.

Overview

The article discusses the experiences and lessons learned from running Apache Airflow at scale within Shopify. It highlights key challenges faced, such as slow file access and resource contention, and presents solutions implemented to optimize performance and manage workflows effectively.

What You'll Learn

1

How to optimize file access in Airflow using NFS and GCS

2

Why implementing a metadata retention policy is crucial for performance

3

How to create a manifest file for tracking DAG ownership

4

When to apply DAG policies to enforce job standards

5

How to implement a randomized schedule interval for DAGs

Prerequisites & Requirements

  • Understanding of Apache Airflow and its architecture
  • Familiarity with Google Cloud Storage and Kubernetes(optional)

Key Questions Answered

How can slow file access impact Airflow performance?
Slow file access can significantly degrade the performance of Airflow by delaying the scheduler's ability to process DAG files. This is especially critical when using cloud storage solutions like Google Cloud Storage, where each file read incurs a request that can become a bottleneck at scale.
What is the impact of metadata volume on Airflow operations?
As metadata accumulates in Airflow, it can lead to performance degradation, particularly noticeable during Web UI loading times and upgrades, which can take hours. Implementing a metadata retention policy helps mitigate this issue by limiting the amount of historical data stored.
How can DAG authors be tracked in a multi-tenant environment?
To trace DAGs back to individual users or teams, a manifest file is created where users register a namespace for their DAGs, including job ownership and source repository information. This helps administrators quickly identify the responsible parties for any issues.
What strategies can be used to manage resource contention in Airflow?
Resource contention can be managed through the use of Airflow pools to limit task concurrency, priority weights to prioritize critical tasks, and isolated worker queues for different environments. These strategies help ensure that workloads do not interfere with each other.

Key Statistics & Figures

DAGs running in the largest environment
over 10,000
This number reflects the scale at which Shopify operates Airflow, highlighting the need for efficient management strategies.
Average tasks running at a given moment
over 400
This statistic indicates the real-time load on the Airflow system, emphasizing the importance of performance optimization.
Daily runs executed
over 150,000
This figure showcases the high throughput of workflows managed by Airflow at Shopify.
Metadata retention policy duration
28 days
This retention period was chosen to balance historical data needs with performance considerations.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration Platform
Apache Airflow
Used for development, scheduling, and monitoring of workflows.
Cloud Storage
Google Cloud Storage
Utilized for storing DAGs in the Airflow environment.
Container Orchestration
Kubernetes
Hosts the Airflow deployment and manages scalability.
Database
Mysql
Stores metadata for Airflow operations.
File System
Nfs
Improves file access performance by serving as a read-write volume for Airflow.

Key Actionable Insights

1
Implementing a combination of Google Cloud Storage and NFS can significantly enhance file access speed in Airflow.
This setup allows for a stable external source of truth while enabling quick modifications to DAG files, which is crucial for maintaining performance in a high-throughput environment.
2
Establish a metadata retention policy to prevent performance degradation in Airflow.
By limiting the retention period to 28 days, you can maintain sufficient historical data for incident management while keeping database load manageable.
3
Create a manifest file for tracking DAG ownership to improve accountability.
This practice is especially valuable in multi-tenant settings, allowing administrators to quickly identify and resolve issues related to specific DAGs.
4
Utilize DAG policies to enforce standards and limitations on jobs.
These policies can help maintain control over what users can do within Airflow, which is essential for preventing conflicts and ensuring system stability.
5
Adopt a randomized schedule interval for automatically generated DAGs to smooth out traffic.
This approach reduces the risk of surges in traffic that can overwhelm the scheduler and external services, leading to improved resource utilization.

Common Pitfalls

1
Failing to manage the volume of metadata can lead to significant performance issues in Airflow.
As metadata accumulates, it can slow down operations such as Web UI loading and upgrades. Implementing a retention policy is crucial to avoid these pitfalls.
2
Using absolute schedule intervals can create surges in traffic that overwhelm the system.
This happens when many DAGs are scheduled to run simultaneously. A randomized schedule can help distribute the load more evenly.

Related Concepts

Apache Airflow Architecture
Cloud Storage Solutions
Kubernetes Deployment Strategies
Metadata Management In Databases