Apache Flink® on Kubernetes

Ran Zhang

Airbnb’s Use of A New Flink platform evolved from Apache Hadoop® Yarn

Airbnb

•

Ran Zhang

•11 min read•advanced•

--

•View Original

ApacheApache SparkAWSKubernetes

Overview

The article discusses the migration of Airbnb's streaming processing architecture from Hadoop Yarn to Kubernetes using Apache Flink. It highlights the evolution of Flink's architecture, the challenges faced during the transition, and the benefits gained from adopting Kubernetes, including improved developer experience, job availability, and cost efficiency.

What You'll Learn

1

How to deploy Apache Flink on Kubernetes for improved scalability

2

Why migrating from Hadoop Yarn to Kubernetes enhances job availability

3

How to implement a lightweight job scheduler for Flink jobs

Prerequisites & Requirements

Understanding of Apache Flink and Kubernetes concepts
Familiarity with CI/CD systems(optional)

Key Questions Answered

How does the migration from Hadoop Yarn to Kubernetes benefit Flink jobs?

Migrating from Hadoop Yarn to Kubernetes allows for direct deployment of Flink jobs, which enhances scalability, job availability, and reduces latency. The integration with Kubernetes also simplifies job management and enables features like efficient autoscaling, improving overall developer experience.

What challenges did Airbnb face during the migration to Kubernetes?

During the migration, Airbnb faced challenges such as job restart issues due to node rotations, lack of CI/CD integration, and complexities in service discovery and monitoring. These challenges were addressed by implementing a lightweight job scheduler and leveraging Kubernetes features for better resource management.

What are the advantages of using a lightweight job scheduler for Flink jobs?

The lightweight job scheduler improved turnaround time and reduced downtime during job restarts. It also eliminated single points of failure associated with Zookeeper, allowing for more reliable job management and faster recovery from failures.

How does Flink on Kubernetes handle secrets management?

Flink on Kubernetes allows each job to securely store its own secrets within the pods. This approach enhances security by providing a dedicated and isolated environment for managing sensitive information, unlike the previous setup on Hadoop Yarn.

Key Statistics & Figures

Job onboarding time

Hours instead of days

Developers noted that onboarding Flink jobs is now significantly faster, allowing them to focus more on application logic.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Stream Processing

Apache Flink

Used as the primary stream processing platform at Airbnb.

Orchestration

Kubernetes

Used for deploying and managing Flink jobs.

Workflow Management

Apache Airflow

Previously used for job scheduling before transitioning to a lightweight scheduler.

Storage

Amazon S3

Used for storing Flink job dependencies and checkpoints.

Coordination

Apache Zookeeper

Used for managing metadata and job states in the earlier architecture.

Key Actionable Insights

1
Adopt Kubernetes for deploying Apache Flink to enhance scalability and job management.
Kubernetes simplifies the deployment process and allows for features like autoscaling, which can significantly improve the efficiency of Flink jobs.

2
Implement a lightweight job scheduler to reduce downtime and improve job recovery times.
This approach addresses the limitations of using Apache Airflow, particularly in low-latency scenarios, ensuring that jobs can be restarted quickly without significant delays.

3
Utilize CI/CD practices to streamline Flink job deployments and version control.
Integrating Flink with existing CI/CD systems can improve developer velocity by enabling faster onboarding and deployment processes.

Common Pitfalls

1

Relying on a single job scheduler can lead to bottlenecks and delays in job execution.

This was a significant issue with Apache Airflow, which caused delays in job start and failure recovery, leading to SLA violations.

2

Neglecting the importance of CI/CD integration can hinder deployment efficiency.

Without proper CI/CD practices, developers had to manage their own version control strategies, which complicated the deployment process.

Related Concepts

Stream Processing With Apache Flink

Kubernetes Orchestration

CI/CD Practices For Data Applications

Lightweight Job Scheduling Techniques