Building Jetflow: a framework for flexible, performant data pipelines at Cloudflare

Harry Hough

Cloudflare

•

Harry Hough

•11 min read•advanced•

--

•View Original

GolangTransformersYAML

Overview

The article discusses the development of Jetflow, a framework designed by Cloudflare's Business Intelligence team to manage complex data ingestion tasks efficiently. It highlights the challenges faced with existing ELT solutions and how Jetflow significantly improves performance and extensibility for handling petabyte-scale data pipelines.

What You'll Learn

1

How to build a flexible data ingestion framework using modular design principles

2

Why using Arrow as an internal data format can optimize performance in data pipelines

3

How to achieve over 100x efficiency improvements in data processing tasks

Prerequisites & Requirements

Understanding of data ingestion processes and ELT frameworks
Familiarity with data storage solutions like ClickHouse and PostgreSQL(optional)

Key Questions Answered

What performance improvements does Jetflow provide over existing ELT solutions?

Jetflow achieves over 100x efficiency improvements, reducing the time to process 19 billion rows from 48 hours to just 5.5 hours, while also significantly lowering memory usage from 300 GB to 4 GB. This makes it a highly efficient alternative for data ingestion tasks.

How does Jetflow ensure extensibility for various data sources?

Jetflow's modular design allows it to easily integrate with multiple data sources such as ClickHouse, PostgreSQL, and various SaaS APIs. This flexibility supports the addition of new use cases without disrupting existing workflows.

What are the key requirements for building Jetflow?

The key requirements for Jetflow include performance efficiency, backwards compatibility, ease of use, customizability, and testability. These ensure that it meets the complex needs of data ingestion while remaining user-friendly and adaptable.

What challenges did Cloudflare face with their previous ELT solutions?

Cloudflare's existing ELT solutions could not handle the increasing complexity and volume of data, which included ingesting 141 billion rows daily. They needed a more efficient and scalable solution, leading to the development of Jetflow.

Key Statistics & Figures

Daily rows ingested

141 billion

This is the total number of rows ingested daily by Cloudflare using Jetflow.

Efficiency improvement

Over 100x

Jetflow improved the processing time for 19 billion rows from 48 hours to 5.5 hours.

Memory usage reduction

From 300 GB to 4 GB

This significant reduction in memory usage was achieved while processing large datasets.

Row ingestion speed

2-5 million rows per second

This is the new ingestion speed per database connection, compared to the previous rate of 60-80,000 rows per second.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Clickhouse

Used as a source for data ingestion in Jetflow.

Database

Postgresql

Another source for data ingestion, optimized through the jackc/pgx driver.

Data Format

Arrow

Used as an internal data format to optimize performance and reduce serialization overhead.

Key Actionable Insights

1
Implement modular design principles in your data ingestion frameworks to enhance flexibility and scalability.
Modular designs allow for easier integration of new data sources and functionalities, which is crucial as data requirements evolve.

2
Consider using Arrow as an internal data format to reduce serialization overhead and improve performance.
Arrow's in-memory columnar format minimizes data conversion steps, leading to faster processing times and lower memory usage.

3
Focus on optimizing database drivers for specific use cases to maximize data ingestion speeds.
As demonstrated with ClickHouse, using a specialized driver can dramatically enhance performance compared to generic solutions.

Common Pitfalls

1

Relying on generic database drivers can lead to suboptimal performance in data ingestion tasks.

Generic drivers often lack optimizations for specific use cases, resulting in higher overhead and slower processing speeds.

2

Failing to account for data schema compatibility can complicate data ingestion processes.

Incompatibilities can arise when merging disparate schemas, making it essential to design for flexibility and compatibility from the outset.

Related Concepts

Data Ingestion Frameworks

Extract Load Transform (elt) Processes

Performance Optimization Techniques

Modular Design In Software Development