Supercharging your large ClickHouse data loads - Part 3: Making a large data load resilient

Tom Schreiber

ClickHouse

•

Tom Schreiber

•15 min read•beginner•

--

•View Original

ApacheApache KafkaAWSAzure

Overview

This article discusses strategies for making large data loads into ClickHouse resilient and efficient, particularly when migrating from other systems. It introduces ClickPipes, a managed solution for data ingestion, and ClickLoad, a script for loading large datasets incrementally and reliably.

What You'll Learn

1

How to use ClickPipes for resilient data ingestion into ClickHouse

2

How to implement the ClickLoad script for incremental data loading

3

Why staging tables are essential for ensuring data integrity during loads

Prerequisites & Requirements

Basic understanding of data loading processes and ClickHouse
Familiarity with object storage services like AWS S3 or Google Cloud Storage(optional)

Key Questions Answered

What is ClickPipes and how does it improve data loading?

ClickPipes is a fully managed integration solution in ClickHouse Cloud that supports continuous, fast, resilient, and scalable data ingestion from external systems. It automatically retries data transfers in case of failures, ensuring that large datasets can be loaded without interruptions.

How does ClickLoad ensure reliable data loading?

ClickLoad orchestrates data loading by splitting files into repeatable tasks, using staging tables to manage data integrity. If a load fails, it can retry without duplicating data, ensuring each row is stored exactly once in the target table.

What are the benefits of using staging tables during data loads?

Staging tables allow for efficient data management during loads, enabling easy truncation and retry of failed tasks without affecting the target table. This approach guarantees that each row is stored exactly once, preventing data duplication.

What file sizes are recommended for optimal performance with ClickLoad?

It is recommended to use moderately sized files between 50 and 150MB for optimal performance with ClickLoad. This size ensures efficient retries and minimizes the impact of failures during data loading.

Key Statistics & Figures

Data loading throughput

10 million rows per second

This throughput indicates that 36 billion rows can be loaded per hour and 864 billion rows per day.

Ingest throughput increase

from 4 million rows/second to 8 million rows/second

This increase was observed when doubling the number of ClickHouse servers from 3 to 6 while loading a 600+ billion row dataset.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Clickhouse

Used for storing and querying large datasets efficiently.

Data Streaming

Apache Kafka

Supported by ClickPipes for resilient data ingestion.

Object Storage

AWS S3

Recommended for exporting data before loading into ClickHouse.

Object Storage

Google Cloud Storage

Another option for exporting data before loading into ClickHouse.

Key Actionable Insights

1
Utilize ClickPipes for seamless data ingestion into ClickHouse Cloud to minimize downtime during large data loads.
ClickPipes provides built-in resiliency and automatic retries, making it ideal for continuous data ingestion without interruptions.

2
Implement the ClickLoad script for environments not supported by ClickPipes to ensure reliable data loading.
ClickLoad allows for incremental data loading and can handle large datasets efficiently, ensuring that data integrity is maintained even during failures.

3
Adopt a staging table approach to manage data integrity and simplify error handling during data loads.
Using staging tables allows for easy retries and ensures that data duplication is avoided, which is crucial when dealing with large datasets.

Common Pitfalls

1

Starting data loads from scratch after a failure can lead to significant delays and frustration.

Without a managed solution, users often truncate the target table and restart the load, which increases the risk of further failures and wasted time.

2

Not using moderately sized files can complicate the retry process during data loads.

Loading very large files can lead to longer recovery times if a failure occurs, making it harder to manage data integrity.

Related Concepts

Data Migration Strategies

Resilient Data Architectures

Data Ingestion Best Practices