Supercharging your large ClickHouse data loads - Part 3: Making a large data load resilient

Tom Schreiber
15 min readbeginner
--
View Original

Overview

This article discusses strategies for making large data loads into ClickHouse resilient and efficient, particularly when migrating from other systems. It introduces ClickPipes, a managed solution for data ingestion, and ClickLoad, a script for loading large datasets incrementally and reliably.

What You'll Learn

1

How to use ClickPipes for resilient data ingestion into ClickHouse

2

How to implement the ClickLoad script for incremental data loading

3

Why staging tables are essential for ensuring data integrity during loads

Prerequisites & Requirements

  • Basic understanding of data loading processes and ClickHouse
  • Familiarity with object storage services like AWS S3 or Google Cloud Storage(optional)

Key Questions Answered

What is ClickPipes and how does it improve data loading?
ClickPipes is a fully managed integration solution in ClickHouse Cloud that supports continuous, fast, resilient, and scalable data ingestion from external systems. It automatically retries data transfers in case of failures, ensuring that large datasets can be loaded without interruptions.
How does ClickLoad ensure reliable data loading?
ClickLoad orchestrates data loading by splitting files into repeatable tasks, using staging tables to manage data integrity. If a load fails, it can retry without duplicating data, ensuring each row is stored exactly once in the target table.
What are the benefits of using staging tables during data loads?
Staging tables allow for efficient data management during loads, enabling easy truncation and retry of failed tasks without affecting the target table. This approach guarantees that each row is stored exactly once, preventing data duplication.
What file sizes are recommended for optimal performance with ClickLoad?
It is recommended to use moderately sized files between 50 and 150MB for optimal performance with ClickLoad. This size ensures efficient retries and minimizes the impact of failures during data loading.

Key Statistics & Figures

Data loading throughput
10 million rows per second
This throughput indicates that 36 billion rows can be loaded per hour and 864 billion rows per day.
Ingest throughput increase
from 4 million rows/second to 8 million rows/second
This increase was observed when doubling the number of ClickHouse servers from 3 to 6 while loading a 600+ billion row dataset.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Clickhouse
Used for storing and querying large datasets efficiently.
Data Streaming
Apache Kafka
Supported by ClickPipes for resilient data ingestion.
Object Storage
AWS S3
Recommended for exporting data before loading into ClickHouse.
Object Storage
Google Cloud Storage
Another option for exporting data before loading into ClickHouse.

Key Actionable Insights

1
Utilize ClickPipes for seamless data ingestion into ClickHouse Cloud to minimize downtime during large data loads.
ClickPipes provides built-in resiliency and automatic retries, making it ideal for continuous data ingestion without interruptions.
2
Implement the ClickLoad script for environments not supported by ClickPipes to ensure reliable data loading.
ClickLoad allows for incremental data loading and can handle large datasets efficiently, ensuring that data integrity is maintained even during failures.
3
Adopt a staging table approach to manage data integrity and simplify error handling during data loads.
Using staging tables allows for easy retries and ensures that data duplication is avoided, which is crucial when dealing with large datasets.

Common Pitfalls

1
Starting data loads from scratch after a failure can lead to significant delays and frustration.
Without a managed solution, users often truncate the target table and restart the load, which increases the risk of further failures and wasted time.
2
Not using moderately sized files can complicate the retry process during data loads.
Loading very large files can lead to longer recovery times if a failure occurs, making it harder to manage data integrity.

Related Concepts

Data Migration Strategies
Resilient Data Architectures
Data Ingestion Best Practices