How we scaled raw GROUP BY to 100 B+ rows in under a second

Tom Schreiber
27 min readbeginner
--
View Original

Overview

The article discusses how ClickHouse Cloud has achieved the capability to scale complex GROUP BY queries across thousands of cores, processing over 100 billion rows in under a second. This is made possible through a feature called parallel replicas, which allows for infinite horizontal query scaling without data reshuffling.

What You'll Learn

1

How to utilize ClickHouse's parallel replicas for efficient query processing

2

Why GROUP BY operations are critical for analytics performance

3

When to scale horizontally versus vertically in ClickHouse

Key Questions Answered

How does ClickHouse achieve sub-second aggregation for 100 billion rows?
ClickHouse achieves sub-second aggregation for 100 billion rows through its parallel replicas feature, which allows a single query to utilize all cores across multiple nodes. This results in a runtime of just 414 milliseconds and a throughput of 241.83 billion rows per second.
What is the significance of GROUP BY in analytics?
GROUP BY is central to analytics as it powers the majority of analytical queries. A recent study found that over half of queries contained a GROUP BY, highlighting its importance in generating insights from large datasets.
What are the benefits of using parallel replicas in ClickHouse?
Parallel replicas in ClickHouse allow for infinite horizontal query scaling, enabling queries to run faster as more nodes are added. This feature eliminates the need for data reshuffling and simplifies the scaling process, making it easier to achieve interactive speeds on large datasets.

Key Statistics & Figures

Runtime for 100 billion rows aggregation
414 ms
This is the time taken to process 100 billion rows using ClickHouse with parallel replicas enabled.
Throughput during aggregation
241.83 billion rows/s
This throughput was achieved while processing the 100 billion rows in under half a second.
Rows processed in a single query
100 billion
This number highlights the scale at which ClickHouse can operate effectively.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Leverage ClickHouse's parallel replicas for large-scale data processing to achieve near-instantaneous query results.
This is particularly useful for organizations dealing with massive datasets, as it allows for real-time analytics without the need for complex data management strategies.
2
Understand the importance of GROUP BY in analytical queries to optimize query performance.
Since GROUP BY operations are foundational to analytics, ensuring they are efficient can significantly enhance overall data processing speed and responsiveness.
3
Consider horizontal scaling with parallel replicas for workloads that exceed single-node capabilities.
When dealing with large datasets, horizontal scaling allows for better resource utilization and faster query execution, especially when data volumes grow beyond what a single node can handle.

Common Pitfalls

1
Overlooking the need for parallel replicas in large-scale queries can lead to suboptimal performance.
Without enabling parallel replicas, queries on large datasets may run significantly slower, as they would only utilize the resources of a single node.

Related Concepts

Parallel Processing
Massively Parallel Processing (mpp)
Columnar Storage
Data Aggregation Techniques