ClickHouse Release 25.3

The ClickHouse Team
9 min readintermediate
--
View Original

Overview

ClickHouse version 25.3 introduces 18 new features, 13 performance optimizations, and 48 bug fixes, enhancing query support for AWS Glue and Unity catalogs, introducing a new query condition cache, and adding automatic parallelization for S3 queries. This release also includes a production-ready JSON data type and new array functions.

What You'll Learn

1

How to query Apache Iceberg tables using AWS Glue in ClickHouse

2

Why ClickHouse's new JSON data type outperforms traditional JSON stores

3

How to implement query condition caching for improved performance

4

When to use automatic parallelization for querying external data in ClickHouse

5

How to utilize the new array functions like arraySymmetricDifference

Key Questions Answered

How can I query Apache Iceberg tables using AWS Glue in ClickHouse?
To query Apache Iceberg tables via AWS Glue, first create a database engine with the DataLakeCatalog and specify the catalog type as 'glue'. You can then run queries on the tables created in the Glue catalog.
What are the benefits of ClickHouse's new JSON data type?
ClickHouse's new JSON data type offers unmatched performance, compression, and usability, being thousands of times faster than traditional JSON stores like MongoDB. It is also more compact than compressed files and supports dynamic JSON paths without forcing them into a least common type.
What is the query condition cache in ClickHouse and how does it work?
The query condition cache accelerates repeated queries with selective WHERE clauses that do not benefit from the primary index. It caches the scan results from the first query and reuses them in subsequent queries with the same conditions, significantly speeding up execution.
How does automatic parallelization improve querying external data in ClickHouse?
Automatic parallelization allows ClickHouse to distribute the workload across multiple nodes when querying external data, such as from S3. This reduces query execution time significantly, as seen in the example where querying data directly took 64.902 seconds, while using parallelization reduced it to 16.689 seconds.

Key Statistics & Figures

Number of new features
18
ClickHouse version 25.3 includes 18 new features that enhance its capabilities.
Performance improvement in query execution
4x faster
Using automatic parallelization reduced the query execution time from over 64 seconds to about 16 seconds.
Number of bug fixes
48
The release also addresses 48 bug fixes, improving overall stability and performance.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Clickhouse
Used as the primary data processing engine with new features and optimizations introduced in version 25.3.
Cloud Service
AWS Glue
Supported for querying data in ClickHouse, allowing integration with Lakehouse architectures.
Cloud Storage
S3
Used for storing and querying large datasets in ClickHouse with automatic parallelization.

Key Actionable Insights

1
Implement the new query condition cache to optimize repeated query performance in your ClickHouse applications.
This cache is particularly useful in scenarios like dashboarding or observability where the same conditions are reused across different queries, leading to faster response times.
2
Leverage the new JSON data type for handling semi-structured data in ClickHouse to achieve better performance and storage efficiency.
This new implementation allows for faster queries and better compression than traditional JSON stores, making it ideal for applications that require high-speed data processing.
3
Utilize the automatic parallelization feature for querying large datasets stored in S3 to significantly reduce query execution times.
By distributing the workload across multiple nodes, you can achieve faster data retrieval, which is crucial for performance-sensitive applications.
4
Explore the new array functions like arraySymmetricDifference to simplify your data manipulation tasks.
These functions can streamline operations that previously required multiple steps, enhancing code readability and maintainability.

Common Pitfalls

1
Failing to utilize the query condition cache can lead to slower performance in applications that run similar queries repeatedly.
Without caching, each query execution will require full data scans, which can be inefficient and time-consuming, especially in dashboarding scenarios.

Related Concepts

AWS Glue Integration
JSON Data Handling
Performance Optimization Techniques In Clickhouse