Inside ClickHouse full-text search: fast, native, and columnar

Jimmy Aguilar, Elmi Ahmadov, and Robert Schulze
22 min readbeginner
--
View Original

Overview

This article discusses the new full-text search implementation in ClickHouse, highlighting its performance improvements and integration with the columnar database architecture. It covers the transition from legacy systems to a native inverted index, explaining the underlying data structures and optimizations that enhance search speed and efficiency.

What You'll Learn

1

How to implement the new full-text search in ClickHouse

2

Why inverted indexes improve search performance in large datasets

3

When to use different tokenizers for specific data types

4

How to optimize full-text search queries for better performance

Prerequisites & Requirements

  • Understanding of full-text search concepts and database indexing
  • Familiarity with ClickHouse and its SQL syntax(optional)

Key Questions Answered

How does ClickHouse's new full-text search implementation enhance performance?
The new full-text search in ClickHouse uses native inverted indexes, which allow for fast token lookups and efficient storage of posting lists. This design significantly reduces the time taken for searches, enabling retrieval of results from massive datasets in milliseconds, compared to the older implementations that relied on bloom filters.
What are the advantages of using inverted indexes over bloom filters?
Inverted indexes provide a direct mapping from terms to documents, eliminating false positives and offering greater versatility in query expressions. Unlike bloom filters, which require manual tuning and can lead to inefficiencies, inverted indexes streamline the search process and enhance performance across various query types.
What is the role of Roaring bitmaps in ClickHouse's full-text search?
Roaring bitmaps are used to store posting lists efficiently, allowing for fast set operations like intersections and unions. This modern format enhances the performance of queries that involve multiple tokens, making it easier to compute results quickly without excessive memory usage.
How do new search functions like searchAny and searchAll improve usability?
The searchAny and searchAll functions allow for more intuitive querying by eliminating the need for special character separators around tokens. They utilize the tokenizer defined in the index, making it easier to perform searches across multiple tokens without unexpected results.

Key Statistics & Figures

Performance improvement for frequent terms
90%+ reduction in query time
This improvement is achieved by filtering directly using the index without needing to read the text column.
Storage footprint reduction for posting lists
up to 30%
This is achieved through the implementation of PFOR compression on top of Delta encoding.
Average size reduction for FSTs
10%
FSTs are now Zstd-compressed when written to disk, lowering their average size.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement the new full-text search feature in ClickHouse to leverage its performance benefits.
By using the native inverted index, you can significantly reduce search times in large datasets, making your applications more responsive and efficient.
2
Utilize the new split tokenizer for structured data formats like CSV to ensure accurate tokenization.
This tokenizer allows for precise control over token boundaries, which is crucial when dealing with semi-structured data, ensuring that your search queries return the expected results.
3
Take advantage of Roaring bitmaps for efficient storage and fast operations on posting lists.
This modern approach to storing sets of integers can greatly enhance the performance of complex queries involving multiple tokens, making your data retrieval processes more efficient.

Common Pitfalls

1
Relying on bloom filters for full-text search can lead to inefficiencies.
Bloom filters require manual tuning and can produce false positives, which may necessitate additional row scans, reducing overall search performance.

Related Concepts

Full-text Search Optimization
Inverted Indexing
Tokenization Techniques
Database Performance Tuning