Improve logs compression with log clustering

Lionel Palacin
17 min readintermediate
--
View Original

Overview

This article discusses how to enhance log compression through log clustering techniques in ClickHouse, focusing on transforming unstructured logs into structured data for efficient storage. It highlights the challenges of dealing with custom application logs and presents a detailed implementation using the Drain3 tool for pattern extraction.

What You'll Learn

1

How to automate log clustering in ClickHouse for improved compression

2

Why structured logs enhance query performance and storage efficiency

3

How to use Drain3 for extracting log templates from unstructured logs

Prerequisites & Requirements

  • Understanding of log formats and compression techniques
  • Familiarity with ClickHouse and Python(optional)

Key Questions Answered

What is log clustering and how does it work?
Log clustering is a technique that groups similar log lines based on their structure and content, allowing for the identification of recurring patterns in unstructured logs. This process helps automate the extraction of meaningful information, making it easier to store logs efficiently in a columnar format.
How can Drain3 be used to extract log templates?
Drain3 is a Python package designed for mining log templates from streams of log messages. It can identify patterns in logs, allowing users to automate the transformation of unstructured logs into structured data, which can then be stored efficiently in databases like ClickHouse.
What are the benefits of structuring logs?
Structuring logs improves compression and enhances query performance by allowing repetitive log entries to be stored in a more efficient format. This leads to faster troubleshooting and better detection of unusual patterns in log data.
What compression ratios were achieved with structured logs?
The article reports a compression ratio of 22x for structured logs compared to 18x for raw logs, indicating a modest improvement in storage efficiency. However, the article also notes that achieving higher compression ratios is challenging with less consistent application logs.

Key Statistics & Figures

Compression ratio for structured logs
22x
Achieved compared to raw logs, which had a compression ratio of 18x.
Uncompressed size of raw logs
37.67 GiB
This is the total size before any compression was applied.
Compressed size of structured logs
1.71 GiB
This reflects the size after applying log clustering and structuring techniques.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Clickhouse
Used for storing and querying logs efficiently.
Tool
Drain3
A Python package for mining log templates from unstructured log data.

Key Actionable Insights

1
Implement log clustering to enhance the efficiency of log storage in your observability stack.
By automating the transformation of unstructured logs into structured formats, you can significantly improve compression and query performance, making log analysis more effective.
2
Utilize Drain3 for mining log templates to streamline log processing workflows.
This tool can help identify patterns in logs, allowing for better organization and storage of log data, which is crucial for maintaining performance in large-scale applications.
3
Consider creating separate tables for different services to optimize compression further.
By tailoring the data types and sorting keys for each service's logs, you can achieve better compression ratios and improve query performance.

Common Pitfalls

1
Failing to account for unparsed logs can lead to data loss.
It's important to handle logs that do not fit into defined patterns separately to maintain visibility and ensure no critical information is lost.

Related Concepts

Log Compression Techniques
Observability Best Practices
Data Structuring For Performance Optimization