Scaling our Observability platform beyond 100 Petabytes by embracing wide events and replacing OTel

Rory Crispin, Dale McDiarmid
30 min readadvanced
--
View Original

Overview

The article discusses the evolution of ClickHouse's observability platform, LogHouse, as it scales beyond 100 petabytes of data. It highlights the transition from OpenTelemetry (OTel) to a custom-built solution, SysEx, which significantly improved efficiency and data fidelity while addressing the challenges of high-volume log ingestion.

What You'll Learn

1

How to efficiently scale observability platforms to handle petabyte-scale data

2

Why custom pipelines can outperform general-purpose solutions like OpenTelemetry in high-volume scenarios

3

How to implement a specialized data transfer system using SysEx for ClickHouse

4

When to use OpenTelemetry versus custom solutions for observability

Key Questions Answered

What are the main challenges faced when scaling observability platforms?
The article identifies inefficiencies and data loss in traditional OpenTelemetry pipelines as key challenges. As data volume surged, the need for specialized tools like SysEx became apparent to maintain performance and fidelity in log ingestion.
How does SysEx improve log ingestion efficiency compared to OpenTelemetry?
SysEx enables a byte-for-byte copy of data from ClickHouse's system tables directly to LogHouse, drastically reducing CPU usage. It handles 37 million logs per second with just 70 CPU cores, compared to OpenTelemetry's requirement of over 800 CPU cores for 2 million logs per second.
What role does HyperDX play in ClickHouse's observability stack?
HyperDX provides a ClickHouse-native UI that enhances log and trace exploration, correlation, and analysis. It simplifies querying with Lucene syntax while still allowing SQL for complex analyses, making it easier for engineers to interact with large datasets.
When is OpenTelemetry still a viable option for observability?
OpenTelemetry remains useful for capturing logs in scenarios where services are down, as it collects logs emitted to stdout and stderr. It provides a standardized format and is beneficial for onboarding new users, despite the shift to specialized tools like SysEx.

Key Statistics & Figures

Uncompressed data stored in LogHouse
100 PB
LogHouse has scaled from 19 PiB to 100 PB over the past year.
Rows stored in LogHouse
500 trillion
The number of rows has increased from approximately 40 trillion to 500 trillion.
CPU cores used by OTel collectors
800
OTel collectors required over 800 CPU cores to handle 2 million logs per second.
CPU cores used by LogHouse scrapers (SysEx)
70
SysEx scrapers handle 37 million logs per second with just 70 CPU cores.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Transitioning to specialized data ingestion pipelines can significantly enhance performance and reduce resource usage.
As demonstrated with SysEx, moving away from general-purpose solutions like OpenTelemetry can lead to better efficiency and lower costs, especially when dealing with high data volumes.
2
Embrace high cardinality observability by storing wide events instead of pre-aggregating data.
This approach allows for greater flexibility in querying and analysis, enabling engineers to perform detailed investigations without losing fidelity in the data.
3
Utilize tools like HyperDX to improve user experience and data accessibility in observability platforms.
A well-integrated UI can streamline the process of exploring and analyzing large datasets, making it easier for teams to derive insights and respond to incidents.

Common Pitfalls

1
Relying solely on general-purpose observability tools can lead to inefficiencies and data loss at scale.
As seen with OpenTelemetry, the overhead from multiple data transformations can degrade performance and result in dropped logs, especially when handling high volumes of data.

Related Concepts

High Cardinality Observability
Data Ingestion Strategies
Performance Optimization Techniques