Fast and Reliable Schema-Agnostic Log Analytics Platform

Chao Wang, Xiaobing Li
20 min readadvanced
--
View Original

Overview

Uber has developed a centralized, schema-agnostic log analytics platform that enhances logging efficiency and reliability. The platform is capable of ingesting millions of logs per second, addressing the challenges faced with the previous ELK stack, including operational costs and performance inefficiencies.

What You'll Learn

1

How to implement a schema-agnostic logging system using ClickHouse

2

Why operational overhead is reduced in a logging platform

3

When to use materialized columns for performance optimization

Prerequisites & Requirements

  • Understanding of logging systems and data ingestion
  • Familiarity with ClickHouse and Kafka(optional)

Key Questions Answered

What challenges did Uber face with the ELK stack?
Uber faced significant challenges with the ELK stack, including operational costs due to running multiple Elasticsearch clusters, type conflicts leading to dropped logs, and performance inefficiencies during aggregation queries. These issues prompted the need for a more scalable and reliable logging solution.
How does the new logging platform improve performance?
The new logging platform improves performance by utilizing ClickHouse, which allows for fast ingestion of logs at 300K logs per second per node, significantly faster than Elasticsearch. It also supports efficient querying and aggregation, reducing query times to just a few seconds for recent logs.
What is the schema-agnostic data model used in the new platform?
The schema-agnostic data model allows logs to be formatted as JSON with evolving schemas. It tracks field types during ingestion, enabling flexible querying without the constraints of a fixed schema, thus enhancing developer productivity.
What are the benefits of using materialized columns in ClickHouse?
Materialized columns in ClickHouse allow for faster query performance by pre-populating frequently accessed fields. This reduces the need for expensive JSON unmarshalling during query execution, significantly speeding up data retrieval.

Key Statistics & Figures

Logs ingested per second
300K
A single ClickHouse node can ingest 300K logs per second, which is about ten times faster than a single Elasticsearch node.
Operational cost reduction
More than half
The new platform has reduced the hardware cost compared to the ELK stack while serving more production traffic.
Ingestion latency
Under one minute
The platform maintains an ingestion latency of under one minute, ensuring timely access to log data.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a schema-agnostic logging strategy to enhance flexibility in log data handling.
This approach allows developers to evolve log schemas without the constraints of a fixed structure, thus improving productivity and reducing errors related to schema conflicts.
2
Utilize ClickHouse for high-performance log ingestion and querying.
By leveraging ClickHouse, organizations can achieve significant improvements in log processing speeds and query performance, which is critical for real-time analytics.
3
Adopt materialized columns for frequently queried fields to optimize performance.
This technique can drastically reduce query times, especially for large datasets, making it easier to diagnose production issues quickly.

Common Pitfalls

1
Failing to account for schema evolution in logging systems can lead to type conflicts and dropped logs.
This often occurs when developers assume a fixed schema, which can hinder productivity and result in lost log data. Adopting a schema-agnostic approach mitigates this risk.

Related Concepts

Log Analytics
Data Ingestion Strategies
Real-time Data Processing
Distributed Systems Architecture