Evolution of Data Structures in Yandex.Metrica

Alexey Milovidov

ClickHouse

•

Alexey Milovidov

•14 min read•intermediate•

--

•View Original

ApacheMySQLSQLWhisper

Overview

The article discusses the evolution of data structures in Yandex.Metrica, detailing the transition from MyISAM tables to LSM-trees and ultimately to the column-oriented database ClickHouse. It highlights the challenges faced in data storage organization and the improvements in performance and flexibility achieved through these transitions.

What You'll Learn

1

How to transition from MyISAM to a more efficient data structure for analytics

2

Why ClickHouse is a suitable choice for handling large datasets in real-time analytics

3

When to use LSM-trees for write-intensive workloads

Prerequisites & Requirements

Understanding of database indexing and data storage concepts
Familiarity with SQL and database management systems(optional)

Key Questions Answered

What were the limitations of using MyISAM for Yandex.Metrica?

MyISAM faced challenges such as slow read performance due to random data locality, operational drawbacks like slow replication, and issues with consistency and recovery. These limitations necessitated a shift to more efficient data storage solutions.

How did Yandex.Metrica improve performance with Metrage?

The transition to Metrage, which implements LSM-trees, resulted in significant performance improvements, with page-title reports loading in 0.8 seconds compared to 26 seconds previously. This was achieved through better data locality and efficient compression.

Why was ClickHouse developed for Yandex.Metrica?

ClickHouse was developed to handle large datasets efficiently, allowing real-time analytics with non-aggregated data. It supports high query performance, linear scalability, and is capable of processing over 2 terabytes of data per second.

What are the advantages of using ClickHouse over traditional databases?

ClickHouse offers superior performance, processing queries 2.8-3.4 times faster than Vertica, and supports SQL with extensions for web analytics. Its ability to scale and handle large datasets makes it ideal for Yandex.Metrica's needs.

Key Statistics & Figures

Rows stored in MyISAM tables as of 2011

580 billion

This highlights the scale of data Yandex.Metrica was managing before transitioning to Metrage.

Rows stored in Metrage as of 2015

3.37 trillion

This shows the significant growth in data storage requirements that led to the adoption of more efficient data structures.

Query processing time for Metrage

average = 6 ms, 90tile = 31 ms, 99tile = 334 ms

These metrics demonstrate the performance improvements achieved with Metrage compared to previous systems.

Number of servers in Yandex.Metrica's main cluster

426

This indicates the scale at which ClickHouse operates to handle analytics for Yandex.Metrica.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Myisam

Initially used for storing statistics in Yandex.Metrica.

Database

Metrage

An implementation of LSM-trees for handling write-intensive workloads.

Database

Clickhouse

Developed for real-time analytics and handling large datasets efficiently.

Key Actionable Insights

1
Consider transitioning to a column-oriented database like ClickHouse for large-scale analytics.
If your application requires real-time analytics and handles massive datasets, ClickHouse's performance and scalability can significantly enhance user experience and operational efficiency.

2
Utilize LSM-trees for write-heavy workloads to optimize data ingestion.
When dealing with high-frequency event data, LSM-trees can improve write performance and reduce latency, making them suitable for applications like web analytics.

3
Prioritize data locality in your database design to enhance read performance.
Understanding how data is accessed and stored can lead to better performance optimizations, especially in systems that require frequent read operations.

Common Pitfalls

1

Failing to account for data locality can lead to poor read performance.

When data is not stored in a way that optimizes for how it will be accessed, it can result in increased latency and slower query times, especially in analytics applications.

2

Over-aggregating data can lead to unnecessary complexity and storage bloat.

Pre-aggregating data without understanding user needs can result in wasted resources and a system that is difficult to maintain, as users may not utilize all the aggregated reports.

Related Concepts

Data Storage Optimization Techniques

Performance Tuning For Databases

Real-time Analytics Frameworks