Overview
The article discusses the evolution of data structures in Yandex.Metrica, detailing the transition from MyISAM tables to LSM-trees and ultimately to the column-oriented database ClickHouse. It highlights the challenges faced in data storage organization and the improvements in performance and flexibility achieved through these transitions.
What You'll Learn
1
How to transition from MyISAM to a more efficient data structure for analytics
2
Why ClickHouse is a suitable choice for handling large datasets in real-time analytics
3
When to use LSM-trees for write-intensive workloads
Prerequisites & Requirements
- Understanding of database indexing and data storage concepts
- Familiarity with SQL and database management systems(optional)
Key Questions Answered
What were the limitations of using MyISAM for Yandex.Metrica?
MyISAM faced challenges such as slow read performance due to random data locality, operational drawbacks like slow replication, and issues with consistency and recovery. These limitations necessitated a shift to more efficient data storage solutions.
How did Yandex.Metrica improve performance with Metrage?
The transition to Metrage, which implements LSM-trees, resulted in significant performance improvements, with page-title reports loading in 0.8 seconds compared to 26 seconds previously. This was achieved through better data locality and efficient compression.
Why was ClickHouse developed for Yandex.Metrica?
ClickHouse was developed to handle large datasets efficiently, allowing real-time analytics with non-aggregated data. It supports high query performance, linear scalability, and is capable of processing over 2 terabytes of data per second.
What are the advantages of using ClickHouse over traditional databases?
ClickHouse offers superior performance, processing queries 2.8-3.4 times faster than Vertica, and supports SQL with extensions for web analytics. Its ability to scale and handle large datasets makes it ideal for Yandex.Metrica's needs.
Key Statistics & Figures
Rows stored in MyISAM tables as of 2011
580 billion
This highlights the scale of data Yandex.Metrica was managing before transitioning to Metrage.
Rows stored in Metrage as of 2015
3.37 trillion
This shows the significant growth in data storage requirements that led to the adoption of more efficient data structures.
Query processing time for Metrage
average = 6 ms, 90tile = 31 ms, 99tile = 334 ms
These metrics demonstrate the performance improvements achieved with Metrage compared to previous systems.
Number of servers in Yandex.Metrica's main cluster
426
This indicates the scale at which ClickHouse operates to handle analytics for Yandex.Metrica.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Database
Myisam
Initially used for storing statistics in Yandex.Metrica.
Database
Metrage
An implementation of LSM-trees for handling write-intensive workloads.
Database
Clickhouse
Developed for real-time analytics and handling large datasets efficiently.
Key Actionable Insights
1Consider transitioning to a column-oriented database like ClickHouse for large-scale analytics.If your application requires real-time analytics and handles massive datasets, ClickHouse's performance and scalability can significantly enhance user experience and operational efficiency.
2Utilize LSM-trees for write-heavy workloads to optimize data ingestion.When dealing with high-frequency event data, LSM-trees can improve write performance and reduce latency, making them suitable for applications like web analytics.
3Prioritize data locality in your database design to enhance read performance.Understanding how data is accessed and stored can lead to better performance optimizations, especially in systems that require frequent read operations.
Common Pitfalls
1
Failing to account for data locality can lead to poor read performance.
When data is not stored in a way that optimizes for how it will be accessed, it can result in increased latency and slower query times, especially in analytics applications.
2
Over-aggregating data can lead to unnecessary complexity and storage bloat.
Pre-aggregating data without understanding user needs can result in wasted resources and a system that is difficult to maintain, as users may not utilize all the aggregated reports.
Related Concepts
Data Storage Optimization Techniques
Performance Tuning For Databases
Real-time Analytics Frameworks