How we built the Internal Data Warehouse at ClickHouse

Dmitry Pavlov

ClickHouse

•

Dmitry Pavlov

•18 min read•beginner•

--

•View Original

AWSAWS EC2AWS RDSAWS S3DockerKotlinPostgreSQLPythonRedisSQL

Overview

The article details the construction of ClickHouse's Internal Data Warehouse (DWH), emphasizing its architecture, data sources, and operational strategies. It highlights the importance of data-driven decision-making and outlines the tools and methodologies employed to ensure efficient data processing and analysis.

What You'll Learn

1

How to design a scalable data warehouse using ClickHouse

2

Why to implement idempotency in data processing pipelines

3

How to ensure data consistency in distributed systems

4

How to manage user permissions effectively in a data warehouse

5

When to use intermediate data layers in data warehousing

Prerequisites & Requirements

Understanding of data warehousing concepts and architectures
Familiarity with ClickHouse and Airflow(optional)

Key Questions Answered

What are the main data sources for ClickHouse's internal data warehouse?

The main data sources include the Control Plane, Data Plane, AWS Billing, GCP Billing, Salesforce, M3ter, and others. Each source provides specific data such as database services metadata, cost information, and customer usage details, facilitating comprehensive internal analytics.

How does ClickHouse ensure data consistency in its data warehouse?

ClickHouse ensures data consistency by using the insert_quorum setting, which requires that data is successfully inserted into a specified number of replicas before confirming the operation. This approach prevents partial data reads during transformations, crucial for maintaining data integrity in the DWH.

What is the architecture of ClickHouse's internal data warehouse?

The architecture includes ClickHouse Cloud as the main database, Airflow for scheduling, AWS S3 for raw data storage, and Superset for BI and ad-hoc analysis. Data flows from various sources into S3, then into ClickHouse, where it is transformed and stored in a structured format for analysis.

What are the future plans for ClickHouse's data warehouse?

Future plans include introducing a third logical layer called the Detail Data Store (DDS) to manage complex metrics and dependencies, integrating DBT for enhanced data management, and improving naming conventions for clarity in data structures.

Key Statistics & Figures

Monthly active users

>70

Indicates the level of engagement with the DWH.

Queries per day

~40,000

Demonstrates the DWH's usage intensity and performance.

Total data stored

~115 Tb

Represents the scale of data managed within the DWH.

Actual stored data size after compression

~13 Tb

Highlights the efficiency of ClickHouse's data compression capabilities.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Clickhouse Cloud

Serves as the main database for the internal data warehouse.

Scheduling Tool

Airflow

Used for scheduling data ingestion and transformation tasks.

Storage

AWS S3

Acts as intermediate storage for raw data.

Bi Tool

Superset

Provides internal users with tools for querying and visualizing data.

Key Actionable Insights

1
Implementing idempotency in your data processing pipeline can significantly reduce errors and ensure data integrity. This means designing your ETL processes to handle duplicate data gracefully, which is crucial for maintaining accurate analytics.
Idempotency allows for safe re-execution of data loads without the risk of duplicating records, making it essential for robust data warehousing.

2
Utilizing a multi-layer architecture in your data warehouse can enhance flexibility and performance. By separating raw data, transformed data, and business entities, you can streamline data processing and improve query performance.
This approach allows for more efficient data management and easier updates, which is vital as data complexity grows.

3
Regularly reviewing and refining user permissions in your data warehouse can enhance security and compliance. Implementing role-based access controls ensures that sensitive data is only accessible to authorized users.
This is particularly important in environments handling PII and financial data, where compliance with regulations like GDPR is critical.

Common Pitfalls

1

Assuming a simple two-layer architecture is sufficient for complex data needs can lead to inefficiencies and complications in data processing.

As data requirements grow, it's essential to adapt the architecture to include additional layers that can handle complex metrics and dependencies.

Related Concepts

Data Warehousing

Etl Processes

Data Governance

Data Compression Techniques

User Access Control