Overview
This article provides an update on the evolution of ClickHouse's internal data warehouse (DWH) over the past year, highlighting improvements in data processing capabilities and user access. It discusses the integration of dbt for batch reporting, the incorporation of real-time data, and the configuration of various access points for users.
What You'll Learn
1
How to integrate dbt for modular SQL transformations in data pipelines
2
Why real-time data processing is essential for modern data warehouses
3
How to configure multiple access points for enhanced user experience
4
When to decompose workloads for scalable data processing
Prerequisites & Requirements
- Understanding of data warehousing concepts and SQL
- Familiarity with dbt and ClickHouse(optional)
Key Questions Answered
What improvements have been made to ClickHouse's internal data warehouse over the past year?
Over the past year, ClickHouse's internal data warehouse has evolved to support more diverse users and data sources, with enhanced capabilities for real-time data processing and batch reporting. The integration of dbt has centralized transformation logic, allowing for more complex metrics and improved scalability.
How does dbt enhance batch reporting in ClickHouse's DWH?
dbt enhances batch reporting by centralizing transformation logic, allowing data engineers to write modular SQL code for complex metrics. This simplifies the process of adding new data sources and managing dependencies, making it easier to track data lineage and document changes.
What are the benefits of incorporating real-time data into the DWH?
Incorporating real-time data into the DWH allows users to perform ad-hoc analysis and gain insights from less structured data. This capability has proven valuable for various teams, enabling them to track metrics like website activity and customer queries more intuitively.
What access points have been configured for users of the DWH?
The DWH has configured multiple access points, including the ClickHouse Cloud SQL console for ad-hoc queries and a connection with Growthbook for A/B testing. This enhances user experience and allows for direct querying of data without relying on external BI tools.
Key Statistics & Figures
Data volume written daily
6 billion rows and 50 TBs
This data is written to the DWH every day, showcasing the scale at which the warehouse operates.
Compressed data stored
470 TBs
This represents the total amount of data collected over the last two years, indicating significant growth.
User engagement
~70 unique users
This number reflects the users generating at least one SELECT query in a given week.
Data growth rate
23%
This increase in stored data occurred over the last 30 days, highlighting the rapid expansion of data volume.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Database
Clickhouse
Used as the primary data warehouse for storing and processing data.
Data Transformation Tool
Dbt
Facilitates modular SQL transformations and centralized logic for batch reporting.
Workflow Orchestration
Apache Airflow
Initially used for configuring batch reporting processes.
Bi Tool
Apache Superset
Originally configured for user data access before transitioning to ClickHouse Cloud's SQL console.
A/B Testing Tool
Growthbook
Configured to run A/B tests using data from the DWH.
Key Actionable Insights
1Integrate dbt into your data pipeline to manage complex transformations and dependencies more effectively.Using dbt can significantly streamline the process of adding new data sources and defining complex metrics, making your data warehouse more scalable and maintainable.
2Consider exposing real-time data alongside batch reports to provide users with a comprehensive view of business metrics.Real-time data can enhance decision-making by allowing teams to respond quickly to changing conditions and user behavior, thus improving overall business agility.
3Utilize ClickHouse's SQL console for a better user experience when querying data.The native SQL console offers a more reliable and user-friendly interface compared to other BI tools, making it easier for users to explore data and perform ad-hoc analysis.
4Plan for scalable architecture by decomposing workloads into independent processes.This approach can improve performance and reliability, especially as the number of data sources and complexity of data models increase.
Common Pitfalls
1
Relying on a single ETL job for all data sources can lead to performance bottlenecks.
As data sources and models grow in complexity, a single job can become a point of failure. Decomposing workloads into multiple processes can enhance reliability and performance.
Related Concepts
Data Warehousing Best Practices
Real-time Data Processing
Data Pipeline Architecture
Batch Reporting Techniques