How we made our internal data warehouse AI-first

Dmitry Pavlov
12 min readintermediate
--
View Original

Overview

This article details the transformation of ClickHouse's internal data warehouse from a traditional BI-first approach to an AI-first model, significantly enhancing user accessibility to analytics. It discusses the integration of advanced AI technologies, specifically LLMs, and the creation of DWAINE, an AI assistant that streamlines data queries for various teams within the organization.

What You'll Learn

1

How to enable non-technical users to access data insights without SQL

2

Why integrating LLMs can reduce the burden on data teams

3

How to implement the Model Context Protocol for data integration

Prerequisites & Requirements

  • Understanding of data warehousing concepts
  • Familiarity with ClickHouse and Superset(optional)

Key Questions Answered

How did ClickHouse transition from a BI-first to an AI-first data warehouse?
ClickHouse transitioned by integrating advanced LLMs, specifically Claude 3.5 and later versions, which improved SQL query writing and data accessibility. The introduction of the Model Context Protocol (MCP) facilitated structured access to multiple data sources, enabling a seamless connection between LLMs and the data warehouse.
What are the key components of DWAINE's architecture?
DWAINE's architecture includes a user interface powered by LibreChat, an LLM provider using Anthropic Claude 4.0, an integration layer with MCP servers, and ClickHouse Cloud as the main analytical database. This setup allows for efficient data querying and visualization.
What types of queries can DWAINE handle effectively?
DWAINE excels at handling complex queries that require multiple JOINs and aggregations across different data marts. Examples include generating service utilization reports and forecasting data storage needs, which would typically take analysts considerable time to execute manually.
What challenges did ClickHouse face during the AI integration?
Challenges included managing context window sizes for large schemas, ensuring data quality, and handling errors gracefully when LLM queries failed. These issues required careful prompt engineering and robust validation processes to maintain accuracy.

Key Statistics & Figures

Number of internal users utilizing DWAINE
more than 250
DWAINE has been adopted by over 250 internal users, demonstrating its effectiveness in enhancing data accessibility.
Percentage of analytics workload handled by DWAINE
approximately 70%
DWAINE now manages around 70% of internal analytics use cases, significantly reducing the burden on the data warehouse team.
Reduction in routine data requests pressure on DWH team
approximately 50-70%
The implementation of DWAINE has alleviated pressure on the three-person DWH team, allowing them to focus on more complex analyses.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database
Clickhouse
Main analytical database used for data warehousing.
Frontend
Librechat
User interface for interacting with the AI assistant DWAINE.
AI/ML
Anthropic Claude 4.0
LLM used for natural language processing and SQL generation.
Integration
Model Context Protocol (mcp)
Standard for connecting LLMs to external data sources.

Key Actionable Insights

1
Implementing an AI-first approach can significantly reduce the time spent on routine data queries.
By leveraging LLMs like Claude, organizations can empower non-technical users to obtain insights without needing SQL skills, freeing up data teams for more strategic tasks.
2
Establishing a comprehensive business glossary is crucial for effective AI integration.
A well-documented glossary provides necessary context for LLMs, reducing the likelihood of errors and improving the quality of insights generated from complex data structures.
3
Utilizing open standards like the Model Context Protocol can prevent vendor lock-in.
By adopting MCP, organizations can ensure flexibility in integrating various data sources and LLMs, enhancing their data architecture's resilience and adaptability.

Common Pitfalls

1
Relying solely on AI-generated insights without validating results can lead to inaccurate decisions.
AI models can produce errors, especially in complex queries. It's essential to cross-verify AI outputs with traditional SQL queries to ensure accuracy before making significant business decisions.