Goldsky - A Gold Standard Architecture with ClickHouse and Redpanda

The ClickHouse Team
11 min readbeginner
--
View Original

Overview

The article discusses Goldsky's architecture utilizing ClickHouse, Redpanda, and Apache Flink for efficient blockchain data analytics. It highlights how this architecture enables users to deliver transformed blockchain datasets to multiple ClickHouse instances, providing real-time analytics capabilities.

What You'll Learn

1

How to efficiently stream blockchain data to multiple ClickHouse instances using Redpanda

2

Why Apache Flink is suitable for real-time data processing in blockchain applications

3

When to use ClickHouse for advanced analytics on large blockchain datasets

Prerequisites & Requirements

  • Understanding of blockchain technology and data streaming concepts
  • Familiarity with ClickHouse and Redpanda(optional)

Key Questions Answered

How does Goldsky utilize Redpanda for blockchain data streaming?
Goldsky uses Redpanda as a backing store to efficiently store and stream blockchain data. Its tiered storage architecture allows for cost-effective data retention while delivering data at high speeds, ensuring minimal latency for downstream applications like ClickHouse.
What role does Apache Flink play in Goldsky's architecture?
Apache Flink is utilized for processing and transforming blockchain data streams. It allows users to apply complex transformations using FlinkSQL, enabling efficient filtering and aggregation of data before it is delivered to ClickHouse for analytics.
What challenges did Goldsky face when using ClickHouse?
Goldsky encountered challenges with the ReplacingMergeTree engine in ClickHouse, particularly in optimizing its use for handling updates and duplicate events. They focused on emulating PREWHERE conditions and utilizing partitions for efficient querying.
What is the significance of the Base blockchain dataset shared by Goldsky?
The Base blockchain dataset, which has almost 72 million transactions, is significant as it provides a real-time, publicly accessible resource for users to develop blockchain analytics applications using ClickHouse, showcasing the capabilities of Goldsky's architecture.

Key Statistics & Figures

Transaction count on Base blockchain
72 million
This figure represents the total number of transactions recorded on the Base blockchain, demonstrating its high activity and relevance for analytics.
Data processing rate with Apache Flink
500k events/sec
This rate highlights Flink's capability to handle high-throughput data streams effectively, which is essential for real-time blockchain analytics.
Size of the largest dataset in ClickHouse
1 TiB
This indicates the scale of data that can be managed and queried within ClickHouse, showcasing its suitability for large-scale blockchain data.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a multi-tenant architecture using Redpanda can significantly enhance data streaming efficiency.
This approach allows organizations to deliver real-time data to multiple endpoints without the complexity of managing individual data streams, making it ideal for applications requiring rapid analytics.
2
Utilizing FlinkSQL for data transformations can simplify complex data processing tasks.
By leveraging Flink's capabilities, developers can write intuitive SQL queries to filter and aggregate data, which can be particularly beneficial in blockchain applications where data complexity is high.
3
Choosing the right database engine in ClickHouse is crucial for optimizing data handling.
Understanding the strengths of engines like ReplacingMergeTree can help in efficiently managing updates and ensuring high performance in querying large datasets.

Common Pitfalls

1
Misconfiguring the ReplacingMergeTree engine can lead to inefficient data handling.
This can result in performance degradation and challenges in managing duplicate events, making it essential to understand the engine's capabilities and optimize its use.

Related Concepts

Blockchain Analytics
Data Streaming
Real-time Data Processing
Multi-tenant Architecture