Exploring Global Internet Speeds using Apache Iceberg and ClickHouse

Dale McDiarmid
28 min readbeginner
--
View Original

Overview

This article explores the use of Apache Iceberg and ClickHouse to analyze global internet speeds using the Ookla dataset. It covers the capabilities of Iceberg in a lake house architecture and demonstrates how to visualize geographical data using SQL, including techniques for converting polygons to SVGs and leveraging h3 indexing.

What You'll Learn

1

How to query Iceberg files using ClickHouse

2

How to visualize geographical data with SQL

3

Why using h3 indexing can improve data visualization

4

How to compute centroids for polygons in SQL

Prerequisites & Requirements

  • Familiarity with SQL and data visualization concepts
  • Access to ClickHouse and Apache Iceberg

Key Questions Answered

What is Apache Iceberg and how does it relate to ClickHouse?
Apache Iceberg is an open-table format that enhances data lakes by providing table-like semantics, allowing for schema evolution, ACID transactions, and efficient querying. It integrates with ClickHouse to support real-time analytics while leveraging Iceberg for cold storage.
How can geographical data be visualized using SQL?
Geographical data can be visualized by converting geo polygons to SVG format using SQL functions in ClickHouse. This involves using functions like SVG and Mercator projections to render the data accurately on a 2D map.
What are the limitations of ClickHouse's Iceberg support?
ClickHouse currently cannot read Iceberg files with row-based deletes or evolved schemas, and it does not exploit partitions during query evaluation. These limitations may affect query performance and compatibility with certain Iceberg features.
How does h3 indexing improve data visualization?
h3 indexing allows for the Earth's surface to be divided into hierarchical hexagonal grids, enabling efficient aggregation and visualization of data at various resolutions. This flexibility helps in rendering detailed maps without overwhelming file sizes.

Key Statistics & Figures

Total number of rows processed in Ookla dataset
128,006,990
This number reflects the total entries in the dataset used for querying and analysis.
Average download speed in Mbps from the dataset
51.25
This metric represents the average download speed calculated from the dataset for visualization.
Size of the generated SVG file for h3 resolution 6
165MB
This file size indicates the output of the visualization process at a specific resolution.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Leverage Apache Iceberg for efficient data management in your lake house architecture.
Using Iceberg allows for better schema management and concurrent access by multiple query engines, which can significantly enhance data processing workflows.
2
Utilize ClickHouse's geo functions to visualize complex geographical datasets.
By converting polygons to SVGs, you can create interactive visualizations that help in understanding spatial data trends and patterns effectively.
3
Consider using h3 indexing for large-scale geographical data analysis.
h3's hierarchical structure allows for efficient data aggregation and visualization, making it easier to manage and analyze extensive datasets.
4
Implement centroid calculations in SQL for better data representation.
Calculating centroids can simplify the visualization of geographical data by providing a single point representation for complex polygons.

Common Pitfalls

1
Failing to account for Iceberg's limitations can lead to read exceptions.
Users should be aware that ClickHouse cannot read Iceberg files with row-based deletes or evolved schemas unless explicitly configured to ignore these settings.
2
Not optimizing queries for partitioning can result in slower performance.
ClickHouse does not currently exploit partitions during query evaluation, so users should consider structuring their data and queries to improve efficiency.

Related Concepts

Data Lake Vs. Lake House Architecture
Geospatial Data Analysis Techniques
Real-time Analytics With Clickhouse