Skyline: ETL-as-a-Service

Pinterest Engineering
5 min readintermediate
--
View Original

Overview

Skyline is an ETL-as-a-Service platform developed by Pinterest to streamline data processing and reporting for its users. It enables self-service ETL management, data exploration, and lineage visualization, addressing the challenges posed by increasing data volume and user demand.

What You'll Learn

1

How to create and schedule ETL jobs using Skyline's self-service UI

2

Why data lineage is important for understanding data dependencies

3

How to explore datasets in the Pinterest warehouse effectively

Key Questions Answered

What are the core use cases of Skyline?
Skyline enables three core use cases: a self-service UI for creating and scheduling data workflows, visual exploration of data in Pinterest's warehouse, and understanding data lineage to track dependencies. This functionality aims to empower data users without heavy reliance on data engineers.
How does Skyline facilitate self-service ETL management?
Skyline provides a self-service ETL manager that allows data users to create and schedule jobs using simple SQL queries. This includes moving data from Hive to Redshift, building Pinalytics reports, and backfilling historical data on demand, significantly reducing the need for data engineer intervention.
What features does Skyline Data Warehouse offer?
Skyline Data Warehouse offers a catalog view of core datasets, detailing which workflows generated the data, the owner of the data, its currency, column names and types, size, estimated row counts, and user comments. This feature is updated daily to ensure users have the latest information.
What components make up the Skyline architecture?
Skyline comprises several components: a web application built on Flask with a React.js frontend, a Thrift service for data interaction, an ETL Driver for parsing workflows, and parsers for detecting input/output tables and S3 paths. These components work together to facilitate data processing and management.

Key Statistics & Figures

Percentage of Pinalytics reports created using Skyline
40%
This statistic highlights the rapid adoption of Skyline's ETL Manager by data analysts and product managers at Pinterest.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Utilize Skyline's self-service ETL manager to empower data analysts and product managers in your organization.
By allowing users to create and schedule their own ETL jobs, you reduce the bottleneck created by data engineers and enhance productivity across teams.
2
Leverage the data lineage feature to troubleshoot and optimize data workflows.
Understanding how tables are derived from one another can help identify issues quickly and maintain data integrity, especially in complex systems with numerous dependencies.
3
Encourage the use of Skyline Data Warehouse for better data exploration.
This tool provides essential metadata about datasets, making it easier for users to find and understand the data they need for analysis and reporting.

Common Pitfalls

1
Failing to understand data lineage can lead to issues when jobs break in production.
Without clear visibility into how data tables are interdependent, users may struggle to identify affected downstream tables, complicating troubleshooting efforts.