How to Build a Production Grade Workflow with SQL Modelling

Michelle Ark

I’ll show you how we moved to a SQL modelling workflow by leveraging dbt (data build tool) and created tooling for testing and documentation on top of it.

Shopify

•

Michelle Ark

•12 min read•beginner•

--

•View Original

ApacheGolangPandasPySparkSQLYAML

Overview

This article discusses the development of Seamster, a production-grade SQL modeling workflow created by Shopify to improve data reporting efficiency. It highlights the transition from a PySpark-based system to a dbt and Google BigQuery-based approach, addressing challenges faced by data scientists in their workflow.

What You'll Learn

1

How to create a production-ready SQL modeling workflow using dbt

2

Why using a base layer of models protects against breaking changes in raw sources

3

How to implement unit testing for data models in Seamster

4

When to apply CI pipelines for data model validation

Prerequisites & Requirements

Familiarity with SQL and data modeling concepts
Experience with dbt and Google BigQuery(optional)

Key Questions Answered

What challenges did Shopify face with their original data pipeline, Starscream?

Shopify's data scientists faced issues with long development times and the need to translate SQL queries into Python, which was time-consuming and inefficient. This led to the realization that many jobs did not require the generalized computing capabilities of PySpark, creating an opportunity for optimization.

How does Seamster improve the data modeling workflow for data scientists?

Seamster enables faster creation of simple reports by using dbt for SQL modeling, allowing data scientists to focus on their work without the bottlenecks of boilerplate code. It also incorporates unit testing and CI pipelines to ensure reliability and consistency in data models.

What is the significance of the base layer of models in Seamster?

The base layer serves as a one-to-one interface to raw sources, protecting users from breaking changes in the raw data. This design allows for easier maintenance and updates, as changes only need to be made at the base model level rather than across all dependent models.

What are some key features of the unit testing framework in Seamster?

Seamster's unit testing framework allows data scientists to write tests against fixed input data, enabling them to check edge cases not present in production. It supports assertions from the Great Expectations library, enhancing testing capabilities and error messaging.

Key Statistics & Figures

Jobs run on Starscream

76,000

This number reflects the scale of operations handled by Shopify's original data pipeline.

Data written per day

300 terabytes

This statistic illustrates the significant data volume managed by Shopify's data pipeline.

Percentage of PySpark jobs that were full batch queries

70 percent

This finding indicated the opportunity for optimization by transitioning to a more efficient SQL-based workflow.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Modeling

Dbt

Used to create a structured SQL modeling workflow for reporting.

Data Storage

Google Bigquery

Serves as the data store for the Seamster system.

Data Processing

Pyspark

Previously used for the data pipeline before transitioning to Seamster.

Data Manipulation

Pandas

Used in the unit testing framework for constructing mock data models.

Data Validation

Great Expectations

Integrated into the unit testing framework to provide assertions and error messaging.

Key Actionable Insights

1
Implement a base layer of models in your data pipeline to safeguard against breaking changes from raw sources.
This approach allows data scientists to make adjustments at the base model level without affecting downstream models, enhancing maintainability and reducing the risk of errors.

2
Utilize dbt's CI pipelines to validate data models before deployment, ensuring that changes do not introduce errors.
By running validation tests on every commit, teams can catch potential issues early, maintaining the integrity of the data warehouse.

3
Adopt a structured approach to model ownership by organizing models into directories based on data science teams.
This organization helps clarify responsibilities and improves collaboration among teams, making it easier to manage and discover data models.

4
Leverage unit testing frameworks to validate data models against fixed input data.
This practice allows data scientists to ensure their models behave as expected, even when working with new or untested data.

Common Pitfalls

1

Failing to implement a base layer of models can lead to frequent breaking changes and increased maintenance efforts.

Without a base layer, data scientists must address issues across all dependent models whenever raw sources change, which can be time-consuming and error-prone.

2

Neglecting to run validation tests on data models can result in undetected errors being deployed to production.

Regular validation ensures that changes do not introduce new issues, maintaining the reliability of the data warehouse.

Related Concepts

Data Modeling Best Practices

Unit Testing In Data Pipelines

Continuous Integration In Data Workflows

Dimensional Modeling Techniques