I’ll show you how we moved to a SQL modelling workflow by leveraging dbt (data build tool) and created tooling for testing and documentation on top of it.
Overview
This article discusses the development of Seamster, a production-grade SQL modeling workflow created by Shopify to improve data reporting efficiency. It highlights the transition from a PySpark-based system to a dbt and Google BigQuery-based approach, addressing challenges faced by data scientists in their workflow.
What You'll Learn
How to create a production-ready SQL modeling workflow using dbt
Why using a base layer of models protects against breaking changes in raw sources
How to implement unit testing for data models in Seamster
When to apply CI pipelines for data model validation
Prerequisites & Requirements
- Familiarity with SQL and data modeling concepts
- Experience with dbt and Google BigQuery(optional)
Key Questions Answered
What challenges did Shopify face with their original data pipeline, Starscream?
How does Seamster improve the data modeling workflow for data scientists?
What is the significance of the base layer of models in Seamster?
What are some key features of the unit testing framework in Seamster?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Implement a base layer of models in your data pipeline to safeguard against breaking changes from raw sources.This approach allows data scientists to make adjustments at the base model level without affecting downstream models, enhancing maintainability and reducing the risk of errors.
2Utilize dbt's CI pipelines to validate data models before deployment, ensuring that changes do not introduce errors.By running validation tests on every commit, teams can catch potential issues early, maintaining the integrity of the data warehouse.
3Adopt a structured approach to model ownership by organizing models into directories based on data science teams.This organization helps clarify responsibilities and improves collaboration among teams, making it easier to manage and discover data models.
4Leverage unit testing frameworks to validate data models against fixed input data.This practice allows data scientists to ensure their models behave as expected, even when working with new or untested data.