Open sourcing Querybook, Pinterest’s collaborative big data hub

Pinterest Engineering
7 min readbeginner
--
View Original

Overview

Pinterest has open sourced Querybook, a collaborative big data hub designed to improve data access and analysis for teams, especially in a remote working environment. The article details the motivation behind its development, its features, architecture, and the path to open sourcing.

What You'll Learn

1

How to use Querybook for collaborative data analysis

2

Why a responsive web UI is crucial for data scientists and engineers

3

When to implement automated query analytics in your data workflows

Key Questions Answered

What is Querybook and how does it improve data analysis?
Querybook is a collaborative big data hub that allows data scientists, product managers, and engineers to compose queries, create analyses, and share findings in a responsive web UI. It integrates with various SQL engines and provides features like real-time collaboration, automated query analytics, and customizable dashboards.
What are the main features of Querybook?
Key features of Querybook include a DataDoc interface for composing queries, real-time collaboration, automated query analytics, and visualization options for creating dashboards. Users can also schedule updates for visualizations and utilize a plugin system for customization.
How does Querybook's architecture support query execution?
Querybook's architecture involves creating a DataDoc, streaming queries to the server, and executing them via a task queue. Results are then stored and made available for users, allowing for a seamless querying experience across various SQL engines.
What challenges did Pinterest face when open sourcing Querybook?
Pinterest aimed to make Querybook generic while retaining Pinterest-specific integrations. They implemented a plugin system and an Admin UI to allow for easy configuration and customization, enabling broader usability for the open-source community.

Key Statistics & Figures

Daily Active Users (DAUs)
500
Querybook has an average of 500 daily active users within Pinterest.
Daily Query Runs
7000
On average, Querybook handles 7000 query runs daily.
Internal User Rating
8.1/10
Querybook is rated 8.1 out of 10 by its internal users, indicating high satisfaction.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Sparksql
Used as one of the query engines for executing queries within Querybook.
Backend
Hive
Another query engine compatible with Querybook for executing queries.
Backend
Presto
A query engine supported by Querybook for data analysis.
Backend
Sqlalchemy
Compatible engine for executing SQL queries within Querybook.
Backend
Redis
Used for real-time updates and task queue management in Querybook.
Backend
Elasticsearch
Utilized for searching DataDoc content in Querybook.
Tools
Jinja
Used for templating options in Querybook, allowing for dynamic content generation.

Key Actionable Insights

1
Implement Querybook to streamline your data analysis workflows and enhance collaboration among team members.
As remote work becomes more prevalent, tools like Querybook can help teams efficiently compose queries and share insights, improving overall productivity.
2
Utilize the automated query analytics feature to enhance your data documentation and schema management.
This feature helps maintain up-to-date metadata and usage statistics, making it easier to understand data sources and their relevance over time.
3
Leverage the visualization capabilities of Querybook to create dynamic dashboards that can be updated automatically.
This allows teams to visualize data trends and insights in real-time, facilitating quicker decision-making based on the latest data.

Common Pitfalls

1
Failing to leverage the collaborative features of Querybook can limit the effectiveness of data analysis.
Many teams may continue to work in isolation, missing out on the benefits of real-time collaboration and shared insights that Querybook facilitates.
2
Overlooking the importance of automated query analytics can lead to outdated metadata and inefficiencies.
Without utilizing this feature, teams may struggle to maintain an accurate understanding of their data sources and how they are being used.