Super Tables: The road to building reliable and discoverable data products

LinkedIn Engineering Team

•

LinkedIn Engineering Team

•15 min read•intermediate•

--

•View Original

ApacheAvro

Overview

The article discusses the concept of Super Tables at LinkedIn, which are designed to address the challenges of data discoverability, reliability, and change management in a rapidly growing data ecosystem. It outlines the design principles, benefits, and lessons learned from implementing Super Tables to create high-quality data products.

What You'll Learn

1

How to define and implement Super Tables for data products

2

Why establishing SLAs is crucial for data reliability

3

When to consolidate datasets into Super Tables to improve discoverability

Prerequisites & Requirements

Understanding of data ecosystems and data product management
Experience with data governance and quality assurance practices(optional)

Key Questions Answered

What are Super Tables and how do they improve data management?

Super Tables are pre-computed, denormalized datasets that consolidate attributes and insights optimized for analytics. They provide high data quality, availability, and formal ownership, simplifying data discovery and processing across various teams.

What benefits do Super Tables provide to data teams?

Super Tables enhance discoverability by reducing data duplication, strengthen reliability through established SLAs, and improve change management with defined governance policies. This leads to more efficient data analytics and resource utilization.

How does LinkedIn ensure the quality and reliability of Super Tables?

LinkedIn ensures quality and reliability through proactive data quality checks, established SLAs for data sources, and continuous monitoring of data flows. This includes maintaining high availability and implementing governance processes for change management.

What lessons were learned from implementing Super Tables?

Key lessons include the importance of understanding use cases before building new Super Tables, the need for clear communication regarding semantic logic ownership, and the necessity of a governance body to manage changes and maintain data quality.

Key Statistics & Figures

Number of data sources combined in JOBS Super Table

57+

The JOBS Super Table integrates data from over 57 critical data sources to provide comprehensive job-related insights.

Number of columns in JOBS Super Table

158

The JOBS Super Table contains 158 columns, precomputing essential information for job analytics.

Availability target for Super Tables

99+%

Super Tables aim for over 99% availability, translating to approximately one SLA miss per quarter.

Key Actionable Insights

1
Implementing Super Tables can significantly streamline data access and reduce redundancy in datasets.
By consolidating multiple similar datasets into a single Super Table, teams can save time in data discovery and minimize the complexity of data management, ultimately leading to more efficient analytics.

2
Establishing clear SLAs for data products is essential for maintaining trust and reliability.
When teams have defined SLAs, they can ensure that data consumers are aware of the quality and availability of the datasets, which helps in managing expectations and improving overall data governance.

3
Regularly review and update governance processes to adapt to changing data needs.
As data requirements evolve, it is crucial to have a governance body that can assess and recommend changes to Super Tables, ensuring they continue to meet business needs effectively.

Common Pitfalls

1

Failing to communicate changes to datasets can lead to breaking changes for downstream consumers.

When changes are made without notifying all affected teams, it can disrupt business continuity and create confusion, highlighting the need for a structured change management process.

2

Overcomplicating Super Tables by consolidating too many datasets can jeopardize SLA commitments.

While it may seem beneficial to include as much data as possible, this can lead to performance issues and make it difficult to maintain the quality and reliability of the Super Table.

Related Concepts

Data Mesh Principles

Data Quality Management

Data Governance Frameworks