Schematizing Deletion at Scale

Building Shopify’s schematization platform for managing Personally Identifiable Information (PII) within our data warehouse.

Behrooz Shafiee
16 min readbeginner
--
View Original

Overview

The article discusses Shopify's approach to managing personal identifiable information (PII) at scale through a schematization platform that enhances data processing reliability, performance, and efficiency. It details the collaboration between the Privacy team and Data Science & Engineering teams to implement effective deletion strategies for PII using obfuscation and tokenization techniques.

What You'll Learn

1

How to design and implement a schematization system for event data

2

Why obfuscation and tokenization are critical for handling PII

3

How to effectively delete PII across multiple data controllers

4

When to apply pseudonymization techniques in data processing

Prerequisites & Requirements

  • Understanding of data privacy regulations and PII management
  • Familiarity with Kafka and data warehousing concepts(optional)

Key Questions Answered

How does Shopify handle the deletion of PII at scale?
Shopify employs a tokenization vault that allows for the effective deletion of PII by removing the mapping of tokens to personal data. This process ensures that once the mapping is deleted, the tokens become random strings, making it impossible to retrieve the original PII, thus streamlining the deletion process.
What are the benefits of using a schematization platform?
The schematization platform at Shopify enhances privacy education, ensures the correct structure of event data, and facilitates the reuse and observability of data. It allows data scientists to better understand and manage PII, leading to improved compliance with privacy regulations.
What types of pseudonymization techniques are used in data processing?
Shopify utilizes two main pseudonymization techniques: obfuscation, which masks identifying data while preserving analytical value, and tokenization, which replaces PII with a consistent random token stored in a secure vault. This dual approach enhances data privacy while maintaining usability.
What challenges did Shopify face in adopting the new PII management tools?
Shopify faced challenges in stakeholder engagement and ensuring the adoption of new tools across various teams. Effective communication and making the new tools the default option were critical strategies to overcome these hurdles and promote compliance with privacy standards.

Key Statistics & Figures

Active schemas
4500
Shopify currently maintains over 4500 active schemas for event data.
Daily event processing rate
20 billion events
The platform processes approximately 20 billion events per day.
Tokenization vault size
500 billion distinct PII to token mappings
The tokenization vault holds around 500 billion mappings, facilitating efficient data management.
Data deletion rate
tens to hundreds of millions
The tokenization vault deletes tens to hundreds of millions of mappings daily in response to deletion requests.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a schematization system to standardize event data collection.
This ensures that all event data adheres to a defined structure, improving data quality and compliance with privacy regulations.
2
Utilize obfuscation techniques to protect sensitive data while maintaining its analytical value.
Obfuscation allows for the analysis of data without exposing personal identifiers, which is essential for privacy compliance.
3
Adopt a tokenization strategy to facilitate the deletion of PII across multiple data controllers.
This approach simplifies the deletion process, allowing organizations to comply with data protection regulations efficiently.

Common Pitfalls

1
Failing to engage all stakeholders during the implementation of new privacy tools.
This can lead to resistance and poor adoption of the tools, making it essential to involve all relevant parties in the process.
2
Neglecting the importance of making the right tools the default option.
If the new tools are not the easiest option, teams may revert to old practices, undermining the goals of privacy compliance.

Related Concepts

Data Privacy Regulations
Pseudonymization Techniques
Data Warehousing Best Practices