Automating Data Protection at Scale, Part 2

elizabeth nammour

Part two of a series on how we provide powerful, automated, and scalable data privacy and security engineering capabilities at Airbnb

Airbnb

•

elizabeth nammour

•17 min read•intermediate•

--

•View Original

AWSAWS S3DynamoDBElasticsearchGitJSONJuliaKubernetesMachine LearningMySQLRedisThrift

Overview

This article discusses the architecture and functionality of Airbnb's data classification systems, Inspekt and Angmar, which automate the detection of personal and sensitive data and secrets in their infrastructure. It highlights the challenges of manual data classification and outlines the technical components and methodologies used to enhance data protection at scale.

What You'll Learn

1

How to implement automated data classification using Inspekt

2

Why continuous quality measurement is essential for data verifiers

3

How to prevent secrets from entering codebases with Angmar

Key Questions Answered

What are the main components of the Inspekt data classification system?

Inspekt consists of two main components: the Task Creator, which identifies what data needs to be scanned, and the Scanner, which samples and scans the data to detect personal and sensitive information. This architecture allows for automated and scalable data classification across various data stores.

How does the Inspekt Scanner ensure data quality?

The Inspekt Quality Measurement Service continuously monitors the performance of data verifiers by calculating precision, recall, and accuracy based on true positive and true negative data sets. This ensures that the classification results are reliable and minimizes false positives and negatives.

What strategies does Angmar use to detect secrets in code?

Angmar employs a CI check that scans every commit for secrets and a pre-commit hook that prevents secrets from being committed. If secrets are detected, a JIRA ticket is automatically created for resolution, ensuring that sensitive data does not enter the codebase.

What are the challenges of manual data classification?

Manual data classification is prone to errors due to the evolving nature of data, the expansion of security and privacy requirements, and the risk of secrets leaking into the codebase. These challenges necessitate automated solutions like Inspekt and Angmar to improve efficiency and accuracy.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Storage

AWS S3

Used for storing data assets and facilitating access during scanning processes.

Orchestration

Kubernetes

Used to deploy and manage the Inspekt Scanner as a distributed system.

Search

Elasticsearch

Utilized for querying application logs to enhance data sampling during scans.

AI/ML

Machine Learning

Employed in Inspekt to detect complex data elements that cannot be identified by traditional methods.

Key Actionable Insights

1
Implement automated data classification tools like Inspekt to enhance data protection.
Automating data classification reduces errors associated with manual tracking and ensures compliance with evolving privacy regulations, ultimately saving time and resources.

2
Regularly measure the quality of data verifiers to maintain trust in classification results.
By continuously monitoring precision and recall, organizations can ensure that their data classification systems are effective and reliable, minimizing disruptions caused by false alerts.

3
Adopt proactive secret detection strategies to prevent sensitive data exposure.
Using tools like Angmar to block secrets before they enter the codebase can significantly reduce security risks and the costs associated with secret rotation.

Common Pitfalls

1

Relying solely on manual data classification can lead to significant errors and inefficiencies.

As data evolves rapidly and privacy regulations change, manual tracking becomes increasingly challenging, making automated solutions essential for accurate data protection.

Related Concepts

Data Classification Techniques

Automated Data Protection Strategies

Machine Learning In Data Security

Secrets Management In Software Development