Data @Scale – Boston recap

Meta

Recently, we hosted Data @Scale, an invitation-only technical conference for engineers working on large-scale storage systems and analytics. Facebook’s Seth Silverman, engineering manager, an…

Overview

The article provides a recap of the Data @Scale conference held in Boston, focusing on the challenges and advancements in large-scale data storage and analytics. It features insights from various industry leaders on topics such as patient privacy, database deployments, and data pipeline reliability.

What You'll Learn

1

How to protect patient privacy while utilizing large-scale health care data

2

Why balancing flexibility and control is crucial in database deployments

3

How to leverage sampling techniques to optimize data warehouse resource consumption

4

How to implement a cloud-native data warehouse architecture

5

How to build reliable data pipelines that can handle trillions of data points

Key Questions Answered

What are the challenges of protecting patient privacy in health care data?

The article discusses the challenges faced by OM1 in safeguarding patient privacy while managing large datasets. It highlights the need for effective data representation and the solutions implemented to ensure privacy is maintained while utilizing real-world evidence in clinical research.

How can organizations balance flexibility and control in database deployments?

Tanya Cashorali explains that organizations must navigate various constraints when choosing between on-premises and cloud-based infrastructures. The talk outlines the trade-offs involved with different database technologies and deployment strategies, providing recommendations for achieving an optimal balance.

What techniques can reduce resource consumption in data warehouses?

Gabriela Jacques Da Silva and Donghui Zhang present sampling as a technique to manage the increasing compute and storage demands of Facebook's analytics workloads. They discuss how sampling can provide approximate query results while addressing challenges like uncertainty propagation in aggregated metrics.

What lessons were learned from scaling a timeseries database?

Ryan Betts shares insights from scaling InfluxDB, detailing lessons learned from various deployment scenarios. He discusses failure conditions that influenced design choices and the trade-offs between monolithic and service-oriented database implementations.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration

Kubernetes

Used in the Kubeflow framework for managing machine learning workflows.

Database

Apache Cassandra

Utilized for its replication techniques and transient replication adaptation.

Cloud Services

AWS

Leveraged by DataXu for building a cloud-native data warehouse.

Search Engine

Elasticsearch

Migrated by HubSpot to enhance security and manageability.

Query Engine

Presto

Used for data querying and optimization through the Cost-Based Optimizer.

Stream Processing

Kafka

Employed by Wayfair for handling data plumbing components.

Key Actionable Insights

1
Implementing a robust deletion framework is essential for managing user data privacy at scale.
As highlighted by Ben Strahs, deletion is crucial for user control over data. Organizations should develop systems that can automatically detect gaps and ensure completeness in data deletion processes.

2
Leveraging cloud-native architectures can significantly reduce operational costs and improve scalability.
Suchi Raman's presentation on DataXu's transition to a cloud-native warehouse illustrates the benefits of using AWS services like Glue Data Catalog and Athena to enhance data processing capabilities.

3
Utilizing sampling techniques can effectively manage resource consumption in data analytics.
As discussed by Gabriela Jacques Da Silva and Donghui Zhang, sampling allows organizations to handle increasing data volumes while still delivering accurate insights, making it a valuable strategy for data-heavy applications.

4
Building highly reliable data pipelines requires a focus on job isolation and ephemeral clusters.
Jeremy Karn emphasizes the importance of these practices at Datadog to ensure the processing of trillions of data points daily, which can help in quickly recovering from failures.

Common Pitfalls

1

Failing to adequately plan for data deletion can lead to compliance issues and loss of user trust.

As mentioned by Ben Strahs, organizations must ensure their deletion frameworks are robust enough to handle distributed systems, confirming that data is completely and accurately deleted.