Data Wrangling at Slack

For a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. The Data Engineering team at Slack works to provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions:…

Ronnie Chen
11 min readintermediate
--
View Original

Overview

The article discusses the data wrangling practices at Slack, focusing on the tools and strategies employed by the Data Engineering team to handle user data efficiently. It highlights the challenges faced in ensuring interoperability among various data processing engines and the solutions implemented to maintain data integrity.

What You'll Learn

1

How to implement a data processing pipeline using Hive, Presto, and Spark

2

Why using a common data format like Parquet is essential for data interoperability

3

When to use Amazon EMR for creating ephemeral clusters for data processing

Prerequisites & Requirements

  • Understanding of data warehousing concepts
  • Familiarity with AWS services, particularly S3 and EMR(optional)

Key Questions Answered

How does Slack manage data from multiple sources?
Slack collects data from various sources, including MySQL databases, servers, clients, and job queues, and pushes this data to Amazon S3. This allows for centralized data management, enabling efficient querying and analysis using tools like Hive, Presto, and Spark.
What challenges does Slack face with data processing engines?
Slack encounters issues with interoperability among Hive, Presto, and Spark due to differences in how they handle the Parquet format. These discrepancies can lead to data being written in a way that is unreadable by other tools, causing significant operational challenges.
What is the role of the Hive Metastore in Slack's data architecture?
The Hive Metastore serves as the ground truth for data schema at Slack, ensuring that all processing engines are aware of the latest schema. This helps maintain consistency and integrity across different data processing tasks.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Storage
Amazon S3
Used as the central data warehouse for storing and querying data.
Data Processing
Hive
Used for handling larger datasets and performing ETL tasks.
Data Processing
Presto
Optimized for interactive queries and ad-hoc analysis.
Data Processing
Spark
Used for batch processing and data aggregation tasks.
Data Streaming
Kafka
Used for collecting and streaming data to S3.
Data Processing
Secor
Used for persisting data from Kafka to S3.

Key Actionable Insights

1
Implement a centralized data warehouse using Amazon S3 to streamline data collection from various sources.
This approach allows for easier querying and analysis, enabling better data-driven decision-making across teams.
2
Utilize a common data format like Parquet to enhance the interoperability of different processing engines.
By standardizing data formats, you can minimize compatibility issues and ensure that data can be accessed and processed by multiple tools without errors.
3
Regularly review and upgrade your data processing tools to incorporate bug fixes and performance improvements.
Upgrading tools like EMR can resolve existing issues but requires careful management to avoid introducing new incompatibilities.

Common Pitfalls

1
Failing to manage schema evolution can lead to data being misaligned across different processing engines.
When schemas change, old and new data files may not match, causing errors during data processing. To avoid this, it's crucial to implement a strategy for managing schema changes effectively.

Related Concepts

Data Warehousing
Data Processing Frameworks
Big Data Analytics