Introducing Petastorm: Uber ATG’s Data Access Library for Deep Learning

Robbie Gruener, Owen Cheng, Yevgeni Litvin

Uber

•

Robbie Gruener, Owen Cheng, Yevgeni Litvin

•16 min read•advanced•

--

•View Original

ApacheApache ArrowApache SparkDeep LearningNumPyPySparkPyTorchSQL

Overview

The article introduces Petastorm, an open-source data access library developed by Uber's Advanced Technologies Group (ATG) for facilitating deep learning model training and evaluation directly from large datasets stored in Apache Parquet format. It outlines the library's capabilities, architecture, and how it integrates with popular machine learning frameworks like TensorFlow and PyTorch.

What You'll Learn

1

How to efficiently manage large datasets for deep learning using Petastorm

2

Why Apache Parquet is beneficial for storing and accessing large datasets

3

When to use row predicates for efficient data selection in Petastorm

Prerequisites & Requirements

Understanding of deep learning concepts and frameworks
Familiarity with Apache Parquet and its advantages(optional)

Key Questions Answered

How does Petastorm facilitate data access for deep learning?

Petastorm enables efficient access to large datasets by providing a unified interface for reading data stored in Apache Parquet format. It supports both single-machine and distributed training, allowing seamless integration with machine learning frameworks like TensorFlow and PyTorch, thus streamlining the model training process.

What are the advantages of using Apache Parquet with Petastorm?

Apache Parquet offers several advantages such as efficient storage, fast access to individual columns, and compatibility with big data processing frameworks like Apache Spark. Petastorm leverages these benefits to optimize data loading and processing for deep learning tasks, making it easier for researchers to experiment with large datasets.

What features does Petastorm provide for handling datasets?

Petastorm includes features like efficient row filtering, data sharding, and support for time-series data. These capabilities allow researchers to manage large datasets effectively, enabling them to focus on model experimentation without worrying about data access complexities.

Key Statistics & Figures

Number of files in datasets used by Uber ATG

over 100 million files

This statistic highlights the scale of data management challenges faced by the team.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Storage

Apache Parquet

Used for efficient storage and access of large datasets in Petastorm.

Framework

Tensorflow

One of the machine learning frameworks supported by Petastorm.

Framework

Pytorch

Another machine learning framework that integrates with Petastorm.

Processing

Apache Spark

Used for generating datasets and analyzing data within Petastorm.

Key Actionable Insights

1
Utilize Petastorm to streamline your data access for deep learning projects.
By integrating Petastorm into your workflow, you can efficiently manage large datasets, enhancing your model training and evaluation processes.

2
Leverage Apache Parquet's columnar storage to optimize data retrieval.
Using columnar storage allows you to load only the necessary data, significantly reducing memory usage and improving performance during model training.

Common Pitfalls

1

Failing to optimize row group sizes can lead to out-of-memory errors during data processing.

This issue arises when row groups are too large, causing memory constraints. It's essential to balance row group sizes to ensure efficient memory usage while maintaining performance.

Related Concepts

Data Management In Deep Learning

Apache Parquet Storage Format

Distributed Training Techniques