Open Sourcing TonY: Native Support of TensorFlow on Hadoop

Jonathan Hung

•

Jonathan Hung

•8 min read•advanced•

--

•View Original

ApacheApache SparkMachine LearningPythonTensorBoardTensorFlow

Overview

The article discusses the open sourcing of TonY, a framework designed to enable native support for TensorFlow on Hadoop. It highlights the challenges faced in integrating TensorFlow with Hadoop's capabilities and details the features and architecture of TonY, along with experimental results demonstrating its effectiveness.

What You'll Learn

1

How to run distributed TensorFlow jobs on Hadoop using TonY

2

Why GPU scheduling is important for TensorFlow training on Hadoop

3

When to use TonY for large-scale machine learning applications

Prerequisites & Requirements

Understanding of TensorFlow and Hadoop
Familiarity with YARN and distributed computing concepts(optional)

Key Questions Answered

What is TonY and how does it support TensorFlow on Hadoop?

TonY is a framework that allows TensorFlow to run natively on Hadoop by handling resource negotiation and container management. It enables users to leverage Hadoop's computing power for distributed TensorFlow training, addressing the complexities of orchestrating such jobs.

What are the main components of TonY and their functions?

TonY consists of three main components: Client, ApplicationMaster, and TaskExecutor. The Client submits the TensorFlow job, the ApplicationMaster negotiates resources with YARN, and the TaskExecutors execute the training code on allocated nodes.

What experimental results were achieved using TonY?

Using TonY, the Inception v3 model was trained with 8 workers and GPU training achieved a top-5 error rate of 26.3% after 100,000 steps. The results demonstrated a four times speedup for GPU training compared to CPU training, indicating TonY's efficiency.

Key Statistics & Figures

Top-5 error rate

26.3%

Achieved after training the Inception v3 model with 8 workers using GPU training.

Speedup factor

4 times

Observed when comparing GPU training to CPU training for the same model.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Machine Learning Framework

Tensorflow

Used for building and training deep learning models.

Big Data Platform

Hadoop

Provides the underlying infrastructure for data storage and processing.

Resource Management

Yarn

Manages resources in the Hadoop cluster for running TonY applications.

Key Actionable Insights

1
Leverage TonY for efficient distributed TensorFlow training on Hadoop to maximize resource utilization.
TonY simplifies the orchestration of TensorFlow jobs on Hadoop, making it easier for data scientists to focus on model development rather than infrastructure management.

2
Utilize GPU scheduling capabilities in TonY to ensure optimal performance for deep learning tasks.
With native GPU support in Hadoop, TonY allows users to request specific GPU resources, enhancing the training speed and efficiency of machine learning models.

3
Implement fault tolerance mechanisms in your TensorFlow jobs using TonY to safeguard against node failures.
TonY's ability to restart applications from checkpoints ensures that long-running training processes can recover from transient errors, maintaining productivity.

Common Pitfalls

1

Assuming that existing solutions like TensorFlow on Spark will meet all requirements for distributed TensorFlow training.

The article highlights that while TensorFlow on Spark was initially used, it lacked critical features like GPU scheduling, leading to the development of TonY, which provides a more tailored solution.

Related Concepts

Distributed Machine Learning

Tensorflow Architecture

Hadoop Ecosystem