The Present and Future of Apache Hadoop: A Community Meetup at LinkedIn

Overview

The article discusses a community meetup held at LinkedIn focused on Apache Hadoop, highlighting contributions from various organizations and key presentations on topics like TensorFlow on YARN, Hadoop encryption, and HDFS scalability. It emphasizes the importance of community collaboration in maintaining and evolving the Hadoop ecosystem.

What You'll Learn

1

How to implement TensorFlow on YARN using TonY

2

Why HDFS scalability is crucial for handling increased traffic

3

How to manage Hadoop encryption using KMS

4

When to utilize router-based federation for HDFS

Prerequisites & Requirements

  • Understanding of Hadoop and its ecosystem
  • Familiarity with TensorFlow and YARN(optional)

Key Questions Answered

What are the benefits of using TonY for TensorFlow on YARN?
TonY allows for distributed deep learning with TensorFlow on YARN, providing a seamless integration that enhances resource management and scalability. It supports additional runtimes like PyTorch, which broadens its applicability in machine learning workflows.
How does Hadoop handle encryption with KMS?
Hadoop uses a Key Management Server (KMS) to manage encryption keys, which is essential for complying with data protection regulations like GDPR. This system helps administrators encrypt sensitive data while managing the load on the KMS effectively.
What improvements were made to HDFS scalability?
Recent enhancements in HDFS allow clients to read metadata from standby NameNodes, significantly increasing the number of read requests the cluster can handle. This feature supports high availability and reduces latency during peak traffic.
What is the purpose of router-based federation in HDFS?
Router-based federation simplifies the management of multiple HDFS clusters by directing client requests to the appropriate cluster. This approach enhances scalability and reduces configuration complexity, allowing for better resource utilization.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing
Hadoop
Used for offline data infrastructure needs at LinkedIn.
Machine Learning
Tensorflow
Utilized for distributed deep learning via TonY on YARN.
Resource Management
Yarn
Framework for managing resources in Hadoop.
Data Format
Apache Parquet
Used for schema-controlled column-level access control via encryption.

Key Actionable Insights

1
Implementing TonY can significantly enhance your deep learning workflows on Hadoop.
By using TonY, you can leverage the power of TensorFlow within the Hadoop ecosystem, enabling efficient resource management and scalability for machine learning tasks.
2
Utilizing HDFS's new features can improve your data processing capabilities.
The ability to read from standby NameNodes allows for increased throughput and reduced latency, making it essential for organizations experiencing high traffic.
3
Managing encryption effectively is crucial for compliance and data security.
Understanding how to use KMS for encryption in Hadoop will help you meet regulatory requirements while ensuring that sensitive data is protected.

Common Pitfalls

1
Failing to keep up with Hadoop version upgrades can lead to compatibility issues.
Many operators are hesitant to upgrade from Hadoop 2 to 3 due to the complexity involved, which can result in missed features and improvements.

Related Concepts

Apache Spark
Apache Impala
Presto
Erasure Coding