Streamific, the Ingestion Service for Hadoop Big Data at Uber Engineering

Mrina Natarajan, Naveen Somasundaram

Uber

•

Mrina Natarajan, Naveen Somasundaram

•7 min read•intermediate•

--

•View Original

ApacheScalaSQL

Overview

The article discusses Streamific, Uber's in-house ingestion service designed for efficiently streaming data into Hadoop's ecosystem. It highlights the complexities of data ingestion at Uber, the role of various technologies like Kafka and Schemaless, and the architecture of Streamific.

What You'll Learn

1

How to utilize Streamific for data ingestion in Hadoop

2

Why Kafka is used as an intermediary in data ingestion processes

3

When to implement custom solutions for data ingestion challenges

Prerequisites & Requirements

Understanding of data ingestion concepts and technologies like Kafka and Hadoop
Familiarity with Schemaless and HDFS(optional)

Key Questions Answered

How does Streamific streamline data ingestion at Uber?

Streamific simplifies data ingestion by routing data through a Schemaless Kafka cluster to HDFS and HBase, ensuring higher availability and consistency. It reduces the number of open HDFS files, improving performance and reliability during data processing.

What are the advantages of using Kafka in the ingestion process?

Kafka reduces the number of shards from 4096 to around 32, preventing HDFS from being overwhelmed by too many open files. It also serializes data from different data centers to the same partition, avoiding conflicting updates and allowing for reprocessing without impacting live database performance.

What are the main components of the Streamific architecture?

Streamific consists of several actors, including the source stream actor (SSA) for data origin, the destination stream actor (DSA) for target data storage, and the routing actor (RA) which manages state and checkpoints. This architecture facilitates efficient data flow and management.

What are the pros and cons of the Streamific approach?

The advantages of Streamific include reduced shard management, data serialization for consistency, and the ability to read from various sources uniformly. However, it also incurs operational overhead due to maintaining the Schemaless Kafka cluster, which is a long-term goal for optimization.

Key Statistics & Figures

Number of shards managed by Kafka

32

Kafka reduces the Schemaless shards from 4096 to 32, significantly improving the efficiency of data processing.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Messaging System

Kafka

Used as an intermediary to manage data flow and prevent conflicts during ingestion.

Data Processing Framework

Hadoop

Serves as the primary platform for data storage and processing at Uber.

Storage System

Hdfs

Used for storing large datasets ingested through Streamific.

Nosql Database

Hbase

Used in conjunction with HDFS for real-time data access.

Data Storage

Schemaless

Uber's custom data store that Streamific interacts with for data ingestion.

Cluster Management

Apache Helix

Provides fault tolerance and resource distribution for Streamific nodes.

Actor-based Concurrency Model

Akka

Facilitates asynchronous messaging between components in the Streamific architecture.

Key Actionable Insights

1
Implementing Streamific can significantly enhance your data ingestion pipeline by ensuring efficient data routing and processing.
This is particularly beneficial for organizations dealing with large volumes of data across multiple sources, as it streamlines the ingestion process and improves data accessibility.

2
Utilizing Kafka as an intermediary can help manage data flow and prevent system overloads during peak processing times.
By reducing the number of simultaneous open HDFS files, organizations can maintain system performance and reliability, especially in high-demand environments.

3
Consider developing custom ingestion solutions tailored to your specific data architecture needs.
This approach allows for greater flexibility and control over data management, enabling organizations to adapt to evolving data challenges effectively.

Common Pitfalls

1

Neglecting the operational overhead associated with maintaining a Kafka cluster can lead to increased costs and complexity.

Organizations should weigh the benefits of using Kafka against the resources required to manage it effectively, ensuring that the ingestion process remains efficient and cost-effective.

Related Concepts

Data Ingestion Strategies

Real-time Data Processing

Big Data Architecture Patterns