Streamific, the Ingestion Service for Hadoop Big Data at Uber Engineering

Mrina Natarajan, Naveen Somasundaram
7 min readintermediate
--
View Original

Overview

The article discusses Streamific, Uber's in-house ingestion service designed for efficiently streaming data into Hadoop's ecosystem. It highlights the complexities of data ingestion at Uber, the role of various technologies like Kafka and Schemaless, and the architecture of Streamific.

What You'll Learn

1

How to utilize Streamific for data ingestion in Hadoop

2

Why Kafka is used as an intermediary in data ingestion processes

3

When to implement custom solutions for data ingestion challenges

Prerequisites & Requirements

  • Understanding of data ingestion concepts and technologies like Kafka and Hadoop
  • Familiarity with Schemaless and HDFS(optional)

Key Questions Answered

How does Streamific streamline data ingestion at Uber?
Streamific simplifies data ingestion by routing data through a Schemaless Kafka cluster to HDFS and HBase, ensuring higher availability and consistency. It reduces the number of open HDFS files, improving performance and reliability during data processing.
What are the advantages of using Kafka in the ingestion process?
Kafka reduces the number of shards from 4096 to around 32, preventing HDFS from being overwhelmed by too many open files. It also serializes data from different data centers to the same partition, avoiding conflicting updates and allowing for reprocessing without impacting live database performance.
What are the main components of the Streamific architecture?
Streamific consists of several actors, including the source stream actor (SSA) for data origin, the destination stream actor (DSA) for target data storage, and the routing actor (RA) which manages state and checkpoints. This architecture facilitates efficient data flow and management.
What are the pros and cons of the Streamific approach?
The advantages of Streamific include reduced shard management, data serialization for consistency, and the ability to read from various sources uniformly. However, it also incurs operational overhead due to maintaining the Schemaless Kafka cluster, which is a long-term goal for optimization.

Key Statistics & Figures

Number of shards managed by Kafka
32
Kafka reduces the Schemaless shards from 4096 to 32, significantly improving the efficiency of data processing.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Messaging System
Kafka
Used as an intermediary to manage data flow and prevent conflicts during ingestion.
Data Processing Framework
Hadoop
Serves as the primary platform for data storage and processing at Uber.
Storage System
Hdfs
Used for storing large datasets ingested through Streamific.
Nosql Database
Hbase
Used in conjunction with HDFS for real-time data access.
Data Storage
Schemaless
Uber's custom data store that Streamific interacts with for data ingestion.
Cluster Management
Apache Helix
Provides fault tolerance and resource distribution for Streamific nodes.
Actor-based Concurrency Model
Akka
Facilitates asynchronous messaging between components in the Streamific architecture.

Key Actionable Insights

1
Implementing Streamific can significantly enhance your data ingestion pipeline by ensuring efficient data routing and processing.
This is particularly beneficial for organizations dealing with large volumes of data across multiple sources, as it streamlines the ingestion process and improves data accessibility.
2
Utilizing Kafka as an intermediary can help manage data flow and prevent system overloads during peak processing times.
By reducing the number of simultaneous open HDFS files, organizations can maintain system performance and reliability, especially in high-demand environments.
3
Consider developing custom ingestion solutions tailored to your specific data architecture needs.
This approach allows for greater flexibility and control over data management, enabling organizations to adapt to evolving data challenges effectively.

Common Pitfalls

1
Neglecting the operational overhead associated with maintaining a Kafka cluster can lead to increased costs and complexity.
Organizations should weigh the benefits of using Kafka against the resources required to manage it effectively, ensuring that the ingestion process remains efficient and cost-effective.

Related Concepts

Data Ingestion Strategies
Real-time Data Processing
Big Data Architecture Patterns