Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework and Library for Apache Hadoop

Danny Chen, Omkar Joshi
18 min readadvanced
--
View Original

Overview

Marmaray is an open-source data ingestion and dispersal framework designed for Apache Hadoop, enabling Uber to manage large volumes of data efficiently. It allows users to ingest data from various sources and disperse it to different sinks while ensuring high data quality and reliability.

What You'll Learn

1

How to implement a data ingestion pipeline using Marmaray

2

Why using AvroPayload improves data handling in Hadoop environments

3

When to use ForkOperator for data quality management

Prerequisites & Requirements

  • Understanding of data ingestion and dispersal concepts
  • Familiarity with Apache Hadoop and Apache Spark(optional)

Key Questions Answered

How does Marmaray ensure data quality during ingestion?
Marmaray ensures data quality by conforming all ingested raw data to an appropriate source schema, filtering out any malformed or incomplete records. This process allows data scientists to focus on extracting insights rather than managing data quality issues.
What are the main use cases for Marmaray at Uber?
Marmaray is used at Uber for various applications, including enhancing restaurant recommendations in the Uber Eats app and improving data analytics capabilities in Uber Freight. It enables efficient ingestion and dispersal of data across different systems.
What challenges did Uber face with data ingestion before Marmaray?
Prior to Marmaray, Uber struggled with maintaining multiple ad-hoc ingestion pipelines, which became cumbersome as data volumes grew. The need for a reliable and scalable ingestion solution led to the development of Marmaray.
How does Marmaray track data delivery?
Marmaray tracks data delivery using custom-authored accumulators in Spark, allowing users to monitor data delivery with minimal overhead. This system ensures high confidence in data delivery rates, aiming for 99.99 to 99.999 percent accuracy.

Key Statistics & Figures

Number of messages processed daily
greater than 100 billion
This statistic highlights the scale at which Marmaray operates, showcasing its capability to handle massive data volumes efficiently.
Percentage of data delivery accuracy
99.99 to 99.999 percent
This level of accuracy is critical for ensuring that data analytics are based on complete and reliable datasets.
Number of jobs onboarded through the self-service platform
over 3,300
This figure demonstrates the effectiveness of Marmaray's self-service capabilities in facilitating user adoption and operational efficiency.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Apache Hadoop
Used as the primary data platform for managing large datasets.
Backend
Apache Spark
Serves as the main data processing engine for Marmaray.
Message Broker
Kafka
Used for handling real-time data streams during ingestion.
Database
Cassandra
Utilized for low-latency data storage and retrieval.
Data Management
Hudi
Manages storage of large analytical datasets and supports incremental updates.

Key Actionable Insights

1
Implementing a self-service platform like Marmaray can significantly reduce the onboarding time for data ingestion jobs.
By allowing users to set up pipelines through a user-friendly interface, organizations can empower teams to manage their data needs without deep technical expertise.
2
Utilizing AvroPayload for data processing can enhance performance and reduce network overhead.
Avro's binary encoding format minimizes schema overhead, making it efficient for large-scale data transfers and processing in Hadoop environments.
3
Consolidating ingestion pipelines into a single framework can streamline maintenance and reduce operational overhead.
Marmaray's design allows for a source-agnostic approach, simplifying the addition of new data sources and reducing the complexity of managing multiple codebases.

Common Pitfalls

1
Failing to ensure data quality can lead to unreliable analytics.
Without proper validation and filtering of incoming data, organizations risk making decisions based on flawed insights, which can have significant operational impacts.
2
Overcomplicating ingestion pipelines can hinder scalability.
Maintaining multiple ad-hoc pipelines increases the complexity of the data architecture, making it difficult to adapt to changing data needs and scaling operations effectively.

Related Concepts

Data Ingestion Frameworks
Data Quality Management
Apache Hadoop Ecosystem
Real-time Data Processing