Aegisthus is Now Part of NetflixOSS

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•7 min read•intermediate•

--

•View Original

CassandraMySQLOracle

Overview

The article announces that Aegisthus, a map/reduce program for reading Cassandra SSTables, is now open source as part of NetflixOSS. It discusses the evolution of Aegisthus, its integration with other Netflix data platform services, and its role in processing large datasets efficiently.

What You'll Learn

1

How to leverage Aegisthus for processing Cassandra SSTables

2

Why integrating Aegisthus with Genie enhances job scheduling

3

How to implement incremental data processing with Aegisthus

Prerequisites & Requirements

Understanding of map/reduce concepts and data processing pipelines
Familiarity with Hadoop and Cassandra(optional)

Key Questions Answered

How does Aegisthus process data from Cassandra?

Aegisthus reads SSTables from Cassandra, compacts records to eliminate duplicates, and serializes the data into JSON format for batch processing. It utilizes a custom SSTableReader for improved performance and integrates with other tools like Genie for job scheduling.

What improvements were made to Aegisthus since its initial release?

Since its initial release, Aegisthus has been refactored to leverage Genie for job management, allowing better scalability and integration with Netflix's data platform. The configuration of jobs is now stored in Franklin, enhancing metadata management and data lineage.

What is the role of Franklin in Aegisthus?

Franklin serves as a metadata service that stores information about datasets, including their origins and how to consume them. This integration allows Aegisthus to maintain clear data lineage and simplifies job scheduling.

How does Aegisthus handle incremental data processing?

Aegisthus processes incremental data by reading the most recently flushed SSTables and applying them to the latest JSON dataset. This approach significantly reduces the amount of data processed daily, improving efficiency.

Key Statistics & Figures

Daily datasets processed

100 datasets

Aegisthus is currently processing over 100 datasets daily, representing more than 20TB of incremental SSTables.

Data processing efficiency

10-20 times less data

By compacting the current JSON dataset with incremental data, Aegisthus processes significantly less data each day.

Performance increase

2-3 times faster

Aegisthus achieves a speed increase of about 2-3 times compared to using Cassandra's internal file reader.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Cassandra

Used as the primary data source for Aegisthus.

Data Processing Framework

Hadoop

Aegisthus operates as a map/reduce job within a Hadoop cluster.

Job Scheduling

Genie

Used to launch Aegisthus jobs on EMR-based Hadoop clusters.

Metadata Service

Franklin

Provides metadata management for datasets processed by Aegisthus.

Key Actionable Insights

1
Integrate Aegisthus with your data processing pipeline to streamline the handling of Cassandra data.
By using Aegisthus, you can efficiently convert Cassandra SSTables into a format suitable for analysis, which is crucial for organizations dealing with large volumes of data.

2
Utilize Franklin for enhanced metadata management in your data workflows.
Storing job configurations and dataset origins in Franklin helps maintain data lineage and simplifies the management of complex data processing tasks.

3
Adopt incremental processing strategies to minimize data handling and improve performance.
Processing only the changes in data rather than the entire dataset can lead to significant performance gains, especially in environments with large datasets.

Common Pitfalls

1

Neglecting to periodically resync data to a full snapshot can lead to inconsistencies.

Without regular verification against a full snapshot, issues such as data expiration or tombstones may not be addressed, potentially leading to outdated or incorrect data being processed.

Related Concepts

Data Processing Pipelines

Map/Reduce Frameworks

Cassandra Sstables

Incremental Data Processing Strategies