Hadoop and Near Real Time Analytics at SlideShare

•

Nikhil Prabhakar

•21 min read•intermediate•

ApacheHAProxyJavaMongoDBMySQLREST APIRubySQLThrift

Overview

The article discusses the implementation of Hadoop and HBase for near real-time analytics at SlideShare, detailing the upgrade process, technology selection, and the resulting improvements in data processing speed. It highlights the transition from a MySQL-based analytics system to a Hadoop ecosystem, achieving data updates with a lag of just 30-90 minutes compared to the previous 24-36 hours.

What You'll Learn

How to implement near real-time analytics using Hadoop and HBase

Why HBase schema design is critical for performance

How to optimize Hadoop configurations for better performance

When to use Oozie for managing Hadoop workflows

Prerequisites & Requirements

Understanding of Hadoop ecosystem and analytics concepts
Familiarity with HBase and Pig(optional)

Key Questions Answered

How did SlideShare migrate from MySQL to Hadoop for analytics?

SlideShare migrated from MySQL to Hadoop by manually dumping MySQL tables into HDFS and using Pig scripts to transform and load data into HBase. This approach allowed them to handle new data points and improve analytics speed significantly.

What technologies were selected for the Hadoop ecosystem at SlideShare?

The technologies selected included HBase for database storage, Pig for writing MapReduce jobs, Oozie for workflow scheduling, Java for various processing tasks, TorqueBox/JRuby for REST API development, and Phoenix for SQL queries over HBase tables.

What improvements were achieved with the new analytics system?

The new analytics system reduced data update lag from 24-36 hours to just 30-90 minutes, enabling near real-time analytics and enhancing the overall user experience on SlideShare.

What are the key challenges faced during the Hadoop implementation?

Key challenges included configuring Hadoop optimally, ensuring HBase schema design supported performance needs, and managing the complexity of data migration from MySQL to HBase without data discrepancies.

Key Statistics & Figures

Data update lag

30-90 minutes

This is a significant improvement from the previous lag of 24-36 hours.

Number of nodes in Hadoop cluster

31 nodes

This includes a highly available setup with no single point of failure for masters.

HDFS block size

128 MB

This configuration was adjusted to improve data processing efficiency.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Hadoop

Used for processing large datasets and enabling near real-time analytics.

Database

Hbase

Serves as the database layer leveraging Hadoop HDFS for storage.

Data Processing

Pig

Used for writing MapReduce jobs in a high-level language.

Workflow Scheduler

Oozie

Manages and schedules Hadoop jobs.

Backend

Torquebox

Facilitates REST API development for analytics data points.

Database

Phoenix

Allows SQL queries over HBase tables.

Key Actionable Insights

1
Implementing a near real-time analytics system can significantly enhance user experience by providing timely insights.
This is particularly important for platforms like SlideShare, where user engagement relies on up-to-date analytics to inform content strategies.

2
Careful schema design in HBase is crucial for optimizing query performance and reducing latency.
Understanding how data will be queried can guide the design process, ensuring that the system can handle expected workloads efficiently.

3
Utilizing Oozie for managing complex workflows can streamline the execution of interdependent MapReduce jobs.
As workflows grow in complexity, Oozie provides a more manageable solution compared to using cron jobs.

Common Pitfalls

Overlooking the importance of HBase schema design can lead to significant performance issues.

If the schema does not align with query patterns, it may necessitate creating new tables and reprocessing data, which can be resource-intensive.

Failing to monitor configuration changes can result in unexpected performance degradation.

It's crucial to change configuration parameters one at a time and assess their impact to avoid compounding issues.

Related Concepts

Hadoop Ecosystem

Hbase Schema Design

Mapreduce Programming

Data Migration Strategies