Overview
The article discusses the implementation of Hadoop and HBase for near real-time analytics at SlideShare, detailing the upgrade process, technology selection, and the resulting improvements in data processing speed. It highlights the transition from a MySQL-based analytics system to a Hadoop ecosystem, achieving data updates with a lag of just 30-90 minutes compared to the previous 24-36 hours.
What You'll Learn
1
How to implement near real-time analytics using Hadoop and HBase
2
Why HBase schema design is critical for performance
3
How to optimize Hadoop configurations for better performance
4
When to use Oozie for managing Hadoop workflows
Prerequisites & Requirements
- Understanding of Hadoop ecosystem and analytics concepts
- Familiarity with HBase and Pig(optional)
Key Questions Answered
How did SlideShare migrate from MySQL to Hadoop for analytics?
SlideShare migrated from MySQL to Hadoop by manually dumping MySQL tables into HDFS and using Pig scripts to transform and load data into HBase. This approach allowed them to handle new data points and improve analytics speed significantly.
What technologies were selected for the Hadoop ecosystem at SlideShare?
The technologies selected included HBase for database storage, Pig for writing MapReduce jobs, Oozie for workflow scheduling, Java for various processing tasks, TorqueBox/JRuby for REST API development, and Phoenix for SQL queries over HBase tables.
What improvements were achieved with the new analytics system?
The new analytics system reduced data update lag from 24-36 hours to just 30-90 minutes, enabling near real-time analytics and enhancing the overall user experience on SlideShare.
What are the key challenges faced during the Hadoop implementation?
Key challenges included configuring Hadoop optimally, ensuring HBase schema design supported performance needs, and managing the complexity of data migration from MySQL to HBase without data discrepancies.
Key Statistics & Figures
Data update lag
30-90 minutes
This is a significant improvement from the previous lag of 24-36 hours.
Number of nodes in Hadoop cluster
31 nodes
This includes a highly available setup with no single point of failure for masters.
HDFS block size
128 MB
This configuration was adjusted to improve data processing efficiency.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Hadoop
Used for processing large datasets and enabling near real-time analytics.
Database
Hbase
Serves as the database layer leveraging Hadoop HDFS for storage.
Data Processing
Pig
Used for writing MapReduce jobs in a high-level language.
Workflow Scheduler
Oozie
Manages and schedules Hadoop jobs.
Backend
Torquebox
Facilitates REST API development for analytics data points.
Database
Phoenix
Allows SQL queries over HBase tables.
Key Actionable Insights
1Implementing a near real-time analytics system can significantly enhance user experience by providing timely insights.This is particularly important for platforms like SlideShare, where user engagement relies on up-to-date analytics to inform content strategies.
2Careful schema design in HBase is crucial for optimizing query performance and reducing latency.Understanding how data will be queried can guide the design process, ensuring that the system can handle expected workloads efficiently.
3Utilizing Oozie for managing complex workflows can streamline the execution of interdependent MapReduce jobs.As workflows grow in complexity, Oozie provides a more manageable solution compared to using cron jobs.
Common Pitfalls
1
Overlooking the importance of HBase schema design can lead to significant performance issues.
If the schema does not align with query patterns, it may necessitate creating new tables and reprocessing data, which can be resource-intensive.
2
Failing to monitor configuration changes can result in unexpected performance degradation.
It's crucial to change configuration parameters one at a time and assess their impact to avoid compounding issues.
Related Concepts
Hadoop Ecosystem
Hbase Schema Design
Mapreduce Programming
Data Migration Strategies