Overview
This article discusses Pinterest's Batch Processing Platform, Monarch, focusing on efficient resource management to ensure quality of service (QoS) while maintaining cost efficiency. It outlines challenges faced in resource allocation, the implementation of workflow tiering, and the development of a resource allocation algorithm to optimize performance.
What You'll Learn
1
How to implement workflow tiering in batch processing systems
2
Why resource allocation algorithms are crucial for maintaining QoS
3
How to monitor and adjust cluster resource usage effectively
Prerequisites & Requirements
- Understanding of batch processing and resource management concepts
- Familiarity with AWS EC2 and Hadoop YARN(optional)
Key Questions Answered
What challenges does Pinterest face in resource management for batch processing?
Pinterest's Batch Processing Platform, Monarch, faces challenges such as resource interference between workflows, lack of priority in queue management, and inefficient resource allocation due to a non-consistent queue structure. These issues can lead to critical workflows being impacted by non-critical ones, necessitating a more structured approach to resource management.
How does the Percentile Algorithm improve resource allocation in Monarch?
The Percentile Algorithm analyzes historical run data to determine the optimal queue weight for resource allocation, ensuring that critical workflows receive the necessary resources to meet their service level objectives. By using a time window to evaluate resource needs, it helps avoid waste and ensures efficient resource usage.
What is the purpose of workflow tiering in Monarch?
Workflow tiering in Monarch categorizes workflows into three tiers based on their criticality, allowing for differentiated resource allocation. This ensures that more critical workflows receive priority in resource allocation, improving overall performance and meeting service level objectives effectively.
How does Pinterest monitor workflow performance in Monarch?
Pinterest monitors workflow performance through a dashboard that tracks the success ratio of workflows meeting their runtime service level objectives (SLOs). By analyzing the SLO success ratio, the team can assess and improve the performance of workflows across different tiers.
Key Statistics & Figures
Daily S3 data write
~30PB
This statistic reflects the scale at which Pinterest's Batch Processing Platform operates, indicating the volume of data processed daily.
Daily S3 data read
~120PB
This highlights the significant data retrieval demands placed on the Monarch platform, showcasing the need for efficient resource management.
Workflow SLO success ratio for tier1 workflows
90%
The improvement in SLO success ratio for tier1 workflows indicates the effectiveness of the implemented resource management strategies.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Hadoop Yarn
Used to manage cluster resources and task scheduling in the Monarch platform.
Cloud
AWS EC2
Provides the infrastructure for running the Monarch batch processing workflows.
Data Streaming
Kafka
Used for ingesting logs generated by users into Pinterest's data system.
Key Actionable Insights
1Implement workflow tiering to prioritize critical tasks in batch processing environments.By categorizing workflows into tiers, you can ensure that critical tasks receive the necessary resources and attention, leading to improved performance and adherence to service level objectives.
2Utilize historical data to inform resource allocation strategies.Analyzing past resource usage patterns can help in setting more accurate resource allocations, reducing waste, and ensuring that workflows are adequately supported during peak times.
3Develop monitoring dashboards to track workflow performance metrics.Having a visual representation of workflow performance allows teams to quickly identify issues and make informed decisions to enhance efficiency and effectiveness in resource management.
Common Pitfalls
1
Failing to establish a consistent queue structure can lead to resource interference among workflows.
Without a well-defined queue hierarchy, critical workflows may suffer delays due to competition for resources with non-critical tasks. Implementing a tiered queue structure can mitigate this issue.
2
Neglecting to monitor resource usage can result in over or under-utilization of cluster resources.
Regular monitoring allows teams to adjust resources proactively, ensuring that clusters are neither overburdened nor underutilized, which can lead to cost inefficiencies.
Related Concepts
Resource Management In Batch Processing
Workflow Orchestration With Airflow
Cloud Infrastructure Optimization Strategies