Introducing WorkflowGuard: The Workflow Governance and Observability System That Oversees over 120,000 Data Workflows

Chengchun Yan, Jing Shi, Sudhir Mallem
14 min readbeginner
--
View Original

Overview

The article introduces WorkflowGuard, a workflow governance and observability system developed by Uber that manages over 120,000 data workflows. It addresses challenges such as resource demand, execution delays, and workflow inefficiencies, while enhancing user experience and cost efficiency.

What You'll Learn

1

How to implement workflow task prioritization using WorkflowGuard

2

Why resource isolation is critical for workflow efficiency

3

When to recycle inefficient workflows to improve system performance

Prerequisites & Requirements

  • Understanding of workflow management concepts
  • Familiarity with Uber's Data Workflow Platform(optional)

Key Questions Answered

What is WorkflowGuard and how does it function?
WorkflowGuard is a governance and observability system that oversees the entire lifecycle of workflows, ensuring efficient resource use, prioritization of tasks, and compliance with governance policies. It enhances user experience by providing a centralized interface for managing workflows and monitoring performance.
How does WorkflowGuard improve workflow performance?
Since the introduction of WorkflowGuard, Uber has seen a 66% reduction in inactive Presto workflows, leading to an increase in execution success rates from 69.28% to 85.22%. This improvement is attributed to the cleanup of legacy workflows that previously hindered performance.
What are the main components of WorkflowGuard?
WorkflowGuard consists of five main components: event detector, policy validator, governance executor, notification service, and governance observability service. These components work together to monitor workflows, enforce policies, and notify users of governance actions.
What challenges does WorkflowGuard address?
WorkflowGuard addresses challenges such as increasing compute resource demands, execution delays during traffic bursts, and the management of inefficient workflows. It implements prioritization, resource isolation, and recycling of workflows to enhance overall system reliability.

Key Statistics & Figures

Reduction in inactive Presto workflows
66%
This reduction was achieved after implementing the workflow retention policy.
Increase in Presto execution success rate
from 69.28% to 85.22%
This improvement resulted from cleaning up consecutive failed executions.
Reduction in overall median task execution latency
from 40 to 15 minutes
This 62.5% reduction was due to the release of computing power after cleaning up inefficient workflows.
Amortized annual savings from Presto computation
$200,000
This savings was identified as a result of implementing the workflow retention policy.

Key Actionable Insights

1
Implementing WorkflowGuard can significantly enhance workflow governance and observability, leading to improved performance and cost savings.
By utilizing WorkflowGuard, organizations can better manage their workflows, ensuring that resources are allocated efficiently and that high-priority tasks are executed without delays.
2
Prioritizing workflows based on business impact can help in resource allocation and task scheduling.
Using tier tags to classify workflows allows teams to isolate critical workflows, ensuring they receive the necessary resources during peak times.
3
Regularly reviewing and recycling inefficient workflows can prevent resource wastage and improve system performance.
WorkflowGuard's ability to identify and manage legacy workflows helps maintain a clean and efficient workflow environment, reducing the likelihood of performance bottlenecks.

Common Pitfalls

1
Failing to prioritize workflows can lead to resource contention and execution delays.
Without proper prioritization, critical workflows may be delayed or fail due to resource competition, impacting overall performance.
2
Neglecting to recycle inefficient workflows can cause performance degradation.
Inefficient workflows that are not regularly reviewed can consume valuable resources, leading to increased costs and reduced system reliability.