Every Day Is Monday In Operations

Benjamin Purgason
7 min readintermediate
--
View Original

Overview

The article discusses the continuous challenges faced in operations, particularly in maintaining 24/7 online services. It highlights the importance of addressing systematic failures and learning from operational experiences to improve service reliability.

What You'll Learn

1

How to identify and mitigate systematic failures in operations

2

Why consistent monitoring is crucial for 24/7 services

3

When to escalate issues based on operational changes

Key Questions Answered

What should you do when tests yield different results?
When tests yield different results, it is essential to investigate all potential causes, including code, environment, and configuration issues. This relentless pursuit of understanding helps in building more predictable systems and prevents future discrepancies.
How can systematic failures affect large distributed systems?
In large distributed systems, even a 1% chance of failure can lead to significant issues, such as 700 segfaults per day across 70,000 servers. Understanding and addressing these systematic failures can drastically improve operational reliability and reduce engineer hours spent on firefighting.
What is the impact of operational changes on system reliability?
Operational changes can expose previously hidden bugs, leading to failures that may not have been apparent before. Continuous monitoring and a comprehensive change log are critical to identifying and resolving these issues effectively.

Key Statistics & Figures

Average segfaults per day
700
This number is derived from a 1% chance of segfaulting a process per day across 70,000 servers.
Engineer-hours saved per day after fixing a bug
233
This was the result of addressing a systematic issue that had been causing widespread failures.
Total engineer-hours spent to mitigate a bug
750
This was the effort required to identify and resolve a segfault issue affecting the deployment system.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Deployment System
Saltstack
Used as a key component of the deployment system that faced issues due to segfaults.
Programming Language
Python
The Python interpreter was identified as the source of segfaults affecting the deployment processes.

Key Actionable Insights

1
Implement a robust monitoring system to catch issues early.
By having a monitoring system in place, teams can quickly identify and address problems before they escalate, ensuring smoother operations and less downtime.
2
Maintain a comprehensive change log for all systems.
A detailed change log helps teams correlate issues with recent changes, making it easier to identify the root cause of failures and improve response times.
3
Foster a culture of relentless problem-solving among engineers.
Encouraging engineers to persistently seek out and address the root causes of issues leads to more reliable systems and a more resilient operational environment.

Common Pitfalls

1
Failing to correlate operational changes with system failures.
This often leads to prolonged troubleshooting efforts and missed opportunities for quick fixes. Keeping a detailed change log can help avoid this mistake.

Related Concepts

Operational Reliability
Systematic Failures
Monitoring And Alerting Systems
Distributed Systems Management