Overview
The article 'What Gets Measured Gets Fixed' discusses the importance of measurement in engineering, illustrating this principle through two case studies: a database migration failure and the establishment of a tools status standup. It emphasizes that effective measurement leads to improved performance and problem resolution.
What You'll Learn
1
How to implement a rollback plan during major system migrations
2
Why measuring mean time to detect (MTTD) and mean time to resolve (MTTR) outages is crucial
3
How to effectively use Net Promoter Score (NPS) to gauge tool reliability
Key Questions Answered
What lessons were learned from the 10g database migration failure?
The key lessons from the 10g migration failure include the necessity of having a rollback plan and understanding the importance of measuring peak traffic. The failure to measure and plan adequately led to significant operational issues and prolonged resolution times.
How did TS3 improve the reliability of internal tools?
TS3 improved tool reliability by establishing daily meetings to discuss outages, measuring MTTD and MTTR, and focusing on reducing user impact. This structured approach led to an 85.39% reduction in MTTD and an 85.57% reduction in MTTR over the year.
What is the significance of Net Promoter Score (NPS) in evaluating tools?
Net Promoter Score (NPS) is used to measure user satisfaction and reliability of tools. A score of -46 indicated significant dissatisfaction, prompting further investigation and validation of tool performance through user feedback and outage data.
Key Statistics & Figures
NPS score
-46
Indicated significant dissatisfaction with internal tools before improvements were made.
Reduction in MTTD
85.39%
Achieved through the TS3 initiative over the course of a year.
Reduction in MTTR
85.57%
Also achieved through the TS3 initiative, demonstrating improved operational efficiency.
Technologies & Tools
Database
Oracle 10g
Used in the context of the database migration case study.
Operating System
Linux
The target operating system for the database migration.
Database
Oracle Rac
Implemented as part of the migration strategy to improve database performance.
Key Actionable Insights
1Establish a rollback plan for all major system migrations to minimize risks.Having a rollback plan can prevent prolonged outages and operational chaos, as seen in the 10g migration failure where the absence of such a plan led to significant challenges.
2Regularly measure MTTD and MTTR to enhance operational reliability.By focusing on these metrics, teams can identify areas for improvement and ensure faster resolution of outages, as demonstrated by the success of TS3.
3Utilize NPS surveys to gather user feedback on tool reliability.NPS can provide valuable insights into user satisfaction and highlight areas needing attention, helping to prioritize improvements in internal tools.
Common Pitfalls
1
Failing to measure critical performance metrics can lead to significant operational issues.
Without understanding peak load and other metrics, teams may overlook potential failures, as seen in the 10g migration where lack of measurement led to repeated crashes.
2
Neglecting to establish a rollback plan during major changes can result in prolonged outages.
The 10g migration highlighted the dangers of not having a clear plan to revert changes, leading to a 'fix-forward hell' scenario.
Related Concepts
Operational Reliability
Performance Metrics
Database Migration Strategies
User Feedback Mechanisms