Overview
Netflix has announced a significant upgrade to Chaos Monkey, an open-source tool designed to enhance the resiliency of microservices by randomly terminating instances in production. The new version, Chaos Monkey 2.0, features improved maintainability, integration with Spinnaker, and additional functionalities for better user experience and tracking.
What You'll Learn
1
How to configure Chaos Monkey with Spinnaker for enhanced service resiliency
2
Why tracking terminations is crucial for monitoring service health
3
When to use Chaos Monkey for testing redundancy in microservices
Prerequisites & Requirements
- Understanding of microservices architecture and resiliency concepts
- Familiarity with Spinnaker for continuous delivery(optional)
Key Questions Answered
What are the new features in Chaos Monkey 2.0?
Chaos Monkey 2.0 introduces several new features including integration with Spinnaker for service configuration, improved user experience for scheduling terminations, and the ability to specify trackers for external notifications. These enhancements aim to streamline the process of managing instance failures in production environments.
How does Chaos Monkey improve service resiliency?
Chaos Monkey enhances service resiliency by randomly terminating instances during business hours, which forces engineers to build redundancy and automation into their applications. This proactive approach ensures that services can withstand unexpected failures without impacting users.
What is the primary function of Chaos Monkey?
The primary function of Chaos Monkey is to terminate instances in production environments to test the resilience of applications. By simulating failures, it helps teams ensure that their systems can handle unexpected outages effectively.
What limitations does Chaos Monkey 2.0 have compared to previous versions?
Chaos Monkey 2.0 is limited to terminating instances only, unlike previous versions which allowed for additional actions such as SSH access and CPU stress testing. Users relying on those functionalities should consider this change before upgrading.
Technologies & Tools
Backend
Chaos Monkey
A resiliency tool that helps applications tolerate random instance failures.
Tools
Spinnaker
Continuous delivery platform integrated with Chaos Monkey for managing service configurations.
Monitoring
Atlas
Telemetry platform used for reporting metrics related to Chaos Monkey terminations.
Monitoring
Chronos
Event tracking system used internally to monitor Chaos Monkey activities.
Key Actionable Insights
1Integrate Chaos Monkey with Spinnaker to automate service termination configurations.This integration allows service owners to manage their Chaos Monkey settings directly through Spinnaker, enhancing operational efficiency and ensuring that service deployments are resilient against unexpected failures.
2Utilize the new tracking feature to monitor instance terminations effectively.By configuring external trackers, teams can receive notifications and report metrics into telemetry platforms, which helps in maintaining oversight of service health and performance.
3Schedule terminations based on mean time between failures rather than arbitrary probabilities.This new scheduling method provides a more intuitive approach for service owners to manage chaos testing, aligning it better with their operational needs and redundancy architectures.
Common Pitfalls
1
Upgrading to Chaos Monkey 2.0 may lead to loss of functionalities present in earlier versions.
Users who relied on features like SSH access or CPU stress testing should evaluate their use cases before upgrading, as these capabilities are no longer supported in the new version.
Related Concepts
Chaos Engineering
Microservices Resiliency
Continuous Delivery With Spinnaker