ChAP: Chaos Automation Platform

Netflix Technology Blog
6 min readintermediate
--
View Original

Overview

ChAP, the Chaos Automation Platform, enhances Netflix's chaos engineering capabilities by allowing for controlled experimentation at the microservice level. It builds upon previous tools like Chaos Monkey and FIT, enabling safer and more efficient testing of system resilience against failures.

What You'll Learn

1

How to run controlled chaos experiments in production environments

2

Why ChAP improves upon previous chaos engineering tools like FIT

3

When to apply concentrated experiments to test system resilience

Prerequisites & Requirements

  • Understanding of chaos engineering principles
  • Familiarity with Netflix's CI/CD tools like Spinnaker(optional)

Key Questions Answered

How does ChAP enhance chaos engineering at Netflix?
ChAP allows for more controlled chaos experiments by launching experiment and control clusters for microservices, enabling better analysis of system resilience without significantly impacting customer experience. This is achieved by routing a small amount of traffic to each cluster and applying specific failure scenarios.
What are the benefits of running concentrated experiments?
Concentrated experiments allow for a higher ratio of failures or latency while limiting user impact. By directing specific traffic to experimental clusters, ChAP can effectively test system responses under stress without affecting the entire production environment.
What is the role of automation in ChAP?
Automation in ChAP includes a circuit breaker that ends experiments if a predefined error budget is exceeded. This ensures that experiments can run unsupervised while maintaining system resilience, integrating with existing canary analysis processes.
When should Netflix engineers use ChAP for experiments?
Engineers should use ChAP when they need to test microservice resilience against specific failure scenarios without disrupting the overall customer experience. This is particularly useful when deploying new features or updates that may introduce risks.

Technologies & Tools

Chaos Engineering Tool
Chap
Used to automate chaos experiments in production environments.
CI/CD Tool
Spinnaker
Integrated with ChAP to run experiments continuously.

Key Actionable Insights

1
Implement ChAP to enhance your chaos engineering practices by allowing for controlled experimentation at the microservice level.
Using ChAP can help identify vulnerabilities in your system without risking customer experience, making it a valuable tool for continuous improvement.
2
Utilize concentrated experiments to test system resilience under stress while minimizing user impact.
This approach allows for effective testing of system behavior during failure scenarios, providing clearer insights into performance metrics.
3
Integrate ChAP with your CI/CD pipeline to ensure ongoing resilience testing.
By automating experiments through your deployment process, you can continuously identify and mitigate potential issues before they affect users.

Common Pitfalls

1
Running experiments that are too large can lead to unnecessary disruptions for customers.
To avoid this, ChAP emphasizes the importance of keeping experiments small and focused, ensuring that customer experience remains unaffected.

Related Concepts

Chaos Engineering
Microservices
Resilience Testing
Continuous Integration/Continuous Deployment