Overview
SafetyCulture documents their migration from Helm-based deployment pipelines to GitOps with ArgoCD for hundreds of microservices across multiple Kubernetes clusters. The article details their decision to build a custom DSL using CUE language, their zero-downtime cutover strategy involving temporary suffixed deployments, and the key lessons learned from migrating approximately 250-300 applications across 20 teams.
What You'll Learn
How to migrate hundreds of microservices from Helm to ArgoCD with zero customer downtime
How to build a domain-specific language using CUE language to replace Helm chart configurations
Why GitOps eliminates configuration drift and improves audit capabilities in multi-cluster Kubernetes environments
How to implement a controlled cutover strategy using temporary suffixed deployments for zero-downtime migrations
When to adopt a team-by-team migration strategy versus a big-bang approach for large infrastructure changes
Prerequisites & Requirements
- Understanding of Kubernetes concepts including deployments, clusters, and namespaces
- Familiarity with Helm charts, values files, and templating
- Basic understanding of GitOps principles and declarative configuration management
- Experience operating microservices at scale across multiple Kubernetes clusters(optional)
- Familiarity with ArgoCD or similar continuous delivery tools for Kubernetes(optional)
Key Questions Answered
How do you migrate from Helm to ArgoCD with zero downtime?
Why choose CUE language over Helm for Kubernetes configuration management?
What problems does configuration drift cause in Kubernetes deployments?
How does ArgoCD improve disaster recovery for Kubernetes applications?
What is a team-by-team migration strategy and why use it for ArgoCD adoption?
What are the limitations of Helm-based deployment pipelines at scale?
How do ArgoCD PreSync hooks help with database migrations?
How does environment-wide deployment in ArgoCD differ from per-cluster deployment?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Start infrastructure migrations with high-value, low-risk services to validate your approach before tackling complex ones. Begin with simpler, less critical services that allow you to refine the migration process with minimal business impact, then progressively move to more complex services as confidence grows.SafetyCulture used this approach across 250-300 applications and 20 teams, incorporating feedback from early migrations to improve the process for later ones.
2Invest in custom abstractions like a domain-specific language early in a migration rather than directly porting existing configurations. Building a CUE-based DSL that abstracts away Kubernetes complexity provides type safety, schema validation, and a cleaner interface for engineering teams, even though it requires upfront investment.SafetyCulture's CUE DSL prevented entire categories of configuration errors at the DSL level, validated against Kubernetes and CRD schemas before deployment, and provided a more intuitive interface than Helm templating.
3Implement a controlled cutover strategy using temporary suffixed deployments to achieve zero-downtime migrations. Create new deployment resources alongside existing ones and only remove the originals once the new resources are verified healthy and serving traffic, ensuring no customer disruption if issues occur.The argoMigration flag pattern with -temp suffixed deployments allowed SafetyCulture to safely transition each service, with the ability to roll back if the new deployment wasn't healthy.
4Respect muscle memory by keeping familiar interfaces while changing the implementation behind the scenes. When possible, maintain similar workflows and gradually introduce new capabilities rather than requiring completely new processes, as teams adapt more quickly when existing patterns are preserved.SafetyCulture learned that people develop strong habits with daily tools. Overhauling everything at once created more resistance than gradually evolving workflows.
5Plan for scale from the beginning by thoroughly testing infrastructure changes at production-level volume before deployment. Avoid the pattern of reacting to performance issues in production, as having to scale components while they're under stress creates unnecessary complexity.SafetyCulture experienced ArgoCD performance degradation as they added more resources, particularly in their development cluster with the highest application count and fastest change rate, leading to slow reconciliation times and resource exhaustion.
6Communicate benefits rather than just technical changes when driving adoption of new infrastructure. Focus on how GitOps solves each team's specific pain points — such as eliminating manual redeployments or preventing configuration drift — rather than explaining the technical details of the migration.SafetyCulture found that teams adapted more quickly when the messaging was centered on solving their existing frustrations rather than on the mechanics of ArgoCD and CUE.