Our Journey to GitOps: Migrating to ArgoCD with Zero Downtime

Andrew Jeffree

SafetyCulture

•

Andrew Jeffree

•13 min read•advanced•

--

•View Original

GitGolangHelmIstioKubernetesYAML

Overview

SafetyCulture documents their migration from Helm-based deployment pipelines to GitOps with ArgoCD for hundreds of microservices across multiple Kubernetes clusters. The article details their decision to build a custom DSL using CUE language, their zero-downtime cutover strategy involving temporary suffixed deployments, and the key lessons learned from migrating approximately 250-300 applications across 20 teams.

What You'll Learn

1

How to migrate hundreds of microservices from Helm to ArgoCD with zero customer downtime

2

How to build a domain-specific language using CUE language to replace Helm chart configurations

3

Why GitOps eliminates configuration drift and improves audit capabilities in multi-cluster Kubernetes environments

4

How to implement a controlled cutover strategy using temporary suffixed deployments for zero-downtime migrations

5

When to adopt a team-by-team migration strategy versus a big-bang approach for large infrastructure changes

Prerequisites & Requirements

Understanding of Kubernetes concepts including deployments, clusters, and namespaces
Familiarity with Helm charts, values files, and templating
Basic understanding of GitOps principles and declarative configuration management
Experience operating microservices at scale across multiple Kubernetes clusters(optional)
Familiarity with ArgoCD or similar continuous delivery tools for Kubernetes(optional)

Key Questions Answered

How do you migrate from Helm to ArgoCD with zero downtime?

SafetyCulture used a multi-step approach: first adding services to ArgoCD with sync disabled to observe drift, then aligning configurations, temporarily blocking pipelines for 30 minutes during cutover, annotating existing deployments for ArgoCD adoption, creating temporary suffixed deployments that only replaced originals once healthy, and finally removing the migration flag to create standard deployments.

Why choose CUE language over Helm for Kubernetes configuration management?

CUE language offers a powerful type system that prevents configuration errors at the DSL level rather than deployment time, can import and validate against Go types and OpenAPI schemas from Kubernetes and CRDs like Istio, provides more elegant abstraction mechanisms than Helm's template functions, and offers a familiar programming-like experience that reduces the learning curve for engineers.

What problems does configuration drift cause in Kubernetes deployments?

Configuration drift became a significant issue where manual changes made during incidents, such as scaling adjustments through separate pipelines, would be overwritten by the next deployment without auto-reconciliation. This caused services to revert to problematic states, undermining incident response efforts and creating unpredictable behavior across clusters.

How does ArgoCD improve disaster recovery for Kubernetes applications?

With ArgoCD, the entire application state is defined in Git, allowing teams to quickly redeploy services after a cluster failure with confidence that they match the previous state. Recovery time is significantly reduced because there's no need to manually reconstruct configurations or trace through multiple deployment pipelines.

What is a team-by-team migration strategy and why use it for ArgoCD adoption?

SafetyCulture migrated approximately 250-300 applications across roughly 20 teams by focusing on one team at a time. This approach allowed dedicated support for each team, ensured understanding of the new system before moving on, enabled process refinement based on feedback from early migrations, and minimized business risk by starting with simpler, less critical services.

What are the limitations of Helm-based deployment pipelines at scale?

At scale, Helm pipelines suffered from forced use of latest chart versions pushing untested changes to production, per-cluster pipelines creating unofficial canary clusters with more customer issues, manual redeployments required for configuration changes, growing complexity of bash scripts, slow Helm rendering, YAML readability challenges with layered inheritance, and configuration drift from manual incident interventions.

How do ArgoCD PreSync hooks help with database migrations?

ArgoCD PreSync hooks allow database schema changes to be automatically applied before related application changes are deployed. This decouples CI from CD by ensuring proper sequencing of database migrations without requiring pipeline coordination, making the deployment process more reliable and reducing the risk of application errors from schema mismatches.

How does environment-wide deployment in ArgoCD differ from per-cluster deployment?

Instead of deploying to individual clusters sequentially, ArgoCD deploys across all clusters in an environment simultaneously. This eliminates the unofficial canary cluster problem where some customers experienced more issues, forces teams to conduct more thorough testing in earlier environments, and ensures a uniform experience for all users.

Key Statistics & Figures

Microservices migrated

250-300 applications

Total applications migrated across the platform

Teams involved in migration

~20 teams

Teams migrated using a team-by-team strategy

Pipeline block duration during cutover

~30 minutes

Service pipelines were temporarily blocked during each cutover to prevent new deployments

Scale description

Hundreds of microservices

Services deployed across multiple Kubernetes clusters

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Continuous Delivery

Argocd

GitOps-based continuous delivery tool for Kubernetes that manages service deployments and auto-reconciliation

Configuration Language

Cue

Used to build a domain-specific language for defining microservice configurations with type safety and schema validation

Package Manager

Helm

Previous deployment tool using standardized charts for microservices and cronjobs, replaced by ArgoCD and CUE

Container Orchestration

Kubernetes

Underlying platform running hundreds of microservices across multiple clusters

Service Mesh

Istio

Service mesh configurations managed through standardized Helm charts and later through CUE DSL

Monitoring

Prometheus

PrometheusRules implemented as part of platform standards across services

Deployment Strategy

Argo Rollouts

Being implemented for canary deployments with manual promotion steps aligned across all clusters in an environment

Version Control

Git

Single source of truth for all application state and configuration in the GitOps model

Key Actionable Insights

1
Start infrastructure migrations with high-value, low-risk services to validate your approach before tackling complex ones. Begin with simpler, less critical services that allow you to refine the migration process with minimal business impact, then progressively move to more complex services as confidence grows.
SafetyCulture used this approach across 250-300 applications and 20 teams, incorporating feedback from early migrations to improve the process for later ones.

2
Invest in custom abstractions like a domain-specific language early in a migration rather than directly porting existing configurations. Building a CUE-based DSL that abstracts away Kubernetes complexity provides type safety, schema validation, and a cleaner interface for engineering teams, even though it requires upfront investment.
SafetyCulture's CUE DSL prevented entire categories of configuration errors at the DSL level, validated against Kubernetes and CRD schemas before deployment, and provided a more intuitive interface than Helm templating.

3
Implement a controlled cutover strategy using temporary suffixed deployments to achieve zero-downtime migrations. Create new deployment resources alongside existing ones and only remove the originals once the new resources are verified healthy and serving traffic, ensuring no customer disruption if issues occur.
The argoMigration flag pattern with -temp suffixed deployments allowed SafetyCulture to safely transition each service, with the ability to roll back if the new deployment wasn't healthy.

4
Respect muscle memory by keeping familiar interfaces while changing the implementation behind the scenes. When possible, maintain similar workflows and gradually introduce new capabilities rather than requiring completely new processes, as teams adapt more quickly when existing patterns are preserved.
SafetyCulture learned that people develop strong habits with daily tools. Overhauling everything at once created more resistance than gradually evolving workflows.

5
Plan for scale from the beginning by thoroughly testing infrastructure changes at production-level volume before deployment. Avoid the pattern of reacting to performance issues in production, as having to scale components while they're under stress creates unnecessary complexity.
SafetyCulture experienced ArgoCD performance degradation as they added more resources, particularly in their development cluster with the highest application count and fastest change rate, leading to slow reconciliation times and resource exhaustion.

6
Communicate benefits rather than just technical changes when driving adoption of new infrastructure. Focus on how GitOps solves each team's specific pain points — such as eliminating manual redeployments or preventing configuration drift — rather than explaining the technical details of the migration.
SafetyCulture found that teams adapted more quickly when the messaging was centered on solving their existing frustrations rather than on the mechanics of ArgoCD and CUE.

Common Pitfalls

1

Forcing deployment pipelines to always use the latest version of Helm charts means untested chart changes can reach production before completing testing in development and staging environments. This creates a risk where infrastructure changes bypass the normal validation pipeline.

SafetyCulture experienced this when chart changes would go into production before testing was completed in lower environments. Pinning chart versions and promoting them through environments would have prevented this.

2

Using per-cluster deployment pipelines naturally leads teams to treat one production cluster as an unofficial canary, resulting in significantly more issues for customers on that cluster compared to others. This creates an uneven customer experience without the proper guardrails of an intentional canary deployment strategy.

This pattern emerged organically at SafetyCulture and was only resolved by moving to environment-wide simultaneous deployments with ArgoCD.

3

Not planning for ArgoCD performance at scale can lead to controller resource exhaustion, slow reconciliation times, and degraded performance as application count grows. The development cluster with the highest application count and fastest change rate is particularly susceptible.

SafetyCulture experienced significant performance degradation that required reactive scaling of ArgoCD components under stress, which could have been avoided with upfront load testing at production scale.

4

Building a DSL without sufficient early user feedback results in missing edge cases and configurations that only surface during later team migrations. The initial implementation may not cover the diverse configuration needs across all teams, requiring rapid enhancement during the migration.

SafetyCulture acknowledged they should have involved users even more during development, as their current interface still leaves room for improvement.

5

Manual configuration changes during incidents create configuration drift that gets silently overwritten by the next deployment, causing services to revert to problematic states. Without auto-reconciliation, emergency scaling adjustments or hotfixes are lost, potentially re-triggering the original incident.

This was one of the key pain points that motivated SafetyCulture's migration to ArgoCD, which provides automatic reconciliation to prevent drift.

Related Concepts

Gitops

Continuous Delivery

Infrastructure As Code

Kubernetes Operators

Configuration Management

Canary Deployments

Service Mesh

Declarative Configuration

Immutable Infrastructure

Platform Engineering

Developer Experience

Blue-green Deployments

Horizontal Pod Autoscaling

Crds (custom Resource Definitions)

Openapi Schema Validation