This post builds on our earlier work modernising Slack’s Chef infrastructure. Instead of a disruptive migration to Policyfiles, we focused on practical improvements to our existing EC2 and Chef frameworks - delivering safer, more reliable deploys with minimal change for our service owners.
Overview
This article details Slack's approach to making Chef infrastructure deployments safer by splitting a single production Chef environment into six bucketed environments (prod-1 through prod-6) mapped to AWS Availability Zones, and building a new signal-based service called Chef Summoner to replace cron-based Chef runs. The changes introduce a release train model for rolling out cookbook changes progressively, significantly reducing blast radius during deployments while maintaining compliance through fallback mechanisms.
What You'll Learn
How to split a single production Chef environment into multiple availability-zone-based buckets to reduce blast radius
How to implement a release train model for progressive infrastructure deployments across multiple environments
How to replace cron-based Chef runs with a signal-driven service that triggers runs only when new artifacts are available
Why canary environments with frequent updates catch production issues earlier than batched rollouts
How to build fallback mechanisms that prevent a broken deployment tool from blocking its own recovery
Prerequisites & Requirements
- Understanding of Chef configuration management concepts including cookbooks, environments, roles, and Chef runs
- Familiarity with AWS EC2 concepts including Availability Zones, AMIs, and cloud-init
- Understanding of progressive deployment strategies such as canary deployments and release trains(optional)
- Reading the previous blog post 'Advancing Our Chef Infrastructure' for full context on Chef Librarian and the multi-stack model
- Familiarity with S3, Kubernetes cron jobs, and infrastructure-as-code tooling(optional)
Key Questions Answered
How does Slack split Chef environments to reduce deployment blast radius?
What is the release train model for Chef cookbook deployments at Slack?
How does Chef Summoner replace cron-based Chef runs?
Why did Slack choose not to migrate to Chef Policyfiles?
How does Slack prevent a broken Chef Summoner from blocking its own recovery?
What is Chef Librarian and how does it work with Chef Summoner?
Why does prod-1 update more frequently than other production environments?
What is Shipyard and why is Slack building it?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Split production environments by availability zone to contain blast radius. Rather than having all production nodes share a single configuration source, map each node to a zone-specific environment during bootstrapping. This ensures that bad configurations only affect a subset of your fleet initially, giving you time to detect and fix issues before they propagate everywhere.This is especially critical during scale-out events where dozens or hundreds of new nodes could simultaneously pick up a broken configuration from a shared environment.
2Use a canary environment that updates frequently with the latest changes, separate from the rest of production that follows a slower release train. Prod-1 receiving every new version hourly catches issues close to when they're introduced, while prod-2 through prod-6 only advance after the previous version has fully propagated. This dual-speed approach balances early detection with safety.Testing large cumulative changes makes it harder to identify which specific change caused a problem. Small, frequent canary updates make root cause analysis significantly easier.
3Replace fixed-schedule cron-based configuration runs with signal-driven triggers. When environments update at variable times due to staggered rollouts, a fixed schedule can't predict when new changes will be available. A signal-based system that only triggers runs when actual updates exist improves both efficiency and safety.Slack built Chef Summoner to poll S3 for signals from Chef Librarian, using configurable splay values to stagger runs within each environment and avoid load spikes.
4Always build a fallback recovery mechanism for critical deployment infrastructure. If your deployment trigger service itself is deployed by the same system it manages, a broken version could prevent its own fix from being rolled out. Bake in an independent safety net (like a cron job) that can trigger deployments when the primary system fails.Slack's fallback cron checks Chef Summoner's local state and triggers Chef directly if no run has occurred in 12 hours, ensuring the system can always self-heal.
5Use S3 as a lightweight signaling mechanism between infrastructure services rather than building complex pub/sub systems. By having Chef Librarian write JSON signals to S3 on each promotion and having Chef Summoner poll those signals, Slack created a decoupled, durable communication channel that doesn't require managing additional messaging infrastructure.The S3 signal contains all necessary metadata including version, cookbook versions, splay configuration, and commit hash, making it a self-contained deployment manifest.
6When considering major infrastructure migrations like Chef Policyfiles, evaluate the migration cost against incremental improvements to existing systems. Sometimes improving the current system with environment splitting and progressive rollouts achieves similar safety gains without requiring dozens of teams to rewrite their configurations.Slack chose incremental improvement over a disruptive migration, reserving the complete rebuild (Shipyard) for when they had the resources and design to do it properly.