Advancing Our Chef Infrastructure: Safety Without Disruption

This post builds on our earlier work modernising Slack’s Chef infrastructure. Instead of a disruptive migration to Policyfiles, we focused on practical improvements to our existing EC2 and Chef frameworks - delivering safer, more reliable deploys with minimal change for our service owners.

Archie Gunasekara
16 min readintermediate
--
View Original

Overview

This article details Slack's approach to making Chef infrastructure deployments safer by splitting a single production Chef environment into six bucketed environments (prod-1 through prod-6) mapped to AWS Availability Zones, and building a new signal-based service called Chef Summoner to replace cron-based Chef runs. The changes introduce a release train model for rolling out cookbook changes progressively, significantly reducing blast radius during deployments while maintaining compliance through fallback mechanisms.

What You'll Learn

1

How to split a single production Chef environment into multiple availability-zone-based buckets to reduce blast radius

2

How to implement a release train model for progressive infrastructure deployments across multiple environments

3

How to replace cron-based Chef runs with a signal-driven service that triggers runs only when new artifacts are available

4

Why canary environments with frequent updates catch production issues earlier than batched rollouts

5

How to build fallback mechanisms that prevent a broken deployment tool from blocking its own recovery

Prerequisites & Requirements

  • Understanding of Chef configuration management concepts including cookbooks, environments, roles, and Chef runs
  • Familiarity with AWS EC2 concepts including Availability Zones, AMIs, and cloud-init
  • Understanding of progressive deployment strategies such as canary deployments and release trains(optional)
  • Reading the previous blog post 'Advancing Our Chef Infrastructure' for full context on Chef Librarian and the multi-stack model
  • Familiarity with S3, Kubernetes cron jobs, and infrastructure-as-code tooling(optional)

Key Questions Answered

How does Slack split Chef environments to reduce deployment blast radius?
Slack splits the single production Chef environment into six buckets (prod-1 through prod-6), mapping each instance to a specific environment based on its AWS Availability Zone ID. This is done during instance bootstrapping by a tool called Poptart Bootstrap, which inspects the AZ ID and assigns the appropriate environment. This ensures new nodes don't all pull from one global configuration source, and changes roll out gradually across AZs rather than fleet-wide.
What is the release train model for Chef cookbook deployments at Slack?
Sandbox and dev environments receive the latest cookbook version at the top of the hour. Prod-1 acts as a canary, receiving the dev version every 30 minutes past the hour. Prod-2 through prod-6 follow a staggered rollout where each environment advances one step every 30 minutes, but only after the previous version has fully propagated through all production environments. This means a new version takes approximately 5+ hours to reach all production environments.
How does Chef Summoner replace cron-based Chef runs?
Chef Summoner is a service running on every Slack node that monitors an S3 bucket for signals from Chef Librarian. When Chef Librarian promotes a new artifact version to an environment, it writes a JSON message to S3. Chef Summoner detects this signal, reads the configured splay value, and schedules a Chef run accordingly. This ensures Chef only runs when actual updates are available, rather than on a fixed schedule.
Why did Slack choose not to migrate to Chef Policyfiles?
Migrating to Chef Policyfiles would have required replacing existing roles and environments and asking dozens of teams to rewrite their cookbooks. While it might have improved safety long-term, the short-term effort was massive and would have introduced more risk than it solved. Instead, Slack chose to improve their existing EC2 framework in a way that doesn't disrupt existing cookbooks or roles while still achieving safer deployments.
How does Slack prevent a broken Chef Summoner from blocking its own recovery?
Every node has a fallback cron job that checks Chef Summoner's local state, including last run time and artifact version. If Chef hasn't run within 12 hours, the cron job triggers a Chef run directly, bypassing Summoner entirely. This recovery mechanism ensures that even if a broken Summoner version is deployed, a working version can still be pushed out through the fallback Chef run.
What is Chef Librarian and how does it work with Chef Summoner?
Chef Librarian is a service that watches for new Chef cookbook artifacts and uploads them to all Chef stacks. It exposes an API endpoint for promoting specific artifact versions to environments. When it promotes a version, it writes a JSON signal to an S3 bucket containing the version details, cookbook versions, splay value, and commit hash. Chef Summoner then reads these signals to determine when to trigger Chef runs.
Why does prod-1 update more frequently than other production environments?
Prod-1 serves as a canary environment that receives the latest dev version every hour (when new changes exist). If Slack waited for each version to pass through all environments before updating prod-1, they would end up testing artifacts with large, cumulative changes. Frequent prod-1 updates allow detecting issues closer to when they were introduced, keeping each test increment small and making root cause analysis easier.
What is Shipyard and why is Slack building it?
Shipyard is Slack's new EC2 ecosystem designed for teams that can't yet migrate to their container-based platform Bedrock. The current Chef-based system can't efficiently support service-level deployments because creating dedicated Chef environments for hundreds of services is unmanageable at scale. Shipyard introduces service-level deployments, metric-driven rollouts, and fully automated rollbacks, representing a complete reimagining of EC2-based service management.

Key Statistics & Figures

Number of production Chef environments
6
Split from a single prod environment into prod-1 through prod-6, mapped to AWS Availability Zones
Minimum Chef run compliance interval
12 hours
Chef Summoner ensures Chef runs at least once every 12 hours even without new signals, maintained by a fallback cron job
Canary update frequency
Every hour
Prod-1 receives the latest dev version every hour when new changes have been merged
Production rollout cadence
30 minutes between environment promotions
Each production environment advances one step every 30 minutes past the hour during the release train

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Configuration Management
Chef
Primary configuration management tool for maintaining fleet consistency across EC2 instances
Cloud Infrastructure
AWS EC2
Compute platform running Slack's services with availability zone-based environment splitting
Cloud Storage
AWS S3
Used as a signaling mechanism between Chef Librarian and Chef Summoner for deployment triggers
Container Orchestration
Kubernetes
Runs cron jobs that manage the rollout of cookbook changes across environments
Instance Bootstrapping
Cloud-init
Runs Poptart Bootstrap at instance boot time to configure Chef environment and DNS
Internal Tooling
Chef Librarian
Service that watches for new cookbook artifacts, uploads them to Chef stacks, and sends promotion signals to S3
Internal Tooling
Chef Summoner
Node-level service that monitors S3 signals and triggers Chef runs when new versions are available
Internal Tooling
Poptart Bootstrap
AMI-baked tool that creates Chef node objects, sets up DNS, and assigns nodes to availability zone-based Chef environments

Key Actionable Insights

1
Split production environments by availability zone to contain blast radius. Rather than having all production nodes share a single configuration source, map each node to a zone-specific environment during bootstrapping. This ensures that bad configurations only affect a subset of your fleet initially, giving you time to detect and fix issues before they propagate everywhere.
This is especially critical during scale-out events where dozens or hundreds of new nodes could simultaneously pick up a broken configuration from a shared environment.
2
Use a canary environment that updates frequently with the latest changes, separate from the rest of production that follows a slower release train. Prod-1 receiving every new version hourly catches issues close to when they're introduced, while prod-2 through prod-6 only advance after the previous version has fully propagated. This dual-speed approach balances early detection with safety.
Testing large cumulative changes makes it harder to identify which specific change caused a problem. Small, frequent canary updates make root cause analysis significantly easier.
3
Replace fixed-schedule cron-based configuration runs with signal-driven triggers. When environments update at variable times due to staggered rollouts, a fixed schedule can't predict when new changes will be available. A signal-based system that only triggers runs when actual updates exist improves both efficiency and safety.
Slack built Chef Summoner to poll S3 for signals from Chef Librarian, using configurable splay values to stagger runs within each environment and avoid load spikes.
4
Always build a fallback recovery mechanism for critical deployment infrastructure. If your deployment trigger service itself is deployed by the same system it manages, a broken version could prevent its own fix from being rolled out. Bake in an independent safety net (like a cron job) that can trigger deployments when the primary system fails.
Slack's fallback cron checks Chef Summoner's local state and triggers Chef directly if no run has occurred in 12 hours, ensuring the system can always self-heal.
5
Use S3 as a lightweight signaling mechanism between infrastructure services rather than building complex pub/sub systems. By having Chef Librarian write JSON signals to S3 on each promotion and having Chef Summoner poll those signals, Slack created a decoupled, durable communication channel that doesn't require managing additional messaging infrastructure.
The S3 signal contains all necessary metadata including version, cookbook versions, splay configuration, and commit hash, making it a self-contained deployment manifest.
6
When considering major infrastructure migrations like Chef Policyfiles, evaluate the migration cost against incremental improvements to existing systems. Sometimes improving the current system with environment splitting and progressive rollouts achieves similar safety gains without requiring dozens of teams to rewrite their configurations.
Slack chose incremental improvement over a disruptive migration, reserving the complete rebuild (Shipyard) for when they had the resources and design to do it properly.

Common Pitfalls

1
Using a single shared production environment for all nodes means that newly provisioned instances immediately pick up the latest (potentially broken) configuration. During large scale-out events, dozens or hundreds of new nodes could launch with a bad configuration simultaneously, turning a minor issue into a fleet-wide outage.
Split production environments by availability zone so that new nodes are distributed across isolated environments, limiting the blast radius of any single bad change.
2
Relying on fixed cron schedules to trigger configuration management runs becomes impractical when environments update at variable times due to staggered rollouts. A node might run Chef right before a new version lands, then not run again for hours, or multiple nodes might cluster their runs and create load spikes.
Use signal-driven triggers with configurable splay values instead, so Chef only runs when actual updates exist and runs are staggered to avoid resource contention.
3
Making your deployment trigger service (Chef Summoner) entirely dependent on the deployment system it manages creates a circular dependency. If a broken version of the trigger service is deployed, it can stop triggering further deployments, making it impossible to push out a fix through normal channels.
Always bake in an independent fallback mechanism such as a cron job that can trigger deployments when the primary system fails, ensuring a recovery path exists.
4
Waiting for each artifact version to fully propagate through all environments before testing the next version means you end up testing large, cumulative changes. This makes it difficult to identify which specific change introduced a regression and slows down your feedback loop.
Use a canary environment (prod-1) that always receives the latest version to keep test increments small, while the remaining environments follow a safer staggered rollout.
5
Attempting a large-scale migration to Chef Policyfiles or similar tools may seem like the right architectural move, but requiring dozens of teams to simultaneously rewrite their cookbooks and change their workflows introduces massive coordination overhead and short-term risk that may outweigh the long-term benefits.
Consider whether incremental improvements to the existing system (like environment splitting and progressive rollouts) can achieve similar safety gains without disrupting team workflows.

Related Concepts

Canary Deployments
Release Train Model
Progressive Rollout Strategies
Blast Radius Reduction
Chef Policyfiles
Configuration Management At Scale
AWS Availability Zones
Infrastructure As Code
Signal-driven Architecture
Deployment Safety
Ami Baking
Fleet Management
Self-healing Infrastructure
Compliance Automation
Container-based Platforms