On September 30th 2021, Slack had an outage that impacted less than 1% of our online user base, and lasted for 24 hours. This outage was the result of our attempt to enable DNSSEC — an extension intended to secure the DNS protocol, required for FedRAMP Moderate — but which ultimately led to a series of…
Overview
The article discusses the challenges and outcomes of Slack's attempt to implement DNSSEC, a security extension for the Domain Name System. It details the incidents that led to outages and the lessons learned from the experience.
What You'll Learn
How to implement DNSSEC securely in a production environment
Why thorough testing and monitoring are crucial during DNS changes
When to roll back DNS changes to minimize customer impact
Prerequisites & Requirements
- Understanding of DNS and DNSSEC concepts
- Experience with managing DNS configurations(optional)
Key Questions Answered
What caused the DNSSEC rollout failure at Slack?
How did Slack handle the DNSSEC rollout incident?
What steps did Slack take to validate DNSSEC implementation?
What lessons can be learned from Slack's DNSSEC rollout?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Implement comprehensive monitoring for DNS changes to catch issues early.By setting up global monitoring for DNS resolution, teams can quickly identify and address problems, minimizing customer impact during changes.
2Conduct thorough testing on wildcard records before deploying DNSSEC.Testing with wildcard records is essential as they can introduce unique issues that may not appear in standard domain setups, as seen in Slack's experience.
3Engage with DNS service providers proactively during major changes.Maintaining open communication with providers like Route 53 can facilitate quicker resolutions to unexpected issues, as demonstrated during the incident.