The Case of the Recursive Resolvers

On September 30th 2021, Slack had an outage that impacted less than 1% of our online user base, and lasted for 24 hours. This outage was the result of our attempt to enable DNSSEC — an extension intended to secure the DNS protocol, required for FedRAMP Moderate — but which ultimately led to a series of…

Rafael Elvira
19 min readintermediate
--
View Original

Overview

The article discusses the challenges and outcomes of Slack's attempt to implement DNSSEC, a security extension for the Domain Name System. It details the incidents that led to outages and the lessons learned from the experience.

What You'll Learn

1

How to implement DNSSEC securely in a production environment

2

Why thorough testing and monitoring are crucial during DNS changes

3

When to roll back DNS changes to minimize customer impact

Prerequisites & Requirements

  • Understanding of DNS and DNSSEC concepts
  • Experience with managing DNS configurations(optional)

Key Questions Answered

What caused the DNSSEC rollout failure at Slack?
The failure was primarily due to issues with NSEC responses generated by Route 53 for wildcard records, leading to cache poisoning in some DNS resolvers. This resulted in users experiencing 'ERR_NAME_NOT_RESOLVED' errors when attempting to access certain subdomains.
How did Slack handle the DNSSEC rollout incident?
Slack's Traffic Engineering team carefully monitored DNS resolution and ultimately decided to roll back the DNSSEC changes after observing significant resolution issues. They also contacted major ISPs to flush cached records to restore normal service.
What steps did Slack take to validate DNSSEC implementation?
Slack implemented external monitoring and alerting for DNS resolution, conducted multiple tests using tools like dnsviz.net and Verisign DNSSEC Debugger, and performed hand-crafted dig tests to ensure proper DNSSEC setup before enabling it on their domains.
What lessons can be learned from Slack's DNSSEC rollout?
Key lessons include the importance of thorough testing, understanding the behavior of DNS resolvers with DNSSEC, and ensuring proper communication with DNS service providers during critical changes to avoid service disruptions.

Key Statistics & Figures

Percentage of affected users during the outage
less than 1%
This was the impact on Slack's online user base during the DNSSEC rollout failure.
Duration of the outage
24 hours
The length of time Slack users experienced issues due to the DNSSEC implementation.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

DNS Service
Amazon Route 53
Used as the authoritative DNS server for Slack's public domains.
Infrastructure As Code
Terraform
Managed DNS configurations for Slack.

Key Actionable Insights

1
Implement comprehensive monitoring for DNS changes to catch issues early.
By setting up global monitoring for DNS resolution, teams can quickly identify and address problems, minimizing customer impact during changes.
2
Conduct thorough testing on wildcard records before deploying DNSSEC.
Testing with wildcard records is essential as they can introduce unique issues that may not appear in standard domain setups, as seen in Slack's experience.
3
Engage with DNS service providers proactively during major changes.
Maintaining open communication with providers like Route 53 can facilitate quicker resolutions to unexpected issues, as demonstrated during the incident.

Common Pitfalls

1
Failing to account for the behavior of DNS resolvers with DNSSEC enabled can lead to service disruptions.
Many resolvers may enforce stricter rules when DNSSEC is enabled, which can cause unexpected resolution failures if not properly tested.
2
Assuming that removing DS records from the registrar will clear cached records immediately.
In reality, resolvers may cache DS records for up to 24 hours, leading to prolonged issues if not managed correctly.

Related Concepts

Dnssec Implementation Strategies
DNS Resolver Behavior
Impact Of DNS Changes On User Experience