The Case of the Recursive Resolvers

Rafael Elvira

On September 30th 2021, Slack had an outage that impacted less than 1% of our online user base, and lasted for 24 hours. This outage was the result of our attempt to enable DNSSEC — an extension intended to secure the DNS protocol, required for FedRAMP Moderate — but which ultimately led to a series of…

Slack

•

Rafael Elvira

•19 min read•intermediate•

--

•View Original

AWSChefPythonTerraformTypeScript

Overview

The article discusses the challenges and outcomes of Slack's attempt to implement DNSSEC, a security extension for the Domain Name System. It details the incidents that led to outages and the lessons learned from the experience.

What You'll Learn

1

How to implement DNSSEC securely in a production environment

2

Why thorough testing and monitoring are crucial during DNS changes

3

When to roll back DNS changes to minimize customer impact

Prerequisites & Requirements

Understanding of DNS and DNSSEC concepts
Experience with managing DNS configurations(optional)

Key Questions Answered

What caused the DNSSEC rollout failure at Slack?

The failure was primarily due to issues with NSEC responses generated by Route 53 for wildcard records, leading to cache poisoning in some DNS resolvers. This resulted in users experiencing 'ERR_NAME_NOT_RESOLVED' errors when attempting to access certain subdomains.

How did Slack handle the DNSSEC rollout incident?

Slack's Traffic Engineering team carefully monitored DNS resolution and ultimately decided to roll back the DNSSEC changes after observing significant resolution issues. They also contacted major ISPs to flush cached records to restore normal service.

What steps did Slack take to validate DNSSEC implementation?

Slack implemented external monitoring and alerting for DNS resolution, conducted multiple tests using tools like dnsviz.net and Verisign DNSSEC Debugger, and performed hand-crafted dig tests to ensure proper DNSSEC setup before enabling it on their domains.

What lessons can be learned from Slack's DNSSEC rollout?

Key lessons include the importance of thorough testing, understanding the behavior of DNS resolvers with DNSSEC, and ensuring proper communication with DNS service providers during critical changes to avoid service disruptions.

Key Statistics & Figures

Percentage of affected users during the outage

less than 1%

This was the impact on Slack's online user base during the DNSSEC rollout failure.

Duration of the outage

24 hours

The length of time Slack users experienced issues due to the DNSSEC implementation.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

DNS Service

Amazon Route 53

Used as the authoritative DNS server for Slack's public domains.

Infrastructure As Code

Terraform

Managed DNS configurations for Slack.

Key Actionable Insights

1
Implement comprehensive monitoring for DNS changes to catch issues early.
By setting up global monitoring for DNS resolution, teams can quickly identify and address problems, minimizing customer impact during changes.

2
Conduct thorough testing on wildcard records before deploying DNSSEC.
Testing with wildcard records is essential as they can introduce unique issues that may not appear in standard domain setups, as seen in Slack's experience.

3
Engage with DNS service providers proactively during major changes.
Maintaining open communication with providers like Route 53 can facilitate quicker resolutions to unexpected issues, as demonstrated during the incident.

Common Pitfalls

1

Failing to account for the behavior of DNS resolvers with DNSSEC enabled can lead to service disruptions.

Many resolvers may enforce stricter rules when DNSSEC is enabled, which can cause unexpected resolution failures if not properly tested.

2

Assuming that removing DS records from the registrar will clear cached records immediately.

In reality, resolvers may cache DS records for up to 24 hours, leading to prolonged issues if not managed correctly.

Related Concepts

Dnssec Implementation Strategies

DNS Resolver Behavior

Impact Of DNS Changes On User Experience

At Slack, we’ve gone through an evolution of our AWS infrastructure from the early days of running a few hand-built EC2 instances, all the way to provisioning thousands of EC2s instances across multiple AWS regions, using the latest AWS services to build reliable and scalable infrastructure. One of the pain points inherited from the early…

TypeScriptAWSDynamoDB

12 min read

Has Summary

--

Slack

Intermediate

Women in Security at Slack

Since its inception, Slack has fostered a culture of inclusion and diversity. The Security organization at Slack is a prime example of how women can thrive in the security space, transitioning to security from different backgrounds and expertises. With Slack’s strong commitment to diversity, it should not be a surprise that nearly a third of…

TypeScriptPHPHTML

12 min read

Has Summary

--

Slack

Advanced

Building the Next Evolution of Cloud Networks at Slack – A Retrospective

About a year ago, I wrote a blog post called Building the Next Evolution of Cloud Networks at Slack. In it, we discussed how Slack’s AWS infrastructure has evolved over the years and the pain points that drove us to spin up a brand-new network architecture redesign project called Whitecastle. If you have not had…

TypeScriptGolangAWS

14 min read

Includes Code

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "The Case of the Recursive Resolvers". Explore more engineering insights on TypeScript, AWS, PHP.