When protections outlive their purpose: A lesson on managing defense systems at scale

User feedback led us to clean up outdated mitigations. See why observability and lifecycle management are critical for defense systems.

Thomas Kjær Aabo
6 min readintermediate
--
View Original

Overview

GitHub discovered that emergency rate-limiting and protection rules added during past abuse incidents had been left in place, quietly blocking legitimate users with 'too many requests' errors during normal browsing. The article details how they traced the issue across multiple infrastructure layers, quantifies the false-positive impact, and outlines their new approach to lifecycle management of defensive controls.

What You'll Learn

1

Why emergency incident mitigations become technical debt when left in place without lifecycle management

2

How to trace rate-limiting issues across multi-layered infrastructure stacks

3

How composite fingerprinting signals can produce false positives against legitimate users

4

When to treat incident mitigations as temporary by default and require intentional decisions for permanence

5

How to build observability into defense mechanisms the same way you would for product features

Prerequisites & Requirements

  • Understanding of rate limiting and traffic control concepts
  • Familiarity with multi-layered infrastructure architectures (edge, application, service, backend tiers)
  • Experience with incident response and abuse mitigation at scale(optional)

Key Questions Answered

Why do emergency rate-limiting rules start blocking legitimate users over time?
Emergency protections are added during active incidents based on patterns strongly associated with abusive traffic at that moment. Over time, threat patterns evolve and legitimate tools and usage change, causing those same fingerprint patterns to match normal user behavior. Without expiration dates, post-incident reviews, or impact monitoring, temporary mitigations become permanent technical debt that quietly accumulates false positives.
How do you trace which infrastructure layer is blocking a user request?
GitHub traced blocked requests by working backward from user reports: first gathering timestamps and behavior patterns from reports, then checking edge tier logs to confirm requests reached infrastructure, examining application tier logs to find 429 responses, and finally analyzing protection rule configurations to identify which specific rules matched. This required correlating logs across multiple systems with different schemas.
What was the false-positive rate of GitHub's outdated protection rules?
Among requests matching suspicious fingerprints, only 0.5–0.9% were actually blocked — those that also triggered business-logic rules. Relative to total traffic, false positives represented approximately 0.003–0.004%, which translates to roughly 3–4 requests incorrectly blocked per 100,000. While the percentage was low, it still meant real users were blocked during normal browsing.
How should incident mitigations be managed to prevent them from becoming stale?
Incident mitigations should be treated as temporary by default, requiring an intentional and documented decision to make them permanent. Each mitigation should have an expiration date set at creation, undergo post-incident review to evaluate whether emergency controls should be evolved into sustainable targeted solutions, and include ongoing monitoring for false-positive impact on legitimate users.
What is the lifecycle of incident mitigations that become problematic?
The lifecycle follows a predictable pattern: a control is added during an active incident and works correctly at that time, it remains active over time without review as the original threat evolves, and eventually it starts blocking legitimate traffic because the patterns it matches no longer exclusively indicate abuse. Without lifecycle management including expiration dates and periodic review, this progression is inevitable.
How does GitHub's multi-layered protection infrastructure work?
GitHub uses a custom multi-layered protection infrastructure built upon open-source projects like HAProxy. Requests flow through edge, application, service, and backend layers, each with DDoS protection, rate limits, authentication, and access controls. During incidents, protections can be added at any layer depending on where abuse is best mitigated and what controls are fastest to deploy.

Key Statistics & Figures

Fingerprint matches actually blocked
0.5–0.9%
Among requests matching suspicious fingerprints, only those also triggering business-logic rules were blocked
False-positive rate relative to total traffic
0.003–0.004%
Approximately 3–4 requests per 100,000 were incorrectly blocked in the hour before cleanup
Block rate for dual-criteria matches
100%
Requests matching both suspicious fingerprints and business-logic rules were blocked 100% of the time

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Set expiration dates on all emergency mitigations at creation time. When adding protective rules during incidents, include a TTL or review date so they don't become permanent by default. Making permanence require an intentional, documented decision prevents stale rules from silently accumulating.
GitHub found that rules added during past incidents quietly persisted and started matching legitimate traffic patterns as usage evolved over time.
2
Build observability into your defense mechanisms with the same rigor you apply to product features. Defense systems need monitoring dashboards, alerting on false-positive rates, and visibility into what each rule is actually blocking today versus what it was originally designed to block.
GitHub's investigation required tracing requests across multiple infrastructure layers with different log schemas — better observability would have surfaced the issue before users reported it.
3
Conduct post-incident reviews that specifically evaluate emergency controls and evolve them into sustainable, targeted solutions. Quick-response mitigations use broader patterns that are necessary in the moment but may not be appropriate long-term. Review each rule to narrow its scope or replace it with more precise controls.
The composite fingerprinting signals GitHub used were effective during incidents but produced false positives because the broad patterns also matched some legitimate logged-out requests.
4
Implement cross-layer request tracing that can identify which specific defense layer and rule blocked a given request. When protections exist at edge, application, service, and backend layers, a single blocked request may require correlating logs from multiple systems to diagnose.
GitHub's investigation went from user reports to edge logs to application logs to rule configurations, highlighting that without unified tracing, finding the source of blocks is time-consuming.
5
Use composite signals rather than single fingerprints for traffic filtering, but actively monitor the false-positive rate of each composite rule. GitHub's approach of combining fingerprinting with business-logic rules filtered out most matches (only 0.5–0.9% were blocked), but even this layered approach produced false positives that impacted real users.
A low overall false-positive rate (0.003–0.004% of total traffic) still translated to real users being incorrectly blocked, demonstrating that any incorrect blocking is unacceptable at scale.

Common Pitfalls

1
Treating emergency incident mitigations as permanent by default. When rate-limiting or blocking rules are added during active abuse incidents, they are often left in place indefinitely because there is no process to review or expire them. Over time, threat patterns change and legitimate usage evolves, causing these stale rules to match normal user behavior and produce false positives.
Set expiration dates at creation time and require an explicit, documented decision to make any emergency mitigation permanent.
2
Lacking observability into defense mechanisms across infrastructure layers. When protections are spread across edge, application, service, and backend tiers — each with different log schemas — it becomes extremely difficult to trace which layer and rule blocked a specific request. This delays diagnosis and allows stale rules to persist undetected.
Build unified tracing and monitoring for defense systems so that blocked requests can be quickly attributed to specific rules and layers.
3
Assuming a low overall false-positive rate means the impact is acceptable. Even at 0.003–0.004% of total traffic, GitHub recognized that real users were being incorrectly blocked during normal browsing, which is unacceptable. At platform scale, even tiny percentages translate to meaningful numbers of affected users.
Measure false-positive impact not just as a percentage but in terms of actual user experiences, and treat any incorrect blocking as a problem worth solving.
4
Using broad composite fingerprinting patterns without ongoing validation. While combining industry-standard fingerprinting with business-logic rules provides effective filtering, these composite signals can produce false positives as client behaviors evolve. Patterns that once exclusively indicated abuse may start matching legitimate logged-out users over time.
Regularly re-evaluate whether the patterns your rules match still accurately distinguish abusive traffic from legitimate usage.

Related Concepts

Rate Limiting
Ddos Protection
Incident Response
Observability
Technical Debt
Traffic Fingerprinting
False-positive Management
Multi-layered Infrastructure Defense
Site Reliability Engineering
Abuse Detection
Haproxy Configuration
Post-incident Review