User feedback led us to clean up outdated mitigations. See why observability and lifecycle management are critical for defense systems.
Overview
GitHub discovered that emergency rate-limiting and protection rules added during past abuse incidents had been left in place, quietly blocking legitimate users with 'too many requests' errors during normal browsing. The article details how they traced the issue across multiple infrastructure layers, quantifies the false-positive impact, and outlines their new approach to lifecycle management of defensive controls.
What You'll Learn
Why emergency incident mitigations become technical debt when left in place without lifecycle management
How to trace rate-limiting issues across multi-layered infrastructure stacks
How composite fingerprinting signals can produce false positives against legitimate users
When to treat incident mitigations as temporary by default and require intentional decisions for permanence
How to build observability into defense mechanisms the same way you would for product features
Prerequisites & Requirements
- Understanding of rate limiting and traffic control concepts
- Familiarity with multi-layered infrastructure architectures (edge, application, service, backend tiers)
- Experience with incident response and abuse mitigation at scale(optional)
Key Questions Answered
Why do emergency rate-limiting rules start blocking legitimate users over time?
How do you trace which infrastructure layer is blocking a user request?
What was the false-positive rate of GitHub's outdated protection rules?
How should incident mitigations be managed to prevent them from becoming stale?
What is the lifecycle of incident mitigations that become problematic?
How does GitHub's multi-layered protection infrastructure work?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Set expiration dates on all emergency mitigations at creation time. When adding protective rules during incidents, include a TTL or review date so they don't become permanent by default. Making permanence require an intentional, documented decision prevents stale rules from silently accumulating.GitHub found that rules added during past incidents quietly persisted and started matching legitimate traffic patterns as usage evolved over time.
2Build observability into your defense mechanisms with the same rigor you apply to product features. Defense systems need monitoring dashboards, alerting on false-positive rates, and visibility into what each rule is actually blocking today versus what it was originally designed to block.GitHub's investigation required tracing requests across multiple infrastructure layers with different log schemas — better observability would have surfaced the issue before users reported it.
3Conduct post-incident reviews that specifically evaluate emergency controls and evolve them into sustainable, targeted solutions. Quick-response mitigations use broader patterns that are necessary in the moment but may not be appropriate long-term. Review each rule to narrow its scope or replace it with more precise controls.The composite fingerprinting signals GitHub used were effective during incidents but produced false positives because the broad patterns also matched some legitimate logged-out requests.
4Implement cross-layer request tracing that can identify which specific defense layer and rule blocked a given request. When protections exist at edge, application, service, and backend layers, a single blocked request may require correlating logs from multiple systems to diagnose.GitHub's investigation went from user reports to edge logs to application logs to rule configurations, highlighting that without unified tracing, finding the source of blocks is time-consuming.
5Use composite signals rather than single fingerprints for traffic filtering, but actively monitor the false-positive rate of each composite rule. GitHub's approach of combining fingerprinting with business-logic rules filtered out most matches (only 0.5–0.9% were blocked), but even this layered approach produced false positives that impacted real users.A low overall false-positive rate (0.003–0.004% of total traffic) still translated to real users being incorrectly blocked, demonstrating that any incorrect blocking is unacceptable at scale.