How GitHub engineers tackle platform problems

Our best practices for quickly identifying, resolving, and preventing issues at scale.

Fabian Aguilar Gomez
7 min readintermediate
--
View Original

Overview

This article explores how GitHub's platform engineering team approaches infrastructure problems differently from product engineering. Written by a GitHub engineer who transitioned from product to platform work, it covers domain understanding, platform-specific technical skills, knowledge sharing, understanding impact radius, and testing strategies for foundational services like DNS.

What You'll Learn

1

How to ramp up on a new platform engineering domain by leveraging handover meetings, backlog investigation, and documentation

2

Why platform engineering requires deeper technical skills in networking, operating systems, IaC, and distributed systems compared to product engineering

3

How to assess and manage the impact radius of changes to foundational platform services

4

How to test infrastructure changes safely in distributed environments using IaC validation, E2E traffic shifting, and self-healing verification

5

Why knowledge sharing prevents institutional knowledge loss and accelerates team problem-solving

Prerequisites & Requirements

  • Basic software engineering experience, either in product or platform roles
  • Familiarity with concepts like DNS, networking, and distributed systems(optional)
  • Understanding of Infrastructure as Code tools like Terraform or Ansible(optional)

Key Questions Answered

What is the difference between platform engineering and product engineering?
Product engineers build end-user-facing features and products directly, while platform engineers supply the foundational tools, infrastructure, and services that product engineers depend on. Platform engineers serve internal customers rather than external users, focusing on reliability, scalability, and providing the building blocks that enable product teams to ship effectively.
How do you ramp up on a new platform engineering domain quickly?
Three key strategies help you ramp up: arrange handover meetings with teams experienced in the domain to learn terminology and context, investigate old and stale backlog issues to understand current system limitations and improvement areas, and thoroughly read existing documentation to understand how the system works and its design decisions.
What technical skills are essential for platform engineers?
Platform engineers need proficiency in four core areas: networking fundamentals (TCP, UDP, L4 load balancing, debugging tools like dig), operating systems and hardware selection for scalability and cost management, Infrastructure as Code tools like Terraform, Ansible, and Consul for reducing human error, and distributed systems concepts including failover and recovery mechanisms.
How does GitHub test changes to foundational infrastructure services like DNS?
GitHub tests infrastructure changes using a multi-layered approach: first validating IaC operations like provisioning and deprovisioning on test machines, then performing end-to-end testing by directing small portions of network traffic to test servers to observe behavior, and finally testing self-healing capabilities to verify the platform can recover from unexpected loads. Changes are then rolled out host-by-host for safe rollback.
Why is impact radius important in platform engineering?
Platform services are foundational building blocks, so even minor changes can have extensive repercussions across many products. For example, a small DNS change at GitHub could disrupt access to content across the entire site, affecting services from GitHub Pages to GitHub Copilot. Understanding downstream dependencies through direct team communication, postmortems, and monitoring is critical to preventing widespread outages.
How should platform teams approach monitoring and observability?
Platform teams should condense important monitoring and logging into a small, quickly digestible format that shows general system health. A Single Availability Metric (SAM) displayed on a single dashboard allows engineers to rapidly pinpoint issue sources and streamlines debugging and incident mitigation, rather than requiring engineers to search through and interpret detailed monitors or log messages.
Why is knowledge sharing critical for platform engineering teams?
Knowledge sharing is essential for three reasons: collaboration leads to quicker problem resolution and innovation as engineers learn from each other, it prevents institutional knowledge loss when engineers leave or are unavailable, and it helps teams build reliable, scalable, and secure platforms that better serve customers. Sharing lessons about what worked and what didn't provides valuable new perspectives.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Infrastructure As Code
Terraform
Infrastructure provisioning and modification automation to reduce human error
Infrastructure As Code
Ansible
Configuration management and infrastructure automation
Infrastructure As Code
Consul
Service discovery and infrastructure management
Infrastructure Service
DNS
Foundational platform service that the author's team is responsible for at GitHub
CI/CD
Github Actions
Referenced as the team's previous product work on deployment views across environments
Developer Tool
Github Copilot
Referenced as a downstream product affected by DNS platform changes
Hosting Platform
Github Pages
Referenced as a downstream product affected by DNS platform changes

Key Actionable Insights

1
When transitioning to a new platform domain, arrange structured handover meetings with the previous owners and systematically review old backlog issues. This dual approach gives you both the conceptual understanding from experienced colleagues and the practical context of known system limitations and recurring problems.
This is especially important when inheriting new Areas of Responsibility (AoRs), as GitHub's team experienced when moving to the infrastructure organization.
2
Build a condensed monitoring dashboard with a Single Availability Metric (SAM) that gives engineers a quick health overview of the platform. This allows rapid issue identification without requiring engineers to sift through detailed logs, significantly reducing time to detect and mitigate incidents.
GitHub recommends this approach for foundational services like DNS where the impact radius is large and fast incident response is critical to preventing cascading failures across multiple products.
3
Always test infrastructure changes on isolated test machines before production rollout, and deploy changes on a host-by-host basis. This includes validating IaC provisioning and deprovisioning operations, directing small portions of traffic for E2E testing, and verifying self-healing capabilities under unexpected loads.
This incremental approach allows individual machine rollback and prevents changes from being applied to unaffected hosts, which is critical for services like DNS where errors can propagate widely.
4
Study postmortems from past incidents related to your platform to build context around what changes or failures were introduced, how your platform played a role, and how issues were resolved. This builds practical understanding of your system's failure modes and informs safer change management.
Asking 'What is the impact of this incident?' helps platform engineers understand downstream dependencies and the true blast radius of their services.
5
Invest in learning platform-specific technical skills beyond typical product engineering, including network fundamentals (TCP, UDP, L4 load balancing), operating system and hardware selection, Infrastructure as Code tooling, and distributed systems resilience patterns.
Platform teams serve as the foundational layer, so they require deeper technical knowledge than product teams to make informed decisions about scalability, cost, security, and reliability.
6
Communicate directly with dependent teams before making changes to foundational services to understand how proposed modifications may affect downstream services. This proactive approach to understanding your impact radius helps prevent unexpected disruptions across the product ecosystem.
At GitHub, the DNS team's changes can affect everything from GitHub Pages to GitHub Copilot, making cross-team communication essential before any platform modification.

Common Pitfalls

1
Applying product engineering testing approaches to platform engineering without adaptation. Platform changes have a much wider impact radius than product features, and a minor alteration to foundational services like DNS can disrupt access across an entire site and affect multiple products simultaneously.
Platform engineers need to adopt host-by-host rollout strategies with rollback capabilities rather than deploying changes broadly, and must explicitly test IaC provisioning, E2E traffic behavior, and self-healing capabilities.
2
Failing to understand downstream dependencies before making platform changes. Without mapping out which products and services depend on your platform, even small modifications can cause unexpected cascading failures across the organization.
Proactively communicate with dependent teams and review past postmortems to build a comprehensive understanding of your service's impact radius before making any changes.
3
Not investing in knowledge sharing, which leads to institutional knowledge being siloed in individual engineers. When those engineers leave or are unavailable, critical context about system design, past incidents, and operational procedures is permanently lost.
Document lessons learned, share findings across teams, and treat knowledge dissemination as an essential engineering practice rather than an optional activity.
4
Selecting inappropriate virtual machines or operating systems without considering scalability, cost, and security implications. Choosing systems with known vulnerabilities or those nearing end of life can introduce significant risk to the platform.
Platform engineers must develop a strong understanding of both hardware capabilities and operating system lifecycle management to make well-informed infrastructure decisions.

Related Concepts

Platform Engineering
Product Engineering
DNS Infrastructure
Infrastructure As Code
Distributed Systems
TCP/UDP Networking
L4 Load Balancing
Incident Management And Postmortems
Single Availability Metric (sam)
Failover And Recovery Mechanisms
Host-by-host Deployment
Self-healing Systems
End-to-end Testing
Knowledge Management
Impact Radius Analysis