Our best practices for quickly identifying, resolving, and preventing issues at scale.
Overview
This article explores how GitHub's platform engineering team approaches infrastructure problems differently from product engineering. Written by a GitHub engineer who transitioned from product to platform work, it covers domain understanding, platform-specific technical skills, knowledge sharing, understanding impact radius, and testing strategies for foundational services like DNS.
What You'll Learn
How to ramp up on a new platform engineering domain by leveraging handover meetings, backlog investigation, and documentation
Why platform engineering requires deeper technical skills in networking, operating systems, IaC, and distributed systems compared to product engineering
How to assess and manage the impact radius of changes to foundational platform services
How to test infrastructure changes safely in distributed environments using IaC validation, E2E traffic shifting, and self-healing verification
Why knowledge sharing prevents institutional knowledge loss and accelerates team problem-solving
Prerequisites & Requirements
- Basic software engineering experience, either in product or platform roles
- Familiarity with concepts like DNS, networking, and distributed systems(optional)
- Understanding of Infrastructure as Code tools like Terraform or Ansible(optional)
Key Questions Answered
What is the difference between platform engineering and product engineering?
How do you ramp up on a new platform engineering domain quickly?
What technical skills are essential for platform engineers?
How does GitHub test changes to foundational infrastructure services like DNS?
Why is impact radius important in platform engineering?
How should platform teams approach monitoring and observability?
Why is knowledge sharing critical for platform engineering teams?
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1When transitioning to a new platform domain, arrange structured handover meetings with the previous owners and systematically review old backlog issues. This dual approach gives you both the conceptual understanding from experienced colleagues and the practical context of known system limitations and recurring problems.This is especially important when inheriting new Areas of Responsibility (AoRs), as GitHub's team experienced when moving to the infrastructure organization.
2Build a condensed monitoring dashboard with a Single Availability Metric (SAM) that gives engineers a quick health overview of the platform. This allows rapid issue identification without requiring engineers to sift through detailed logs, significantly reducing time to detect and mitigate incidents.GitHub recommends this approach for foundational services like DNS where the impact radius is large and fast incident response is critical to preventing cascading failures across multiple products.
3Always test infrastructure changes on isolated test machines before production rollout, and deploy changes on a host-by-host basis. This includes validating IaC provisioning and deprovisioning operations, directing small portions of traffic for E2E testing, and verifying self-healing capabilities under unexpected loads.This incremental approach allows individual machine rollback and prevents changes from being applied to unaffected hosts, which is critical for services like DNS where errors can propagate widely.
4Study postmortems from past incidents related to your platform to build context around what changes or failures were introduced, how your platform played a role, and how issues were resolved. This builds practical understanding of your system's failure modes and informs safer change management.Asking 'What is the impact of this incident?' helps platform engineers understand downstream dependencies and the true blast radius of their services.
5Invest in learning platform-specific technical skills beyond typical product engineering, including network fundamentals (TCP, UDP, L4 load balancing), operating system and hardware selection, Infrastructure as Code tooling, and distributed systems resilience patterns.Platform teams serve as the foundational layer, so they require deeper technical knowledge than product teams to make informed decisions about scalability, cost, security, and reliability.
6Communicate directly with dependent teams before making changes to foundational services to understand how proposed modifications may affect downstream services. This proactive approach to understanding your impact radius helps prevent unexpected disruptions across the product ecosystem.At GitHub, the DNS team's changes can affect everything from GitHub Pages to GitHub Copilot, making cross-team communication essential before any platform modification.