Systems @Scale 2018 recap

Meta

Recently, we hosted our first-ever Systems @Scale conference. Held at Facebook’s Menlo Park campus, the event brought engineers from various companies to discuss the challenge of managing lar…

Overview

The Systems @Scale 2018 conference hosted by Facebook brought together engineers from various companies to discuss managing large-scale information systems. Key topics included software deployment strategies, stateful application management, and innovative scaling techniques from industry leaders.

What You'll Learn

1

How to manage stateful applications effectively in large-scale environments

2

Why using a service mesh can improve microservice networking and observability

3

How to implement low-downtime migrations for stateful applications in Kubernetes

4

How to leverage geo-replication in NoSQL databases for global applications

5

Why dynamic query evaluation and tracing can enhance debugging strategies

Key Questions Answered

How does Facebook manage to update its core software more frequently?

Facebook updates its core software at least 10 times more often than it did 10 years ago, achieving faster updates despite significant growth in servers, engineers, and users. This is accomplished through effective management strategies that ensure operational continuity while adding new features.

What challenges does Amazon DynamoDB face with global tables?

Amazon DynamoDB's global tables feature allows for replication across AWS regions, but it requires careful design to maintain key properties like elastic scale, high availability, and predictable performance. Doug Terry discusses these challenges in his presentation.

What are the benefits of using systemd for container management?

Systemd provides a low cognitive overhead runtime for containers, managing processes and resources efficiently. Madelaine and Lindsay explain how it simplifies deployment and enhances the management of composable services compared to traditional models.

How does Shopify handle application migrations in Kubernetes?

Shopify coordinates low-downtime migrations for stateful applications between clusters and regions, addressing the complexities of maintaining high availability and managing resources effectively during these transitions.

Key Statistics & Figures

Software update frequency

10 times more often

Facebook now updates its core software at least 10 times more frequently than a decade ago.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Dynamodb

Used for global tables that support replication across AWS regions.

Runtime

Systemd

Manages processes and resources for containerized applications.

Container Orchestration

Kubernetes

Facilitates application migrations and management of stateful applications.

Service Mesh

Envoy

Enhances microservice networking and observability at Lyft.

Key Actionable Insights

1
Implementing a service mesh can significantly enhance your microservice architecture by improving observability and operational agility.
As seen in Lyft's transition to using Envoy, a service mesh can alleviate networking issues and provide better insights into service interactions, which is crucial for maintaining performance in complex systems.

2
Utilizing geo-replication in your database design can improve availability and performance for global applications.
Doug Terry's talk on DynamoDB highlights the importance of designing for global distribution, which can help ensure that your application remains responsive and reliable across different regions.

3
Adopting dynamic query evaluation and tracing can streamline your debugging process and reduce the time spent on incident resolution.
Liz Fong-Jones emphasizes that building more dashboards is not the solution; instead, integrating these techniques can provide clearer insights into system behavior during outages.

Common Pitfalls

1

Relying solely on dashboards for monitoring can lead to inefficiencies during incident response.

Many engineers fall into the trap of over-relying on visual monitoring tools, which can obscure the root causes of issues. Instead, integrating dynamic query evaluation and tracing can provide more actionable insights.