Scaling Kubernetes to 2,500 nodes

Christopher Berner

Techniques for training large neural networksPublicationJun 9, 2022

OpenAI

•

Christopher Berner

•9 min read•advanced•

--

•View Original

AzureDatadogDockerKubernetesNeural NetworksPrometheusRedis

Overview

The article discusses the challenges and solutions encountered while scaling Kubernetes to over 2,500 nodes, detailing specific issues with components like etcd, Kube masters, and Docker image pulls. It provides insights into performance optimizations and configurations necessary for managing large Kubernetes clusters effectively.

What You'll Learn

1

How to optimize etcd performance for large Kubernetes clusters

2

Why using local SSDs for etcd storage improves write latency

3

How to configure KubeDNS to avoid reliability issues in large clusters

4

When to adjust the ARP cache settings in Kubernetes environments

Prerequisites & Requirements

Understanding of Kubernetes architecture and components
Familiarity with monitoring tools like Datadog and Prometheus(optional)

Key Questions Answered

What issues arise when scaling Kubernetes beyond 500 nodes?

As Kubernetes scales beyond 500 nodes, issues such as timeouts in kubectl commands and high write latency in etcd become apparent. These problems often stem from the limitations of etcd's performance and the configuration of Kube masters, which can lead to significant operational challenges.

How can Docker image pull times be reduced in Kubernetes?

To reduce Docker image pull times, it's recommended to set the kubelet's serialize-image-pulls flag to false and move the Docker root to an SSD. This allows multiple images to be pulled concurrently and speeds up the overall process, especially for large images.

What configurations can improve KubeDNS reliability in large clusters?

Adding anti-affinity rules to KubeDNS pods can help distribute the load evenly across nodes, preventing hotspots that exceed the allowed query per second limits. This adjustment enhances the reliability of DNS resolution in large Kubernetes environments.

What are the recommended ARP cache settings for Kubernetes clusters?

For Kubernetes clusters, it's advisable to increase the ARP cache thresholds to prevent overflow and ensure smooth network communication. Adjusting settings in /etc/sysctl.conf can help manage the ARP cache effectively, especially as the number of pods increases.

Key Statistics & Figures

Maximum nodes in a Kubernetes cluster

2,500

This is the scale to which the discussed Kubernetes cluster was successfully pushed.

Write latency for etcd on DS15v2 machines

hundreds of milliseconds

This latency was observed before optimizations were made to the etcd storage configuration.

Write latency after moving etcd to local SSDs

200 microseconds

This significant reduction in latency was achieved by switching to local SSD storage.

Default etcd storage limit

2GB

Reaching this limit caused cascading failures in the Kubernetes cluster, necessitating an increase in the maximum etcd size.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Orchestration

Kubernetes

Used for managing containerized applications across a cluster of machines.

Database

Etcd

Serves as the central store of state for the Kubernetes cluster.

Containerization

Docker

Used for creating and managing containers in the Kubernetes environment.

Networking

Flannel

Provides a network fabric for Kubernetes clusters.

Monitoring

Prometheus

Used for monitoring Kubernetes components and performance metrics.

Monitoring

Datadog

Used for monitoring and logging within the Kubernetes environment.

Key Actionable Insights

1
Optimize etcd performance by using local SSDs for storage to significantly reduce write latency.
This adjustment is crucial for maintaining a healthy etcd cluster, especially when scaling beyond 1,000 nodes, where write latency can become a bottleneck.

2
Implement anti-affinity rules for KubeDNS pods to avoid reliability issues caused by hotspots.
This strategy helps distribute DNS queries more evenly across nodes, ensuring that no single node becomes overwhelmed, which is critical in large-scale deployments.

3
Adjust the kubelet's image pull settings to allow concurrent downloads, reducing the time pods spend in a Pending state.
This is particularly important for large images, as it allows for faster pod startup times and improves overall cluster efficiency.

4
Regularly monitor etcd performance metrics to identify and address latency issues proactively.
Using tools like Prometheus can help in tracking etcd's performance and ensuring that the cluster remains responsive as it scales.

Common Pitfalls

1

Failing to monitor etcd performance can lead to high latency and cluster instability.

Without proper monitoring, issues may go unnoticed until they cause significant operational problems, making it essential to implement monitoring solutions like Prometheus.

2

Not adjusting the default etcd storage limit can lead to cascading failures.

Exceeding the default 2GB limit without adjustments can cause etcd to stop accepting writes, resulting in health check failures across the cluster.

3

Overloading KubeDNS with too many queries can lead to reliability issues.

If KubeDNS pods are not properly distributed, they can exceed their query limits, causing DNS resolution failures in the cluster.

Related Concepts

Kubernetes Scaling Strategies

Etcd Performance Optimization

Docker Image Management

Kubernetes Networking Solutions