The Evolution of Container Usage at Netflix

Netflix Technology Blog
9 min readintermediate
--
View Original

Overview

This article discusses the evolution of container usage at Netflix, focusing on the development and implementation of Titus, Netflix's container management platform. It highlights significant milestones, the growth of container usage, and the integration of containers into Netflix's infrastructure to enhance developer productivity and service delivery.

What You'll Learn

1

How to leverage Titus for efficient batch job scheduling

2

Why integrating containers with existing infrastructure is crucial for reliability

3

How to improve deployment speed using Docker layered images

Prerequisites & Requirements

  • Understanding of containerization concepts
  • Familiarity with AWS services, particularly EC2(optional)

Key Questions Answered

What is Titus and how does it enhance container management at Netflix?
Titus is Netflix's container management platform that provides scalable cluster and resource management, enabling efficient execution of both service and batch jobs. It integrates deeply with AWS EC2, allowing for optimized resource scheduling and improved developer productivity by simplifying infrastructure management.
How has container usage at Netflix evolved over time?
Netflix has seen a dramatic increase in container usage, launching over one million containers per week, up from a few thousand at the start of Titus in December 2015. This growth reflects the expanding role of containers in supporting various workloads, including customer-facing services.
What challenges does Netflix face in running a large-scale container platform?
Running a container platform at Netflix's scale presents challenges in reliability and scalability, requiring a focus on system design for failure. Issues can arise at all levels, necessitating careful monitoring and balancing of reliability against functionality in the container management system.
How does Titus support both batch and service jobs?
Titus supports service jobs that run continuously and batch jobs that execute until completion. It manages resource scheduling efficiently, allowing for autoscaling and retries based on defined policies, which enhances operational reliability and performance.

Key Statistics & Figures

Containers launched per week
over one million
This milestone reflects Netflix's significant growth in container usage, marking a 1000X increase since the initial launch of Titus.
Compute resources for batch users
500 r3.8xl instances with 16,000 cores and 120 TB of memory
These resources support various batch workloads, showcasing the scale at which Netflix operates.
Service job containers
over 10,000
These containers are used for long-running stream processing jobs, indicating the extensive use of Titus in production.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Container Management
Titus
Used for managing containerized applications and scheduling workloads at Netflix.
Cloud Infrastructure
AWS EC2
Provides the underlying virtual machine infrastructure for running containers.
Containerization
Docker
Used for creating and managing container images for applications.
Continuous Delivery
Spinnaker
Facilitates deployment processes across both virtual machines and containers.
Scheduling
Fenzo
An open-source library used for resource scheduling in Titus.

Key Actionable Insights

1
Implementing Titus can significantly enhance your batch job scheduling efficiency.
By utilizing Titus, developers can quickly specify resource requirements without managing AWS EC2 instance sizes, leading to faster execution of batch applications.
2
Integrating container management with existing tools like Spinnaker is crucial for consistent deployment processes.
This integration allows developers to maintain a uniform deployment experience across both virtual machines and containers, reducing complexity and improving reliability.
3
Utilizing Docker layered images can drastically reduce deployment times.
With Docker, Netflix has reduced deployment times from tens of minutes to just one or two minutes, which accelerates feature delivery to customers.

Common Pitfalls

1
Failing to integrate containers with existing infrastructure can lead to operational challenges.
Without proper integration, teams may struggle with inconsistent deployment processes and increased complexity, which can hinder productivity and reliability.
2
Neglecting to monitor system performance can result in scalability issues.
As container usage scales, it is essential to measure service level objectives to balance reliability and functionality effectively.