Replacement-level homeowners buy boxes of pens and stick them in “the pen drawer”. What the elites know: you have to think adversarially about pens. “The purpose of a system is what it does”; a household’s is to uniformly distribute pens. Months fro
Overview
This article explains the design and implementation of Sprites, Fly.io's new product offering instant-creation Linux VMs with 100GB durable storage backed by object storage. The post details three key architectural decisions that differentiate Sprites from Fly Machines: eliminating container images for instant creation, using S3-compatible object storage instead of attached NVMe for durable disk, and moving orchestration logic inside the VM itself.
What You'll Learn
Why eliminating container images enables instant VM creation in cloud platforms
How to use S3-compatible object storage as the root of durable VM disk storage instead of attached NVMe
How inside-out orchestration (running management services inside the VM) simplifies platform operations and reduces blast radius
Why splitting storage into data chunks on object storage and metadata in local SQLite enables fast checkpoint and restore
When to choose disposable VMs over traditional container-based deployments for development workflows
Prerequisites & Requirements
- Understanding of containers, Docker, and OCI images
- Familiarity with cloud infrastructure concepts (VMs, NVMe storage, object storage like S3)
- Basic understanding of Linux namespaces and containerization(optional)
- Experience with cloud deployment platforms (e.g., Fly.io, AWS EC2)(optional)
Key Questions Answered
What are Fly.io Sprites and how do they differ from Fly Machines?
Why does removing container images make VM creation instant?
How does Fly.io use object storage for Sprite disk persistence?
What is inside-out orchestration in Sprites?
How do Sprite checkpoints work and why are they fast?
Why is attached NVMe storage problematic for cloud VM orchestration?
How do Sprites handle networking and service discovery?
When should you use Sprites vs Fly Machines for application deployment?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Eliminate container images from ephemeral/disposable compute workflows to achieve instant creation times. When all instances run from a standard base image, physical workers can pre-pool empty instances, making creation as fast as starting an already-created VM rather than pulling and unpacking container layers.This is especially valuable for development environments, AI coding agents, and interactive workflows where creation latency directly impacts developer experience.
2Use S3-compatible object storage as the root of durable VM storage rather than attached NVMe to decouple workloads from physical servers. This makes migration trivial (the durable state is just a URL), eliminates data loss risk from hardware failure, and enables fast checkpoint/restore by only shuffling metadata.The JuiceFS model of splitting storage into immutable data chunks on object storage and metadata in local SQLite (backed by Litestream) provides a practical architecture for this approach.
3Move orchestration and management services inside the VM to reduce blast radius of platform changes. By running storage, service management, logging, and networking services in the VM's root namespace (with user code in an inner container), changes only affect new VMs picking up updates rather than restarting host-level components.This 'inside-out' architecture also enables bouncing user environments without rebooting the entire VM, since the inner container can be restarted independently from the root namespace services.
4Use NVMe as a read-through cache layer rather than as primary storage to get the performance benefits of local disk without the operational burden of data durability. Cached chunks are immutable and their canonical state lives on object storage, so nothing on the NVMe volume matters for correctness.This architecture dramatically simplifies operations since local storage failures are non-events — the cache simply rebuilds from object storage on the next read.
5Design checkpoint/restore as a first-class feature rather than a disaster recovery escape hatch by making it a metadata-only operation. When data chunks are immutable on object storage, checkpoints become as lightweight as saving a metadata snapshot, enabling routine use similar to git commits.This shifts the mental model from 'system restore' to 'git restore', encouraging users to checkpoint frequently as part of normal workflow rather than only in emergencies.