Voldemort on Solid State Drives

Vinoth Chandar
4 min readintermediate
--
View Original

Overview

The article discusses the migration of Project Voldemort's storage from SAS disks to Solid State Drives (SSDs) at LinkedIn, highlighting the performance improvements and challenges encountered, particularly with Java Garbage Collection and fragmentation issues. It provides insights into the technical adjustments made to optimize performance and manage latency effectively.

What You'll Learn

1

How to mitigate Java Garbage Collection issues when migrating to SSDs

2

Why SSDs can increase latency in Java applications

3

How to address fragmentation in high-speed SSD environments

Prerequisites & Requirements

  • Understanding of Java Garbage Collection mechanisms
  • Familiarity with SSD and SAS disk technologies

Key Questions Answered

What causes increased latency after migrating to SSDs?
The increased latency after migrating to SSDs was primarily due to Java Garbage Collection issues, particularly because the JVM heap was not locked down, leading to excessive paging by Linux. This resulted in significant pauses during minor collections, affecting performance.
How does fragmentation affect performance in SSD environments?
Fragmentation in SSD environments can lead to promotion failures during data cleanup jobs, as high-speed SSDs combined with multi-tenancy create rapid garbage generation. This overloads the Concurrent Mark and Sweep (CMS) process, causing performance degradation.
What JVM flags can help improve performance on SSDs?
Using the AlwaysPreTouch JVM flag can help improve performance on SSDs by ensuring that the virtual address space of the JVM heap is fully mapped to physical addresses, reducing garbage collection pauses during shorter lab experiments.

Key Statistics & Figures

95th percentile latency
4 seconds
This latency was observed during minor pauses due to JVM page scans after the migration to SSDs.

Technologies & Tools

Backend
Voldemort
An open-source implementation of Amazon Dynamo used for handling high traffic at LinkedIn.
Database
Berkeley Db Je
Used as the storage engine for Voldemort, involved in data cleanup and management.

Key Actionable Insights

1
Locking down the JVM heap with mlock() can prevent Linux from paging out JVM pages, which significantly reduces latency during garbage collection.
This is crucial when operating in environments with high memory usage, as it minimizes the overhead caused by page scans and improves overall application responsiveness.
2
Implementing a distributed workload tool to simulate traffic can help identify performance bottlenecks related to fragmentation and garbage collection.
This approach allows for better understanding and testing of system behavior under production-like conditions, leading to more effective optimizations.
3
Forcing the storage engine to avoid caching objects during cleanup jobs can drastically reduce promotion rates and improve performance.
This adjustment is particularly important in high-speed SSD environments where rapid promotions can overwhelm the garbage collection process.

Common Pitfalls

1
Failing to lock down the JVM heap can lead to significant performance penalties due to excessive paging.
This occurs because Linux will swap out JVM pages, causing delays during garbage collection when those pages need to be retrieved.
2
Not addressing fragmentation can result in promotion failures and increased garbage collection overhead.
High-speed SSDs exacerbate fragmentation issues, leading to performance degradation if not managed properly.