Benchmarking High Performance I/O with SSD for Cassandra on AWS

Netflix Technology Blog
10 min readintermediate
--
View Original

Overview

The article discusses the benchmarking of high-performance I/O using SSDs for Apache Cassandra on AWS, highlighting the advantages of the new hi1.4xlarge instance type. It presents detailed comparisons of performance metrics, cost implications, and the benefits of transitioning to SSD-based storage for Cassandra workloads.

What You'll Learn

1

How to benchmark high-performance I/O for Cassandra on AWS

2

Why SSD-based instances improve performance for I/O intensive applications

3

When to consider transitioning from traditional storage to SSD for Cassandra workloads

Prerequisites & Requirements

  • Understanding of Apache Cassandra and its architecture
  • Familiarity with AWS EC2 instance types and configurations(optional)

Key Questions Answered

What are the performance benefits of using the hi1.4xlarge SSD instance for Cassandra?
The hi1.4xlarge SSD instance provides around 100,000 very low latency IOPS and a gigabyte per second of throughput, significantly outperforming previous instance types. This results in hundreds of times higher throughput than other storage options, with extremely low latency and variance due to local SSD access.
How does the cost of running Cassandra on SSD compare to traditional instances?
Running Cassandra on the hi1.4xlarge SSD instance is about half the system cost for the same throughput compared to the existing m2.4xlarge setup. This cost efficiency is achieved while also reducing mean read request latency from 10ms to 2.2ms and the 99th percentile latency from 65ms to 10ms.
What benchmarks were performed to evaluate the SSD instance's performance?
Benchmarks included filesystem level performance testing with iozone, achieving over 100,000 IOPS and 1 GByte/s of throughput, and a standard Cassandra stress test that demonstrated close to a gigabyte per second of throughput during data loading into memory.
What configurations were compared in the Netflix application benchmark?
The benchmark compared an existing system with 48 Cassandra instances on m2.4xlarge and 36 EVcache instances on m2.xlarge against a new configuration with 12 Cassandra instances on hi1.4xlarge. This showed that the SSD-based system could maintain similar throughput with lower latency.

Key Statistics & Figures

Throughput of hi1.4xlarge SSD instance
1 GByte/s
Achieved during filesystem level performance testing with iozone.
IOPS capability of hi1.4xlarge SSD instance
100,000
This performance metric demonstrates the instance's ability to handle high I/O workloads effectively.
Mean read request latency reduction
from 10ms to 2.2ms
This improvement highlights the efficiency of the SSD-based configuration.
99th percentile request latency reduction
from 65ms to 10ms
This significant drop indicates better performance consistency with SSD instances.

Technologies & Tools

Database
Apache Cassandra
Used for managing large-scale data storage and retrieval in the benchmarks.
Cloud Service
AWS EC2
Provides the infrastructure for running Cassandra instances and benchmarking performance.

Key Actionable Insights

1
Transitioning to SSD-based instances can drastically improve performance for I/O intensive applications like Cassandra.
With the new hi1.4xlarge instance, organizations can achieve significantly higher throughput and lower latency, making it a compelling choice for high-performance workloads.
2
Carefully scheduling maintenance operations such as compactions can prevent I/O overload.
By managing these operations sequentially across nodes, teams can ensure that their Cassandra clusters remain responsive and efficient, especially when using high-performance SSD instances.
3
Benchmarking is crucial before migrating workloads to new instance types.
Conducting thorough performance tests allows teams to validate the expected benefits of new hardware configurations and make informed decisions about resource allocation.

Common Pitfalls

1
Overloading I/O during maintenance operations can lead to performance degradation.
This often happens when compactions and repairs are scheduled simultaneously across nodes, which can overwhelm the I/O capacity of the instances.

Related Concepts

Benchmarking Performance Of Cloud-based Databases
Cost Efficiency In Cloud Resource Management
High-performance I/O Configurations