Seeing through hardware counters: a journey to threefold performance increase

Netflix Technology Blog
11 min readadvanced
--
View Original

Overview

This article discusses the challenges faced by Netflix when migrating a Java microservice to a larger AWS instance, which unexpectedly resulted in suboptimal performance. It details the investigation into CPU microarchitecture using Performance Monitoring Counters (PMCs) and the eventual identification and resolution of performance bottlenecks, leading to a threefold increase in throughput.

What You'll Learn

1

How to use Performance Monitoring Counters to diagnose performance issues

2

Why false sharing can significantly impact application performance

3

How to implement patches to optimize JVM performance

Prerequisites & Requirements

  • Understanding of Java microservices and JVM internals
  • Familiarity with Intel vTune and Performance Monitoring Counters(optional)

Key Questions Answered

What caused the performance degradation after migrating to a larger AWS instance?
The performance degradation was attributed to false sharing, where unrelated variables accessed by different cores shared the same cache line, leading to increased CPU stalls and latency. This was identified through the analysis of Performance Monitoring Counters and CPU profiling.
How did Netflix achieve a threefold increase in throughput?
Netflix achieved a threefold increase in throughput by identifying and resolving issues related to false sharing in the JVM. By patching the JDK to insert padding between variables, they eliminated the performance bottleneck, resulting in improved CPU utilization and reduced latency.
What is the difference between false sharing and true sharing?
False sharing occurs when independent variables share a cache line, causing unnecessary CPU stalls due to cache coherency protocols. True sharing, on the other hand, happens when multiple threads access the same variable, leading to contention and performance degradation due to CPU-enforced memory ordering.
What tools did Netflix use to analyze CPU performance?
Netflix utilized Intel vTune and Performance Monitoring Counters (PMCs) to analyze CPU performance. These tools provided insights into CPU utilization, cache activity, and instruction cycles, helping to identify the root causes of performance issues.

Key Statistics & Figures

Throughput improvement
3.5x
This improvement was achieved after addressing false sharing and optimizing the JVM.
Average CPU utilization target
55%
The target was set during the autoscaling process to optimize resource usage.
Latency degradation
more than 50%
Latency increased significantly during the initial migration to the larger instance.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Java
Used for developing the microservice GS2.
Cloud Infrastructure
AWS EC2
Used to host the microservice and facilitate scaling.
Profiling Tool
Intel Vtune
Used for microarchitecture profiling and performance analysis.

Key Actionable Insights

1
Utilize Performance Monitoring Counters to gain deep insights into CPU performance and identify bottlenecks.
This approach allows engineers to pinpoint specific issues at the microarchitecture level, which can lead to significant performance improvements in applications.
2
Implement padding between variables in shared memory to avoid false sharing and improve throughput.
This technique can drastically reduce CPU stalls and enhance performance, especially in multi-threaded environments.
3
Regularly profile JVM applications using tools like Intel vTune to catch performance issues early.
Proactive profiling can help maintain optimal performance as workloads and infrastructure change.

Common Pitfalls

1
Overlooking the impact of cache coherence on multi-threaded performance can lead to significant bottlenecks.
Many developers may not consider how shared memory access patterns affect performance, leading to issues like false sharing that can degrade application responsiveness.

Related Concepts

Performance Monitoring Counters
Java Virtual Machine (jvm) Optimization
Microservices Architecture
Multi-threading Performance Issues