The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload

This describes a performance triage method used to figure out the main performance limiters of a given GPU workload, using NVIDIA-specific hardware metrics.

Louis Bavoil
48 min readadvanced
--
View Original

Overview

The article discusses the Peak-Performance-Percentage Analysis Method developed by NVIDIA to optimize GPU workloads by identifying performance limiters using hardware metrics. It provides a structured approach to analyze GPU performance, offering insights into how to improve throughput and efficiency based on specific metrics.

What You'll Learn

1

How to capture a GPU frame using Nsight Graphics

2

Why analyzing SOL% metrics is crucial for optimizing GPU workloads

3

How to identify performance limiters in GPU workloads

4

When to apply asynchronous compute for performance gains

Prerequisites & Requirements

  • Understanding of GPU architecture and performance metrics
  • Nsight Visual Studio Edition or Nsight Graphics

Key Questions Answered

What is the Peak-Performance-Percentage Analysis Method?
The Peak-Performance-Percentage Analysis Method is a performance triage technique used to identify the main performance limiters of any GPU workload by analyzing hardware metrics. It focuses on understanding how well the GPU is utilized and which units are limiting performance, allowing developers to optimize rendering applications effectively.
How can I capture a frame for performance analysis?
To capture a frame, launch Nsight Graphics, create a project, and generate a C++ capture by specifying the application executable path. After capturing, you can analyze the frame using the Range Profiler to identify performance bottlenecks.
What should I do if the Top SOL% is low?
If the Top SOL% is low, it indicates that the GPU units are under-utilized or inefficient. You should investigate the Graphics/Compute Idle% and SM Active% metrics to identify CPU limitations or inefficiencies in the workload, and optimize accordingly.
What are the common performance limiters in GPU workloads?
Common performance limiters include low SOL% values for key GPU units, inefficient memory access patterns, and high idle times in the GPU pipeline. Analyzing these metrics helps identify areas for optimization, such as reducing workload on bottlenecked units.

Key Statistics & Figures

SM Active%
99.3%
Indicates no SM idleness issue for the TEX-Interface Limited Workload.
TEX hit rate
88.9%
Shows good performance in texture fetching for the TEX-Interface Limited Workload.
L2 hit rate
87.3%
Reflects the efficiency of the Level-2 cache in the TEX-Interface Limited Workload.

Technologies & Tools

Tool
Nsight Graphics
Used for capturing and analyzing GPU frame data.
Library
Perfworks
Provides hardware metrics for performance analysis.

Key Actionable Insights

1
Regularly capture and analyze GPU frame data using Nsight Graphics to identify performance bottlenecks early in the development process.
This proactive approach allows developers to make informed decisions on optimizations before performance issues become critical, ensuring smoother gameplay experiences.
2
Focus on improving the achieved throughput of underperforming GPU units by optimizing shader code and reducing unnecessary computations.
By targeting specific units with low SOL% values, developers can significantly enhance overall GPU performance and reduce frame times.
3
Utilize the PerfWorks library to gather detailed metrics on GPU workloads, enabling a deeper understanding of performance characteristics.
This data-driven approach allows for precise adjustments and optimizations based on actual performance metrics rather than assumptions.

Common Pitfalls

1
Failing to capture frames before and after optimization attempts can lead to a lack of reproducible data for performance analysis.
Without this data, it becomes challenging to assess the effectiveness of optimizations and understand how performance metrics have changed.
2
Assuming that high utilization of one GPU unit guarantees overall performance without considering interactions between units.
Performance is often limited by the weakest link in the chain; thus, a holistic view of all units is necessary for effective optimization.

Related Concepts

GPU Architecture And Performance Metrics
Asynchronous Compute Techniques
Shader Optimization Strategies