Accelerating NVIDIA HPC Software with SVE on AWS Graviton3

The NVIDIA HPC SDK 22.7 now supports the AWS Gravition3 with auto-vectorization for the Scalable Vector Extension to the Arm architecture.

John Linford
6 min readadvanced
--
View Original

Overview

The article discusses the latest updates to the NVIDIA HPC SDK, focusing on its support for the Arm-based AWS Graviton3 processor and the Scalable Vector Extension (SVE) auto-vectorization. It highlights the performance improvements for HPC applications and the integration of advanced features that enhance computational efficiency.

What You'll Learn

1

How to enable SVE auto-vectorization with NVIDIA compilers for HPC applications

2

Why using NVIDIA HPC compilers can improve performance on AWS Graviton3

3

When to utilize the -tp architecture flag for optimizing performance

Prerequisites & Requirements

  • Understanding of HPC applications and vectorization concepts
  • Familiarity with NVIDIA HPC SDK and compilers(optional)

Key Questions Answered

What performance improvements can be expected with NVIDIA HPC compilers on AWS Graviton3?
Applications compiled with NVIDIA HPC compilers on AWS Graviton3 can see performance improvements of up to 17% compared to GCC 12.1, as indicated by SPEC CPU® 2017 benchmark scores. For instance, the 64 Thread FPSpeed shows a speedup of 1.17 with NVIDIA HPC compilers.
How does SVE enhance performance for HPC applications?
SVE, or Scalable Vector Extension, allows for flexible vector length implementations, enabling better data parallelism and optimizations in HPC applications. It supports vector lengths from 128 bits to 2,048 bits, facilitating advanced operations like gather-load and scatter-store, which are crucial for HPC and ML applications.
What are the key features of AWS Graviton3 processors?
AWS Graviton3 processors feature DDR5 memory, providing 50% higher memory bandwidth compared to DDR4, and include SVE for enhanced vectorization capabilities. They deliver up to 25% better performance over Graviton2 and have shown up to 35% better performance in benchmarks.
How can developers get started with the NVIDIA HPC SDK?
Developers can get started with the NVIDIA HPC SDK by downloading the software from the NVIDIA website. The SDK provides a comprehensive software stack for creating and optimizing HPC applications on platforms like AWS Graviton3.

Key Statistics & Figures

Performance improvement on SPEC CPU® 2017
17%
This improvement is observed when using NVIDIA HPC compilers compared to GCC 12.1.
Speedup for 64 Thread FPSpeed
1.17
This indicates the performance advantage of using NVIDIA HPC compilers over GCC 12.1.
Performance increase compared to Graviton2
25%
AWS Graviton3 provides this improvement for compute-intensive workloads.
Performance improvement benchmarked by ANSYS
35%
This performance increase is noted when comparing AWS Graviton3 to its predecessor, Graviton2.

Technologies & Tools

Software
Nvidia Hpc SDK
Used for developing and optimizing HPC applications on various architectures.
Hardware
AWS Graviton3
Arm-based CPU designed for high-performance computing and optimized for cloud workloads.
Architecture
Sve
Scalable Vector Extension that enhances vectorization capabilities for HPC applications.

Key Actionable Insights

1
Utilizing the NVIDIA HPC compilers can significantly enhance the performance of HPC applications on AWS Graviton3.
By leveraging the auto-vectorization capabilities of the compilers, developers can achieve better optimization and take full advantage of the Graviton3's architecture.
2
Implementing SVE in applications can lead to substantial performance gains, especially for data-intensive tasks.
SVE's ability to handle flexible vector lengths allows for more efficient data processing, making it ideal for high-performance computing and machine learning applications.
3
Understanding the architecture flag -tp is crucial for optimizing application performance on different CPU architectures.
Specifying the correct architecture flag ensures that the compiler generates optimized code tailored to the specific capabilities of the target CPU, maximizing performance.

Common Pitfalls

1
Failing to specify the correct -tp architecture flag can lead to suboptimal performance.
Without the correct flag, the compiler may not generate the most efficient code for the specific CPU architecture, resulting in slower application performance.
2
Neglecting to utilize SVE auto-vectorization may limit the performance of HPC applications.
Not leveraging the advanced features of SVE can prevent applications from achieving their full potential in terms of speed and efficiency.

Related Concepts

High-performance Computing (hpc)
Scalable Vector Extension (sve)
AWS Graviton Processors
Nvidia Compilers And Libraries