Using Fortran Standard Parallel Programming for GPU Acceleration

Miko Stulajter

We present lessons learned from refactoring a Fortran application to use modern do concurrent loops in place of OpenACC for GPU acceleration.

NVIDIA

•

Miko Stulajter

•11 min read•advanced•

--

•View Original

Fortran

Overview

The article discusses the use of Fortran's standard parallel programming features, particularly the 'do concurrent' construct, for GPU acceleration. It highlights the benefits of using standard language features for future-proofing, portability, and ease of code maintenance while providing insights into performance comparisons with existing OpenACC directives.

What You'll Learn

1

How to use 'do concurrent' in Fortran for GPU programming

2

Why refactoring existing code to standard parallelism can improve maintainability

3

When to consider using managed memory in GPU applications

Prerequisites & Requirements

Familiarity with Fortran programming and parallel computing concepts
Access to NVIDIA HPC SDK for compiling code(optional)

Key Questions Answered

How does 'do concurrent' improve code portability and maintainability?

'do concurrent' is a standard Fortran feature that enhances code portability by reducing reliance on vendor-specific directives. This makes the code easier to maintain and future-proof, as standard features are less likely to become obsolete compared to proprietary directives.

What performance differences exist between 'do concurrent' and OpenACC?

The article shows that while the 'do concurrent' version of the POT3D code has a small performance hit of approximately 10% compared to the original OpenACC code, it significantly reduces the number of directives and lines of code, making it easier to manage and maintain.

When is it necessary to use directives with 'do concurrent'?

Directives are necessary when using features like device selection or when specific atomic operations are required, as the nvfortran compiler currently lacks support for these in 'do concurrent'. This allows developers to optimize data movement while still benefiting from the simplicity of 'do concurrent'.

Key Statistics & Figures

Performance comparison

10%

The 'do concurrent' version showed a performance hit of approximately 10% compared to the original OpenACC code.

Number of directives in POT3D (Original)

80

The original POT3D code contained 80 OpenACC directives, which were reduced to just 3 in the 'do concurrent' version.

Lines of code in POT3D (Original)

7019

The original code had a total of 7019 lines, which was reduced to 6872 lines in the 'do concurrent' version.

Technologies & Tools

Programming Language

Fortran

Used for developing the POT3D code and implementing parallel programming features.

Parallel Programming Model

Openacc

Initially used for GPU acceleration in the POT3D code before refactoring to 'do concurrent'.

Software

Nvidia Hpc SDK

Compiler used for building and optimizing Fortran code with GPU support.

Key Actionable Insights

1
Refactor existing GPU-accelerated Fortran code to use 'do concurrent' to simplify maintenance and improve portability.
By reducing the number of directives in your code, you can make it easier for other developers to understand and modify, which is especially important in collaborative projects.

2
Consider using managed memory for data movement in GPU applications to reduce the need for explicit data movement directives.
This can streamline your code and reduce complexity, but be aware of the potential performance trade-offs, especially as you scale to multiple GPUs.

Common Pitfalls

1

Over-reliance on managed memory can lead to performance degradation.

While managed memory simplifies data management, it may introduce overhead that affects performance, particularly as the number of GPUs increases. Developers should balance the convenience of managed memory with the performance characteristics of their applications.

Related Concepts

Parallel Programming In Fortran

GPU Acceleration Techniques

Openacc Vs. Standard Parallelism

Managed Memory In GPU Applications