We present lessons learned from refactoring a Fortran application to use modern do concurrent loops in place of OpenACC for GPU acceleration.
Overview
The article discusses the use of Fortran's standard parallel programming features, particularly the 'do concurrent' construct, for GPU acceleration. It highlights the benefits of using standard language features for future-proofing, portability, and ease of code maintenance while providing insights into performance comparisons with existing OpenACC directives.
What You'll Learn
1
How to use 'do concurrent' in Fortran for GPU programming
2
Why refactoring existing code to standard parallelism can improve maintainability
3
When to consider using managed memory in GPU applications
Prerequisites & Requirements
- Familiarity with Fortran programming and parallel computing concepts
- Access to NVIDIA HPC SDK for compiling code(optional)
Key Questions Answered
How does 'do concurrent' improve code portability and maintainability?
'do concurrent' is a standard Fortran feature that enhances code portability by reducing reliance on vendor-specific directives. This makes the code easier to maintain and future-proof, as standard features are less likely to become obsolete compared to proprietary directives.
What performance differences exist between 'do concurrent' and OpenACC?
The article shows that while the 'do concurrent' version of the POT3D code has a small performance hit of approximately 10% compared to the original OpenACC code, it significantly reduces the number of directives and lines of code, making it easier to manage and maintain.
When is it necessary to use directives with 'do concurrent'?
Directives are necessary when using features like device selection or when specific atomic operations are required, as the nvfortran compiler currently lacks support for these in 'do concurrent'. This allows developers to optimize data movement while still benefiting from the simplicity of 'do concurrent'.
Key Statistics & Figures
Performance comparison
10%
The 'do concurrent' version showed a performance hit of approximately 10% compared to the original OpenACC code.
Number of directives in POT3D (Original)
80
The original POT3D code contained 80 OpenACC directives, which were reduced to just 3 in the 'do concurrent' version.
Lines of code in POT3D (Original)
7019
The original code had a total of 7019 lines, which was reduced to 6872 lines in the 'do concurrent' version.
Technologies & Tools
Programming Language
Fortran
Used for developing the POT3D code and implementing parallel programming features.
Parallel Programming Model
Openacc
Initially used for GPU acceleration in the POT3D code before refactoring to 'do concurrent'.
Software
Nvidia Hpc SDK
Compiler used for building and optimizing Fortran code with GPU support.
Key Actionable Insights
1Refactor existing GPU-accelerated Fortran code to use 'do concurrent' to simplify maintenance and improve portability.By reducing the number of directives in your code, you can make it easier for other developers to understand and modify, which is especially important in collaborative projects.
2Consider using managed memory for data movement in GPU applications to reduce the need for explicit data movement directives.This can streamline your code and reduce complexity, but be aware of the potential performance trade-offs, especially as you scale to multiple GPUs.
Common Pitfalls
1
Over-reliance on managed memory can lead to performance degradation.
While managed memory simplifies data management, it may introduce overhead that affects performance, particularly as the number of GPUs increases. Developers should balance the convenience of managed memory with the performance characteristics of their applications.
Related Concepts
Parallel Programming In Fortran
GPU Acceleration Techniques
Openacc Vs. Standard Parallelism
Managed Memory In GPU Applications