ACCTuner: OpenACC Auto-Tuner For Accelerated Scientific Applications
AdvisorsKeyes, David E.
MetadataShow full item record
AbstractWe optimize parameters in OpenACC clauses for a stencil evaluation kernel executed on Graphical Processing Units (GPUs) using a variety of machine learning and optimization search algorithms, individually and in hybrid combinations, and compare execution time performance to the best possible obtained from brute force search. Several auto-tuning techniques – historic learning, random walk, simulated annealing, Nelder-Mead, and genetic algorithms – are evaluated over a large two-dimensional parameter space not satisfactorily addressed to date by OpenACC compilers, consisting of gang size and vector length. A hybrid of historic learning and Nelder-Mead delivers the best balance of high performance and low tuning effort. GPUs are employed over an increasing range of applications due to the performance available from their large number of cores, as well as their energy efficiency. However, writing code that takes advantage of their massive fine-grained parallelism requires deep knowledge of the hardware, and is generally a complex task involving program transformation and the selection of many parameters. To improve programmer productivity, the directive-based programming model OpenACC was announced as an industry standard in 2011. Various compilers have been developed to support this model, the most notable being those by Cray, CAPS, and PGI. While the architecture and number of cores have evolved rapidly, the compilers have failed to keep up at configuring the parallel program to run most e ciently on the hardware. Following successful approaches to obtain high performance in kernels for cache-based processors using auto-tuning, we approach this compiler-hardware gap in GPUs by employing auto-tuning for the key parameters “gang” and “vector” in OpenACC clauses. We demonstrate results for a stencil evaluation kernel typical of seismic imaging over a variety of realistically sized three-dimensional grid configurations, with different truncation error orders in the spatial dimensions. Apart from random walk and historic learning based on nearest neighbor in grid size, most of our heuristics, including the one that proves best, appear to be applied in this context for the first time. This work is a stepping-stone towards an OpenACC auto-tuning framework for more general high-performance numerical kernels optimized for GPU computations.