High Productivity Programming of Dense Linear Algebra on Heterogeneous NUMA Architectures
AuthorsAlomairy, Rabab M.
AdvisorsKeyes, David E.
Permanent link to this recordhttp://hdl.handle.net/10754/297194
MetadataShow full item record
AbstractHigh-end multicore systems with GPU-based accelerators are now ubiquitous in the hardware landscape. Besides dealing with the nontrivial heterogeneous environ- ment, end users should often take into consideration the underlying memory architec- ture to decrease the overhead of data motion, especially when running on non-uniform memory access (NUMA) platforms. We propose the OmpSs parallel programming model approach using its Nanos++ dynamic runtime system to solve the two challeng- ing problems aforementioned, through 1) an innovative NUMA node-aware scheduling policy to reduce data movement between NUMA nodes and 2) a nested parallelism feature to concurrently exploit the resources available from the GPU devices as well as the CPU host, without compromising the overall performance. Our approach fea- tures separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. The Cholesky factorization is used as a benchmark representative of dense numerical linear algebra algorithms. Superior performance is also demonstrated on the symmetric matrix inversion based on Cholesky factorization, commonly used in co-variance computations in statistics. Performance on a NUMA system with Kepler-based GPUs exceeds that of existing implementations, while the OmpSs-enabled code remains very similar to its original sequential version.