AuthorsBoukaram, Wagih Halim
AdvisorsKeyes, David E.
Permanent link to this recordhttp://hdl.handle.net/10754/662664
MetadataShow full item record
AbstractLarge dense matrices are ubiquitous in scientific computing, arising from the discretization of integral operators associated with elliptic pdes, Schur complement methods, covariances in spatial statistics, kernel-based machine learning, and numerical optimization problems. Hierarchical matrices are an efficient way for storing the dense matrices of very large dimension that appear in these and related settings. They exploit the fact that the underlying matrices, while formally dense, are data sparse. They have a structure consisting of blocks many of which can be well-approximated by low rank factorizations. A hierarchical organization of the blocks avoids superlinear growth in memory requirements to store n × n dense matrices in a scalable manner, requiring O(n) units of storage with a constant depending on a representative rank k for the low rank blocks. The asymptotically optimal storage requirement of the resulting hierarchical matrices is a critical advantage, particularly in extreme computing environments, characterized by low memory per processing core. The challenge then becomes to develop the parallel linear algebra operations that can be performed directly on this compressed representation. In this dissertation, I implement a set of hierarchical basic linear algebra subroutines (HBLAS) optimized for GPUs, including hierarchical matrix vector multiplication, orthogonalization, compression, low rank updates, and matrix multiplication. I develop a library of open source batched kernel operations previously missing on GPUs for the high performance implementation of the H2 operations, while relying wherever possible on existing open source and vendor kernels to ride future improvements in the technology. Fast marshaling routines extract the batch operation data from an efficient representation of the trees that compose the hierarchical matrices. The methods developed for GPUs extend to CPUs using the same code base with simple abstractions around the batched routine execution. To demonstrate the scalability of the hierarchical operations I implement a distributed memory multi-GPU hierarchical matrix vector product that focuses on reducing communication volume and hiding communication overhead and areas of low GPU utilization using low priority streams. Two demonstrations involving Hessians of inverse problems governed by pdes and space-fractional diffusion equations show the effectiveness of the hierarchical operations in realistic applications.