Maximizing I/O Bandwidth for Out-of-Core HPC Applications on Homogeneous and Heterogeneous Large-Scale Systems
AdvisorsKeyes, David E.
Embargo End Date2021-10-01
Permanent link to this recordhttp://hdl.handle.net/10754/665396
MetadataShow full item record
Access RestrictionsAt the time of archiving, the student author of this dissertation opted to temporarily restrict access to it. The full text of this dissertation will become available to the public after the expiration of the embargo on 2021-10-01.
AbstractOut-of-Core simulation systems often produce a massive amount of data that cannot t on the aggregate fast memory of the compute nodes, and they also require to read back these data for computation. As a result, I/O data movement can be a bottleneck in large-scale simulations. Advances in memory architecture have made it feasible and a ordable to integrate hierarchical storage media on large-scale systems, starting from the traditional Parallel File Systems (PFSs) to intermediate fast disk technologies (e.g., node-local and remote-shared NVMe and SSD-based Burst Bu ers) and up to CPU main memory and GPU High Bandwidth Memory (HBM). However, while adding additional and faster storage media increases I/O bandwidth, it pressures the CPU, as it becomes responsible for managing and moving data between these layers of storage. Simulation systems are thus vulnerable to being blocked by I/O operations. The Multilayer Bu er System (MLBS) proposed in this research demonstrates a general and versatile method for overlapping I/O with computation that helps to ameliorate the strain on the processors through asynchronous access. The main idea consists in decoupling I/O operations from computational phases using dedicated hardware resources to perform expensive context switches. MLBS monitors I/O tra c in each storage layer allowing fair utilization of shared resources. By continually prefetching up and down across all hardware layers of the memory and storage subsystems, MLBS transforms the original I/O-bound behavior of evaluated applications and shifts it closer to a memory-bound or compute-bound regime. The evaluation on the Cray XC40 Shaheen-2 supercomputer for a representative I/Obound application, seismic inversion, shows that MLBS outperforms state-of-the-art PFSs, i.e., Lustre, Data Elevator and DataWarp by 6.06X, 2.23X, and 1.90X, respectively. On the IBM-built Summit supercomputer, using 2048 compute nodes equipped with a total of 12288 GPUs, MLBS achieves up to 1.4X performance speedup compared to the reference PFS-based implementation. MLBS is also demonstrated on applications from cosmology, combustion, and a classic out-of-core computational physics and linear algebra routines.