Extreme Computing Research Center
Recent Submissions

A nonlinear elimination preconditioned inexact Newton method for blood flow problems in human artery with stenosis(Journal of Computational Physics, Elsevier BV, 20190909) [Article]Simulation of blood flows in the human artery is a promising tool for understanding the hemodynamics. The blood flow is often smooth in a healthy artery, but may become locally chaotic in a diseased artery with stenosis, and as a result, a traditional solver may take many iterations to converge or does not converge at all. To overcome the problem, we develop a nonlinearly preconditioned Newton method in which the variables associated with the stenosis are iteratively eliminated and then a global Newton method is applied to the smooth part of the system. More specifically, we model the blood flow in a patientspecific artery based on the unsteady incompressible NavierStokes equations with resistive boundary conditions discretized by a fully implicit finite element method. The resulting nonlinear system at each time step is solved by using an inexact Newton method with a domain decomposition based Jacobian solver. To improve the convergence and robustness of the Newton method for arteries with stenosis, we develop an adaptive restricted regionbased nonlinear elimination preconditioner which performs subspace correction to remove the local high nonlinearities. Numerical experiments for several cerebral arteries are presented to demonstrate the superiority of the proposed algorithm over the classical method with respect to some physical and numerical parameters. We also report the parallel scalability of the proposed algorithm on a supercomputer with thousands of processor cores.

A rezoningfree CESE Scheme for solving the Compressible Euler Equations on Moving Unstructured Meshes(Journal of Computational Physics, Elsevier BV, 20190802) [Article]We construct a spacetime conservation element and solution element (CESE) scheme for solving the compressible Euler equations on moving meshes (CESEMM) which allow an arbitrary motion for each of the mesh points. The scheme is a direct extension of a purely Eulerian CESE scheme that was previously implemented on hybrid unstructured meshes (Shen et al., J. Comput. Phys., 2015). It adopts a staggered mesh in space and time such that the physical variables are continuous across the interfaces of the adjacent spacetime control volumes and, therefore, a Riemann solver is not required to calculate interface fluxes or the node velocities. Moreover, the staggered mesh can significantly alleviate mesh tangles so that the time step can be kept at an acceptable level without using any rezoning operation. The discretization of the integral spacetime conservation law is completely based on the physical spacetime control volume, thereby satisfying the physical and geometrical conservation laws. Plenty of numerical examples are carried out to validate the accuracy and robustness of the CESEMM scheme.

Conservative and entropy stable solid wall boundary conditions for the compressible Navier–Stokes equations: Adiabatic wall and heat entropy transfer(Journal of Computational Physics, Elsevier BV, 20190801) [Article]We present a novel technique for the imposition of nonlinear entropy conservative and entropy stable solid wall boundary conditions for the compressible Navier–Stokes equations in the presence of an adiabatic wall, or a wall with a prescribed heat entropy flow. The procedure relies on the formalism and mimetic properties of diagonalnorm, summationbyparts and simultaneousapproximationterm operators, and is a generalization of previous works on discontinuous interface coupling [1] and solid wall boundary conditions [2]. Using the method of lines, a semidiscrete entropy estimate for the entire domain is obtained when the proposed numerical imposition of boundary conditions are coupled with an entropyconservative or entropystable discrete interior operator. The resulting estimate mimics the global entropy estimate obtained at the continuous level. The boundary data at the wall are weakly imposed using a penalty flux approach and a simultaneousapproximationterm technique for both the conservative variables and the gradient of the entropy variables. Discontinuous spectral collocation operators (mass lumped nodal discontinuous Galerkin operators), on highorder unstructured grids, are used for the purpose of demonstrating the robustness and efficacy of the new procedure for weakly enforcing boundary conditions. Numerical simulations confirm the nonlinear stability of the proposed technique, with applications to threedimensional subsonic and supersonic flows. The procedure described is compatible with any diagonalnorm summationbyparts spatial operator, including finite element, finite difference, finite volume, discontinuous Galerkin, and flux reconstruction schemes.

HLIBCov: Parallel hierarchical matrix approximation of large covariance matrices and likelihoods with applications in parameter identification(MethodsX, Elsevier BV, 20190712) [Article]We provide more technical details about the HLIBCov package, which is using parallel hierarchical (H) matrices to: • approximates large dense inhomogeneous covariance matrices with a loglinear computational cost and storage requirement; •computes matrixvector product, Cholesky factorization and inverse with a loglinear complexity; •identify unknown parameters of the covariance function (variance, smoothness, and covariance length); These unknown parameters are estimated by maximizing the joint Gaussian loglikelihood function. To demonstrate the numerical performance, we identify three unknown parameters in an example with 2,000,000 locations on a PCdesktop.

Solution of the 3D densitydriven groundwater flow problem with uncertain porosity and permeability(arXiv, 20190601) [Preprint]As groundwater is an essential nutrition and irrigation resource, its pollution may lead to catastrophic consequences. Therefore, accurate modeling of the pollution of the soil and groundwater aquifer is highly important. As a model, we consider a densitydriven groundwater flow problem with uncertain porosity and permeability. This problem may arise in geothermal reservoir simulation, natural salinedisposal basins, modeling of contaminant plumes, and subsurface flow. This strongly nonlinear timedependent problem describes the convection of the twophase flow. This liquid streams under the gravity force, building socalled ``fingers''. The accurate numerical solution requires fine spatial resolution with an unstructured mesh and, therefore, high computational resources. Consequently, we run the parallelized simulation toolbox \myug with the geometric multigrid solver on Shaheen II supercomputer. The parallelization is done in physical and stochastic spaces. Additionally, we demonstrate how the \myug toolbox can be run in a blackbox fashion for testing different scenarios in the densitydriven flow. As a benchmark, we solve the Elderlike problem in a 3D domain. For approximations in the stochastic space, we use the generalized polynomial chaos expansion. We compute the mean, variance, and exceedance probabilities of the mass fraction. As a reference solution, we use the solution, obtained from the quasiMonte Carlo method.

Rotational test spaces for a fullyimplicit FVM and FEM for the DNS of fluidparticle interaction(Journal of Computational Physics, Academic Press, 20190523) [Article]The paper presents a fullyimplicit and stable finite element and finite volume scheme for the simulation of freely moving particles in a fluid. The developed method is based on the PetrovGalerkin formulation of a vertexcentered finite volume method (PGFVM) on unstructured grids. Appropriate extension of the ansatz and test spaces lead to a formulation comparable to a fictitious domain formulation. The purpose of this work is to introduce a new concept of numerical modeling reducing the mathematical overhead which many other methods require. It exploits the identification of the PGFVM with a corresponding finite element bilinear form. The surface integrals of the finite volume scheme enable a natural incorporation of the interface forces purely based on the original bilinear operator for the fluid. As a result, there is no need to expand the system of equations to a saddlepoint problem. Like for fictitious domain methods the extended scheme treats the particles as rigid parts of the fluid. The distinguishing feature compared to most existing fictitious domain methods is that there is no need for an additional Lagrange multiplier or other artificial external forces for the fluidsolid coupling. Consequently, only one single solve for the derived linear system for the fluid together with the particles is necessary and the proposed method does not require any fractional time stepping scheme to balance the interaction forces between fluid and particles. For the linear Stokes problem we will prove the stability of both schemes. Moreover, for the stationary case the conservation of mass and momentum is not violated by the extended scheme, i.e. conservativity is accomplished within the range of the underlying, unconstrained discretization scheme. The scheme is applicable for problems in two and three dimensions.

SquareRoot Variable Metric based elastic fullwaveform inversion – Part 1: Theory and validation(Geophysical Journal International, Oxford University Press (OUP), 20190516) [Article]Fullwaveform inversion (FWI) has become a powerful tool in inverting subsurface geophysical properties. The estimation of uncertainty in the resulting Earth models and parameter tradeoffs, although equally important to the inversion result, has however often been neglected or became prohibitive for largescale inverse problems. Theoretically, the uncertainty estimation is linked to the inverse Hessian (or posterior covariance matrix), which for massive inverse problems becomes impossible to store and compute. In this study, we investigate the application of the squareroot variable metric (SRVM) method, a quasiNewton optimisation algorithm, to FWI in a vector version. This approach allows us to reconstruct the final inverse Hessian at an affordable storage memory cost. We conduct SRVM based elastic FWI on several elastic models in regular, freesurface and practical cases. Comparing the results with those obtained by the stateoftheart LBFGS algorithm, we find that the proposed SRVM method performs on a similar, highlyefficient level as LBFGS, with the advantage of providing additional information such as the inverse Hessian needed for uncertainty quantification.

Squareroot variable metric based elastic fullwaveform inversionPart 2: Uncertainty estimation(Geophysical Journal International, Oxford University Press (OUP), 20190502) [Article]In our first paper (Part 1) about the squareroot variable metric (SRVM) method we presented the basic theory and validation of the inverse algorithm applicable to largescale seismic data inversions. In this second paper (Part 2) about the SRVM method, the objective is to estimate the resolution and uncertainty of the inverted resulting geophysical model. Bayesian inference allows estimating the posterior model distribution from its prior distribution and likelihood function. These distributions, when being linear and Gaussian, can be mathematically characterized by their covariance matrices. However, it is prohibitive to explicitly construct and store the covariance in largescale practical problems. In Part 1, we applied the SRVM method to elastic fullwaveform inversion in a matrixfree vector version. This new algorithm allows accessing the posterior covariance by reconstructing the inverseHessian with memoryAffordable vector series. The focus of this paper is on extracting quantitative and statistical information from the inverse Hessian for quality assessment of the inverted seismic model by FWI. To operate on the inverse Hessian more efficiently, we compute its eigenvalues and eigenvectors with randomized singular value decomposition. Furthermore, we collect pointspread functions from the Hessian in an efficient way. The posterior standard deviation quantitatively measures the uncertainties of the posterior model. 2D Gaussian random samplers will help to visually compare both the prior and posterior distributions. We highlight our method on several numerical examples and demonstrate an uncertainty estimation analysis applicable to largescale inversions.

Massively Parallel Polar Decomposition on Distributedmemory Systems(ACM Transactions on Parallel Computing, Association for Computing Machineryacmhelp@acm.org, 20190501) [Article]We present a highperformance implementation of the Polar Decomposition (PD) on distributedmemory systems. Building upon on the QRbased Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functionsintroduced by Zolotarev (ZOLO) in 1877this new PD algorithm ZOLOPD converges within two iterations even for illconditioned matrices, instead of the original six iterations needed for QDWH. ZOLOPD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLOPD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floatingpoint operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLOPD is able to outperform QDWH by up to 2.3× speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.

A kmean characteristic function for optimizing STA/LTA based detection of microseismic events(GEOPHYSICS, Society of Exploration Geophysicists, 20190413) [Article]Event detection is an essential component of microseismic data analysis. This process is typically carried out using a short and longterm average ratio (STA/LTA) method, which is simple and computationally efficient but often yields inconsistent results for noisy datasets. Here, we aim to optimize the performance of the STA/LTA method by testing different input forms of threecomponent waveform data and different characteristic functions (CFs), including a proposed kmean CF. These tests are evaluated using receiver operating characteristic (ROC) analysis and compared based on synthetic and field data examples. Our analysis shows that the STA/LTA method using a kmean CF improves the detection sensitivity and yields more robust event detection on noisy datasets than some previous approaches. In addition, microseismic events are detected efficiently on field data examples using the same detection threshold obtained from the ROC analysis on synthetic data examples. We recommend the use of Youden index based on ROC analysis using a training subset, extracted from the continuous data, to further improve the detection threshold for field microseismic data.

Reproducibility in Benchmarking Parallel Fast Fourier Transform based Applications(Companion of the 2019 ACM/SPEC International Conference on Performance Engineering  ICPE '19, Association for Computing Machinery (ACM), 20190405) [Conference Paper]An overview of concerns observed in allowing for reproducibility in parallel applications that heavily depend on the three dimensional distributed memory fast Fourier transform are summarized. Suggestions for reproducibility categories for benchmark results are given.

mpi4pyfft: Parallel Fast Fourier Transforms with MPI for Python(Journal of Open Source Software, The Open Journal, 20190402) [Article]

A QDWHbased SVD software framework on distributedmemory manycore systems(ACM Transactions on Mathematical Software, Association for Computing Machinery (ACM), 20190401) [Article]This article presents a highperformance software framework for computing a dense SVD on distributedmemory manycore systems. Originally introduced by Nakatsukasa et al. (2010) and Nakatsukasa and Higham (2013), the SVD solver relies on the polar decomposition using the QR Dynamically Weighted Halley algorithm (QDWH). Although the QDWHbased SVD algorithm performs a significant amount of extra floatingpoint operations compared to the traditional SVD with the onestage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS computebound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK twodimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWHSVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying on the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling tradeoffs. We report performance results against stateoftheart existing QDWH software implementations (e.g., Elemental) and their SVD extensions on largescale distributedmemory manycore systems based on commodity Intel x86 Haswell processors and Knights Landing (KNL) architecture. The QDWHSVD framework achieves up to 3/8fold speedups on the Haswell/KNLbased platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well and illconditioned matrices. We finally come up herein with a performance model based on these empirical results. Our QDWHbased polar decomposition and its SVD extension are freely available at https://github.com/ecrc/qdwh.git and https://github.com/ecrc/ksvd.git, respectively, and have been integrated into the Cray Scientific numerical library LibSci v17.11.1.

Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs(ACM Transactions on Mathematical Software, Association for Computing Machinery acmhelp@acm.org, 20190401) [Article]Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical lowrank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing stateoftheart implementations.

Application of High Performance Asynchronous Acoustic Wave Equation Stencil Solver into a Land Survey(SPE Middle East Oil and Gas Show and Conference, Society of Petroleum Engineers (SPE), 20190313) [Conference Paper]This paper describes the application of high performance asynchronous stencil computations for 3D acoustic modeling on a synthetic land survey. Using the FiniteDifference TimeDomain (FDTD) method, a parallel Multicore Wavefront Diamondtiling (MWD) stencil kernel (Malas et al. 2015, Malas et al. 2017) drives the high performance execution using temporal blocking to maximize data locality, while reducing the expensive horizontal data movement. As absorbing boundary conditions, we use Convolutional Perfectly Matched Layer (CPML), which have to be redesigned to not interrupt the asynchronous execution flow engendered by the MWD stencil kernel for the innerdomain points. The main idea consists in weakening the data dependencies by moving the CPML computations into the innercomputational loop of the MWD stencil kernel (Akbudak et al. 2019). In addition to handling the absorbing boundary conditions, applying the asynchronous MWD with CPML kernels to a realistic land survey requires the extraction of the wavefield value at each receiver position. We revisit the default extraction process and make it also compliant with the overall asynchrony of the 3D acoustic modeling. We report performance improvement up to 24% against the standard spatial blocking algorithm on Intel multicore chips using the synthetic land survey, which is representative of an area of interest in Saudi Arabia. While these results concur with previous performance campaign assessment, we can actually produce and assess the resulting 3D shot gather accuracy. To our knowledge, this is the first time the effectiveness of asynchronous MWD stencil kernel with CPML absorbing boundary conditions is demonstrated in an industrial seismic application.

Fast parallel multidimensional FFT using advanced MPI(Journal of Parallel and Distributed Computing, Elsevier BV, 20190311) [Article]We present a new method for performing global redistributions of multidimensional arrays essential to parallel fast Fourier (or similar) transforms. Traditional methods use standard alltoall collective communication of contiguous memory buffers, thus necessarily requiring local data realignment steps intermixed inbetween redistribution and transform steps. Instead, our method takes advantage of subarray datatypes and generalized alltoall scatter/gather from the MPI2 standard to communicate discontiguous memory buffers, effectively eliminating the need for local data realignments. Despite generalized alltoall communication of discontiguous data being generally slower, our proposal economizes in local work. For a range of strong and weak scaling tests, we found the overall performance of our method to be on par and often better than wellestablished libraries like MPIFFTW, P3DFFT, and 2DECOMP&FFT. We provide compact routines implemented at the highest possible level using the MPI bindings for the C programming language. These routines apply to any global redistribution, over any two directions of a multidimensional array, decomposed on arbitrary Cartesian processor grids (1D slabs, 2D pencils, or even higherdimensional decompositions). The high level implementation makes the code easy to read, maintain, and eventually extend. Our approach enables for future speedups from optimizations in the internal datatype handling engines within MPI implementations.

Hierarchical Matrix Operations on GPUs(ACM Transactions on Mathematical Software, Association for Computing Machinery (ACM), 20190225) [Article]Hierarchical matrices are space and timeefficient representations of dense matrices that exploit the lowrank structure of matrix blocks at different levels of granularity. The hierarchically lowrank block partitioning produces representations that can be stored and operated on in nearlinear complexity instead of the usual polynomial complexity of dense matrices. In this article, we present highperformance implementations of matrix vector multiplication and compression operations for the H2 variant of hierarchical matrices on GPUs. The H2 variant exploits, in addition to the hierarchical block partitioning, hierarchical bases for the block representations and results in a scheme that requires only O(n) storage and O(n) complexity for the matvec and compression kernels. These two operations are at the core of algebraic operations for hierarchical matrices, the matvec being a ubiquitous operation in numerical algorithms while compression/recompression represents a key building block for other algebraic operations, which require periodic recompression during execution. The difficulties in developing efficient GPU algorithms come primarily from the irregular tree data structures that underlie the hierarchical representations, and the key to performance is to recast the computations on flattened trees in ways that allow batched linear algebra operations to be performed. This requires marshaling the irregularly laid out data in a way that allows them to be used by the batched routines. Marshaling operations only involve pointer arithmetic with no data movement and as a result have minimal overhead. Our numerical results on covariance matrices from 2D and 3D problems from spatial statistics show the high efficiency our routines achieve over 550GB/s for the bandwidthlimited matrixvector operation and over 850GFLOPS/s in sustained performance for the compression operation on the P100 Pascal GPU.

Scaling Distributed Machine Learning with InNetwork Aggregation(arXiv, 20190222) [Preprint]Training complex machine learning models in parallel is an increasinglyimportant workload. We accelerate distributed parallel training by designing acommunication primitive that uses a programmable switch dataplane to execute akey step of the training process. Our approach, SwitchML, reduces the volume ofexchanged data by aggregating the model updates from multiple workers in thenetwork. We codesign the switch processing with the endhost protocols and MLframeworks to provide a robust, efficient solution that speeds up training byup to 300%, and at least by 20% for a number of realworld benchmark models.

Advanced Hepatitis C Virus Replication PDE Models within a Realistic Intracellular Geometric Environment(International Journal of Environmental Research and Public Health, MDPI AG, 20190213) [Article]The hepatitis C virus (HCV) RNA replication cycle is a dynamic intracellular process occurring in threedimensional space (3D), which is difficult both to capture experimentally and to visualize conceptually. HCVgenerated replication factories are housed within virusinduced intracellular structures termed membranous webs (MW), which are derived from the Endoplasmatic Reticulum (ER). Recently, we published 3D spatiotemporal resolved diffusion–reaction models of the HCV RNA replication cycle by means of surface partial differential equation (sPDE) descriptions. We distinguished between the basic components of the HCV RNA replication cycle, namely HCV RNA, nonstructural viral proteins (NSPs), and a host factor. In particular, we evaluated the sPDE models upon realistic reconstructed intracellular compartments (ER/MW). In this paper, we propose a significant extension of the model based upon two additional parameters: different aggregate states of HCV RNA and NSPs, and population dynamics inspired diffusion and reaction coefficients instead of multilinear ones. The combination of both aspects enables realistic modeling of viral replication at all scales. Specifically, we describe a replication complex state consisting of HCV RNA together with a defined amount of NSPs. As a result of the combination of spatial resolution and different aggregate states, the new model mimics a cis requirement for HCV RNA replication. We used heuristic parameters for our simulations, which were run only on a subsection of the ER. Nevertheless, this was sufficient to allow the fitting of core aspects of virus reproduction, at least qualitatively. Our findings should help stimulate new model approaches and experimental directions for virology.

Likelihood approximation with hierarchical matrices for large spatial datasets(Computational Statistics & Data Analysis, Elsevier BV, 20190212) [Article]The unknown parameters (variance, smoothness, and covariance length) of a spatial covariance function can be estimated by maximizing the joint Gaussian loglikelihood function. To overcome cubic complexity in the linear algebra, the discretized covariance function is approximated in the hierarchical (H) matrix format. The Hmatrix format has a loglinear computational cost and O(knlogn) storage, where the rank k is a small integer, and n is the number of locations. The Hmatrix technique can approximate general covariance matrices (also inhomogeneous) discretized on a fairly general mesh that is not necessarily axesparallel, and neither the covariance matrix itself nor its inverse has to be sparse. It is investigated how the Hmatrix approximation error influences the estimated parameters. Numerical examples with Monte Carlo simulations, where the true values of the unknown parameters are given, and an application to soil moisture data with unknown parameters are presented. The C, C++ codes and data are freely available.