Extreme Computing Research Center
Recent Submissions

A Multilayer Nonlinear Elimination Preconditioned Inexact Newton Method for SteadyState Incompressible Flow Problems in Three Dimensions(SIAM Journal on Scientific Computing, Society for Industrial & Applied Mathematics (SIAM), 20201124) [Article]We develop a multilayer nonlinear elimination preconditioned inexact Newton method for a nonlinear algebraic system of equations, and a target application is the threedimensional steadystate incompressible NavierStokes equations at high Reynolds numbers. Nonlinear steadystate problems are often more difficult to solve than timedependent problems because the Jacobian matrix is less diagonally dominant, and a good initial guess from the previous time step is not available. For such problems, Newtonlike methods may suffer from slow convergence or stagnation even with globalization techniques such as line search. In this paper, we introduce a cascadic multilayer nonlinear elimination approach based on feedback from intermediate solutions to improve the convergence of Newton iteration. Numerical experiments show that the proposed algorithm is superior to the classical inexact Newton method and other single layer nonlinear elimination approaches in terms of the robustness and efficiency. Using the proposed nonlinear preconditioner with a highly parallel domain decomposition framework, we demonstrate that steady solutions of the NavierStokes equations with Reynolds numbers as large as 7,500 can be obtained for the liddriven cavity flow problem in three dimensions without the use of any continuation methods.

NodePy: A package for the analysis of numerical ODE solvers(Journal of Open Source Software, The Open Journal, 20201117) [Article]Ordinary differential equations (ODEs) are used to model a vast range of physical and other phenomena. They also arise in the discretization of partial differential equations. In most cases, solutions of differential equations must be approximated by numerical methods. The study of the properties of numerical methods for ODEs comprises an important and large body of knowledge. NodePy (available from https://github.com/ketch/nodepy, with documentation at https://nodepy.readthedocs.io/en/latest/) is a software package for designing and studying the properties of numerical ODE solvers. For the most important classes of methods, NodePy can automatically assess their stability, accuracy, and many other properties. NodePy has also been used as a catalog of coefficients for time integration methods in PDE solver codes.

Leveraging PaRSEC Runtime Support to Tackle Challenging 3D DataSparse Matrix Problems(IEEE, 20201101) [Conference Paper]The taskbased programming model associated with dynamic runtime systems has gained popularity for challenging problems because of workload imbalance, heterogeneous resources, or extreme concurrency. During the last decade, lowrank matrix approximations, where the main idea consists of exploiting data sparsity typically by compressing offdiagonal tiles up to an applicationspecific accuracy threshold, have been adopted to address the curse of dimensionality at extreme scale. In this paper, we create a bridge between the runtime and the linear algebra by communicating knowledge of the data sparsity to the runtime. We design and implement this synergistic approach with high user productivity in mind, in the context of the PaRSEC runtime system and the HiCMA numerical library. This requires to extend PaRSEC with new features to integrate rank information into the dataflow so that proper decisions can be taken at runtime. We focus on the tile lowrank (TLR) Cholesky factorization for solving 3D datasparse covariance matrix problems arising in environmental applications. In particular, we employ the 3D exponential model of Matern matrix kernel, which exhibits challenging nonuniform ´high ranks in offdiagonal tiles. We first provide a dynamic data structure management driven by a performance model to reduce extra floatingpoint operations. Next, we optimize the memory footprint of the application by relying on a dynamic memory allocator, and supported by a rankaware data distribution to cope with the workload imbalance. Finally, we expose further parallelism using kernel recursive formulations to shorten the critical path. Our resulting highperformance implementation outperforms existing datasparse TLR Cholesky factorization by up to 7fold on a largescale distributedmemory system, while minimizing the memory footprint up to a 44fold factor. This multidisciplinary work highlights the need to empower runtime systems beyond their original duty of task scheduling for servicing nextgeneration lowrank matrix algebra libraries.

High Performance Asynchronous Reverse Time Migration for Oil and Gas Exploration(IEEE, 20201101) [Conference Paper]Reverse Time Migration (RTM) is a stateoftheart algorithm used in seismic depth imaging in complex geological environments for the oil and gas exploration industry. It calculates highresolution images by solving the threedimensional acoustic wave equation using seismic datasets recorded at various receiver locations. Using a finitedifference timedomain (FDTD) scheme, the time integration follows an adjointstate formulation with two successive phases, i.e., forward modeling and backward integration. Each subsurface image is then generated during the imaging condition that combines a forward propagated source wavefield with a backward propagated receiver wavefield. RTM’s computational phases are predominantly composed of stencil computational kernels for the FDTD, applying the absorbing boundary conditions, and I/O operations needed for the imaging condition. In fact, RTM can be considered as an outofcore algorithm, which requires offloading to disk snapshots of the domain solution at specific time intervals during the forward modeling phase. During the backward time integration, these snapshots are read back and synchronized at the corresponding times. As far as optimizing the stencil computation, spatial blocking represents the widelyadopted vendoragnostic technique for increasing data reuse in the highlevel of the memory subsystem. In this paper, we integrate in RTM the asynchronous Multicore Wavefront Diamond (MWD) tiling approach that permits to further increase data reuse by leveraging spatial with Temporal Blocking (TB) during the stencil computations. This integration engenders new challenges with a snowball effect on the legacy synchronous RTM workflow as it requires to deeply rethink of how the absorbing boundary conditions, the I/O operations, and the imaging condition operate. These disruptive changes in cascade are necessary to maintain the performance superiority of asynchronous stencil execution throughout the time integration, while ensuring the quality of the subsurface image does not deteriorate. We assess the overall performance of the new MWDbased RTM and compare against traditional SBbased RTM on various sharedmemory systems using the SEG Salt3D model. The MWDbased RTM is able to achieve up to 60% performance speedup compared to SBbased RTM. To our knowledge, this paper highlights for the first time the applicability of asynchronous RTM executions, which results in a higher simulation throughput and may eventually create new research opportunities in improving the hydrocarbon extraction for the petroleum industry.

RKOpt: A package for the design of numerical ODE solvers(Journal of Open Source Software, The Open Journal, 20201030) [Article]Ordinary and partial differential equations (ODEs and PDEs) are used to model many important phenomena. In most cases, solutions of these models must be approximated by numerical methods. Most of the relevant algorithms fall within a few classes of methods, with the properties of individual methods determined by their coefficients. The choice of appropriate coefficients in the design of methods for specific applications is an important area of research. RKOpt is a software package for designing numerical ODE solvers with coefficients optimally chosen to provide desired properties. It is available from https://github.com/ketch/RKOpt, with documentation at http://numerics.kaust.edu.sa/RKOpt/. The primary focus of the package is on the design of RungeKutta methods, but some routines for designing other classes of methods such as multistep RungeKutta and general linear methods are also included.

On the robustness and performance of entropy stable collocated discontinuous Galerkin methods(Journal of Computational Physics, Elsevier BV, 20201022) [Article]In computational fluid dynamics, the demand for increasingly multidisciplinary reliable simulations, for both analysis and design optimization purposes, requires transformational advances in individual components of future solvers. At the algorithmic level, hardware compatibility and efficiency are of paramount importance in determining viability at exascale and beyond. However, equally important (if not more so) is algorithmic robustness with minimal user intervention, which becomes progressively more challenging to achieve as problem size and physics complexity increase. We numerically show that low and high order entropy stable collocated discontinuous Galerkin discretizations based on summationbypart operators and simultaneousapproximationterms technique provide an essential step toward a truly enabling technology in terms of reliability and robustness for both underresolved turbulent flow simulations and flows with discontinuities.

Bayes Meets Tikhonov: Understanding Uncertainty Within Gaussian Framework for Seismic Inversion(Springer Singapore, 20201022) [Book Chapter]In this chapter, we demonstrate the sound connection between the Bayesian approach and the Tikhonov regularisation within Gaussian framework. We provide a thorough uncertainty analysis to answer the following two fundamental questions: (1) How well is the estimate determined by a posteriori PDF, i.e. by the combination of observed data and a priori information? (2) What are the respective contributions of observed data and a priori information? To support the proposed methodology, we demonstrate it through numerical applications in seismic inversions.

Hierarchical Matrix Approximations of Hessians Arising in Inverse Problems Governed by PDEs(SIAM Journal on Scientific Computing, Society for Industrial & Applied Mathematics (SIAM), 20201022) [Article]Hessian operators arising in inverse problems governed by partial differential equations (PDEs) play a critical role in delivering efficient, dimensionindependent convergence for Newton solution of deterministic inverse problems, as well as Markov chain Monte Carlo sampling of posteriors in the Bayesian setting. These methods require the ability to repeatedly perform operations on the Hessian such as multiplication with arbitrary vectors, solving linear systems, inversion, and (inverse) square root. Unfortunately, the Hessian is a (formally) dense, implicitly defined operator that is intractable to form explicitly for practical inverse problems, requiring as many PDE solves as inversion parameters. Low rank approximations are effective when the data contain limited information about the parameters but become prohibitive as the data become more informative. However, the Hessians for many inverse problems arising in practical applications can be well approximated by matrices that have hierarchically low rank structure. Hierarchical matrix representations promise to overcome the high complexity of dense representations and provide effective data structures and matrix operations that have only loglinear complexity. In this work, we describe algorithms for constructing and updating hierarchical matrix approximations of Hessians, and illustrate them on a number of representative inverse problems involving timedependent diffusion, advectiondominated transport, frequency domain acoustic wave propagation, and low frequency Maxwell equations, demonstrating up to an order of magnitude speedup compared to globally low rank approximations.

Stochastic LevenbergMarquardt for Solving Optimization Problems on Hardware Accelerators(Submitted to IEEE, 20201019) [Preprint]We present a new Stochastic LevenbergMarquardt (SLM) algorithm for efficiently solving largescale nonlinear leastsquares optimization problems. The SLM method incorporates stochasticity into the traditional LevenbergMarquardt (LM) method. While the traditional LM operates on the full objective function, SLM randomly evaluates part of the objective to compute the corresponding derivatives and function values. Hence, SLM reduces the algorithmic complexity per iteration and speeds up the overall time to solution, while maintaining the numerical robustness of secondorder methods. We assess the SLM method on standard datasets from LIBSVM, as well as on a largescale optimization problem found in groundbased astronomy applications, and in particular adaptive optics systems of the next generation of instruments for the European Very Large and Extremely Large Telescopes. We implement SLM and deploy it on a sharedmemory system equipped with multiple GPU hardware accelerators. We demonstrate the performance superiority of the SLM method over not only the traditional LM algorithm but also the stateoftheart firstorder methods. SLM finishes optimization process in less than 1 second on large datasets from the adaptive optics application, where LM and other methods require more than a few minutes. This enables to identify of the system parameters (e.g., atmospheric turbulence and wind speed) and to capture their evolution required during a night of observations with a close to realtime throughput.

Extrapolating lowfrequency prestack land data with deep learning(Society of Exploration Geophysicists, 20201001) [Conference Paper]Missing lowfrequency content in seismic data is a common challenge for seismic inversion. Long wavelengths are necessary to reveal large structures in the subsurface and to build an acceptable starting point for later iterations of fullwaveform inversion (FWI). Highfrequency land seismic data are particularly challenging due to the elastic nature of the Earth contrasting with acoustic air at the typically rugged free surface, which makes the use of low frequencies even more vital to the inversion. We propose a supervised deep learning framework for bandwidth extrapolation of prestack elastic data in the time domain. We utilize a Convolutional Neural Network (CNN) with a UNetinspired architecture to convert portions of bandlimited shot gathers from 515 Hz to 05 Hz band. In the synthetic experiment, we train the network on 192x192 patches of wavefields simulated for different crosssections of the elastic SEAM Arid model with freesurface. Then, we test the network on unseen shot gathers from the same model to demonstrate the viability of the approach. The results show promise for future field data applications.

Elastic nearsurface model estimation from full waveforms by deep learning(Society of Exploration Geophysicists, 20200930) [Conference Paper]Strong nearsurface heterogeneity poses a major challenge for seismic imaging of deep targets in arid environments. Inspired by the initial success of deep learning applications to inverse problems, we investigate the possibility of building nearsurface models directly from raw elastic data including surface and body waves in arid conditions. Namely, we train a convolutional neural network to map the data into the model directly in a supervised way on a part SEAM Arid synthetic dataset and evaluate its performance on a different part of the same dataset. The main feature of our approach is that we estimate the model as a set of 1D vertical velocity profiles, utilizing relevant subsets of input data from neighboring locations. This effectively reduces the data and label spaces for a more practical neural network application.

Explicit coupling of acoustic and elastic wave propagation in finitedifference simulations(Geophysics, Society of Exploration Geophysicists, 20200930) [Article]We present a mechanism to explicitly couple the finitedifference discretizations of 2D acoustic and isotropic elasticwave systems that are separated by straight interfaces. Such coupled simulations allow for the application of the elastic model to geological regions that are of special interest for seismic exploration studies (e.g., the areas surrounding salt bodies), with the computationally more tractable acoustic model still being applied in the background regions. Specifically, the acoustic wave system is expressed in terms of velocity and pressure while the elastic wave system is expressed in terms of velocity and stress. Both systems are posed in firstorder forms and are discretized on staggered grids. Special variants of the standard finitedifference operators, namely, operators that possess the summationbyparts property, are used for the approximation of spatial derivatives. Penalty terms, which are also referred to as the simultaneous approximation terms, are designed to weakly impose the elasticacoustic interface conditions in the finitedifference discretizations and couple the elastic and acoustic wave simulations together. With the presented mechanism, we are able to perform the coupled elasticacoustic wave simulations stably and accurately. Moreover, it is shown that the energyconserving property in the continuous systems can be preserved in the discretized systems with carefully designed penalty terms.

Preconditioned BFGSbased Uncertainty Quantification in elastic Full Waveform Inversion(arXiv, 20200926) [Preprint]Full Waveform Inversion (FWI) has become an essential technique for mapping geophysical subsurface structures. However, proper uncertainty quantification is often lacking in current applications. In theory, uncertainty quantification is related to the inverse Hessian (or the posterior covariance matrix). Even for common geophysical inverse problems is beyond the computational and storage capacities of the largest highperformance computing systems. In this study, we amend the BroydenFletcherGoldfarbShanno (BFGS) algorithm to perform uncertainty quantification for largescale applications. For seismic inverse problems, the limitedmemory BFGS (LBFGS) method prevails as the most efficient quasiNewton method. We aim to further augment it to obtain an approximate inverse Hessian for uncertainty quantification in FWI. To facilitate retrieval of the inverse Hessian, we combine BFGS (essentially a fullhistory LBFGS) with randomized singular value decomposition to determine a lowrank approximation of the inverse Hessian. Setting the rank number equal to the number of iterations makes this solution efficient and memoryaffordable even for largescale problems. Furthermore, based on the GaussNewton method, we formulate different initial, diagonal Hessian matrices as preconditioners for the inverse scheme and compare their performances in elastic FWI applications. We highlight our approach with the elastic Marmousi benchmark, demonstrating the applicability of preconditioned BFGS for largescale FWI and uncertainty quantification.

Highorder accurate entropystable discontinuous collocated Galerkin methods with the summationbyparts property for compressible CFD frameworks: Scalable SSDC algorithms and flow solver(Journal of Computational Physics, Elsevier BV, 20200922) [Article]This work reports on the performances of a fullydiscrete hpadaptive entropy stable discontinuous collocated Galerkin method for the compressible Naiver–Stokes equations. The resulting code framework is denoted by SSDC, the first S for entropy, the second for stable, and DC for discontinuous collocated. The method is endowed with the summationbyparts property, allows for arbitrary spatial and temporal order, and is implemented in an unstructured high performance solver. The considered class of fullydiscrete algorithms are systematically designed with mimetic and structure preserving properties that allow the transfer of continuous proofs to the fully discrete setting. Our goal is to provide numerical evidence of the adequacy and maturity of these highorder methods as potential base schemes for the next generation of unstructured computational fluid dynamics tools. We provide a series of test cases of increased difficulty, ranging from nonsmooth to turbulent flows, in order to evaluate the numerical performance of the algorithms. Results on weak and strong scaling of the distributed memory implementation demonstrate that the parallel SSDC solver can scale efficiently over 100,000 processes.

Effects of Composition Heterogeneities on Flame Kernel Propagation: A DNS Study(Fluids, MDPI AG, 20200904) [Article]In this study, a new set of direct numerical simulations is generated and used to examine the influence of mixture composition heterogeneities on the propagation of a premixed isooctane/air spherical turbulent flame, with a representative chemical description. The dynamic effects of both turbulence and combustion heterogeneities are considered, and their competition is assessed. The results of the turbulent homogeneous case are compared with those of heterogeneous cases which are characterized by multiple stratification length scales and segregation rates in the regime of a wrinkled flame. The comparison reveals that stratification does not alter turbulent flame behaviors such as the preferential alignment of the convex flame front with the direction of the compression. However, we find that the overall flame front propagation is slower in the presence of heterogeneities because of the differential on speed propagation. Furthermore, analysis of different displacement speed components is performed by taking multispecies formalism into account. This analysis shows that the global flame propagation front slows down due to the heterogeneities caused by the reaction mechanism and the differential diffusion accompanied by flame surface density variations. Quantification of the effects of each of these mechanisms shows that their intensity increases with the increase in stratification’s length scale and segregation rate.

Maximizing I/O bandwidth for reverse time migration on heterogeneous largescale systems(Springer International Publishing, 20200818) [Conference Paper]Reverse Time Migration (RTM) is an important scientific application for oil and gas exploration. The 3D RTM simulation generates terabytes of intermediate data that does not fit in main memory. In particular, RTM has two successive computational phases, i.e., the forward modeling and the backward propagation, that necessitate to write and then to read the state of the computed solution grid at specific time steps of the time integration. Advances in memory architecture have made it feasible and affordable to integrate hierarchical storage media on largescale systems, starting from the traditional Parallel File Systems (PFS) to intermediate fast disk technologies (e.g., nodelocal and remoteshared Burst Buffer) and up to CPU main memory. To address the trend of heterogeneous HPC systems deployment, we introduce an extension to our Multilayer Buffer System (MLBS) framework to further maximize RTM I/O bandwidth in presence of GPU hardware accelerators. The main idea is to leverage the GPU’s High Bandwidth Memory (HBM) as an additional storage media layer. The objective of MLBS is ultimately to hide the application’s I/O overhead by enabling a buffering mechanism operating across all the hierarchical storage media layers. MLBS is therefore able to sustain the I/O bandwidth at each storage media layer. By asynchronously performing expensive I/O operations and creating opportunities for overlapping data motion with computations, MLBS may transform the original I/O bound behavior of the RTM application into a computebound regime. In fact, the prefetching strategy of MLBS allows the RTM application to believe that it has access to a larger memory capacity on the GPU, while transparently performing the necessary housekeeping across the storage layers. We demonstrate the effectiveness of MLBS on the Summit supercomputer using 2048 compute nodes equipped with a total of 12288 GPUs by achieving up to 1.4X performance speedup compared to the reference PFSbased RTM implementation for large 3D solution grid.

Characterization of pressure fluctuations within a controlleddiffusion blade boundary layer using the equilibrium wallmodelled LES.(Scientific reports, Springer Science and Business Media LLC, 20200729) [Article]In this study, the generation of airfoil trailing edge broadband noise that arises from the interaction of turbulent boundary layer with the airfoil trailing edge is investigated. The primary objectives of this work are: (i) to apply a wallmodelled largeeddy simulation (WMLES) approach to predict the flow of air passing a controlleddiffusion blade, and (ii) to study the blade broadband noise that is generated from the interaction of a turbulent boundary layer with a lifting surface trailing edge. This study is carried out for two values of the Mach number, [Formula: see text] and 0.5, two values of the chord Reynolds number, [Formula: see text] and [Formula: see text], and two angles of attack, AoA [Formula: see text] and [Formula: see text]. To examine the influence of the grid resolution on aerodynamic and aeroacoustic quantities, we compare our results with experimental data available in the literature. We also compare our results with two inhouse numerical solutions generated from two wallresolved LES (WRLES) calculations, one of which has a DNSlike resolution. We show that WMLES accurately predicts the mean pressure coefficient distribution, velocity statistics (including the mean velocity), and the traces of Reynolds tensor components. Furthermore, we observe that the instantaneous flow structures computed by the WMLES resemble those found in the reference WMLES database, except near the leading edge region. Some of the differences observed in these structures are associated with tripping and the transition to a turbulence mechanism near the leading edge, which are significantly affected by the grid resolution. The aeroacoustic noise calculations indicate that the power spectral density profiles obtained using the WMLES compare well with the experimental data.

MFEM: A modular finite element methods library(Computers and Mathematics with Applications, Elsevier BV, 20200711) [Article]MFEM is an opensource, lightweight, flexible and scalable C++ library for modular finite element methods that features arbitrary highorder finite element meshes and spaces, support for a wide variety of discretization approaches and emphasis on usability, portability, and highperformance computing efficiency. MFEM's goal is to provide application scientists with access to cuttingedge algorithms for highorder finite element meshing, discretizations and linear solvers, while enabling researchers to quickly and easily develop and test new algorithms in very general, fully unstructured, highorder, parallel and GPUaccelerated settings. In this paper we describe the underlying algorithms and finite element abstractions provided by MFEM, discuss the software implementation, and illustrate various applications of the library.

Fully discrete explicit locally entropystable schemes for the compressible Euler and Navier–Stokes equations(Computers & Mathematics with Applications, Elsevier BV, 20200709) [Article]Recently, relaxation methods have been developed to guarantee the preservation of a single global functional of the solution of an ordinary differential equation. Here, we generalize this approach to guarantee local entropy inequalities for finitely many convex functionals (entropies) and apply the resulting methods to the compressible Euler and Navier–Stokes equations. Based on the unstructured hpadaptive SSDC framework of entropy conservative or dissipative semidiscretizations using summationbyparts and simultaneousapproximationterm operators, we develop the first discretizations for compressible computational fluid dynamics that are primary conservative, locally entropy stable in the fully discrete sense under a usual CFL condition, explicit except for the parallelizable solution of a single scalar equation per element, and arbitrarily highorder accurate in space and time. We demonstrate the accuracy and the robustness of the fullydiscrete explicit locally entropystable solver for a set of test cases of increasing complexity.

ExtremeScale TaskBased Cholesky Factorization Toward Climate and Weather Prediction Applications(ACM, 20200618) [Conference Paper]Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLEbased iterative optimization procedure requires the solving of largescale linear systems that performs a Cholesky factorization on a symmetric positivedefinite covariance matrix—a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix offdiagonal tiles by means of lowrank approximations; and, at the programmingparadigm level, we integrate PaRSEC, a dynamic, taskbased runtime to reach unparalleled levels of efficiency for solving extremescale linear algebra matrix operations. The resulting solution leverages finegrained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications.