Recent Submissions

  • Characterization of pressure fluctuations within a controlled-diffusion blade boundary layer using the equilibrium wall-modelled LES.

    Boukharfane, Radouan; Parsani, Matteo; Bodart, Julien (Scientific reports, Springer Science and Business Media LLC, 2020-07-31) [Article]
    In this study, the generation of airfoil trailing edge broadband noise that arises from the interaction of turbulent boundary layer with the airfoil trailing edge is investigated. The primary objectives of this work are: (i) to apply a wall-modelled large-eddy simulation (WMLES) approach to predict the flow of air passing a controlled-diffusion blade, and (ii) to study the blade broadband noise that is generated from the interaction of a turbulent boundary layer with a lifting surface trailing edge. This study is carried out for two values of the Mach number, [Formula: see text] and 0.5, two values of the chord Reynolds number, [Formula: see text] and [Formula: see text], and two angles of attack, AoA [Formula: see text] and [Formula: see text]. To examine the influence of the grid resolution on aerodynamic and aeroacoustic quantities, we compare our results with experimental data available in the literature. We also compare our results with two in-house numerical solutions generated from two wall-resolved LES (WRLES) calculations, one of which has a DNS-like resolution. We show that WMLES accurately predicts the mean pressure coefficient distribution, velocity statistics (including the mean velocity), and the traces of Reynolds tensor components. Furthermore, we observe that the instantaneous flow structures computed by the WMLES resemble those found in the reference WMLES database, except near the leading edge region. Some of the differences observed in these structures are associated with tripping and the transition to a turbulence mechanism near the leading edge, which are significantly affected by the grid resolution. The aeroacoustic noise calculations indicate that the power spectral density profiles obtained using the WMLES compare well with the experimental data.
  • MFEM: A modular finite element methods library

    Anderson, Robert; Andrej, Julian; Barker, Andrew; Bramwell, Jamie; Camier, Jean Sylvain; Cerveny, Jakub; Dobrev, Veselin; Dudouit, Yohann; Fisher, Aaron; Kolev, Tzanio; Pazner, Will; Stowell, Mark; Tomov, Vladimir; Akkerman, Ido; Dahm, Johann; Medina, David; Zampini, Stefano (Computers and Mathematics with Applications, Elsevier BV, 2020-07-11) [Article]
    MFEM is an open-source, lightweight, flexible and scalable C++ library for modular finite element methods that features arbitrary high-order finite element meshes and spaces, support for a wide variety of discretization approaches and emphasis on usability, portability, and high-performance computing efficiency. MFEM's goal is to provide application scientists with access to cutting-edge algorithms for high-order finite element meshing, discretizations and linear solvers, while enabling researchers to quickly and easily develop and test new algorithms in very general, fully unstructured, high-order, parallel and GPU-accelerated settings. In this paper we describe the underlying algorithms and finite element abstractions provided by MFEM, discuss the software implementation, and illustrate various applications of the library.
  • Solving Acoustic Boundary Integral Equations Using High Performance Tile Low-Rank LU Factorization

    Al-Harthi, Noha A.; Alomairy, Rabab M.; Akbudak, Kadir; Chen, Rui; Ltaief, Hatem; Bagci, Hakan; Keyes, David E. (High Performance Computing, Springer International Publishing, 2020-06-18) [Conference Paper]
    We design and develop a new high performance implementation of a fast direct LU-based solver using low-rank approximations on massively parallel systems. The LU factorization is the most time-consuming step in solving systems of linear equations in the context of analyzing acoustic scattering from large 3D objects. The matrix equation is obtained by discretizing the boundary integral of the exterior Helmholtz problem using a higher-order Nyström scheme. The main idea is to exploit the inherent data sparsity of the matrix operator by performing local tile-centric approximations while still capturing the most significant information. In particular, the proposed LU-based solver leverages the Tile Low-Rank (TLR) data compression format as implemented in the Hierarchical Computations on Manycore Architectures (HiCMA) library to decrease the complexity of “classical” dense direct solvers from cubic to quadratic order. We taskify the underlying boundary integral kernels to expose fine-grained computations. We then employ the dynamic runtime system StarPU to orchestrate the scheduling of computational tasks on shared and distributed-memory systems. The resulting asynchronous execution permits to compensate for the load imbalance due to the heterogeneous ranks, while mitigating the overhead of data motion. We assess the robustness of our TLR LU-based solver and study the qualitative impact when using different numerical accuracies. The new TLR LU factorization outperforms the state-of-the-art dense factorizations by up to an order of magnitude on various parallel systems, for analysis of scattering from large-scale 3D synthetic and real geometries.
  • Hierarchical matrix approximations for space-fractional diffusion equations

    Boukaram, Wagih Halim; Lucchesi, Marco; Turkiyyah, George; Le Maître, Olivier; Knio, Omar; Keyes, David E. (Computer Methods in Applied Mechanics and Engineering, Elsevier BV, 2020-06-11) [Article]
    Space fractional diffusion models generally lead to dense discrete matrix operators, which lead to substantial computational challenges when the system size becomes large. For a state of size N, full representation of a fractional diffusion matrix would require O(N2) memory storage requirement, with a similar estimate for matrix–vector products. In this work, we present H2 matrix representation and algorithms that are amenable to efficient implementation on GPUs, and that can reduce the cost of storing these operators to O(N) asymptotically. Matrix–vector multiplications can be performed in asymptotically linear time as well. Performance of the algorithms is assessed in light of 2D simulations of space fractional diffusion equation with constant diffusivity. Attention is focused on smooth particle approximation of the governing equations, which lead to discrete operators involving explicit radial kernels. The algorithms are first tested using the fundamental solution of the unforced space fractional diffusion equation in an unbounded domain, and then for the steady, forced, fractional diffusion equation in a bounded domain. Both matrix-inverse and pseudo-transient solution approaches are considered in the latter case. Our experiments show that the construction of the fractional diffusion matrix, the matrix–vector multiplication, and the generation of an approximate inverse pre-conditioner all perform very well on a single GPU on 2D problems with N in the range 105 – 106. In addition, the tests also showed that, for the entire range of parameters and fractional orders considered, results obtained using the H2 approximations were in close agreement with results obtained using dense operators, and exhibited the same spatial order of convergence. Overall, the present experiences showed that the H2 matrix framework promises to provide practical means to handle large-scale space fractional diffusion models in several space dimensions, at a computational cost that is asymptotically similar to the cost of handling classical diffusion equations.
  • Benchmarking solvers for the one dimensional cubic nonlinear klein gordon equation on a single core

    Muite, B. K.; Aseeri, Samar (Springer International Publishing, 2020-06-08) [Conference Paper]
    To determine the best method for solving a numerical problem modeled by a partial differential equation, one should consider the discretization of the problem, the computational hardware used and the implementation of the software solution. In solving a scientific computing problem, the level of accuracy can also be important, with some numerical methods being efficient for low accuracy simulations, but others more efficient for high accuracy simulations. Very few high performance benchmarking efforts allow the computational scientist to easily measure such tradeoffs in order to obtain an accurate enough numerical solution at a low computational cost. These tradeoffs are examined in the numerical solution of the one dimensional Klein Gordon equation on single cores of an ARM CPU, an AMD x86-64 CPU, two Intel x86-64 CPUs and a NEC SX-ACE vector processor. The work focuses on comparing the speed and accuracy of several high order finite difference spatial discretizations using a conjugate gradient linear solver and a fast Fourier transform based spatial discretization. In addition implementations using second and fourth order timestepping are also included in the comparison. The work uses accuracy-efficiency frontiers to compare the effectiveness of five hardware platforms
  • Asynchronous computations for solving the acoustic wave propagation equation

    Akbudak, Kadir; Ltaief, Hatem; Etienne, Vincent; Abdelkhalak, Rached; Tonellot, Thierry; Keyes, David E. (The International Journal of High Performance Computing Applications, SAGE Publications, 2020-05-19) [Article]
    The aim of this study is to design and implement an asynchronous computational scheme for solving the acoustic wave propagation equation with absorbing boundary conditions (ABCs) in the context of seismic imaging applications. While the convolutional perfectly matched layer (CPML) is typically used for ABCs in the oil and gas industry, its formulation further stresses memory accesses and decreases the arithmetic intensity at the physical domain boundaries. The challenges with CPML are twofold: (1) the strong, inherent data dependencies imposed on the explicit time-stepping scheme render asynchronous time integration cumbersome and (2) the idle time is further exacerbated by the load imbalance introduced among processing units. In fact, the CPML formulation of the ABCs requires expensive synchronization points, which may hinder the parallel performance of the overall asynchronous time integration. In particular, when deployed in conjunction with the multicore-optimized wavefront diamond temporal blocking (MWD-TB) approach for the inner domain points, it results in a major performance slow down. To relax CPML’s synchrony and mitigate the resulting load imbalance, we embed CPML’s calculation into MWD-TB’s inner loop and carry on the time integration with fine-grained computations in an asynchronous, holistic way. This comes at the price of storing transient results to alleviate dependencies from critical data hazards while maintaining the numerical accuracy of the original scheme. Performance and scalability results on various x86 architectures demonstrate the superiority of MWD-TB with CPML support against the standard spatial blocking on various grid sizes. To our knowledge, this is the first practical study that highlights the consolidation of CPML ABCs with asynchronous temporal blocking stencil computations.
  • Automated methods for the comparison of natural languages

    Wittum, Gabriel; Hoffer, Michael; Lemke, Babett; Jabs, Robert; Nägel, Arne (Computing and Visualization in Science, Springer Science and Business Media LLC, 2020-05-18) [Article]
    Starting from the general question, if there is a connection between the mathematical capabilities of a student and his native language, we aim at comparing natural languages with mathematical language quantitatively. In [20] we set up an approach to compare language structures using Natural Language Processors (NLP). However, difficulties arose with the quality of the structural analysis of the NLP used just comparing simple sentences in different but closely related natural languages. We now present a comparison of different available NLPs and discuss the results. The comparison confirms the results from [20], showing that current NLPs are not capable of analysing even simple sentences such that resulting structures between different natural languages can be compared.
  • 3d Modeling and simulation of a harpsichord

    Larisch, Lukas; Lemke, Babett; Wittum, Gabriel (Computing and Visualization in Science, Springer Science and Business Media LLC, 2020-04-29) [Article]
    The mathematical characterization of the sound of a musical instrument still follows Schumann’s laws (Schumann in Physik der klangfarben, Leipzig, 1929). According to this theory, the resonances of the instrument body, “the formants”, filter the oscillations of the sound generator (e.g., strings) and produce the characteristic “timbre” of an instrument. This is a strong simplification of the actual situation. It applies to a point source and can be easily performed by a loudspeaker, disregarding the three dimensional structure of music instruments. To describe the effect of geometry and material of the instruments, we set up a 3d model and simulate it using the simulation system UG4 (Vogel et al. in Comput Vis Sci 16(4):165–179, 2013; Reiter et al. in Comput Vis Sci 16(4):151–164, 2014). We aim to capture the oscillation behavior of eigenfrequencies of a harpsichord soundboard and investigate how well a model for the oscillation behavior of the soundboard approximates the oscillation behavior of the whole instrument. We resolve the complicated geometry by several unstructured 3d grids and take into account the anisotropy of wood. The oscillation behavior of the soundboard is modeled following the laws of linear orthotropic elasticity with homogenous boundary conditions. The associated eigenproblem is discretized using FEM and solved with the iterative PINVIT method using an efficient GMG preconditioner (Neymeyr in A hierarchy of preconditioned eigensolvers for elliptic differential operators. Habilitation dissertation, University of Tübingen, 2001). The latter allows us to resolve the harpsichord with a high resolution hybrid grid, which is required to capture fine modes of the simulated eigenfrequencies. We computed the first 16 eigenmodes and eigenfrequencies with a resolution of 1.8 billion unknowns each on Shaheen II supercomputer (https://www.hpc.kaust.edu.sa/content/shaheen-ii). To verify our results, we compare them with measurement data obtained from an experimental modal analysis of a real reference harpsichord.
  • Mesh generation for thin layered domains and its application to parallel multigrid simulation of groundwater flow

    Reiter, Sebastian; Logashenko, Dmitry; Vogel, Andreas; Wittum, Gabriel (Computing and Visualization in Science, Springer Science and Business Media LLC, 2020-04-20) [Article]
    The generation of detailed three dimensional meshes for the simulation of groundwater flow in thin layered domains is crucial to capture important properties of the underlying domains and to reach a satisfying accuracy. At the same time, this level of detail poses high demands both on suitable hardware and numerical solver efficiency. Parallel multigrid methods have been shown to exhibit near optimal weak scalability for massively parallel computations of density driven flow. A fully automated parameterized algorithm for prism based meshing of coarse grids from height data of individual layers is presented. Special structures like pinch outs of individual layers are preserved. The resulting grid is used as a starting point for parallel mesh and hierarchy creation through interweaved projected refinement and redistribution. Efficiency and applicability of the proposed approach are demonstrated for a parallel multigrid based simulation of a realistic sample problem.
  • Hierarchical Matrix Approximations of Hessians Arising in Inverse Problems Governed by PDEs

    Ambartsumyan, Ilona; Boukaram, Wagih Halim; Bui-Thanh, Tan; Ghattas, Omar; Keyes, David E.; Stadler, Georg; Turkiyyah, George; Zampini, Stefano (arXiv, 2020-03-23) [Preprint]
    Hessian operators arising in inverse problems governed by partial differential equations (PDEs) play a critical role in delivering efficient, dimension-independent convergence for both Newton solution of deterministic inverse problems, as well as Markov chain Monte Carlo sampling of posteriors in the Bayesian setting. These methods require the ability to repeatedly perform such operations on the Hessian as multiplication with arbitrary vectors, solving linear systems, inversion, and (inverse) square root. Unfortunately, the Hessian is a (formally) dense, implicitly-defined operator that is intractable to form explicitly for practical inverse problems, requiring as many PDE solves as inversion parameters. Low rank approximations are effective when the data contain limited information about the parameters, but become prohibitive as the data become more informative. However, the Hessians for many inverse problems arising in practical applications can be well approximated by matrices that have hierarchically low rank structure. Hierarchical matrix representations promise to overcome the high complexity of dense representations and provide effective data structures and matrix operations that have only log-linear complexity. In this work, we describe algorithms for constructing and updating hierarchical matrix approximations of Hessians, and illustrate them on a number of representative inverse problems involving time-dependent diffusion, advection-dominated transport, frequency domain acoustic wave propagation, and low frequency Maxwell equations, demonstrating up to an order of magnitude speedup compared to globally low rank approximations.
  • Performance study of sustained petascale direct numerical simulation on Cray XC40 systems

    Hadri, Bilel; Parsani, Matteo; Hutchinson, Maxwell; Heinecke, Alexander; Dalcin, Lisandro; Keyes, David E. (Concurrency and Computation: Practice and Experience, Wiley, 2020-03-17) [Article]
    We present in this paper a comprehensive performance study of highly efficient extreme scale direct numerical simulations of secondary flows, using an optimized version of Nek5000. Our investigations are conducted on various Cray XC40 systems, using a very high-order spectral element method. Single-node efficiency is achieved by auto-generated assembly implementations of small matrix multiplies and key vector-vector operations, streaming lossless I/O compression, aggressive loop merging, and selective single precision evaluations. Comparative studies across different Cray XC40 systems at scale, Trinity (LANL), Cori (NERSC), and ShaheenII (KAUST) show that a Cray programming environment, network configuration, parallel file system, and burst buffer all have a major impact on the performance. All three systems possess a similar hardware with similar CPU nodes and parallel file system, but they have different theoretical peak network bandwidths, different OSs, and different versions of the programming environment. Our study reveals how these slight configuration differences can be critical in terms of performance of the application. We also find that with 9216 nodes (294 912 cores) on Trinity XC40 the applications sustain petascale performance, as well as 50% of peak memory bandwidth over the entire solver (500 TB/s in aggregate). On 3072 Xeon Phi nodes of Cori, we reach 378 TFLOP/s with an aggregated bandwidth of 310 TB/s, corresponding to time-to-solution 2.11× faster than obtained with the same number of (dual-socket) Xeon nodes.
  • Entropy stable h/p-nonconforming discretization with the summation-by-parts property for the compressible Euler and Navier–Stokes equations

    Fernández, David C. Del Rey; Carpenter, Mark H.; Dalcin, Lisandro; Zampini, Stefano; Parsani, Matteo (SN Partial Differential Equations and Applications, Springer Science and Business Media LLC, 2020-03-16) [Article]
    In this paper, the entropy conservative/stable algorithms presented by Del Rey Fernandez and coauthors [18,16,17] for the compressible Euler and Navier-Stokes equations on nonconforming p-refined/coarsened curvilinear grids is extended to h/p refinement/coarsening. The main difficulty in developing nonconforming algorithms is the construction of appropriate coupling procedures across nonconforming interfaces. Here, a computationally simple and efficient approach based upon using decoupled interpolation operators is utilized. The resulting scheme is entropy conservative/stable and element-wise conservative. Numerical simulations of the isentropic vortex and viscous shock propagation confirm the entropy conservation/stability and accuracy properties of the method (achieving ~ p + 1 convergence) which are comparable to those of the original conforming scheme [4,35]. Simulations of the Taylor-Green vortex at Re = 1,600 and turbulent flow past a sphere at Re = 2,000 show the robustness and stability properties of the overall spatial discretization for unstructured grids. Finally, to demonstrate the entropy conservation property of a fully-discrete explicit entropy stable algorithm with h/p refinement/coarsening, we present the time evolution of the entropy function obtained by simulating the propagation of the isentropic vortex using a relaxation Runge-Kutta scheme.
  • Relaxation Runge--Kutta Methods: Fully Discrete Explicit Entropy-Stable Schemes for the Compressible Euler and Navier--Stokes Equations

    Ranocha, Hendrik; Sayyari, Mohammed; Dalcin, Lisandro; Parsani, Matteo; Ketcheson, David I. (SIAM Journal on Scientific Computing, Society for Industrial & Applied Mathematics (SIAM), 2020-03-12) [Article]
    The framework of inner product norm preserving relaxation Runge--Kutta methods [D. I. Ketcheson, SIAM J. Numer. Anal., 57 (2019), pp. 2850--2870] is extended to general convex quantities. Conservation, dissipation, or other solution properties with respect to any convex functional are enforced by the addition of a relaxation parameter that multiplies the Runge--Kutta update at each step. Moreover, other desirable stability (such as strong stability preservation) and efficiency (such as low storage requirements) properties are preserved. The technique can be applied to both explicit and implicit Runge--Kutta methods and requires only a small modification to existing implementations. The computational cost at each step is the solution of one additional scalar algebraic equation for which a good initial guess is available. The effectiveness of this approach is proved analytically and demonstrated in several numerical examples, including applications to high order entropy-conservative and entropy-stable semidiscretizations on unstructured grids for the compressible Euler and Navier--Stokes equations.
  • Solution of the 3D density-driven groundwater flow problem with uncertain porosity and permeability

    Litvinenko, Alexander; Logashenko, Dmitry; Tempone, Raul; Wittum, Gabriel; Keyes, David E. (GEM - International Journal on Geomathematics, Springer Science and Business Media LLC, 2020-03-02) [Article]
    The pollution of groundwater, essential for supporting populations and agriculture, can have catastrophic consequences. Thus, accurate modeling of water pollution at the surface and in groundwater aquifers is vital. Here, we consider a density-driven groundwater flow problem with uncertain porosity and permeability. Addressing this problem is relevant for geothermal reservoir simulations, natural saline-disposal basins, modeling of contaminant plumes and subsurface flow predictions. This strongly nonlinear time-dependent problem describes the convection of a two-phase flow, whereby a liquid flows and propagates into groundwater reservoirs under the force of gravity to form so-called “fingers”. To achieve an accurate numerical solution, fine spatial resolution with an unstructured mesh and, therefore, high computational resources are required. Here we run a parallelized simulation toolbox ug4 with a geometric multigrid solver on a parallel cluster, and the parallelization is carried out in physical and stochastic spaces. Additionally, we demonstrate how the ug4 toolbox can be run in a black-box fashion for testing different scenarios in the density-driven flow. As a benchmark, we solve the Elder-like problem in a 3D domain. For approximations in the stochastic space, we use the generalized polynomial chaos expansion. We compute the mean, variance, and exceedance probabilities for the mass fraction. We use the solution obtained from the quasi-Monte Carlo method as a reference solution.
  • Fully implicit hybrid two-level domain decomposition algorithms for two-phase flows in porous media on 3D unstructured grids

    Luo, Li; Liu, Lulu; Cai, Xiao Chuan; Keyes, David E. (Journal of Computational Physics, Elsevier BV, 2020-02-07) [Article]
    Simulation of subsurface flows in porous media is difficult due to the nonlinearity of the operators and the high heterogeneity of material coefficients. In this paper, we present a scalable fully implicit solver for incompressible two-phase flows based on overlapping domain decomposition methods. Specifically, an inexact Newton-Krylov algorithm with analytic Jacobian is used to solve the nonlinear systems arising from the discontinuous Galerkin discretization of the governing equations on 3D unstructured grids. The linear Jacobian system is preconditioned by additive Schwarz algorithms, which are naturally suitable for parallel computing. We propose a hybrid two-level version of the additive Schwarz preconditioner consisting of a nested coarse space to improve the robustness and scalability of the classical one-level version. On the coarse level, a smaller linear system arising from the same discretization of the problem on a coarse grid is solved by using GMRES with a one-level preconditioner until a relative tolerance is reached. Numerical experiments are presented to demonstrate the effectiveness and efficiency of the proposed solver for 3D heterogeneous medium problems. We also report the parallel scalability of the proposed algorithms on a supercomputer with up to 8,192 processor cores.
  • Square-Root Variable Metric based nullspace shuttle: a characterization of the non-uniqueness in elastic full-waveform inversion

    Liu, Qiancheng; Peter, Daniel (Journal of Geophysical Research: Solid Earth, American Geophysical Union (AGU), 2020-01-28) [Article]
    Full-waveform inversion (FWI) is for most geophysical applications an ill-posed inverse prob-16lem, with non-unique solutions. We examine its non-uniqueness by exploring the nullspace17shuttle, which can efficiently generate an ensemble of data-fitting solutions. We construct18this shuttle based on a quasi-Newton method, the square-root variable-metric (SRVM)19method. The latter provides access to the inverse data-misfit Hessian in FWI for large-scale20applications. Combining the SRVM method with a randomised singular value decomposi-21tion, we obtain the eigenvector subspaces of the inverse data-misfit Hessian. Its primary22eigenvalue and eigenvector are considered to determine the null space of inversion result.23Using the SRVM-based nullspace shuttle, we can modify the inverted result a posteriori24in a highly efficient manner without corrupting the data misfit. Also, because the SRVM25method is embedded through elastic FWI, our method can be extended to multi-parameter26problems. We confirm and highlight our approach with the elastic Marmousi example.
  • Hierarchical algorithms on hierarchical architectures

    Keyes, David E.; Ltaief, Hatem; Turkiyyah, G. (Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, The Royal Society, 2020-01-20) [Article]
    A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile low-rank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrix-free higher-order methods in optimization and large-scale spatial statistics. Some modules of this open-source project have been adopted in the software libraries of major vendors. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.
  • Performance analysis of tile low-rank cholesky factorization using PaRSEC instrumentation tools

    Cao, Quinglei; Pei, Yu; Herauldt, Thomas; Akbudak, Kadir; Mikhalev, Aleksandr; Bosilca, George; Ltaief, Hatem; Keyes, David E.; Dongarra, Jack (IEEE, 2020-01-15) [Conference Paper]
    This paper highlights the necessary development of new instrumentation tools within the PaRSE task-based runtime system to leverage the performance of low-rank matrix computations. In particular, the tile low-rank (TLR) Cholesky factorization represents one of the most critical matrix operations toward solving challenging large-scale scientific applications. The challenge resides in the heterogeneous arithmetic intensity of the various computational kernels, which stresses PaRSE's dynamic engine when orchestrating the task executions at runtime. Such irregular workload imposes the deployment of new scheduling heuristics to privilege the critical path, while exposing task parallelism to maximize hardware occupancy. To measure the effectiveness of PaRSE's engine and its various scheduling strategies for tackling such workloads, it becomes paramount to implement adequate performance analysis and profiling tools tailored to fine-grained and heterogeneous task execution. This permits us not only to provide insights from PaRSE, but also to identify potential applications' performance bottlenecks. These instrumentation tools may actually foster synergism between applications and PaRSE developers for productivity as well as high-performance computing purposes. We demonstrate the benefits of these amenable tools, while assessing the performance of TLR Cholesky factorization from data distribution, communication-reducing and synchronization-reducing perspectives. This tool-assisted performance analysis results in three major contributions: a new hybrid data distribution, a new hierarchical TLR Cholesky algorithm, and a new performance model for tuning the tile size. The new TLR Cholesky factorization achieves an 8X performance speedup over existing implementations on massively parallel supercomputers, toward solving large-scale 3D climate and weather prediction applications.
  • Mixed-Precision Tomographic Reconstructor Computations on Hardware Accelerators

    Doucet, Nicolas; Ltaief, Hatem; Gratadour, Damien; Keyes, David E. (IEEE, 2019-12-31) [Conference Paper]
    The computation of tomographic reconstructors (ToR) is at the core of a simulation framework to design the next generation of adaptive optics (AO) systems to be installed on future Extremely Large Telescopes (ELT). In fact, it is also a critical component for their operation on sky. The goals of these instruments range from the detection of the light from the most distant galaxies to the analysis of the composition of exoplanets terrestrial atmospheres. Based on advanced AO techniques, the instrument MOSAIC relies on a computational framework to filter out the Earth atmospheric turbulence and eventually enhance the images quality out of ground-based telescopes. The ToR calculation is a compute-bound operation based on the Cholesky factorization. Due to its cubical algorithmic complexity, the ToR may represent a major bottleneck for the E-ELT when scaling up the large number of wavefront sensors used in the baseline MOSAIC design. To mitigate this increasing dimensionality overhead, this paper presents the implementation of a novel mixed-precision Cholesky-based dense matrix solver on hardware accelerators. The new algorithm takes into account the data-sparse structure of the covariance matrix operator and uses the tensor cores of NVIDIA V100 GPUs to leverage performance at an unprecedented scale. To our knowledge, this is the first computational astronomy application that exploits V100's tensor cores outside of the traditional arena of artificial intelligence. Experimental results demonstrate the accuracy robustness and the high performance of the mixed-precision ToR on synthetic datasets, which paves the way for future instrument deployments on the E-ELT
  • On the robustness and performance of entropy stable discontinuous collocation methods for the compressible Navier-Stokes equations

    Rojas, Diego; Boukharfane, Radouan; Dalcin, Lisandro; Fernandez, David C. Del Rey; Ranocha, Hendrik; Keyes, David E.; Parsani, Matteo (Submitted to Journal of Computational Physics, arXiv, 2019-11-21) [Preprint]
    In computational fluid dynamics, the demand for increasingly multidisciplinary reliable simulations, for both analysis and design optimization purposes, requires transformational advances in individual components of future solvers. At the algorithmic level, hardware compatibility and efficiency are of paramount importance in determining viability at exascale and beyond. However, equally important (if not more so) is algorithmic robustness with minimal user intervention, which becomes progressively more challenging to achieve as problem size and physics complexity increase. We numerically show that low and high order entropy stable discontinuous spatial discretizations based on summation-by-part operators and simultaneous-approximation-terms technique provides an essential step toward a truly enabling technology in terms of reliability and robustness for both under-resolved turbulent flow simulations and flows with discontinuities.

View more