## Search

Now showing items 1-10 of 10

JavaScript is disabled for your browser. Some features of this site may not work without it.

AuthorKeyes, David E. (5)Parsani, Matteo (4)Dalcin, Lisandro (3)Ltaief, Hatem (3)Esposito, Aniello (2)View MoreDepartmentExtreme Computing Research Center (10)Applied Mathematics and Computational Science Program (9)Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division (9)Applied Mathematics and Computational Science (2)Computer Science Program (2)View MoreJournalACM Transactions on Mathematical Software (2)ACM Transactions on Parallel Computing (1)Geophysical Journal International (1)Journal of Computational Physics (1)Parallel Computing (1)View MoreKAUST Acknowledged Support Unit

Extreme Computing Research Center (10)

Supercomputing Laboratory (6)Information Technology (1)Intel Research Laboratories (1)Office of Sponsored Research (1)View MoreKAUST Grant NumberOSR-2018-CARF-3666 (1)UAPN#2605-CRG4 (1)PublisherarXiv (3)Elsevier BV (2)Association for Computing Machinery
acmhelp@acm.org (1)Association for Computing Machinery (ACM) (1)Association for Computing Machineryacmhelp@acm.org (1)View MoreSubjectbatched algorithms (1)Batched BLAS Kernels (1)compressible multifluids (1)Computational seismology (1)curved elements (1)View MoreTypeArticle (7)Preprint (3)Year (Issue Date)2019 (8)2016 (2)Item Availability
Open Access (10)

Now showing items 1-10 of 10

- List view
- Grid view
- Sort Options:
- Relevance
- Title Asc
- Title Desc
- Issue Date Asc
- Issue Date Desc
- Submit Date Asc
- Submit Date Desc
- Results Per Page:
- 5
- 10
- 20
- 40
- 60
- 80
- 100

Entropy Stable p-Nonconforming Discretizations with the Summation-by-Parts Property for the Compressible Navier–Stokes Equations

Fernandez, David C. Del Rey; Carpenter, Mark H.; Dalcin, Lisandro; Fredrich, Lucas; Winters, Andrew R.; Gassner, Gregor J.; Parsani, Matteo (Submitted to Computers and Fluids, arXiv, 2019-09-27) [Preprint]

The entropy conservative, curvilinear, nonconforming, p-refinement algorithm for hyperbolic conservation laws of Del Rey Fernandez et al. (2019), is extended from the compressible Euler equations to the compressible Navier-Stokes equations. A simple and flexible coupling procedure with planar interpolation operators between adjoining nonconforming elements is used. Curvilinear volume metric terms are numerically approximated via a minimization procedure and satisfy the discrete geometric conservation law conditions. Distinct curvilinear surface metrics are used on the adjoining interfaces to construct the interface coupling terms, thereby localizing the discrete geometric conservation law constraints to each individual element. The resulting scheme is entropy conservative/stable, element-wise conservative, and freestream preserving. Viscous interface dissipation operators are developed that retain the entropy stability of the base scheme. The accuracy and stability properties of the resulting numerical scheme are shown to be comparable to those of the original conforming scheme (achieving ~p+1 convergence) in the context of the viscous shock problem, the Taylor-Green vortex problem at a Reynolds number of Re=1,600, and a subsonic turbulent flow past a sphere at Re = 2,000.

Entropy Stable p-Nonconforming Discretizations with the Summation-by-Parts Property for the Compressible Euler equations

Fernandez, D. C. Del Rey; Carpenter, M. H.; Dalcin, Lisandro; Fredrich, L.; Rojas, D.; Winters, A. R.; Gassner, G. J.; Zampini, Stefano; Parsani, Matteo (arXiv, 2019-09-27) [Preprint]

The entropy conservative/stable algorithm of Friedrich etal (2018) for hyperbolic conservation laws on nonconforming p-refined/coarsened Cartesian grids, is extended to curvilinear grids for the compressible Euler equations. The primary focus is on constructing appropriate coupling procedures across the curvilinear nonconforming interfaces. A simple and flexible approach is proposed that uses interpolation operators from one element to the other. On the element faces, the analytic metrics are used to construct coupling terms,while metric terms in the volume are approximated to satisfy a discretization of the geometric conservation laws. The resulting scheme is entropy conservative/stable, elementwise conservative, and freestream preserving. The accuracy and stability properties of the resulting numerical algorithm are shown to be comparable to those of the original conforming scheme (~p+1 convergence) in the context of the isentropic Euler vortex and the inviscid Taylor-Green vortex problems on manufactured high order grids.

Randomized GPU algorithms for the construction of hierarchical matrices from matrix-vector operations

Boukaram, Wagih Halim; Turkiyyah, George; Keyes, David E. (SIAM Journal on Scientific Computing, Society for Industrial & Applied Mathematics (SIAM), 2019-07-09) [Article]

Randomized algorithms for the generation of low rank approximations of large dense matrices have become popular methods in scientific computing and machine learning. In this paper, we extend the scope of these methods and present batched GPU randomized algorithms for the efficient generation of low rank representations of large sets of small dense matrices, as well as their generalization to the construction of hierarchically low rank symmetric H2 matrices with general partitioning structures. In both cases, the algorithms need to access the matrices only through matrix-vector multiplication operations which can be done in blocks to increase the arithmetic intensity and substantially boost the resulting performance. The batched GPU kernels are adaptive, allow nonuniform sizes in the matrices of the batch, and are more effective than SVD factorizations on matrices with fast decaying spectra. The hierarchical matrix generation consists of two phases, interleaved at every level of the matrix hierarchy. A first phase adaptively generates low rank approximations of matrix blocks through randomized matrix-vector sampling. A second phase accumulates and compresses these blocks into a hierarchical matrix that is incrementally constructed. The accumulation expresses the low rank blocks of a given level as a set of local low rank updates that are performed simultaneously on the whole matrix allowing high-performance batched kernels to be used in the compression operations. When the ranks of the blocks generated in the first phase are too large to be processed in a single operation, the low rank updates can be split into smaller-sized updates and applied in sequence. Assuming representative rank k, the resulting matrix has optimal O(kN) asymptotic storage complexity because of the nested bases it uses. The ability to generate an H2 matrix from matrix-vector products allows us to support a general randomized matrix-matrix multiplication operation, an important kernel in hierarchical matrix computations. Numerical experiments demonstrate the high performance of the algorithms and their effectiveness in generating hierarchical matrices to a desired target accuracy.

Massively Parallel Polar Decomposition on Distributed-memory Systems

Ltaief, Hatem; Sukkari, Dalal E.; Esposito, Aniello; Nakatsukasa, Yuji; Keyes, David E. (ACM Transactions on Parallel Computing, Association for Computing Machineryacmhelp@acm.org, 2019-06-10) [Article]

We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions-introduced by Zolotarev (ZOLO) in 1877-this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3× speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.

Relaxation Runge-Kutta Methods: Fully-Discrete Explicit Entropy-Stable
Schemes for the Euler and Navier-Stokes Equations

Ranocha, Hendrik; Sayyari, Mohammed; Dalcin, Lisandro; Parsani, Matteo; Ketcheson, David I. (arXiv, 2019-05-22) [Preprint]

The framework of inner product norm preserving relaxation Runge-Kutta methods (David I. Ketcheson, Relaxation Runge-Kutta Methods: Conservation and Stability for Inner-Product Norms, 2019. arXiv: 1905.09847 [math.NA]) is extended to general convex quantities. Conservation, dissipation, or other solution properties with respect to any convex functional are enforced by the addition of a relaxation parameter that multiplies the Runge-Kutta update at each step. Moreover, other desirable stability (such as strong stability preservation) and efficiency (such as low storage requirements) properties are preserved. The technique can be applied to both explicit and implicit Runge-Kutta methods and requires only a small modification to existing implementations. The computational cost at each step is the solution of one additional scalar algebraic equation for which a good initial guess is available. The effectiveness of this approach is proved analytically and demonstrated in several numerical examples, including applications to high-order entropy-conservative and entropy-stable semi-discretizations on unstructured grids for the compressible Euler and Navier-Stokes equations.

Square-root variable metric based elastic full-waveform inversion-Part 2: Uncertainty estimation

Liu, Qiancheng; Peter, Daniel (Geophysical Journal International, Oxford University Press (OUP), 2019-05-02) [Article]

In our first paper (Part 1) about the square-root variable metric (SRVM) method we presented the basic theory and validation of the inverse algorithm applicable to large-scale seismic data inversions. In this second paper (Part 2) about the SRVM method, the objective is to estimate the resolution and uncertainty of the inverted resulting geophysical model. Bayesian inference allows estimating the posterior model distribution from its prior distribution and likelihood function. These distributions, when being linear and Gaussian, can be mathematically characterized by their covariance matrices. However, it is prohibitive to explicitly construct and store the covariance in large-scale practical problems. In Part 1, we applied the SRVM method to elastic full-waveform inversion in a matrix-free vector version. This new algorithm allows accessing the posterior covariance by reconstructing the inverseHessian with memory-Affordable vector series. The focus of this paper is on extracting quantitative and statistical information from the inverse Hessian for quality assessment of the inverted seismic model by FWI. To operate on the inverse Hessian more efficiently, we compute its eigenvalues and eigenvectors with randomized singular value decomposition. Furthermore, we collect point-spread functions from the Hessian in an efficient way. The posterior standard deviation quantitatively measures the uncertainties of the posterior model. 2-D Gaussian random samplers will help to visually compare both the prior and posterior distributions. We highlight our method on several numerical examples and demonstrate an uncertainty estimation analysis applicable to large-scale inversions.

A QDWH-based SVD software framework on distributed-memory manycore systems

Sukkari, Dalal E.; Ltaief, Hatem; Esposito, Aniello; Keyes, David E. (ACM Transactions on Mathematical Software, Association for Computing Machinery (ACM), 2019-04-29) [Article]

This article presents a high-performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al. (2010) and Nakatsukasa and Higham (2013), the SVD solver relies on the polar decomposition using the QR Dynamically Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying on the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-offs. We report performance results against state-of-the-art existing QDWH software implementations (e.g., Elemental) and their SVD extensions on large-scale distributed-memory manycore systems based on commodity Intel x86 Haswell processors and Knights Landing (KNL) architecture. The QDWH-SVD framework achieves up to 3/8-fold speedups on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well- and ill-conditioned matrices. We finally come up herein with a performance model based on these empirical results. Our QDWH-based polar decomposition and its SVD extension are freely available at https://github.com/ecrc/qdwh.git and https://github.com/ecrc/ksvd.git, respectively, and have been integrated into the Cray Scientific numerical library LibSci v17.11.1.

Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs

Charara, Ali; Keyes, David E.; Ltaief, Hatem (ACM Transactions on Mathematical Software, Association for Computing Machinery
acmhelp@acm.org, 2019-04-01) [Article]

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.

Maximum-principle-satisfying space-time conservation element and solution element scheme applied to compressible multifluids

Shen, Hua; Wen, Chih-Yung; Parsani, Matteo; Shu, Chi-Wang (Journal of Computational Physics, Elsevier BV, 2016-10-20) [Article]

A maximum-principle-satisfying space-time conservation element and solution element (CE/SE) scheme is constructed to solve a reduced five-equation model coupled with the stiffened equation of state for compressible multifluids. We first derive a sufficient condition for CE/SE schemes to satisfy maximum-principle when solving a general conservation law. And then we introduce a slope limiter to ensure the sufficient condition which is applicative for both central and upwind CE/SE schemes. Finally, we implement the upwind maximum-principle-satisfying CE/SE scheme to solve the volume-fraction-based five-equation model for compressible multifluids. Several numerical examples are carried out to carefully examine the accuracy, efficiency, conservativeness and maximum-principle-satisfying property of the proposed approach.

Unstructured Computational Aerodynamics on Many Integrated Core Architecture

Al Farhan, Mohammed; Kaushik, Dinesh K.; Keyes, David E. (Parallel Computing, Elsevier BV, 2016-06-11) [Article]

Shared memory parallelization of the flux kernel of PETSc-FUN3D, an unstructured tetrahedral mesh Euler flow code previously studied for distributed memory and multi-core shared memory, is evaluated on up to 61 cores per node and up to 4 threads per core. We explore several thread-level optimizations to improve flux kernel performance on the state-of-the-art many integrated core (MIC) Intel processor Xeon Phi “Knights Corner,” with a focus on strong thread scaling. While the linear algebraic kernel is bottlenecked by memory bandwidth for even modest numbers of cores sharing a common memory, the flux kernel, which arises in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, is compute-intensive and is known to exploit effectively contemporary multi-core hardware. We extend study of the performance of the flux kernel to the Xeon Phi in three thread affinity modes, namely scatter, compact, and balanced, in both offload and native mode, with and without various code optimizations to improve alignment and reduce cache coherency penalties. Relative to baseline “out-of-the-box” optimized compilation, code restructuring optimizations provide about 3.8x speedup using the offload mode and about 5x speedup using the native mode. Even with these gains for the flux kernel, with respect to execution time the MIC simply achieves par with optimized compilation on a contemporary multi-core Intel CPU, the 16-core Sandy Bridge E5 2670. Nevertheless, the optimizations employed to reduce the data motion and cache coherency protocol penalties of the MIC are expected to be of value for CFD and many other unstructured applications as many-core architecture evolves. We explore large-scale distributed-shared memory performance on the Cray XC40 supercomputer, to demonstrate that optimizations employed on Phi hybridize to this context, where each of thousands of nodes are comprised of two sockets of Intel Xeon Haswell CPUs with 32 cores per node.

The export option will allow you to export the current search results of the entered query to a file. Different formats are available for download. To export the items, click on the button corresponding with the preferred download format.

By default, clicking on the export buttons will result in a download of the allowed maximum amount of items. For anonymous users the allowed maximum amount is 50 search results.

To select a subset of the search results, click "Selective Export" button and make a selection of the items you want to export. The amount of items that can be exported at once is similarly restricted as the full export.

After making a selection, click one of the export format buttons. The amount of items that will be exported is indicated in the bubble next to export format.