Recent Submissions

  • Positivity-preserving CE/SE schemes for solving the compressible Euler and Navier–Stokes equations on hybrid unstructured meshes

    Shen, Hua; Parsani, Matteo (Computer Physics Communications, Elsevier BV, 2018-05-28) [Article]
    We construct positivity-preserving space–time conservation element and solution element (CE/SE) schemes for solving the compressible Euler and Navier–Stokes equations on hybrid unstructured meshes consisting of triangular and rectangular elements. The schemes use an a posteriori limiter to prevent negative densities and pressures based on the premise of preserving optimal accuracy. The limiter enforces a constraint for spatial derivatives and does not change the conservative property of CE/SE schemes. Several numerical examples suggest that the proposed schemes preserve accuracy for smooth flows and strictly preserve positivity of densities and pressures for the problems involving near vacuum and very strong discontinuities.
  • Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

    Al Farhan, Mohammed Ahmed; Keyes, David E. (IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers (IEEE), 2018-04-13) [Article]
    We investigate several state-of-the-practice shared-memory optimization techniques applied to key routines of an unstructured computational aerodynamics application with irregular memory accesses. We illustrate for the Intel KNL processor, as a representative of the processors in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating point numerics of the original code. We employ low and high-level architecture-specific code optimizations involving thread and data-level parallelism. Our approach is based upon a multi-level hierarchical distribution of work and data across both the threads and the SIMD units within every hardware core. On a 64-core KNL chip, we achieve nearly 2.9x speedup of the dominant routines relative to the baseline. These exhibit almost linear strong scalability up to 64 threads, and thereafter some improvement with hyperthreading. At substantially fewer Watts, we achieve up to 1.7x speedup relative to the performance of 72 threads of a 36-core Haswell CPU and roughly equivalent performance to 112 threads of a 56-core Skylake scalable processor. These optimizations are expected to be of value for many other unstructured mesh PDE-based scientific applications as multi and many-core architecture evolves.
  • Extreme Scale FMM-Accelerated Boundary Integral Equation Solver for Wave Scattering

    AbdulJabbar, Mustafa Abdulmajeed; Al Farhan, Mohammed; Al-Harthi, Noha A.; Chen, Rui; Yokota, Rio; Bagci, Hakan; Keyes, David E. (arXiv, 2018-03-27) [Preprint]
    Algorithmic and architecture-oriented optimizations are essential for achieving performance worthy of anticipated energy-austere exascale systems. In this paper, we present an extreme scale FMM-accelerated boundary integral equation solver for wave scattering, which uses FMM as a matrix-vector multiplication inside the GMRES iterative method. Our FMM Helmholtz kernels treat nontrivial singular and near-field integration points. We implement highly optimized kernels for both shared and distributed memory, targeting emerging Intel extreme performance HPC architectures. We extract the potential thread- and data-level parallelism of the key Helmholtz kernels of FMM. Our application code is well optimized to exploit the AVX-512 SIMD units of Intel Skylake and Knights Landing architectures. We provide different performance models for tuning the task-based tree traversal implementation of FMM, and develop optimal architecture-specific and algorithm aware partitioning, load balancing, and communication reducing mechanisms to scale up to 6,144 compute nodes of a Cray XC40 with 196,608 hardware cores. With shared memory optimizations, we achieve roughly 77% of peak single precision floating point performance of a 56-core Skylake processor, and on average 60% of peak single precision floating point performance of a 72-core KNL. These numbers represent nearly 5.4x and 10x speedup on Skylake and KNL, respectively, compared to the baseline scalar code. With distributed memory optimizations, on the other hand, we report near-optimal efficiency in the weak scalability study with respect to both the logarithmic communication complexity as well as the theoretical scaling complexity of FMM. In addition, we exhibit up to 85% efficiency in strong scaling. We compute in excess of 2 billion DoF on the full-scale of the Cray XC40 supercomputer.
  • Isogeometric BDDC deluxe preconditioners for linear elasticity

    Pavarino, L. F.; Scacchi, S.; Widlund, O. B.; Zampini, Stefano (Mathematical Models and Methods in Applied Sciences, World Scientific Pub Co Pte Lt, 2018-03-14) [Article]
    Balancing Domain Decomposition by Constraints (BDDC) preconditioners have been shown to provide rapidly convergent preconditioned conjugate gradient methods for solving many of the very ill-conditioned systems of algebraic equations which often arise in finite element approximations of a large variety of problems in continuum mechanics. These algorithms have also been developed successfully for problems arising in isogeometric analysis. In particular, the BDDC deluxe version has proven very successful for problems approximated by Non-Uniform Rational B-Splines (NURBS), even those of high order and regularity. One main purpose of this paper is to extend the theory, previously fully developed only for scalar elliptic problems in the plane, to problems of linear elasticity in three dimensions. Numerical experiments supporting the theory are also reported. Some of these experiments highlight the fact that the development of the theory can help to decrease substantially the dimension of the primal space of the BDDC algorithm, which provides the necessary global component of these preconditioners, while maintaining scalability and good convergence rates.
  • High-flux water desalination with interfacial salt sieving effect in nanoporous carbon composite membranes

    Chen, Wei; Chen, Shuyu; Liang, Tengfei; Zhang, Qiang; Fan, Zhongli; Yin, Hang; Huang, Kuo-Wei; Zhang, Xixiang; Lai, Zhiping; Sheng, Ping (Nature Nanotechnology, Springer Nature, 2018-03-05) [Article]
    Freshwater flux and energy consumption are two important benchmarks for the membrane desalination process. Here, we show that nanoporous carbon composite membranes, which comprise a layer of porous carbon fibre structures grown on a porous ceramic substrate, can exhibit 100% desalination and a freshwater flux that is 3-20 times higher than existing polymeric membranes. Thermal accounting experiments demonstrated that the carbon composite membrane saved over 80% of the latent heat consumption. Theoretical calculations combined with molecular dynamics simulations revealed the unique microscopic process occurring in the membrane. When the salt solution is stopped at the openings to the nanoscale porous channels and forms a meniscus, the vapour can rapidly transport across the nanoscale gap to condense on the permeate side. This process is driven by the chemical potential gradient and aided by the unique smoothness of the carbon surface. The high thermal conductivity of the carbon composite membrane ensures that most of the latent heat is recovered.
  • Virtual reality in advanced medical immersive imaging: a workflow for introducing virtual reality as a supporting tool in medical imaging

    Knodel, Markus M.; Lemke, Babett; Lampe, Michael; Hoffer, Michael; Gillmann, Clarissa; Uder, Michael; Hillengaß, Jens; Wittum, Gabriel; Bäuerle, Tobias (Computing and Visualization in Science, Springer Nature, 2018-02-27) [Article]
    Radiologic evaluation of images from computed tomography (CT) or magnetic resonance imaging for diagnostic purposes is based on the analysis of single slices, occasionally supplementing this information with 3D reconstructions as well as surface or volume rendered images. However, due to the complexity of anatomical or pathological structures in biomedical imaging, innovative visualization techniques are required to display morphological characteristics three dimensionally. Virtual reality is a modern tool of representing visual data, The observer has the impression of being “inside” a virtual surrounding, which is referred to as immersive imaging. Such techniques are currently being used in technical applications, e.g. in the automobile industry. Our aim is to introduce a workflow realized within one simple program which processes common image stacks from CT, produces 3D volume and surface reconstruction and rendering, and finally includes the data into a virtual reality device equipped with a motion head tracking cave automatic virtual environment system. Such techniques have the potential to augment the possibilities in non-invasive medical imaging, e.g. for surgical planning or educational purposes to add another dimension for advanced understanding of complex anatomical and pathological structures. To this end, the reconstructions are based on advanced mathematical techniques and the corresponding grids which we can export are intended to form the basis for simulations of mathematical models of the pathogenesis of different diseases.
  • Theoretical investigation of shock stand-off distance for non-equilibrium flows over spheres

    Shen, Hua; WEN, Chih-Yung (Chinese Journal of Aeronautics, Elsevier BV, 2018-02-20) [Article]
    We derived a theoretical solution of the shock stand-off distance for a non-equilibrium flow over spheres based on Wen and Hornung’s solution and Olivier’s solution. Compared with previous approaches, the main advantage of the present approach is allowing an analytic solution without involving any semi-empirical parameter for the whole non-equilibrium flow regimes. The effects of some important physical quantities therefore can be fully revealed via the analytic solution. By combining the current solution with Ideal Dissociating Gas (IDG) model, we investigate the effects of free stream kinetic energy and free stream dissociation level (which can be very different between different facilities) on the shock stand-off distance.
  • Refined isogeometric analysis for a preconditioned conjugate gradient solver

    Garcia, Daniel; Pardo, D.; Dalcin, Lisandro; Calo, Victor M. (Computer Methods in Applied Mechanics and Engineering, Elsevier BV, 2018-02-12) [Article]
    Starting from a highly continuous Isogeometric Analysis (IGA) discretization, refined Isogeometric Analysis (rIGA) introduces C0 hyperplanes that act as separators for the direct LU factorization solver. As a result, the total computational cost required to solve the corresponding system of equations using a direct LU factorization solver dramatically reduces (up to a factor of 55) Garcia et al. (2017). At the same time, rIGA enriches the IGA spaces, thus improving the best approximation error. In this work, we extend the complexity analysis of rIGA to the case of iterative solvers. We build an iterative solver as follows: we first construct the Schur complements using a direct solver over small subdomains (macro-elements). We then assemble those Schur complements into a global skeleton system. Subsequently, we solve this system iteratively using Conjugate Gradients (CG) with an incomplete LU (ILU) preconditioner. For a 2D Poisson model problem with a structured mesh and a uniform polynomial degree of approximation, rIGA achieves moderate savings with respect to IGA in terms of the number of Floating Point Operations (FLOPs) and computational time (in seconds) required to solve the resulting system of linear equations. For instance, for a mesh with four million elements and polynomial degree p=3, the iterative solver is approximately 2.6 times faster (in time) when applied to the rIGA system than to the IGA one. These savings occur because the skeleton rIGA system contains fewer non-zero entries than the IGA one. The opposite situation occurs for 3D problems, and as a result, 3D rIGA discretizations provide no gains with respect to their IGA counterparts when considering iterative solvers.
  • Entropy Stability and the No-Slip Wall Boundary Condition

    Svärd, Magnus; Carpenter, Mark H.; Parsani, Matteo (SIAM Journal on Numerical Analysis, Society for Industrial & Applied Mathematics (SIAM), 2018-01-18) [Article]
    We present an entropy stable numerical scheme subject to no-slip wall boundary conditions. To enforce entropy stability only the no-penetration boundary condition and a temperature condition are needed at a wall, and this leads to an L bound on the conservative variables. In this article, we take the next step and design a finite difference scheme that also bounds the velocity gradients. This necessitates the use of the full no-slip conditions.
  • Quantitative Analysis of Hepatitis C NS5A Viral Protein Dynamics on the ER Surface

    Knodel, Markus M.; Nägel, Arne; Reiter, Sebastian; Vogel, Andreas; Targett-Adams, Paul; McLauchlan, John; Herrmann, Eva; Wittum, Gabriel (Viruses, MDPI AG, 2018-01-08) [Article]
    Exploring biophysical properties of virus-encoded components and their requirement for virus replication is an exciting new area of interdisciplinary virological research. To date, spatial resolution has only rarely been analyzed in computational/biophysical descriptions of virus replication dynamics. However, it is widely acknowledged that intracellular spatial dependence is a crucial component of virus life cycles. The hepatitis C virus-encoded NS5A protein is an endoplasmatic reticulum (ER)-anchored viral protein and an essential component of the virus replication machinery. Therefore, we simulate NS5A dynamics on realistic reconstructed, curved ER surfaces by means of surface partial differential equations (sPDE) upon unstructured grids. We match the in silico NS5A diffusion constant such that the NS5A sPDE simulation data reproduce experimental NS5A fluorescence recovery after photobleaching (FRAP) time series data. This parameter estimation yields the NS5A diffusion constant. Such parameters are needed for spatial models of HCV dynamics, which we are developing in parallel but remain qualitative at this stage. Thus, our present study likely provides the first quantitative biophysical description of the movement of a viral component. Our spatio-temporal resolved ansatz paves new ways for understanding intricate spatial-defined processes central to specfic aspects of virus life cycles.
  • Fast Multipole Method as a Matrix-Free Hierarchical Low-Rank Approximation

    Yokota, Rio; Ibeid, Huda; Keyes, David E. (Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing, Springer International Publishing, 2018-01-03) [Conference Paper]
    There has been a large increase in the amount of work on hierarchical low-rank approximation methods, where the interest is shared by multiple communities that previously did not intersect. This objective of this article is two-fold; to provide a thorough review of the recent advancements in this field from both analytical and algebraic perspectives, and to present a comparative benchmark of two highly optimized implementations of contrasting methods for some simple yet representative test cases. The first half of this paper has the form of a survey paper, to achieve the former objective. We categorize the recent advances in this field from the perspective of compute-memory tradeoff, which has not been considered in much detail in this area. Benchmark tests reveal that there is a large difference in the memory consumption and performance between the different methods.
  • An Entropy Stable h/p Non-Conforming Discontinuous Galerkin Method with the Summation-by-Parts Property

    Friedrich, Lucas; Winters, Andrew R.; Fernández, David C. Del Rey; Gassner, Gregor J.; Parsani, Matteo; Carpenter, Mark H. (arXiv, 2017-12-29) [Preprint]
    This work presents an entropy stable discontinuous Galerkin (DG) spectral element approximation for systems of non-linear conservation laws with general geometric (h) and polynomial order (p) non-conforming rectangular meshes. The crux of the proofs presented is that the nodal DG method is constructed with the collocated Legendre-Gauss-Lobatto nodes. This choice ensures that the derivative/mass matrix pair is a summation-by-parts (SBP) operator such that entropy stability proofs from the continuous analysis are discretely mimicked. Special attention is given to the coupling between nonconforming elements as we demonstrate that the standard mortar approach for DG methods does not guarantee entropy stability for non-linear problems, which can lead to instabilities. As such, we describe a precise procedure and modify the mortar method to guarantee entropy stability for general non-linear hyperbolic systems on h/p non-conforming meshes. We verify the high-order accuracy and the entropy conservation/stability of fully non-conforming approximation with numerical examples.
  • Accelerated Cyclic Reduction: A Distributed-Memory Fast Solver for Structured Linear Systems

    Chávez, Gustavo; Turkiyyah, George; Zampini, Stefano; Ltaief, Hatem; Keyes, David E. (Parallel Computing, Elsevier BV, 2017-12-15) [Article]
    We present Accelerated Cyclic Reduction (ACR), a distributed-memory fast solver for rank-compressible block tridiagonal linear systems arising from the discretization of elliptic operators, developed here for three dimensions. Algorithmic synergies between Cyclic Reduction and hierarchical matrix arithmetic operations result in a solver that has O(kNlogN(logN+k2)) arithmetic complexity and O(k Nlog N) memory footprint, where N is the number of degrees of freedom and k is the rank of a block in the hierarchical approximation, and which exhibits substantial concurrency. We provide a baseline for performance and applicability by comparing with the multifrontal method with and without hierarchical semi-separable matrices, with algebraic multigrid and with the classic cyclic reduction method. Over a set of large-scale elliptic systems with features of nonsymmetry and indefiniteness, the robustness of the direct solvers extends beyond that of the multigrid solver, and relative to the multifrontal approach ACR has lower or comparable execution time and size of the factors, with substantially lower numerical ranks. ACR exhibits good strong and weak scaling in a distributed context and, as with any direct solver, is advantageous for problems that require the solution of multiple right-hand sides. Numerical experiments show that the rank k patterns are of O(1) for the Poisson equation and of O(n) for the indefinite Helmholtz equation. The solver is ideal in situations where low-accuracy solutions are sufficient, or otherwise as a preconditioner within an iterative method.
  • Extension of CE/SE method to non-equilibrium dissociating flows

    Wen, C.Y.; Saldivar Massimi, H.; Shen, Hua (Journal of Computational Physics, Elsevier BV, 2017-12-08) [Article]
    In this study, the hypersonic non-equilibrium flows over rounded nose geometries are numerically investigated by a robust conservation element and solution element (CE/SE) code, which is based on hybrid meshes consisting of triangular and quadrilateral elements. The dissociating and recombination chemical reactions as well as the vibrational energy relaxation are taken into account. The stiff source terms are solved by an implicit trapezoidal method of integration. Comparison with laboratory and numerical cases are provided to demonstrate the accuracy and reliability of the present CE/SE code in simulating hypersonic non-equilibrium flows.
  • Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients

    Chavez Chavez, Gustavo Ivan; Turkiyyah, George; Zampini, Stefano; Keyes, David E. (Journal of Computational and Applied Mathematics, Elsevier BV, 2017-12-07) [Article]
    We present a robust and scalable preconditioner for the solution of large-scale linear systems that arise from the discretization of elliptic PDEs amenable to rank compression. The preconditioner is based on hierarchical low-rank approximations and the cyclic reduction method. The setup and application phases of the preconditioner achieve log-linear complexity in memory footprint and number of operations, and numerical experiments exhibit good weak and strong scalability at large processor counts in a distributed memory environment. Numerical experiments with linear systems that feature symmetry and nonsymmetry, definiteness and indefiniteness, constant and variable coefficients demonstrate the preconditioner applicability and robustness. Furthermore, it is possible to control the number of iterations via the accuracy threshold of the hierarchical matrix approximations and their arithmetic operations, and the tuning of the admissibility condition parameter. Together, these parameters allow for optimization of the memory requirements and performance of the preconditioner.
  • A scalable community detection algorithm for large graphs using stochastic block models

    Peng, Chengbin; Zhang, Zhihua; Wong, Ka-Chun; Zhang, Xiangliang; Keyes, David E. (Intelligent Data Analysis, IOS Press, 2017-11-24) [Article]
    Community detection in graphs is widely used in social and biological networks, and the stochastic block model is a powerful probabilistic tool for describing graphs with community structures. However, in the era of
  • An energy-stable generalized- α method for the Swift–Hohenberg equation

    Sarmiento, Adel; Espath, L.F.R.; Vignal, P.; Dalcin, Lisandro; Parsani, Matteo; Calo, V.M. (Journal of Computational and Applied Mathematics, Elsevier BV, 2017-11-16) [Article]
    We propose a second-order accurate energy-stable time-integration method that controls the evolution of numerical instabilities introducing numerical dissipation in the highest-resolved frequencies. Our algorithm further extends the generalized-α method and provides control over dissipation via the spectral radius. We derive the first and second laws of thermodynamics for the Swift–Hohenberg equation and provide a detailed proof of the unconditional energy stability of our algorithm. Finally, we present numerical results to verify the energy stability and its second-order accuracy in time.
  • Fast multipole preconditioners for sparse matrices arising from elliptic equations

    Ibeid, Huda; Yokota, Rio; Pestana, Jennifer; Keyes, David E. (Computing and Visualization in Science, Springer Nature, 2017-11-09) [Article]
    Among optimal hierarchical algorithms for the computational solution of elliptic problems, the fast multipole method (FMM) stands out for its adaptability to emerging architectures, having high arithmetic intensity, tunable accuracy, and relaxable global synchronization requirements. We demonstrate that, beyond its traditional use as a solver in problems for which explicit free-space kernel representations are available, the FMM has applicability as a preconditioner in finite domain elliptic boundary value problems, by equipping it with boundary integral capability for satisfying conditions at finite boundaries and by wrapping it in a Krylov method for extensibility to more general operators. Here, we do not discuss the well developed applications of FMM to implement matrix-vector multiplications within Krylov solvers of boundary element methods. Instead, we propose using FMM for the volume-to-volume contribution of inhomogeneous Poisson-like problems, where the boundary integral is a small part of the overall computation. Our method may be used to precondition sparse matrices arising from finite difference/element discretizations, and can handle a broader range of scientific applications. It is capable of algebraic convergence rates down to the truncation error of the discretized PDE comparable to those of multigrid methods, and it offers potentially superior multicore and distributed memory scalability properties on commodity architecture supercomputers. Compared with other methods exploiting the low-rank character of off-diagonal blocks of the dense resolvent operator, FMM-preconditioned Krylov iteration may reduce the amount of communication because it is matrix-free and exploits the tree structure of FMM. We describe our tests in reproducible detail with freely available codes and outline directions for further extensibility.
  • Rectangular maximum-volume submatrices and their applications

    Mikhalev, Aleksandr; Oseledets, I.V. (Linear Algebra and its Applications, Elsevier BV, 2017-10-18) [Article]
    We introduce a definition of the volume of a general rectangular matrix, which is equivalent to an absolute value of the determinant for square matrices. We generalize results of square maximum-volume submatrices to the rectangular case, show a connection of the rectangular volume with an optimal experimental design and provide estimates for a growth of coefficients and an approximation error in spectral and Chebyshev norms. Three promising applications of such submatrices are presented: recommender systems, finding maximal elements in low-rank matrices and preconditioning of overdetermined linear systems. The code is available online.
  • Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

    Sukkari, Dalal E.; Ltaief, Hatem; Faverge, Mathieu; Keyes, David E. (IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers (IEEE), 2017-09-29) [Article]
    This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations for the polar decomposition on latest shared-memory vendors' systems, while maintaining numerical accuracy.

View more