Extreme Computing Research Center
Recent Submissions

Positivitypreserving CE/SE schemes for solving the compressible Euler and Navier–Stokes equations on hybrid unstructured meshes(Computer Physics Communications, Elsevier BV, 20180528) [Article]We construct positivitypreserving space–time conservation element and solution element (CE/SE) schemes for solving the compressible Euler and Navier–Stokes equations on hybrid unstructured meshes consisting of triangular and rectangular elements. The schemes use an a posteriori limiter to prevent negative densities and pressures based on the premise of preserving optimal accuracy. The limiter enforces a constraint for spatial derivatives and does not change the conservative property of CE/SE schemes. Several numerical examples suggest that the proposed schemes preserve accuracy for smooth flows and strictly preserve positivity of densities and pressures for the problems involving near vacuum and very strong discontinuities.

Optimizations of Unstructured Aerodynamics Computations for Manycore Architectures(IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers (IEEE), 20180413) [Article]We investigate several stateofthepractice sharedmemory optimization techniques applied to key routines of an unstructured computational aerodynamics application with irregular memory accesses. We illustrate for the Intel KNL processor, as a representative of the processors in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating point numerics of the original code. We employ low and highlevel architecturespecific code optimizations involving thread and datalevel parallelism. Our approach is based upon a multilevel hierarchical distribution of work and data across both the threads and the SIMD units within every hardware core. On a 64core KNL chip, we achieve nearly 2.9x speedup of the dominant routines relative to the baseline. These exhibit almost linear strong scalability up to 64 threads, and thereafter some improvement with hyperthreading. At substantially fewer Watts, we achieve up to 1.7x speedup relative to the performance of 72 threads of a 36core Haswell CPU and roughly equivalent performance to 112 threads of a 56core Skylake scalable processor. These optimizations are expected to be of value for many other unstructured mesh PDEbased scientific applications as multi and manycore architecture evolves.

Extreme Scale FMMAccelerated Boundary Integral Equation Solver for Wave Scattering(arXiv, 20180327) [Preprint]Algorithmic and architectureoriented optimizations are essential for achieving performance worthy of anticipated energyaustere exascale systems. In this paper, we present an extreme scale FMMaccelerated boundary integral equation solver for wave scattering, which uses FMM as a matrixvector multiplication inside the GMRES iterative method. Our FMM Helmholtz kernels treat nontrivial singular and nearfield integration points. We implement highly optimized kernels for both shared and distributed memory, targeting emerging Intel extreme performance HPC architectures. We extract the potential thread and datalevel parallelism of the key Helmholtz kernels of FMM. Our application code is well optimized to exploit the AVX512 SIMD units of Intel Skylake and Knights Landing architectures. We provide different performance models for tuning the taskbased tree traversal implementation of FMM, and develop optimal architecturespecific and algorithm aware partitioning, load balancing, and communication reducing mechanisms to scale up to 6,144 compute nodes of a Cray XC40 with 196,608 hardware cores. With shared memory optimizations, we achieve roughly 77% of peak single precision floating point performance of a 56core Skylake processor, and on average 60% of peak single precision floating point performance of a 72core KNL. These numbers represent nearly 5.4x and 10x speedup on Skylake and KNL, respectively, compared to the baseline scalar code. With distributed memory optimizations, on the other hand, we report nearoptimal efficiency in the weak scalability study with respect to both the logarithmic communication complexity as well as the theoretical scaling complexity of FMM. In addition, we exhibit up to 85% efficiency in strong scaling. We compute in excess of 2 billion DoF on the fullscale of the Cray XC40 supercomputer.

Isogeometric BDDC deluxe preconditioners for linear elasticity(Mathematical Models and Methods in Applied Sciences, World Scientific Pub Co Pte Lt, 20180314) [Article]Balancing Domain Decomposition by Constraints (BDDC) preconditioners have been shown to provide rapidly convergent preconditioned conjugate gradient methods for solving many of the very illconditioned systems of algebraic equations which often arise in finite element approximations of a large variety of problems in continuum mechanics. These algorithms have also been developed successfully for problems arising in isogeometric analysis. In particular, the BDDC deluxe version has proven very successful for problems approximated by NonUniform Rational BSplines (NURBS), even those of high order and regularity. One main purpose of this paper is to extend the theory, previously fully developed only for scalar elliptic problems in the plane, to problems of linear elasticity in three dimensions. Numerical experiments supporting the theory are also reported. Some of these experiments highlight the fact that the development of the theory can help to decrease substantially the dimension of the primal space of the BDDC algorithm, which provides the necessary global component of these preconditioners, while maintaining scalability and good convergence rates.

Highflux water desalination with interfacial salt sieving effect in nanoporous carbon composite membranes(Nature Nanotechnology, Springer Nature, 20180305) [Article]Freshwater flux and energy consumption are two important benchmarks for the membrane desalination process. Here, we show that nanoporous carbon composite membranes, which comprise a layer of porous carbon fibre structures grown on a porous ceramic substrate, can exhibit 100% desalination and a freshwater flux that is 320 times higher than existing polymeric membranes. Thermal accounting experiments demonstrated that the carbon composite membrane saved over 80% of the latent heat consumption. Theoretical calculations combined with molecular dynamics simulations revealed the unique microscopic process occurring in the membrane. When the salt solution is stopped at the openings to the nanoscale porous channels and forms a meniscus, the vapour can rapidly transport across the nanoscale gap to condense on the permeate side. This process is driven by the chemical potential gradient and aided by the unique smoothness of the carbon surface. The high thermal conductivity of the carbon composite membrane ensures that most of the latent heat is recovered.

Virtual reality in advanced medical immersive imaging: a workflow for introducing virtual reality as a supporting tool in medical imaging(Computing and Visualization in Science, Springer Nature, 20180227) [Article]Radiologic evaluation of images from computed tomography (CT) or magnetic resonance imaging for diagnostic purposes is based on the analysis of single slices, occasionally supplementing this information with 3D reconstructions as well as surface or volume rendered images. However, due to the complexity of anatomical or pathological structures in biomedical imaging, innovative visualization techniques are required to display morphological characteristics three dimensionally. Virtual reality is a modern tool of representing visual data, The observer has the impression of being “inside” a virtual surrounding, which is referred to as immersive imaging. Such techniques are currently being used in technical applications, e.g. in the automobile industry. Our aim is to introduce a workflow realized within one simple program which processes common image stacks from CT, produces 3D volume and surface reconstruction and rendering, and finally includes the data into a virtual reality device equipped with a motion head tracking cave automatic virtual environment system. Such techniques have the potential to augment the possibilities in noninvasive medical imaging, e.g. for surgical planning or educational purposes to add another dimension for advanced understanding of complex anatomical and pathological structures. To this end, the reconstructions are based on advanced mathematical techniques and the corresponding grids which we can export are intended to form the basis for simulations of mathematical models of the pathogenesis of different diseases.

Theoretical investigation of shock standoff distance for nonequilibrium flows over spheres(Chinese Journal of Aeronautics, Elsevier BV, 20180220) [Article]We derived a theoretical solution of the shock standoff distance for a nonequilibrium flow over spheres based on Wen and Hornung’s solution and Olivier’s solution. Compared with previous approaches, the main advantage of the present approach is allowing an analytic solution without involving any semiempirical parameter for the whole nonequilibrium flow regimes. The effects of some important physical quantities therefore can be fully revealed via the analytic solution. By combining the current solution with Ideal Dissociating Gas (IDG) model, we investigate the effects of free stream kinetic energy and free stream dissociation level (which can be very different between different facilities) on the shock standoff distance.

Refined isogeometric analysis for a preconditioned conjugate gradient solver(Computer Methods in Applied Mechanics and Engineering, Elsevier BV, 20180212) [Article]Starting from a highly continuous Isogeometric Analysis (IGA) discretization, refined Isogeometric Analysis (rIGA) introduces C0 hyperplanes that act as separators for the direct LU factorization solver. As a result, the total computational cost required to solve the corresponding system of equations using a direct LU factorization solver dramatically reduces (up to a factor of 55) Garcia et al. (2017). At the same time, rIGA enriches the IGA spaces, thus improving the best approximation error. In this work, we extend the complexity analysis of rIGA to the case of iterative solvers. We build an iterative solver as follows: we first construct the Schur complements using a direct solver over small subdomains (macroelements). We then assemble those Schur complements into a global skeleton system. Subsequently, we solve this system iteratively using Conjugate Gradients (CG) with an incomplete LU (ILU) preconditioner. For a 2D Poisson model problem with a structured mesh and a uniform polynomial degree of approximation, rIGA achieves moderate savings with respect to IGA in terms of the number of Floating Point Operations (FLOPs) and computational time (in seconds) required to solve the resulting system of linear equations. For instance, for a mesh with four million elements and polynomial degree p=3, the iterative solver is approximately 2.6 times faster (in time) when applied to the rIGA system than to the IGA one. These savings occur because the skeleton rIGA system contains fewer nonzero entries than the IGA one. The opposite situation occurs for 3D problems, and as a result, 3D rIGA discretizations provide no gains with respect to their IGA counterparts when considering iterative solvers.

Entropy Stability and the NoSlip Wall Boundary Condition(SIAM Journal on Numerical Analysis, Society for Industrial & Applied Mathematics (SIAM), 20180118) [Article]We present an entropy stable numerical scheme subject to noslip wall boundary conditions. To enforce entropy stability only the nopenetration boundary condition and a temperature condition are needed at a wall, and this leads to an L bound on the conservative variables. In this article, we take the next step and design a finite difference scheme that also bounds the velocity gradients. This necessitates the use of the full noslip conditions.

Quantitative Analysis of Hepatitis C NS5A Viral Protein Dynamics on the ER Surface(Viruses, MDPI AG, 20180108) [Article]Exploring biophysical properties of virusencoded components and their requirement for virus replication is an exciting new area of interdisciplinary virological research. To date, spatial resolution has only rarely been analyzed in computational/biophysical descriptions of virus replication dynamics. However, it is widely acknowledged that intracellular spatial dependence is a crucial component of virus life cycles. The hepatitis C virusencoded NS5A protein is an endoplasmatic reticulum (ER)anchored viral protein and an essential component of the virus replication machinery. Therefore, we simulate NS5A dynamics on realistic reconstructed, curved ER surfaces by means of surface partial differential equations (sPDE) upon unstructured grids. We match the in silico NS5A diffusion constant such that the NS5A sPDE simulation data reproduce experimental NS5A fluorescence recovery after photobleaching (FRAP) time series data. This parameter estimation yields the NS5A diffusion constant. Such parameters are needed for spatial models of HCV dynamics, which we are developing in parallel but remain qualitative at this stage. Thus, our present study likely provides the first quantitative biophysical description of the movement of a viral component. Our spatiotemporal resolved ansatz paves new ways for understanding intricate spatialdefined processes central to specfic aspects of virus life cycles.

Fast Multipole Method as a MatrixFree Hierarchical LowRank Approximation(Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing, Springer International Publishing, 20180103) [Conference Paper]There has been a large increase in the amount of work on hierarchical lowrank approximation methods, where the interest is shared by multiple communities that previously did not intersect. This objective of this article is twofold; to provide a thorough review of the recent advancements in this field from both analytical and algebraic perspectives, and to present a comparative benchmark of two highly optimized implementations of contrasting methods for some simple yet representative test cases. The first half of this paper has the form of a survey paper, to achieve the former objective. We categorize the recent advances in this field from the perspective of computememory tradeoff, which has not been considered in much detail in this area. Benchmark tests reveal that there is a large difference in the memory consumption and performance between the different methods.

An Entropy Stable h/p NonConforming Discontinuous Galerkin Method with the SummationbyParts Property(arXiv, 20171229) [Preprint]This work presents an entropy stable discontinuous Galerkin (DG) spectral element approximation for systems of nonlinear conservation laws with general geometric (h) and polynomial order (p) nonconforming rectangular meshes. The crux of the proofs presented is that the nodal DG method is constructed with the collocated LegendreGaussLobatto nodes. This choice ensures that the derivative/mass matrix pair is a summationbyparts (SBP) operator such that entropy stability proofs from the continuous analysis are discretely mimicked. Special attention is given to the coupling between nonconforming elements as we demonstrate that the standard mortar approach for DG methods does not guarantee entropy stability for nonlinear problems, which can lead to instabilities. As such, we describe a precise procedure and modify the mortar method to guarantee entropy stability for general nonlinear hyperbolic systems on h/p nonconforming meshes. We verify the highorder accuracy and the entropy conservation/stability of fully nonconforming approximation with numerical examples.

Accelerated Cyclic Reduction: A DistributedMemory Fast Solver for Structured Linear Systems(Parallel Computing, Elsevier BV, 20171215) [Article]We present Accelerated Cyclic Reduction (ACR), a distributedmemory fast solver for rankcompressible block tridiagonal linear systems arising from the discretization of elliptic operators, developed here for three dimensions. Algorithmic synergies between Cyclic Reduction and hierarchical matrix arithmetic operations result in a solver that has O(kNlogN(logN+k2)) arithmetic complexity and O(k Nlog N) memory footprint, where N is the number of degrees of freedom and k is the rank of a block in the hierarchical approximation, and which exhibits substantial concurrency. We provide a baseline for performance and applicability by comparing with the multifrontal method with and without hierarchical semiseparable matrices, with algebraic multigrid and with the classic cyclic reduction method. Over a set of largescale elliptic systems with features of nonsymmetry and indefiniteness, the robustness of the direct solvers extends beyond that of the multigrid solver, and relative to the multifrontal approach ACR has lower or comparable execution time and size of the factors, with substantially lower numerical ranks. ACR exhibits good strong and weak scaling in a distributed context and, as with any direct solver, is advantageous for problems that require the solution of multiple righthand sides. Numerical experiments show that the rank k patterns are of O(1) for the Poisson equation and of O(n) for the indefinite Helmholtz equation. The solver is ideal in situations where lowaccuracy solutions are sufficient, or otherwise as a preconditioner within an iterative method.

Extension of CE/SE method to nonequilibrium dissociating flows(Journal of Computational Physics, Elsevier BV, 20171208) [Article]In this study, the hypersonic nonequilibrium flows over rounded nose geometries are numerically investigated by a robust conservation element and solution element (CE/SE) code, which is based on hybrid meshes consisting of triangular and quadrilateral elements. The dissociating and recombination chemical reactions as well as the vibrational energy relaxation are taken into account. The stiff source terms are solved by an implicit trapezoidal method of integration. Comparison with laboratory and numerical cases are provided to demonstrate the accuracy and reliability of the present CE/SE code in simulating hypersonic nonequilibrium flows.

Parallel accelerated cyclic reduction preconditioner for threedimensional elliptic PDEs with variable coefficients(Journal of Computational and Applied Mathematics, Elsevier BV, 20171207) [Article]We present a robust and scalable preconditioner for the solution of largescale linear systems that arise from the discretization of elliptic PDEs amenable to rank compression. The preconditioner is based on hierarchical lowrank approximations and the cyclic reduction method. The setup and application phases of the preconditioner achieve loglinear complexity in memory footprint and number of operations, and numerical experiments exhibit good weak and strong scalability at large processor counts in a distributed memory environment. Numerical experiments with linear systems that feature symmetry and nonsymmetry, definiteness and indefiniteness, constant and variable coefficients demonstrate the preconditioner applicability and robustness. Furthermore, it is possible to control the number of iterations via the accuracy threshold of the hierarchical matrix approximations and their arithmetic operations, and the tuning of the admissibility condition parameter. Together, these parameters allow for optimization of the memory requirements and performance of the preconditioner.

A scalable community detection algorithm for large graphs using stochastic block models(Intelligent Data Analysis, IOS Press, 20171124) [Article]Community detection in graphs is widely used in social and biological networks, and the stochastic block model is a powerful probabilistic tool for describing graphs with community structures. However, in the era of

An energystable generalized α method for the Swift–Hohenberg equation(Journal of Computational and Applied Mathematics, Elsevier BV, 20171116) [Article]We propose a secondorder accurate energystable timeintegration method that controls the evolution of numerical instabilities introducing numerical dissipation in the highestresolved frequencies. Our algorithm further extends the generalizedα method and provides control over dissipation via the spectral radius. We derive the first and second laws of thermodynamics for the Swift–Hohenberg equation and provide a detailed proof of the unconditional energy stability of our algorithm. Finally, we present numerical results to verify the energy stability and its secondorder accuracy in time.

Fast multipole preconditioners for sparse matrices arising from elliptic equations(Computing and Visualization in Science, Springer Nature, 20171109) [Article]Among optimal hierarchical algorithms for the computational solution of elliptic problems, the fast multipole method (FMM) stands out for its adaptability to emerging architectures, having high arithmetic intensity, tunable accuracy, and relaxable global synchronization requirements. We demonstrate that, beyond its traditional use as a solver in problems for which explicit freespace kernel representations are available, the FMM has applicability as a preconditioner in finite domain elliptic boundary value problems, by equipping it with boundary integral capability for satisfying conditions at finite boundaries and by wrapping it in a Krylov method for extensibility to more general operators. Here, we do not discuss the well developed applications of FMM to implement matrixvector multiplications within Krylov solvers of boundary element methods. Instead, we propose using FMM for the volumetovolume contribution of inhomogeneous Poissonlike problems, where the boundary integral is a small part of the overall computation. Our method may be used to precondition sparse matrices arising from finite difference/element discretizations, and can handle a broader range of scientific applications. It is capable of algebraic convergence rates down to the truncation error of the discretized PDE comparable to those of multigrid methods, and it offers potentially superior multicore and distributed memory scalability properties on commodity architecture supercomputers. Compared with other methods exploiting the lowrank character of offdiagonal blocks of the dense resolvent operator, FMMpreconditioned Krylov iteration may reduce the amount of communication because it is matrixfree and exploits the tree structure of FMM. We describe our tests in reproducible detail with freely available codes and outline directions for further extensibility.

Rectangular maximumvolume submatrices and their applications(Linear Algebra and its Applications, Elsevier BV, 20171018) [Article]We introduce a definition of the volume of a general rectangular matrix, which is equivalent to an absolute value of the determinant for square matrices. We generalize results of square maximumvolume submatrices to the rectangular case, show a connection of the rectangular volume with an optimal experimental design and provide estimates for a growth of coefficients and an approximation error in spectral and Chebyshev norms. Three promising applications of such submatrices are presented: recommender systems, finding maximal elements in lowrank matrices and preconditioning of overdetermined linear systems. The code is available online.

Asynchronous TaskBased Polar Decomposition on Single Node Manycore Architectures(IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers (IEEE), 20170929) [Article]This paper introduces the first asynchronous, taskbased formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a formulation of the iterative QR dynamicallyweighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on finegrained computations, the novel taskbased implementation is capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling lookahead opportunities for better hardware occupancy. The overall QDWHbased polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the intertask data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an outoforder task scheduling. Benchmarking experiments show significant improvements against existing stateoftheart high performance implementations for the polar decomposition on latest sharedmemory vendors' systems, while maintaining numerical accuracy.