Extreme Computing Research Center
Recent Submissions

Characterization of pressure fluctuations within a controlleddiffusion blade boundary layer using the equilibrium wallmodelled LES.(Scientific reports, Springer Science and Business Media LLC, 20200731) [Article]In this study, the generation of airfoil trailing edge broadband noise that arises from the interaction of turbulent boundary layer with the airfoil trailing edge is investigated. The primary objectives of this work are: (i) to apply a wallmodelled largeeddy simulation (WMLES) approach to predict the flow of air passing a controlleddiffusion blade, and (ii) to study the blade broadband noise that is generated from the interaction of a turbulent boundary layer with a lifting surface trailing edge. This study is carried out for two values of the Mach number, [Formula: see text] and 0.5, two values of the chord Reynolds number, [Formula: see text] and [Formula: see text], and two angles of attack, AoA [Formula: see text] and [Formula: see text]. To examine the influence of the grid resolution on aerodynamic and aeroacoustic quantities, we compare our results with experimental data available in the literature. We also compare our results with two inhouse numerical solutions generated from two wallresolved LES (WRLES) calculations, one of which has a DNSlike resolution. We show that WMLES accurately predicts the mean pressure coefficient distribution, velocity statistics (including the mean velocity), and the traces of Reynolds tensor components. Furthermore, we observe that the instantaneous flow structures computed by the WMLES resemble those found in the reference WMLES database, except near the leading edge region. Some of the differences observed in these structures are associated with tripping and the transition to a turbulence mechanism near the leading edge, which are significantly affected by the grid resolution. The aeroacoustic noise calculations indicate that the power spectral density profiles obtained using the WMLES compare well with the experimental data.

MFEM: A modular finite element methods library(Computers and Mathematics with Applications, Elsevier BV, 20200711) [Article]MFEM is an opensource, lightweight, flexible and scalable C++ library for modular finite element methods that features arbitrary highorder finite element meshes and spaces, support for a wide variety of discretization approaches and emphasis on usability, portability, and highperformance computing efficiency. MFEM's goal is to provide application scientists with access to cuttingedge algorithms for highorder finite element meshing, discretizations and linear solvers, while enabling researchers to quickly and easily develop and test new algorithms in very general, fully unstructured, highorder, parallel and GPUaccelerated settings. In this paper we describe the underlying algorithms and finite element abstractions provided by MFEM, discuss the software implementation, and illustrate various applications of the library.

Solving Acoustic Boundary Integral Equations Using High Performance Tile LowRank LU Factorization(High Performance Computing, Springer International Publishing, 20200618) [Conference Paper]We design and develop a new high performance implementation of a fast direct LUbased solver using lowrank approximations on massively parallel systems. The LU factorization is the most timeconsuming step in solving systems of linear equations in the context of analyzing acoustic scattering from large 3D objects. The matrix equation is obtained by discretizing the boundary integral of the exterior Helmholtz problem using a higherorder Nyström scheme. The main idea is to exploit the inherent data sparsity of the matrix operator by performing local tilecentric approximations while still capturing the most significant information. In particular, the proposed LUbased solver leverages the Tile LowRank (TLR) data compression format as implemented in the Hierarchical Computations on Manycore Architectures (HiCMA) library to decrease the complexity of “classical” dense direct solvers from cubic to quadratic order. We taskify the underlying boundary integral kernels to expose finegrained computations. We then employ the dynamic runtime system StarPU to orchestrate the scheduling of computational tasks on shared and distributedmemory systems. The resulting asynchronous execution permits to compensate for the load imbalance due to the heterogeneous ranks, while mitigating the overhead of data motion. We assess the robustness of our TLR LUbased solver and study the qualitative impact when using different numerical accuracies. The new TLR LU factorization outperforms the stateoftheart dense factorizations by up to an order of magnitude on various parallel systems, for analysis of scattering from largescale 3D synthetic and real geometries.

Hierarchical matrix approximations for spacefractional diffusion equations(Computer Methods in Applied Mechanics and Engineering, Elsevier BV, 20200611) [Article]Space fractional diffusion models generally lead to dense discrete matrix operators, which lead to substantial computational challenges when the system size becomes large. For a state of size N, full representation of a fractional diffusion matrix would require O(N2) memory storage requirement, with a similar estimate for matrix–vector products. In this work, we present H2 matrix representation and algorithms that are amenable to efficient implementation on GPUs, and that can reduce the cost of storing these operators to O(N) asymptotically. Matrix–vector multiplications can be performed in asymptotically linear time as well. Performance of the algorithms is assessed in light of 2D simulations of space fractional diffusion equation with constant diffusivity. Attention is focused on smooth particle approximation of the governing equations, which lead to discrete operators involving explicit radial kernels. The algorithms are first tested using the fundamental solution of the unforced space fractional diffusion equation in an unbounded domain, and then for the steady, forced, fractional diffusion equation in a bounded domain. Both matrixinverse and pseudotransient solution approaches are considered in the latter case. Our experiments show that the construction of the fractional diffusion matrix, the matrix–vector multiplication, and the generation of an approximate inverse preconditioner all perform very well on a single GPU on 2D problems with N in the range 105 – 106. In addition, the tests also showed that, for the entire range of parameters and fractional orders considered, results obtained using the H2 approximations were in close agreement with results obtained using dense operators, and exhibited the same spatial order of convergence. Overall, the present experiences showed that the H2 matrix framework promises to provide practical means to handle largescale space fractional diffusion models in several space dimensions, at a computational cost that is asymptotically similar to the cost of handling classical diffusion equations.

Benchmarking solvers for the one dimensional cubic nonlinear klein gordon equation on a single core(Springer International Publishing, 20200608) [Conference Paper]To determine the best method for solving a numerical problem modeled by a partial differential equation, one should consider the discretization of the problem, the computational hardware used and the implementation of the software solution. In solving a scientific computing problem, the level of accuracy can also be important, with some numerical methods being efficient for low accuracy simulations, but others more efficient for high accuracy simulations. Very few high performance benchmarking efforts allow the computational scientist to easily measure such tradeoffs in order to obtain an accurate enough numerical solution at a low computational cost. These tradeoffs are examined in the numerical solution of the one dimensional Klein Gordon equation on single cores of an ARM CPU, an AMD x8664 CPU, two Intel x8664 CPUs and a NEC SXACE vector processor. The work focuses on comparing the speed and accuracy of several high order finite difference spatial discretizations using a conjugate gradient linear solver and a fast Fourier transform based spatial discretization. In addition implementations using second and fourth order timestepping are also included in the comparison. The work uses accuracyefficiency frontiers to compare the effectiveness of five hardware platforms

Asynchronous computations for solving the acoustic wave propagation equation(The International Journal of High Performance Computing Applications, SAGE Publications, 20200519) [Article]The aim of this study is to design and implement an asynchronous computational scheme for solving the acoustic wave propagation equation with absorbing boundary conditions (ABCs) in the context of seismic imaging applications. While the convolutional perfectly matched layer (CPML) is typically used for ABCs in the oil and gas industry, its formulation further stresses memory accesses and decreases the arithmetic intensity at the physical domain boundaries. The challenges with CPML are twofold: (1) the strong, inherent data dependencies imposed on the explicit timestepping scheme render asynchronous time integration cumbersome and (2) the idle time is further exacerbated by the load imbalance introduced among processing units. In fact, the CPML formulation of the ABCs requires expensive synchronization points, which may hinder the parallel performance of the overall asynchronous time integration. In particular, when deployed in conjunction with the multicoreoptimized wavefront diamond temporal blocking (MWDTB) approach for the inner domain points, it results in a major performance slow down. To relax CPML’s synchrony and mitigate the resulting load imbalance, we embed CPML’s calculation into MWDTB’s inner loop and carry on the time integration with finegrained computations in an asynchronous, holistic way. This comes at the price of storing transient results to alleviate dependencies from critical data hazards while maintaining the numerical accuracy of the original scheme. Performance and scalability results on various x86 architectures demonstrate the superiority of MWDTB with CPML support against the standard spatial blocking on various grid sizes. To our knowledge, this is the first practical study that highlights the consolidation of CPML ABCs with asynchronous temporal blocking stencil computations.

Automated methods for the comparison of natural languages(Computing and Visualization in Science, Springer Science and Business Media LLC, 20200518) [Article]Starting from the general question, if there is a connection between the mathematical capabilities of a student and his native language, we aim at comparing natural languages with mathematical language quantitatively. In [20] we set up an approach to compare language structures using Natural Language Processors (NLP). However, difficulties arose with the quality of the structural analysis of the NLP used just comparing simple sentences in different but closely related natural languages. We now present a comparison of different available NLPs and discuss the results. The comparison confirms the results from [20], showing that current NLPs are not capable of analysing even simple sentences such that resulting structures between different natural languages can be compared.

3d Modeling and simulation of a harpsichord(Computing and Visualization in Science, Springer Science and Business Media LLC, 20200429) [Article]The mathematical characterization of the sound of a musical instrument still follows Schumann’s laws (Schumann in Physik der klangfarben, Leipzig, 1929). According to this theory, the resonances of the instrument body, “the formants”, filter the oscillations of the sound generator (e.g., strings) and produce the characteristic “timbre” of an instrument. This is a strong simplification of the actual situation. It applies to a point source and can be easily performed by a loudspeaker, disregarding the three dimensional structure of music instruments. To describe the effect of geometry and material of the instruments, we set up a 3d model and simulate it using the simulation system UG4 (Vogel et al. in Comput Vis Sci 16(4):165–179, 2013; Reiter et al. in Comput Vis Sci 16(4):151–164, 2014). We aim to capture the oscillation behavior of eigenfrequencies of a harpsichord soundboard and investigate how well a model for the oscillation behavior of the soundboard approximates the oscillation behavior of the whole instrument. We resolve the complicated geometry by several unstructured 3d grids and take into account the anisotropy of wood. The oscillation behavior of the soundboard is modeled following the laws of linear orthotropic elasticity with homogenous boundary conditions. The associated eigenproblem is discretized using FEM and solved with the iterative PINVIT method using an efficient GMG preconditioner (Neymeyr in A hierarchy of preconditioned eigensolvers for elliptic differential operators. Habilitation dissertation, University of Tübingen, 2001). The latter allows us to resolve the harpsichord with a high resolution hybrid grid, which is required to capture fine modes of the simulated eigenfrequencies. We computed the first 16 eigenmodes and eigenfrequencies with a resolution of 1.8 billion unknowns each on Shaheen II supercomputer (https://www.hpc.kaust.edu.sa/content/shaheenii). To verify our results, we compare them with measurement data obtained from an experimental modal analysis of a real reference harpsichord.

Mesh generation for thin layered domains and its application to parallel multigrid simulation of groundwater flow(Computing and Visualization in Science, Springer Science and Business Media LLC, 20200420) [Article]The generation of detailed three dimensional meshes for the simulation of groundwater flow in thin layered domains is crucial to capture important properties of the underlying domains and to reach a satisfying accuracy. At the same time, this level of detail poses high demands both on suitable hardware and numerical solver efficiency. Parallel multigrid methods have been shown to exhibit near optimal weak scalability for massively parallel computations of density driven flow. A fully automated parameterized algorithm for prism based meshing of coarse grids from height data of individual layers is presented. Special structures like pinch outs of individual layers are preserved. The resulting grid is used as a starting point for parallel mesh and hierarchy creation through interweaved projected refinement and redistribution. Efficiency and applicability of the proposed approach are demonstrated for a parallel multigrid based simulation of a realistic sample problem.

Hierarchical Matrix Approximations of Hessians Arising in Inverse Problems Governed by PDEs(arXiv, 20200323) [Preprint]Hessian operators arising in inverse problems governed by partial differential equations (PDEs) play a critical role in delivering efficient, dimensionindependent convergence for both Newton solution of deterministic inverse problems, as well as Markov chain Monte Carlo sampling of posteriors in the Bayesian setting. These methods require the ability to repeatedly perform such operations on the Hessian as multiplication with arbitrary vectors, solving linear systems, inversion, and (inverse) square root. Unfortunately, the Hessian is a (formally) dense, implicitlydefined operator that is intractable to form explicitly for practical inverse problems, requiring as many PDE solves as inversion parameters. Low rank approximations are effective when the data contain limited information about the parameters, but become prohibitive as the data become more informative. However, the Hessians for many inverse problems arising in practical applications can be well approximated by matrices that have hierarchically low rank structure. Hierarchical matrix representations promise to overcome the high complexity of dense representations and provide effective data structures and matrix operations that have only loglinear complexity. In this work, we describe algorithms for constructing and updating hierarchical matrix approximations of Hessians, and illustrate them on a number of representative inverse problems involving timedependent diffusion, advectiondominated transport, frequency domain acoustic wave propagation, and low frequency Maxwell equations, demonstrating up to an order of magnitude speedup compared to globally low rank approximations.

Performance study of sustained petascale direct numerical simulation on Cray XC40 systems(Concurrency and Computation: Practice and Experience, Wiley, 20200317) [Article]We present in this paper a comprehensive performance study of highly efficient extreme scale direct numerical simulations of secondary flows, using an optimized version of Nek5000. Our investigations are conducted on various Cray XC40 systems, using a very highorder spectral element method. Singlenode efficiency is achieved by autogenerated assembly implementations of small matrix multiplies and key vectorvector operations, streaming lossless I/O compression, aggressive loop merging, and selective single precision evaluations. Comparative studies across different Cray XC40 systems at scale, Trinity (LANL), Cori (NERSC), and ShaheenII (KAUST) show that a Cray programming environment, network configuration, parallel file system, and burst buffer all have a major impact on the performance. All three systems possess a similar hardware with similar CPU nodes and parallel file system, but they have different theoretical peak network bandwidths, different OSs, and different versions of the programming environment. Our study reveals how these slight configuration differences can be critical in terms of performance of the application. We also find that with 9216 nodes (294 912 cores) on Trinity XC40 the applications sustain petascale performance, as well as 50% of peak memory bandwidth over the entire solver (500 TB/s in aggregate). On 3072 Xeon Phi nodes of Cori, we reach 378 TFLOP/s with an aggregated bandwidth of 310 TB/s, corresponding to timetosolution 2.11× faster than obtained with the same number of (dualsocket) Xeon nodes.

Entropy stable h/pnonconforming discretization with the summationbyparts property for the compressible Euler and Navier–Stokes equations(SN Partial Differential Equations and Applications, Springer Science and Business Media LLC, 20200316) [Article]In this paper, the entropy conservative/stable algorithms presented by Del Rey Fernandez and coauthors [18,16,17] for the compressible Euler and NavierStokes equations on nonconforming prefined/coarsened curvilinear grids is extended to h/p refinement/coarsening. The main difficulty in developing nonconforming algorithms is the construction of appropriate coupling procedures across nonconforming interfaces. Here, a computationally simple and efficient approach based upon using decoupled interpolation operators is utilized. The resulting scheme is entropy conservative/stable and elementwise conservative. Numerical simulations of the isentropic vortex and viscous shock propagation confirm the entropy conservation/stability and accuracy properties of the method (achieving ~ p + 1 convergence) which are comparable to those of the original conforming scheme [4,35]. Simulations of the TaylorGreen vortex at Re = 1,600 and turbulent flow past a sphere at Re = 2,000 show the robustness and stability properties of the overall spatial discretization for unstructured grids. Finally, to demonstrate the entropy conservation property of a fullydiscrete explicit entropy stable algorithm with h/p refinement/coarsening, we present the time evolution of the entropy function obtained by simulating the propagation of the isentropic vortex using a relaxation RungeKutta scheme.

Relaxation RungeKutta Methods: Fully Discrete Explicit EntropyStable Schemes for the Compressible Euler and NavierStokes Equations(SIAM Journal on Scientific Computing, Society for Industrial & Applied Mathematics (SIAM), 20200312) [Article]The framework of inner product norm preserving relaxation RungeKutta methods [D. I. Ketcheson, SIAM J. Numer. Anal., 57 (2019), pp. 28502870] is extended to general convex quantities. Conservation, dissipation, or other solution properties with respect to any convex functional are enforced by the addition of a relaxation parameter that multiplies the RungeKutta update at each step. Moreover, other desirable stability (such as strong stability preservation) and efficiency (such as low storage requirements) properties are preserved. The technique can be applied to both explicit and implicit RungeKutta methods and requires only a small modification to existing implementations. The computational cost at each step is the solution of one additional scalar algebraic equation for which a good initial guess is available. The effectiveness of this approach is proved analytically and demonstrated in several numerical examples, including applications to high order entropyconservative and entropystable semidiscretizations on unstructured grids for the compressible Euler and NavierStokes equations.

Solution of the 3D densitydriven groundwater flow problem with uncertain porosity and permeability(GEM  International Journal on Geomathematics, Springer Science and Business Media LLC, 20200302) [Article]The pollution of groundwater, essential for supporting populations and agriculture, can have catastrophic consequences. Thus, accurate modeling of water pollution at the surface and in groundwater aquifers is vital. Here, we consider a densitydriven groundwater flow problem with uncertain porosity and permeability. Addressing this problem is relevant for geothermal reservoir simulations, natural salinedisposal basins, modeling of contaminant plumes and subsurface flow predictions. This strongly nonlinear timedependent problem describes the convection of a twophase flow, whereby a liquid flows and propagates into groundwater reservoirs under the force of gravity to form socalled “fingers”. To achieve an accurate numerical solution, fine spatial resolution with an unstructured mesh and, therefore, high computational resources are required. Here we run a parallelized simulation toolbox ug4 with a geometric multigrid solver on a parallel cluster, and the parallelization is carried out in physical and stochastic spaces. Additionally, we demonstrate how the ug4 toolbox can be run in a blackbox fashion for testing different scenarios in the densitydriven flow. As a benchmark, we solve the Elderlike problem in a 3D domain. For approximations in the stochastic space, we use the generalized polynomial chaos expansion. We compute the mean, variance, and exceedance probabilities for the mass fraction. We use the solution obtained from the quasiMonte Carlo method as a reference solution.

Fully implicit hybrid twolevel domain decomposition algorithms for twophase flows in porous media on 3D unstructured grids(Journal of Computational Physics, Elsevier BV, 20200207) [Article]Simulation of subsurface flows in porous media is difficult due to the nonlinearity of the operators and the high heterogeneity of material coefficients. In this paper, we present a scalable fully implicit solver for incompressible twophase flows based on overlapping domain decomposition methods. Specifically, an inexact NewtonKrylov algorithm with analytic Jacobian is used to solve the nonlinear systems arising from the discontinuous Galerkin discretization of the governing equations on 3D unstructured grids. The linear Jacobian system is preconditioned by additive Schwarz algorithms, which are naturally suitable for parallel computing. We propose a hybrid twolevel version of the additive Schwarz preconditioner consisting of a nested coarse space to improve the robustness and scalability of the classical onelevel version. On the coarse level, a smaller linear system arising from the same discretization of the problem on a coarse grid is solved by using GMRES with a onelevel preconditioner until a relative tolerance is reached. Numerical experiments are presented to demonstrate the effectiveness and efficiency of the proposed solver for 3D heterogeneous medium problems. We also report the parallel scalability of the proposed algorithms on a supercomputer with up to 8,192 processor cores.

SquareRoot Variable Metric based nullspace shuttle: a characterization of the nonuniqueness in elastic fullwaveform inversion(Journal of Geophysical Research: Solid Earth, American Geophysical Union (AGU), 20200128) [Article]Fullwaveform inversion (FWI) is for most geophysical applications an illposed inverse prob16lem, with nonunique solutions. We examine its nonuniqueness by exploring the nullspace17shuttle, which can efficiently generate an ensemble of datafitting solutions. We construct18this shuttle based on a quasiNewton method, the squareroot variablemetric (SRVM)19method. The latter provides access to the inverse datamisfit Hessian in FWI for largescale20applications. Combining the SRVM method with a randomised singular value decomposi21tion, we obtain the eigenvector subspaces of the inverse datamisfit Hessian. Its primary22eigenvalue and eigenvector are considered to determine the null space of inversion result.23Using the SRVMbased nullspace shuttle, we can modify the inverted result a posteriori24in a highly efficient manner without corrupting the data misfit. Also, because the SRVM25method is embedded through elastic FWI, our method can be extended to multiparameter26problems. We confirm and highlight our approach with the elastic Marmousi example.

Hierarchical algorithms on hierarchical architectures(Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, The Royal Society, 20200120) [Article]A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically lowrank matrices realize a rarely achieved combination of optimal storage complexity and highcomputational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile lowrank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrixfree higherorder methods in optimization and largescale spatial statistics. Some modules of this opensource project have been adopted in the software libraries of major vendors. This article is part of a discussion meeting issue ‘Numerical algorithms for highperformance computational science’.

Performance analysis of tile lowrank cholesky factorization using PaRSEC instrumentation tools(IEEE, 20200115) [Conference Paper]This paper highlights the necessary development of new instrumentation tools within the PaRSE taskbased runtime system to leverage the performance of lowrank matrix computations. In particular, the tile lowrank (TLR) Cholesky factorization represents one of the most critical matrix operations toward solving challenging largescale scientific applications. The challenge resides in the heterogeneous arithmetic intensity of the various computational kernels, which stresses PaRSE's dynamic engine when orchestrating the task executions at runtime. Such irregular workload imposes the deployment of new scheduling heuristics to privilege the critical path, while exposing task parallelism to maximize hardware occupancy. To measure the effectiveness of PaRSE's engine and its various scheduling strategies for tackling such workloads, it becomes paramount to implement adequate performance analysis and profiling tools tailored to finegrained and heterogeneous task execution. This permits us not only to provide insights from PaRSE, but also to identify potential applications' performance bottlenecks. These instrumentation tools may actually foster synergism between applications and PaRSE developers for productivity as well as highperformance computing purposes. We demonstrate the benefits of these amenable tools, while assessing the performance of TLR Cholesky factorization from data distribution, communicationreducing and synchronizationreducing perspectives. This toolassisted performance analysis results in three major contributions: a new hybrid data distribution, a new hierarchical TLR Cholesky algorithm, and a new performance model for tuning the tile size. The new TLR Cholesky factorization achieves an 8X performance speedup over existing implementations on massively parallel supercomputers, toward solving largescale 3D climate and weather prediction applications.

MixedPrecision Tomographic Reconstructor Computations on Hardware Accelerators(IEEE, 20191231) [Conference Paper]The computation of tomographic reconstructors (ToR) is at the core of a simulation framework to design the next generation of adaptive optics (AO) systems to be installed on future Extremely Large Telescopes (ELT). In fact, it is also a critical component for their operation on sky. The goals of these instruments range from the detection of the light from the most distant galaxies to the analysis of the composition of exoplanets terrestrial atmospheres. Based on advanced AO techniques, the instrument MOSAIC relies on a computational framework to filter out the Earth atmospheric turbulence and eventually enhance the images quality out of groundbased telescopes. The ToR calculation is a computebound operation based on the Cholesky factorization. Due to its cubical algorithmic complexity, the ToR may represent a major bottleneck for the EELT when scaling up the large number of wavefront sensors used in the baseline MOSAIC design. To mitigate this increasing dimensionality overhead, this paper presents the implementation of a novel mixedprecision Choleskybased dense matrix solver on hardware accelerators. The new algorithm takes into account the datasparse structure of the covariance matrix operator and uses the tensor cores of NVIDIA V100 GPUs to leverage performance at an unprecedented scale. To our knowledge, this is the first computational astronomy application that exploits V100's tensor cores outside of the traditional arena of artificial intelligence. Experimental results demonstrate the accuracy robustness and the high performance of the mixedprecision ToR on synthetic datasets, which paves the way for future instrument deployments on the EELT

On the robustness and performance of entropy stable discontinuous collocation methods for the compressible NavierStokes equations(Submitted to Journal of Computational Physics, arXiv, 20191121) [Preprint]In computational fluid dynamics, the demand for increasingly multidisciplinary reliable simulations, for both analysis and design optimization purposes, requires transformational advances in individual components of future solvers. At the algorithmic level, hardware compatibility and efficiency are of paramount importance in determining viability at exascale and beyond. However, equally important (if not more so) is algorithmic robustness with minimal user intervention, which becomes progressively more challenging to achieve as problem size and physics complexity increase. We numerically show that low and high order entropy stable discontinuous spatial discretizations based on summationbypart operators and simultaneousapproximationterms technique provides an essential step toward a truly enabling technology in terms of reliability and robustness for both underresolved turbulent flow simulations and flows with discontinuities.