KAUST Supercomputing Laboratory (KSL)
Recent Submissions
-
Einkorn genomics sheds light on history of the oldest domesticated wheat(Springer Science and Business Media LLC, 2023-08-02) [Article]Einkorn (Triticum monococcum) was the first domesticated wheat species, and was central to the birth of agriculture and the Neolithic Revolution in the Fertile Crescent around 10,000 years ago1,2. Here we generate and analyse 5.2-Gb genome assemblies for wild and domesticated einkorn, including completely assembled centromeres. Einkorn centromeres are highly dynamic, showing evidence of ancient and recent centromere shifts caused by structural rearrangements. Whole-genome sequencing analysis of a diversity panel uncovered the population structure and evolutionary history of einkorn, revealing complex patterns of hybridizations and introgressions after the dispersal of domesticated einkorn from the Fertile Crescent. We also show that around 1% of the modern bread wheat (Triticum aestivum) A subgenome originates from einkorn. These resources and findings highlight the history of einkorn evolution and provide a basis to accelerate the genomics-assisted improvement of einkorn and bread wheat.
-
HPC-based genome variant calling workflow (HPC-GVCW)(Cold Spring Harbor Laboratory, 2023-06-26) [Preprint]A high-performance computing genome variant calling workflow was designed to run GATK on HPC platforms. This workflow efficiently called an average of 27.3 M, 32.6 M, 168.9 M, and 16.2 M SNPs for rice, sorghum, maize, and soybean, respectively, on the most recently released high-quality reference sequences. Analysis of a rice pan-genome reference panel revealed 2.1 M novel SNPs that have yet to be publicly released.
-
Data sets for "HPC-based genome variant calling workflow (HPC-GVCW)"(2023-06-14) [Dataset]
-
Evaluation of next generation of high-order compressible fluid dynamic solvers on the cloud computing for complex industrial flows(Array, Elsevier BV, 2022-12-09) [Article]Industrially relevant computational fluid dynamics simulations frequently require vast computational resources that are only available to governments, wealthy corporations, and wealthy institutions. Thus, in many contexts and realities, high-performance computing grids and cloud resources on demand should be evaluated as viable alternatives to conventional computing clusters. In this work, we present the analysis of the time-to-solution and cost of an entropy stable collocated discontinuous Galerkin (SSDC) compressible computational fluid dynamics framework on Ibex, the on-premises cluster at KAUST, and the Amazon Web Services Elastic Compute Cloud for complex compressible flows. SSDC is a prototype of the next generation computational fluid dynamics frameworks developed following the road map established by the NASA CFD vision 2030. We simulate complex flow problems using high-order accurate fully-discrete entropy stable algorithms. In terms of time-to-solution, the Amazon Elastic Compute Cloud delivers the best performance, with the Graviton2 processors based on the Arm architecture being the fastest. However, the results also indicate that the Ibex nodes based on the AMD Rome architecture deliver good performance, close to those observed for the Amazon Elastic Compute Cloud. Furthermore, we observed that computations performed on the Ibex on-premises cluster are currently less expensive than those performed in the cloud. Our findings could be used to develop guidelines for selecting high-performance computing cloud resources to simulate realistic fluid flow problems.
-
Uauy-Lab/monococcum_introgressions: Analysis of monococcum introgressions into hexaploid wheat(Github, 2022-08-10) [Software]Analysis of monococcum introgressions into hexaploid wheat
-
Reply to “Comment on ‘Origin of symmetry-forbidden high-order harmonic generation in the time-dependent Kohn-Sham formulation’”(Physical Review A, APS, 2022-04-06) [Article]In reply to the Comment by O. Neufeld et al. [Phys. Rev. A 105, 047101 (2022)], we argue that the conclusions of Phys. Rev. A 103, 043106 (2021) remain valid. We disprove the claim that the unphysical even-order harmonics originate from convergence issues related to reflections at the boundary of the simulation box. By additional calculations, we show that such reflections perturb the high-order harmonic generation spectra by oscillations with periods much smaller than the distance between the harmonics. We also demonstrate that the convergence argument of the Comment, in contrast to our multielectron excitations argument, cannot explain why there are no unphysical even-order harmonics in one-electron systems. Moreover, we show that the argument put forward in the Comment to conclude that the time-dependent Kohn-Sham equations are superior to the time-dependent natural Kohn-Sham equations is not valid.
-
IBEXCluster/Wheat-SNPCaller: Wheat SNP Caller pipeline(Github, 2022-03-27) [Software]Wheat SNP Caller pipeline
-
CENH3 information from: Einkorn genomics sheds light on history of the oldest domesticated wheat(Dryad, 2022) [Dataset]Einkorn (Triticum monococcum) is the first domesticated wheat species, being central to the birth of agriculture and the Neolithic Revolution in the Fertile Crescent ~10,000 years ago. Here, we generate and analyze 5.2-gigabase genome assemblies for wild and domesticated einkorn, including completely assembled centromeres. Einkorn centromeres are highly dynamic, showing evidence of ancient and recent centromere shifts caused by structural rearrangements. Whole-genome sequencing of a diversity panel uncovered the population structure and evolutionary history of einkorn, revealing complex patterns of hybridizations and introgressions following the dispersal of domesticated einkorn from the Fertile Crescent. We also discovered that around 1% of the modern bread wheat (Triticum aestivum) A subgenome originates from einkorn. These resources and findings highlight the history of einkorn evolution and provide a basis to accelerate the genomics-assisted improvement of einkorn and bread wheat.
-
Data for: Einkorn genomics sheds light on history of the oldest domesticated wheat(Dryad, 2021) [Dataset]Einkorn (Triticum monococcum) is the first domesticated wheat species, being central to the birth of agriculture and the Neolithic Revolution in the Fertile Crescent ~10,000 years ago. Here, we generate and analyze 5.2-gigabase genome assemblies for wild and domesticated einkorn, including completely assembled centromeres. Einkorn centromeres are highly dynamic, showing evidence of ancient and recent centromere shifts caused by structural rearrangements. Whole-genome sequencing of a diversity panel uncovered the population structure and evolutionary history of einkorn, revealing complex patterns of hybridizations and introgressions following the dispersal of domesticated einkorn from the Fertile Crescent. We also discovered that around 1% of the modern bread wheat (Triticum aestivum) A subgenome originates from einkorn. These resources and findings highlight the history of einkorn evolution and provide a basis to accelerate the genomics-assisted improvement of einkorn and bread wheat.
-
CUDACLAW: A high-performance programmable GPU framework for the solution of hyperbolic PDEs(arXiv, 2018-05-21) [Preprint]We present cudaclaw, a CUDA-based high performance data-parallel framework for the solution of multidimensional hyperbolic partial differential equation (PDE) systems, equations describing wave motion. cudaclaw allows computational scientists to solve such systems on GPUs without being burdened by the need to write CUDA code, worry about thread and block details, data layout, and data movement between the different levels of the memory hierarchy. The user defines the set of PDEs to be solved via a CUDA- independent serial Riemann solver and the framework takes care of orchestrating the computations and data transfers to maximize arithmetic throughput. cudaclaw treats the different spatial dimensions separately to allow suitable block sizes and dimensions to be used in the different directions, and includes a number of optimizations to minimize access to global memory.
-
A Fault-Tolerant HPC Scheduler Extension for Large and Operational Ensemble Data Assimilation:Application to the Red Sea(Journal of Computational Science, Elsevier BV, 2018-04-26) [Article]A fully parallel ensemble data assimilation and forecasting system has been developed for the Red Sea based on the MIT general circulation model (MITgcm) to simulate the Red Sea circulation and the Data Assimilation Research Testbed (DART) ensemble assimilation software. An important limitation of operational ensemble assimilation systems is the risk of ensemble members’ collapse. This could happen in those situations when the filter update step imposes large corrections on one, or more, of the forecasted ensemble members that are not fully consistent with the model physics. Increasing the ensemble size is expected to improve the assimilation system performances, but obviously increases the risk of members’ collapse. Hardware failure or slow numerical convergence encountered for some members should also occur more frequently. In this context, the manual steering of the whole process appears as a real challenge and makes the implementation of the ensemble assimilation procedure uneasy and extremely time consuming.This paper presents our efforts to build an efficient and fault-tolerant MITgcm-DART ensemble assimilation system capable of operationally running thousands of members. Built on top of Decimate, a scheduler extension developed to ease the submission, monitoring and dynamic steering of workflow of dependent jobs in a fault-tolerant environment, we describe the assimilation system implementation and discuss in detail its coupling strategies. Within Decimate, only a few additional lines of Python is needed to define flexible convergence criteria and to implement any necessary actions to the forecast ensemble members, as for instance (i) restarting faulty job in case of job failure, (ii) changing the random seed in case of poor convergence or numerical instability, (iii) adjusting (reducing or increasing) the number of parallel forecasts on the fly, (iv) replacing members on the fly to enrich the ensemble with new members, etc.We demonstrate the efficiency of the system with numerical experiments assimilating real satellites sea surface height and temperature observations in the Red Sea.
-
Evidence for topological type-II Weyl semimetal WTe2(Nature Communications, Springer Nature, 2017-12-15) [Article]Recently, a type-II Weyl fermion was theoretically predicted to appear at the contact of electron and hole Fermi surface pockets. A distinguishing feature of the surfaces of type-II Weyl semimetals is the existence of topological surface states, so-called Fermi arcs. Although WTe2 was the first material suggested as a type-II Weyl semimetal, the direct observation of its tilting Weyl cone and Fermi arc has not yet been successful. Here, we show strong evidence that WTe2 is a type-II Weyl semimetal by observing two unique transport properties simultaneously in one WTe2 nanoribbon. The negative magnetoresistance induced by a chiral anomaly is quite anisotropic in WTe2 nanoribbons, which is present in b-axis ribbon, but is absent in a-axis ribbon. An extra-quantum oscillation, arising from a Weyl orbit formed by the Fermi arc and bulk Landau levels, displays a two dimensional feature and decays as the thickness increases in WTe2 nanoribbon.
-
Scaling of a Fast Fourier Transform and a pseudo-spectral fluid solver up to 196608 cores(Journal of Parallel and Distributed Computing, Elsevier BV, 2017-11-04) [Article]In this paper we present scaling results of a FFT library, FFTK, and a pseudospectral code, Tarang, on grid resolutions up to 81923 grid using 65536 cores of Blue Gene/P and 196608 cores of Cray XC40 supercomputers. We observe that communication dominates computation, more so on the Cray XC40. The computation time scales as Tcomp∼p−1, and the communication time as Tcomm∼n−γ2 with γ2 ranging from 0.7 to 0.9 for Blue Gene/P, and from 0.43 to 0.73 for Cray XC40. FFTK, and the fluid and convection solvers of Tarang exhibit weak as well as strong scaling nearly up to 196608 cores of Cray XC40. We perform a comparative study of the performance on the Blue Gene/P and Cray XC40 clusters.
-
Scientific Applications Performance Evaluation on Burst Buffer(Lecture Notes in Computer Science, Springer Nature, 2017-10-20) [Book Chapter]Parallel I/O is an integral component of modern high performance computing, especially in storing and processing very large datasets, such as the case of seismic imaging, CFD, combustion and weather modeling. The storage hierarchy includes nowadays additional layers, the latest being the usage of SSD-based storage as a Burst Buffer for I/O acceleration. We present an in-depth analysis on how to use Burst Buffer for specific cases and how the internal MPI I/O aggregators operate according to the options that the user provides during his job submission. We analyze the performance of a range of I/O intensive scientific applications, at various scales on a large installation of Lustre parallel file system compared to an SSD-based Burst Buffer. Our results show a performance improvement over Lustre when using Burst Buffer. Moreover, we show results from a data hierarchy library which indicate that the standard I/O approaches are not enough to get the expected performance from this technology. The performance gain on the total execution time of the studied applications is between 1.16 and 3 times compared to Lustre. One of the test cases achieved an impressive I/O throughput of 900 GB/s on Burst Buffer.
-
Simulating MPI Applications: The SMPI Approach(IEEE Transactions on Parallel and Distributed Systems, IEEE, 2017-08-01) [Article]This article summarizes our recent work and developments on SMPI, a flexible simulator of MPI applications. In this tool, we took a particular care to ensure our simulator could be used to produce fast and accurate predictions in a wide variety of situations. Although we did build SMPI on SimGrid whose speed and accuracy had already been assessed in other contexts, moving such techniques to a HPC workload required significant additional effort. Obviously, an accurate modeling of communications and network topology was one of the key to such achievements. Another less obvious key was the choice to combine in a single tool the possibility to do both offline and online simulation.
-
Asynchronous Task-Based Parallelization of Algebraic Multigrid(Proceedings of the Platform for Advanced Scientific Computing Conference on - PASC '17, Association for Computing Machinery (ACM), 2017-06-23) [Conference Paper]As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous.We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.
-
CFD Modeling of a Multiphase Gravity Separator Vessel(2017-05-23) [Poster]The poster highlights a CFD study that incorporates a combined Eulerian multi-fluid multiphase and a Population Balance Model (PBM) to study the flow inside a typical multiphase gravity separator vessel (GSV) found in oil and gas industry. The simulations were performed using Ansys Fluent CFD package running on KAUST supercomputer, Shaheen. Also, a highlight of a scalability study is presented. The effect of I/O bottlenecks and using Hierarchical Data Format (HDF5) for collective and independent parallel reading of case file is presented. This work is an outcome of a research collaboration on an Aramco project on Shaheen.
-
Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions(Lecture Notes in Computer Science, Springer Nature, 2017-05-12) [Book Chapter]Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical N-Body algorithms like Fast Multipole Method (FMM). In the present work, we propose three independent strategies to improve partitioning and reduce communication. First, we show that the conventional wisdom of using space-filling curve partitioning may not work well for boundary integral problems, which constitute a significant portion of FMM’s application user base. We propose an alternative method that modifies orthogonal recursive bisection to relieve the cell-partition misalignment that has kept it from scaling previously. Secondly, we optimize the granularity of communication to find the optimal balance between a bulk-synchronous collective communication of the local essential tree and an RDMA per task per cell. Finally, we take the dynamic sparse data exchange proposed by Hoefler et al. [1] and extend it to a hierarchical sparse data exchange, which is demonstrated at scale to be faster than the MPI library’s MPI_Alltoallv that is commonly used.
-
Spin Filtering in Epitaxial Spinel Films with Nanoscale Phase Separation(ACS Nano, American Chemical Society (ACS), 2017-05-10) [Article]The coexistence of ferromagnetic metallic phase and antiferromagnetic insulating phase in nanoscaled inhomogeneous perovskite oxides accounts for the colossal magnetoresistance. Although the model of spin-polarized electron transport across antiphase boundaries has been commonly employed to account for large magnetoresistance (MR) in ferrites, the magnetic anomalies, the two magnetic phases and enhanced molecular moment, are still unresolved. We observed a sizable MR in epitaxial spinel films (NiCo2O4-δ) that is much larger than that commonly observed in spinel ferrites. Detailed analysis reveals that this MR can be attributed to phase separation, in which the perfect ferrimagnetic metallic phase and ferrimagnetic insulating phase coexist. The magnetic insulating phase plays an important role in spin filtering in these phase separated spinel oxides, leading to a sizable MR effect. A spin filtering model based on Zeeman effect and direct tunneling is developed to account for MR of the phase separated films.