Recent Submissions

  • CUDACLAW: A high-performance programmable GPU framework for the solution of hyperbolic PDEs

    Ohannessian, H. Gorune; Turkiyyah, George; Ahmadia, Aron; Ketcheson, David I. (arXiv, 2018-05-21) [Preprint]
    We present cudaclaw, a CUDA-based high performance data-parallel framework for the solution of multidimensional hyperbolic partial differential equation (PDE) systems, equations describing wave motion. cudaclaw allows computational scientists to solve such systems on GPUs without being burdened by the need to write CUDA code, worry about thread and block details, data layout, and data movement between the different levels of the memory hierarchy. The user defines the set of PDEs to be solved via a CUDA- independent serial Riemann solver and the framework takes care of orchestrating the computations and data transfers to maximize arithmetic throughput. cudaclaw treats the different spatial dimensions separately to allow suitable block sizes and dimensions to be used in the different directions, and includes a number of optimizations to minimize access to global memory.
  • A Fault-Tolerant HPC Scheduler Extension for Large and Operational Ensemble Data Assimilation:Application to the Red Sea

    Toye, Habib; Kortas, Samuel; Zhan, Peng; Hoteit, Ibrahim (Journal of Computational Science, Elsevier BV, 2018-04-26) [Article]
    A fully parallel ensemble data assimilation and forecasting system has been developed for the Red Sea based on the MIT general circulation model (MITgcm) to simulate the Red Sea circulation and the Data Assimilation Research Testbed (DART) ensemble assimilation software. An important limitation of operational ensemble assimilation systems is the risk of ensemble members’ collapse. This could happen in those situations when the filter update step imposes large corrections on one, or more, of the forecasted ensemble members that are not fully consistent with the model physics. Increasing the ensemble size is expected to improve the assimilation system performances, but obviously increases the risk of members’ collapse. Hardware failure or slow numerical convergence encountered for some members should also occur more frequently. In this context, the manual steering of the whole process appears as a real challenge and makes the implementation of the ensemble assimilation procedure uneasy and extremely time consuming.This paper presents our efforts to build an efficient and fault-tolerant MITgcm-DART ensemble assimilation system capable of operationally running thousands of members. Built on top of Decimate, a scheduler extension developed to ease the submission, monitoring and dynamic steering of workflow of dependent jobs in a fault-tolerant environment, we describe the assimilation system implementation and discuss in detail its coupling strategies. Within Decimate, only a few additional lines of Python is needed to define flexible convergence criteria and to implement any necessary actions to the forecast ensemble members, as for instance (i) restarting faulty job in case of job failure, (ii) changing the random seed in case of poor convergence or numerical instability, (iii) adjusting (reducing or increasing) the number of parallel forecasts on the fly, (iv) replacing members on the fly to enrich the ensemble with new members, etc.We demonstrate the efficiency of the system with numerical experiments assimilating real satellites sea surface height and temperature observations in the Red Sea.
  • Evidence for topological type-II Weyl semimetal WTe2

    Li, Peng; Wen, Yan; He, Xin; Zhang, Qiang; Xia, Chuan; Yu, Zhi-Ming; Yang, Shengyuan A.; Zhu, Zhiyong; Alshareef, Husam N.; Zhang, Xixiang (Nature Communications, Springer Nature, 2017-12-15) [Article]
    Recently, a type-II Weyl fermion was theoretically predicted to appear at the contact of electron and hole Fermi surface pockets. A distinguishing feature of the surfaces of type-II Weyl semimetals is the existence of topological surface states, so-called Fermi arcs. Although WTe2 was the first material suggested as a type-II Weyl semimetal, the direct observation of its tilting Weyl cone and Fermi arc has not yet been successful. Here, we show strong evidence that WTe2 is a type-II Weyl semimetal by observing two unique transport properties simultaneously in one WTe2 nanoribbon. The negative magnetoresistance induced by a chiral anomaly is quite anisotropic in WTe2 nanoribbons, which is present in b-axis ribbon, but is absent in a-axis ribbon. An extra-quantum oscillation, arising from a Weyl orbit formed by the Fermi arc and bulk Landau levels, displays a two dimensional feature and decays as the thickness increases in WTe2 nanoribbon.
  • Scaling of a Fast Fourier Transform and a pseudo-spectral fluid solver up to 196608 cores

    Chatterjee, Anando G.; Verma, Mahendra K.; Kumar, Abhishek; Samtaney, Ravi; Hadri, Bilel; Khurram, Rooh Ul Amin (Journal of Parallel and Distributed Computing, Elsevier BV, 2017-11-04) [Article]
    In this paper we present scaling results of a FFT library, FFTK, and a pseudospectral code, Tarang, on grid resolutions up to 81923 grid using 65536 cores of Blue Gene/P and 196608 cores of Cray XC40 supercomputers. We observe that communication dominates computation, more so on the Cray XC40. The computation time scales as Tcomp∼p−1, and the communication time as Tcomm∼n−γ2 with γ2 ranging from 0.7 to 0.9 for Blue Gene/P, and from 0.43 to 0.73 for Cray XC40. FFTK, and the fluid and convection solvers of Tarang exhibit weak as well as strong scaling nearly up to 196608 cores of Cray XC40. We perform a comparative study of the performance on the Blue Gene/P and Cray XC40 clusters.
  • Scientific Applications Performance Evaluation on Burst Buffer

    Markomanolis, Georgios; Hadri, Bilel; Khurram, Rooh Ul Amin; Feki, Saber (Lecture Notes in Computer Science, Springer Nature, 2017-10-20) [Book Chapter]
    Parallel I/O is an integral component of modern high performance computing, especially in storing and processing very large datasets, such as the case of seismic imaging, CFD, combustion and weather modeling. The storage hierarchy includes nowadays additional layers, the latest being the usage of SSD-based storage as a Burst Buffer for I/O acceleration. We present an in-depth analysis on how to use Burst Buffer for specific cases and how the internal MPI I/O aggregators operate according to the options that the user provides during his job submission. We analyze the performance of a range of I/O intensive scientific applications, at various scales on a large installation of Lustre parallel file system compared to an SSD-based Burst Buffer. Our results show a performance improvement over Lustre when using Burst Buffer. Moreover, we show results from a data hierarchy library which indicate that the standard I/O approaches are not enough to get the expected performance from this technology. The performance gain on the total execution time of the studied applications is between 1.16 and 3 times compared to Lustre. One of the test cases achieved an impressive I/O throughput of 900 GB/s on Burst Buffer.
  • Simulating MPI Applications: The SMPI Approach

    Degomme, Augustin; Legrand, Arnaud; Markomanolis, Georgios; Quinson, Martin; Stillwell, Mark; Suter, Frédéric (IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers (IEEE), 2017-08-01) [Article]
    This article summarizes our recent work and developments on SMPI, a flexible simulator of MPI applications. In this tool, we took a particular care to ensure our simulator could be used to produce fast and accurate predictions in a wide variety of situations. Although we did build SMPI on SimGrid whose speed and accuracy had already been assessed in other contexts, moving such techniques to a HPC workload required significant additional effort. Obviously, an accurate modeling of communications and network topology was one of the key to such achievements. Another less obvious key was the choice to combine in a single tool the possibility to do both offline and online simulation.
  • Asynchronous Task-Based Parallelization of Algebraic Multigrid

    AlOnazi, Amani; Markomanolis, Georgios; Keyes, David E. (Proceedings of the Platform for Advanced Scientific Computing Conference on - PASC '17, Association for Computing Machinery (ACM), 2017-06-23) [Conference Paper]
    As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous.We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.
  • CFD Modeling of a Multiphase Gravity Separator Vessel

    Narayan, Gautham; Khurram, Rooh Ul Amin; Elsaadawy, Ehab (2017-05-23) [Poster]
    The poster highlights a CFD study that incorporates a combined Eulerian multi-fluid multiphase and a Population Balance Model (PBM) to study the flow inside a typical multiphase gravity separator vessel (GSV) found in oil and gas industry. The simulations were performed using Ansys Fluent CFD package running on KAUST supercomputer, Shaheen. Also, a highlight of a scalability study is presented. The effect of I/O bottlenecks and using Hierarchical Data Format (HDF5) for collective and independent parallel reading of case file is presented. This work is an outcome of a research collaboration on an Aramco project on Shaheen.
  • Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions

    AbdulJabbar, Mustafa Abdulmajeed; Markomanolis, Georgios; Ibeid, Huda; Yokota, Rio; Keyes, David E. (Lecture Notes in Computer Science, Springer Nature, 2017-05-12) [Book Chapter]
    Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical N-Body algorithms like Fast Multipole Method (FMM). In the present work, we propose three independent strategies to improve partitioning and reduce communication. First, we show that the conventional wisdom of using space-filling curve partitioning may not work well for boundary integral problems, which constitute a significant portion of FMM’s application user base. We propose an alternative method that modifies orthogonal recursive bisection to relieve the cell-partition misalignment that has kept it from scaling previously. Secondly, we optimize the granularity of communication to find the optimal balance between a bulk-synchronous collective communication of the local essential tree and an RDMA per task per cell. Finally, we take the dynamic sparse data exchange proposed by Hoefler et al. [1] and extend it to a hierarchical sparse data exchange, which is demonstrated at scale to be faster than the MPI library’s MPI_Alltoallv that is commonly used.
  • Spin Filtering in Epitaxial Spinel Films with Nanoscale Phase Separation

    Li, Peng; Xia, Chuan; Li, Jun; Zhu, Zhiyong; Wen, Yan; Zhang, Qiang; Zhang, Junwei; Peng, Yong; Alshareef, Husam N.; Zhang, Xixiang (ACS Nano, American Chemical Society (ACS), 2017-05-10) [Article]
    The coexistence of ferromagnetic metallic phase and antiferromagnetic insulating phase in nanoscaled inhomogeneous perovskite oxides accounts for the colossal magnetoresistance. Although the model of spin-polarized electron transport across antiphase boundaries has been commonly employed to account for large magnetoresistance (MR) in ferrites, the magnetic anomalies, the two magnetic phases and enhanced molecular moment, are still unresolved. We observed a sizable MR in epitaxial spinel films (NiCo2O4-δ) that is much larger than that commonly observed in spinel ferrites. Detailed analysis reveals that this MR can be attributed to phase separation, in which the perfect ferrimagnetic metallic phase and ferrimagnetic insulating phase coexist. The magnetic insulating phase plays an important role in spin filtering in these phase separated spinel oxides, leading to a sizable MR effect. A spin filtering model based on Zeeman effect and direct tunneling is developed to account for MR of the phase separated films.
  • Toward a fault-tolerant operational ensemble data assimilation forecasting system for the Red Sea

    Toye, Habib; Kortas, Samuel; Zhan, Peng; Hoteit, Ibrahim (2017-03-13) [Poster]
  • Tuning OpenACC loop execution

    Feki, Saber; Smaoui, Malek (Parallel Programming with OpenACC, Elsevier BV, 2017-01-07) [Book Chapter]
    The purpose of this chapter is to help OpenACC developer who is already familiar with the basic and essential directives to further improve his code performance by adding more descriptive clauses to OpenACC loop constructs. At the end of this chapter the reader will: • Have a better understanding of the purpose of the OpenACC loop construct and its associated clauses illustrated with use cases • Use the acquired knowledge in practice to further improve the performance of OpenACC accelerated codes
  • Strain engineering in monolayer WS2, MoS2, and the WS2/MoS2 heterostructure

    He, Xin; Li, Hai; Zhu, Zhiyong; Dai, Zhenyu; Yang, Yang; Yang, Peng; Zhang, Qiang; Li, Peng; Schwingenschlögl, Udo; Zhang, Xixiang (Applied Physics Letters, AIP Publishing, 2016-10-27) [Article]
    Mechanically exfoliated monolayers of WS2, MoS2 and their van der Waals heterostructure were fabricated on flexible substrate so that uniaxial tensile strain can be applied to the two-dimensional samples. The modification of the band structure under strain was investigated by micro-photoluminescence spectroscopy at room temperature as well as by first-principles calculations. Exciton and trion emissions were observed in both WS2 and the heterostructure at room temperature, and were redshifted by strain, indicating potential for applications in flexible electronics and optoelectronics.
  • Ultrathin Epitaxial Ferromagneticγ-Fe2O3Layer as High Efficiency Spin Filtering Materials for Spintronics Device Based on Semiconductors

    Li, Peng; Xia, Chuan; Zhu, Zhiyong; Wen, Yan; Zhang, Qiang; Alshareef, Husam N.; Zhang, Xixiang (Advanced Functional Materials, Wiley, 2016-06-01) [Article]
    In spintronics, identifying an effective technique for generating spin-polarized current has fundamental importance. The spin-filtering effect across a ferromagnetic insulating layer originates from unequal tunneling barrier heights for spin-up and spin-down electrons, which has shown great promise for use in different ferromagnetic materials. However, the low spin-filtering efficiency in some materials can be ascribed partially to the difficulty in fabricating high-quality thin film with high Curie temperature and/or partially to the improper model used to extract the spin-filtering efficiency. In this work, a new technique is successfully developed to fabricate high quality, ferrimagnetic insulating γ-Fe2O3 films as spin filter. To extract the spin-filtering effect of γ-Fe2O3 films more accurately, a new model is proposed based on Fowler–Nordheim tunneling and Zeeman effect to obtain the spin polarization of the tunneling currents. Spin polarization of the tunneled current can be as high as −94.3% at 2 K in γ-Fe2O3 layer with 6.5 nm thick, and the spin polarization decays monotonically with temperature. Although the spin-filter effect is not very high at room temperature, this work demonstrates that spinel ferrites are very promising materials for spin injection into semiconductors at low temperature, which is important for development of novel spintronics devices. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
  • Community use of XALT in its first year in production

    Budiardja, Reuben; Fahey, Mark; McLay, Robert; Don, Prasad Maddumage; Hadri, Bilel; James, Doug (Proceedings of the Second International Workshop on HPC User Support Tools - HUST '15, Association for Computing Machinery (ACM), 2015-11-09) [Conference Paper]
    XALT collects accurate, detailed, and continuous job-level and link-time data and stores that data in a database; all the data collection is transparent to the users. The data stored can be mined to generate a picture of the compilers, libraries, and other software that users need to run their jobs successfully, highlighting the products that researchers use. We showcase how data collected by XALT can be easily mined into a digestible format by presenting data from four separate HPC centers. XALT is already used by many HPC centers around the world due to its usefulness and complementariness to existing logs and databases. Centers with XALT have a much better understanding of library and executable usage and patterns. We also present new functionality in XALT - namely the ability to anonymize data and early work in providing seamless access to provenance data.
  • Efficient fdCas9 Synthetic Endonuclease with Improved Specificity for Precise Genome Engineering

    Aouida, Mustapha; Eid, Ayman; Ali, Zahir; Cradick, Thomas; Lee, Ciaran; Deshmukh, Harshavardhan; Ahmed, Atef; Abu Samra, Dina Bashir Kamil; Gadhoum, Samah; Merzaban, Jasmeen; Bao, Gang; Mahfouz, Magdy M. (PLoS ONE, Public Library of Science (PLoS), 2015-07-30) [Article]
    The Cas9 endonuclease is used for genome editing applications in diverse eukaryotic species. A high frequency of off-target activity has been reported in many cell types, limiting its applications to genome engineering, especially in genomic medicine. Here, we generated a synthetic chimeric protein between the catalytic domain of the FokI endonuclease and the catalytically inactive Cas9 protein (fdCas9). A pair of guide RNAs (gRNAs) that bind to sense and antisense strands with a defined spacer sequence range can be used to form a catalytically active dimeric fdCas9 protein and generate double-strand breaks (DSBs) within the spacer sequence. Our data demonstrate an improved catalytic activity of the fdCas9 endonuclease, with a spacer range of 15–39 nucleotides, on surrogate reporters and genomic targets. Furthermore, we observed no detectable fdCas9 activity at known Cas9 off-target sites. Taken together, our data suggest that the fdCas9 endonuclease variant is a superior platform for genome editing applications in eukaryotic systems including mammalian cells.
  • Zonal Detached-Eddy Simulation of Turbulent Unsteady Flow over Iced Airfoils

    Zhang, Yue; Habashi, Wagdi G.; Khurram, Rooh Ul Amin (Journal of Aircraft, American Institute of Aeronautics and Astronautics (AIAA), 2015-07-23) [Article]
    This paper presentsamultiscale finite-element formulation for the second modeofzonal detached-eddy simulation. The multiscale formulation corrects the lack of stability of the standard Galerkin formulation by incorporating the effect of unresolved scales to the grid (resolved) scales. The stabilization terms arise naturally and are free of userdefined stability parameters. Validation of the method is accomplished via the turbulent flow over tandem cylinders. The boundary-layer separation, free shear-layer rollup, vortex shedding from the upstream cylinder, and interaction with the downstream cylinder are well reproduced. Good agreement with experimental measurements gives credence to the accuracy of zonal detached-eddy simulation in modeling turbulent separated flows. A comprehensive study is then conducted on the performance degradation of ice-contaminated airfoils. NACA 23012 airfoil with a spanwise ice ridge and Gates Learjet Corporation-305 airfoil with a leading-edge horn-shape glaze ice are selected for investigation. Appropriate spanwise domain size and sufficient grid density are determined to enhance the reliability of the simulations. A comparison of lift coefficient and flowfield variables demonstrates the added advantage that the zonal detached-eddy simulation model brings to the Spalart-Allmaras turbulence model. Spectral analysis and instantaneous visualization of turbulent structures are also highlighted via zonal detached-eddy simulation. Copyright © 2015 by the CFD Lab of McGill University. Published by the American Institute of Aeronautics and Astronautics, Inc.
  • Performance Analysis of FEM Algorithmson GPU and Many-Core Architectures

    Khurram, Rooh; Kortas, Samuel (2015-04-27) [Presentation]
    The roadmaps of the leading supercomputer manufacturers are based on hybrid systems, which consist of a mix of conventional processors and accelerators. This trend is mainly due to the fact that the power consumption cost of the future cpu-only Exascale systems will be unsustainable, thus accelerators such as graphic processing units (GPUs) and many-integrated-core (MIC) will likely be the integral part of the TOP500 ( supercomputers, beyond 2020. The emerging supercomputer architecture will bring new challenges for the code developers. Continuum mechanics codes will particularly be affected, because the traditional synchronous implicit solvers will probably not scale on hybrid Exascale machines. In the previous study[1], we reported on the performance of a conjugate gradient based mesh motion algorithm[2]on Sandy Bridge, Xeon Phi, and K20c. In the present study we report on the comparative study of finite element codes, using PETSC and AmgX solvers on CPU and GPUs, respectively [3,4]. We believe this study will be a good starting point for FEM code developers, who are contemplating a CPU to accelerator transition.
  • Historic Learning Approach for Auto-tuning OpenACC Accelerated Scientific Applications

    Siddiqui, Shahzeb; Alzayer, Fatemah; Feki, Saber (High Performance Computing for Computational Science -- VECPAR 2014, Springer Nature, 2015-04-18) [Conference Paper]
    The performance optimization of scientific applications usually requires an in-depth knowledge of the hardware and software. A performance tuning mechanism is suggested to automatically tune OpenACC parameters to adapt to the execution environment on a given system. A historic learning based methodology is suggested to prune the parameter search space for a more efficient auto-tuning process. This approach is applied to tune the OpenACC gang and vector clauses for a better mapping of the compute kernels onto the underlying architecture. Our experiments show a significant performance improvement against the default compiler parameters and drastic reduction in tuning time compared to a brute force search-based approach.
  • Controlled surface segregation leads to efficient coke-resistant nickel/platinum bimetallic catalysts for the dry reforming of methane

    Li, Lidong; Zhou, Lu; Ould-Chikh, Samy; Anjum, Dalaver H.; Kanoun, Mohammed; Scaranto, Jessica; Hedhili, Mohamed N.; Khalid, Syed; Laveille, Paco; D'Souza, Lawrence; Clo, Alain M.; Basset, Jean-Marie (ChemCatChem, Wiley, 2015-02-03) [Article]
    Surface composition and structure are of vital importance for heterogeneous catalysts, especially for bimetallic catalysts, which often vary as a function of reaction conditions (known as surface segregation). The preparation of bimetallic catalysts with controlled metal surface composition and structure is very challenging. In this study, we synthesize a series of Ni/Pt bimetallic catalysts with controlled metal surface composition and structure using a method derived from surface organometallic chemistry. The evolution of the surface composition and structure of the obtained bimetallic catalysts under simulated reaction conditions is investigated by various techniques, which include CO-probe IR spectroscopy, high-angle annular dark-field scanning transmission electron microscopy, energy-dispersive X-ray spectroscopy, extended X-ray absorption fine structure analysis, X-ray absorption near-edge structure analysis, XRD, and X-ray photoelectron spectroscopy. It is demonstrated that the structure of the bimetallic catalyst is evolved from Pt monolayer island-modified Ni nanoparticles to core-shell bimetallic nanoparticles composed of a Ni-rich core and a Ni/Pt alloy shell upon thermal treatment. These catalysts are active for the dry reforming of methane, and their catalytic activities, stabilities, and carbon formation vary with their surface composition and structure. The reform of reforming: A series of alumina-supported Ni/Pt bimetallic nanoparticles (NPs) with controlled surface composition and structure are prepared. Remarkable surface segregation for these bimetallic NPs is observed upon thermal treatment. These bimetallic NPs are active catalysts for CO2 reforming of CH4, and their catalytic activities, stabilities, and carbon formation vary with their surface composition and structure.

View more