Recent Submissions

  • A Fault-Tolerant HPC Scheduler Extension for Large and Operational Ensemble Data Assimilation:Application to the Red Sea

    Toye, Habib; Kortas, Samuel; Zhan, Peng; Hoteit, Ibrahim (Journal of Computational Science, Elsevier BV, 2018-04-26) [Article]
    A fully parallel ensemble data assimilation and forecasting system has been developed for the Red Sea based on the MIT general circulation model (MITgcm) to simulate the Red Sea circulation and the Data Assimilation Research Testbed (DART) ensemble assimilation software. An important limitation of operational ensemble assimilation systems is the risk of ensemble members’ collapse. This could happen in those situations when the filter update step imposes large corrections on one, or more, of the forecasted ensemble members that are not fully consistent with the model physics. Increasing the ensemble size is expected to improve the assimilation system performances, but obviously increases the risk of members’ collapse. Hardware failure or slow numerical convergence encountered for some members should also occur more frequently. In this context, the manual steering of the whole process appears as a real challenge and makes the implementation of the ensemble assimilation procedure uneasy and extremely time consuming.This paper presents our efforts to build an efficient and fault-tolerant MITgcm-DART ensemble assimilation system capable of operationally running thousands of members. Built on top of Decimate, a scheduler extension developed to ease the submission, monitoring and dynamic steering of workflow of dependent jobs in a fault-tolerant environment, we describe the assimilation system implementation and discuss in detail its coupling strategies. Within Decimate, only a few additional lines of Python is needed to define flexible convergence criteria and to implement any necessary actions to the forecast ensemble members, as for instance (i) restarting faulty job in case of job failure, (ii) changing the random seed in case of poor convergence or numerical instability, (iii) adjusting (reducing or increasing) the number of parallel forecasts on the fly, (iv) replacing members on the fly to enrich the ensemble with new members, etc.We demonstrate the efficiency of the system with numerical experiments assimilating real satellites sea surface height and temperature observations in the Red Sea.
  • Evidence for topological type-II Weyl semimetal WTe2

    Li, Peng; Wen, Yan; He, Xin; Zhang, Qiang; Xia, Chuan; Yu, Zhi-Ming; Yang, Shengyuan A.; Zhu, Zhiyong; Alshareef, Husam N.; Zhang, Xixiang (Nature Communications, Springer Nature, 2017-12-11) [Article]
    Recently, a type-II Weyl fermion was theoretically predicted to appear at the contact of electron and hole Fermi surface pockets. A distinguishing feature of the surfaces of type-II Weyl semimetals is the existence of topological surface states, so-called Fermi arcs. Although WTe2 was the first material suggested as a type-II Weyl semimetal, the direct observation of its tilting Weyl cone and Fermi arc has not yet been successful. Here, we show strong evidence that WTe2 is a type-II Weyl semimetal by observing two unique transport properties simultaneously in one WTe2 nanoribbon. The negative magnetoresistance induced by a chiral anomaly is quite anisotropic in WTe2 nanoribbons, which is present in b-axis ribbon, but is absent in a-axis ribbon. An extra-quantum oscillation, arising from a Weyl orbit formed by the Fermi arc and bulk Landau levels, displays a two dimensional feature and decays as the thickness increases in WTe2 nanoribbon.
  • Scaling of a Fast Fourier Transform and a pseudo-spectral fluid solver up to 196608 cores

    Chatterjee, Anando G.; Verma, Mahendra K.; Kumar, Abhishek; Samtaney, Ravi; Hadri, Bilel; Khurram, Rooh Ul Amin (Journal of Parallel and Distributed Computing, Elsevier BV, 2017-11-04) [Article]
    In this paper we present scaling results of a FFT library, FFTK, and a pseudospectral code, Tarang, on grid resolutions up to 81923 grid using 65536 cores of Blue Gene/P and 196608 cores of Cray XC40 supercomputers. We observe that communication dominates computation, more so on the Cray XC40. The computation time scales as Tcomp∼p−1, and the communication time as Tcomm∼n−γ2 with γ2 ranging from 0.7 to 0.9 for Blue Gene/P, and from 0.43 to 0.73 for Cray XC40. FFTK, and the fluid and convection solvers of Tarang exhibit weak as well as strong scaling nearly up to 196608 cores of Cray XC40. We perform a comparative study of the performance on the Blue Gene/P and Cray XC40 clusters.
  • Asynchronous Task-Based Parallelization of Algebraic Multigrid

    AlOnazi, Amani A.; Markomanolis, George S.; Keyes, David E. (Proceedings of the Platform for Advanced Scientific Computing Conference on - PASC '17, ACM Press, 2017-06-23) [Conference Paper]
    As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous.We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.
  • Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions

    AbdulJabbar, Mustafa Abdulmajeed; Markomanolis, George S.; Ibeid, Huda; Yokota, Rio; Keyes, David E. (Lecture Notes in Computer Science, Springer International Publishing, 2017-05-11) [Book Chapter]
    Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical N-Body algorithms like Fast Multipole Method (FMM). In the present work, we propose three independent strategies to improve partitioning and reduce communication. First, we show that the conventional wisdom of using space-filling curve partitioning may not work well for boundary integral problems, which constitute a significant portion of FMM’s application user base. We propose an alternative method that modifies orthogonal recursive bisection to relieve the cell-partition misalignment that has kept it from scaling previously. Secondly, we optimize the granularity of communication to find the optimal balance between a bulk-synchronous collective communication of the local essential tree and an RDMA per task per cell. Finally, we take the dynamic sparse data exchange proposed by Hoefler et al. [1] and extend it to a hierarchical sparse data exchange, which is demonstrated at scale to be faster than the MPI library’s MPI_Alltoallv that is commonly used.
  • Spin Filtering in Epitaxial Spinel Films with Nanoscale Phase Separation

    Li, Peng; Xia, Chuan; Li, Jun; Zhu, Zhiyong; Wen, Yan; Zhang, Qiang; Zhang, Junwei; Peng, Yong; Alshareef, Husam N.; Zhang, Xixiang (ACS Nano, American Chemical Society (ACS), 2017-05-08) [Article]
    The coexistence of ferromagnetic metallic phase and antiferromagnetic insulating phase in nanoscaled inhomogeneous perovskite oxides accounts for the colossal magnetoresistance. Although the model of spin-polarized electron transport across antiphase boundaries has been commonly employed to account for large magnetoresistance (MR) in ferrites, the magnetic anomalies, the two magnetic phases and enhanced molecular moment, are still unresolved. We observed a sizable MR in epitaxial spinel films (NiCo2O4-δ) that is much larger than that commonly observed in spinel ferrites. Detailed analysis reveals that this MR can be attributed to phase separation, in which the perfect ferrimagnetic metallic phase and ferrimagnetic insulating phase coexist. The magnetic insulating phase plays an important role in spin filtering in these phase separated spinel oxides, leading to a sizable MR effect. A spin filtering model based on Zeeman effect and direct tunneling is developed to account for MR of the phase separated films.
  • Tuning OpenACC loop execution

    Feki, Saber; Smaoui, Malek (Parallel Programming with OpenACC, Elsevier BV, 2017-01-07) [Book Chapter]
    The purpose of this chapter is to help OpenACC developer who is already familiar with the basic and essential directives to further improve his code performance by adding more descriptive clauses to OpenACC loop constructs. At the end of this chapter the reader will: • Have a better understanding of the purpose of the OpenACC loop construct and its associated clauses illustrated with use cases • Use the acquired knowledge in practice to further improve the performance of OpenACC accelerated codes
  • Strain engineering in monolayer WS2, MoS2, and the WS2/MoS2 heterostructure

    He, Xin; Li, Hai; Zhu, Zhiyong; Dai, Zhenyu; Yang, Yang; Yang, Peng; Zhang, Qiang; Li, Peng; Schwingenschlögl, Udo; Zhang, Xixiang (Applied Physics Letters, AIP Publishing, 2016-10-27) [Article]
    Mechanically exfoliated monolayers of WS2, MoS2 and their van der Waals heterostructure were fabricated on flexible substrate so that uniaxial tensile strain can be applied to the two-dimensional samples. The modification of the band structure under strain was investigated by micro-photoluminescence spectroscopy at room temperature as well as by first-principles calculations. Exciton and trion emissions were observed in both WS2 and the heterostructure at room temperature, and were redshifted by strain, indicating potential for applications in flexible electronics and optoelectronics.
  • Ultrathin Epitaxial Ferromagneticγ-Fe2O3Layer as High Efficiency Spin Filtering Materials for Spintronics Device Based on Semiconductors

    Li, Peng; Xia, Chuan; Zhu, Zhiyong; Wen, Yan; Zhang, Qiang; Alshareef, Husam N.; Zhang, Xixiang (Advanced Functional Materials, Wiley-Blackwell, 2016-06-01) [Article]
    In spintronics, identifying an effective technique for generating spin-polarized current has fundamental importance. The spin-filtering effect across a ferromagnetic insulating layer originates from unequal tunneling barrier heights for spin-up and spin-down electrons, which has shown great promise for use in different ferromagnetic materials. However, the low spin-filtering efficiency in some materials can be ascribed partially to the difficulty in fabricating high-quality thin film with high Curie temperature and/or partially to the improper model used to extract the spin-filtering efficiency. In this work, a new technique is successfully developed to fabricate high quality, ferrimagnetic insulating γ-Fe2O3 films as spin filter. To extract the spin-filtering effect of γ-Fe2O3 films more accurately, a new model is proposed based on Fowler–Nordheim tunneling and Zeeman effect to obtain the spin polarization of the tunneling currents. Spin polarization of the tunneled current can be as high as −94.3% at 2 K in γ-Fe2O3 layer with 6.5 nm thick, and the spin polarization decays monotonically with temperature. Although the spin-filter effect is not very high at room temperature, this work demonstrates that spinel ferrites are very promising materials for spin injection into semiconductors at low temperature, which is important for development of novel spintronics devices. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
  • Community use of XALT in its first year in production

    Budiardja, Reuben; Fahey, Mark; McLay, Robert; Don, Prasad Maddumage; Hadri, Bilel; James, Doug (Proceedings of the Second International Workshop on HPC User Support Tools - HUST '15, Association for Computing Machinery (ACM), 2015-11-15) [Conference Paper]
    XALT collects accurate, detailed, and continuous job-level and link-time data and stores that data in a database; all the data collection is transparent to the users. The data stored can be mined to generate a picture of the compilers, libraries, and other software that users need to run their jobs successfully, highlighting the products that researchers use. We showcase how data collected by XALT can be easily mined into a digestible format by presenting data from four separate HPC centers. XALT is already used by many HPC centers around the world due to its usefulness and complementariness to existing logs and databases. Centers with XALT have a much better understanding of library and executable usage and patterns. We also present new functionality in XALT - namely the ability to anonymize data and early work in providing seamless access to provenance data.
  • Zonal Detached-Eddy Simulation of Turbulent Unsteady Flow over Iced Airfoils

    Zhang, Yue; Habashi, Wagdi G.; Khurram, Rooh Ul Amin (Journal of Aircraft, American Institute of Aeronautics and Astronautics (AIAA), 2015-07-23) [Article]
    This paper presentsamultiscale finite-element formulation for the second modeofzonal detached-eddy simulation. The multiscale formulation corrects the lack of stability of the standard Galerkin formulation by incorporating the effect of unresolved scales to the grid (resolved) scales. The stabilization terms arise naturally and are free of userdefined stability parameters. Validation of the method is accomplished via the turbulent flow over tandem cylinders. The boundary-layer separation, free shear-layer rollup, vortex shedding from the upstream cylinder, and interaction with the downstream cylinder are well reproduced. Good agreement with experimental measurements gives credence to the accuracy of zonal detached-eddy simulation in modeling turbulent separated flows. A comprehensive study is then conducted on the performance degradation of ice-contaminated airfoils. NACA 23012 airfoil with a spanwise ice ridge and Gates Learjet Corporation-305 airfoil with a leading-edge horn-shape glaze ice are selected for investigation. Appropriate spanwise domain size and sufficient grid density are determined to enhance the reliability of the simulations. A comparison of lift coefficient and flowfield variables demonstrates the added advantage that the zonal detached-eddy simulation model brings to the Spalart-Allmaras turbulence model. Spectral analysis and instantaneous visualization of turbulent structures are also highlighted via zonal detached-eddy simulation. Copyright © 2015 by the CFD Lab of McGill University. Published by the American Institute of Aeronautics and Astronautics, Inc.
  • Performance Analysis of FEM Algorithmson GPU and Many-Core Architectures

    Khurram, Rooh; Kortas, Samuel (2015-04-27) [Presentation]
    The roadmaps of the leading supercomputer manufacturers are based on hybrid systems, which consist of a mix of conventional processors and accelerators. This trend is mainly due to the fact that the power consumption cost of the future cpu-only Exascale systems will be unsustainable, thus accelerators such as graphic processing units (GPUs) and many-integrated-core (MIC) will likely be the integral part of the TOP500 ( supercomputers, beyond 2020. The emerging supercomputer architecture will bring new challenges for the code developers. Continuum mechanics codes will particularly be affected, because the traditional synchronous implicit solvers will probably not scale on hybrid Exascale machines. In the previous study[1], we reported on the performance of a conjugate gradient based mesh motion algorithm[2]on Sandy Bridge, Xeon Phi, and K20c. In the present study we report on the comparative study of finite element codes, using PETSC and AmgX solvers on CPU and GPUs, respectively [3,4]. We believe this study will be a good starting point for FEM code developers, who are contemplating a CPU to accelerator transition.
  • Historic Learning Approach for Auto-tuning OpenACC Accelerated Scientific Applications

    Siddiqui, Shahzeb; Alzayer, Fatemah; Feki, Saber (High Performance Computing for Computational Science -- VECPAR 2014, Springer Science + Business Media, 2015-04-17) [Conference Paper]
    The performance optimization of scientific applications usually requires an in-depth knowledge of the hardware and software. A performance tuning mechanism is suggested to automatically tune OpenACC parameters to adapt to the execution environment on a given system. A historic learning based methodology is suggested to prune the parameter search space for a more efficient auto-tuning process. This approach is applied to tune the OpenACC gang and vector clauses for a better mapping of the compute kernels onto the underlying architecture. Our experiments show a significant performance improvement against the default compiler parameters and drastic reduction in tuning time compared to a brute force search-based approach.
  • Controlled surface segregation leads to efficient coke-resistant nickel/platinum bimetallic catalysts for the dry reforming of methane

    Li, Lidong; Zhou, Lu; Ould-Chikh, Samy; Anjum, Dalaver H.; Kanoun, Mohammed; Scaranto, Jessica; Hedhili, Mohamed Nejib; Khalid, Syed; Laveille, Paco; D'Souza, Lawrence; Clo, Alain M.; Basset, Jean-Marie (ChemCatChem, Wiley-Blackwell, 2015-02-03) [Article]
    Surface composition and structure are of vital importance for heterogeneous catalysts, especially for bimetallic catalysts, which often vary as a function of reaction conditions (known as surface segregation). The preparation of bimetallic catalysts with controlled metal surface composition and structure is very challenging. In this study, we synthesize a series of Ni/Pt bimetallic catalysts with controlled metal surface composition and structure using a method derived from surface organometallic chemistry. The evolution of the surface composition and structure of the obtained bimetallic catalysts under simulated reaction conditions is investigated by various techniques, which include CO-probe IR spectroscopy, high-angle annular dark-field scanning transmission electron microscopy, energy-dispersive X-ray spectroscopy, extended X-ray absorption fine structure analysis, X-ray absorption near-edge structure analysis, XRD, and X-ray photoelectron spectroscopy. It is demonstrated that the structure of the bimetallic catalyst is evolved from Pt monolayer island-modified Ni nanoparticles to core-shell bimetallic nanoparticles composed of a Ni-rich core and a Ni/Pt alloy shell upon thermal treatment. These catalysts are active for the dry reforming of methane, and their catalytic activities, stabilities, and carbon formation vary with their surface composition and structure. The reform of reforming: A series of alumina-supported Ni/Pt bimetallic nanoparticles (NPs) with controlled surface composition and structure are prepared. Remarkable surface segregation for these bimetallic NPs is observed upon thermal treatment. These bimetallic NPs are active catalysts for CO2 reforming of CH4, and their catalytic activities, stabilities, and carbon formation vary with their surface composition and structure.
  • Predicting wind-induced vibrations of high-rise buildings using unsteady CFD and modal analysis

    Zhang, Yue; Habashi, Wagdi G (Ed); Khurram, Rooh Ul Amin (Journal of Wind Engineering and Industrial Aerodynamics, Elsevier BV, 2015-01) [Article]
    This paper investigates the wind-induced vibration of the CAARC standard tall building model, via unsteady Computational Fluid Dynamics (CFD) and a structural modal analysis. In this numerical procedure, the natural unsteady wind in the atmospheric boundary layer is modeled with an artificial inflow turbulence generation method. Then, the turbulent flow is simulated by the second mode of a Zonal Detached-Eddy Simulation, and a conservative quadrature-projection scheme is adopted to transfer unsteady loads from fluid to structural nodes. The aerodynamic damping that represents the fluid-structure interaction mechanism is determined by empirical functions extracted from wind tunnel experiments. Eventually, the flow solutions and the structural responses in terms of mean and root mean square quantities are compared with experimental measurements, over a wide range of reduced velocities. The significance of turbulent inflow conditions and aeroelastic effects is highlighted. The current methodology provides predictions of good accuracy and can be considered as a preliminary design tool to evaluate the unsteady wind effects on tall buildings.
  • Predictive Performance Tuning of OpenACC Accelerated Applications

    Siddiqui, Shahzeb; Feki, Saber (2014-05-04) [Poster]
    Graphics Processing Units (GPUs) are gradually becoming mainstream in supercomputing as their capabilities to significantly accelerate a large spectrum of scientific applications have been clearly identified and proven. Moreover, with the introduction of high level programming models such as OpenACC [1] and OpenMP 4.0 [2], these devices are becoming more accessible and practical to use by a larger scientific community. However, performance optimization of OpenACC accelerated applications usually requires an in-depth knowledge of the hardware and software specifications. We suggest a prediction-based performance tuning mechanism [3] to quickly tune OpenACC parameters for a given application to dynamically adapt to the execution environment on a given system. This approach is applied to a finite difference kernel to tune the OpenACC gang and vector clauses for mapping the compute kernels into the underlying accelerator architecture. Our experiments show a significant performance improvement against the default compiler parameters and a faster tuning by an order of magnitude compared to the brute force search tuning.
  • Open problems in CEM: Porting an explicit time-domain volume-integral- equation solver on GPUs with OpenACC

    Ergül, Özgür; Feki, Saber; Al-Jarro, Ahmed; Clo, Alain M.; Bagci, Hakan (IEEE Antennas and Propagation Magazine, Institute of Electrical and Electronics Engineers (IEEE), 2014-04) [Article]
    Graphics processing units (GPUs) are gradually becoming mainstream in high-performance computing, as their capabilities for enhancing performance of a large spectrum of scientific applications to many fold when compared to multi-core CPUs have been clearly identified and proven. In this paper, implementation and performance-tuning details for porting an explicit marching-on-in-time (MOT)-based time-domain volume-integral-equation (TDVIE) solver onto GPUs are described in detail. To this end, a high-level approach, utilizing the OpenACC directive-based parallel programming model, is used to minimize two often-faced challenges in GPU programming: developer productivity and code portability. The MOT-TDVIE solver code, originally developed for CPUs, is annotated with compiler directives to port it to GPUs in a fashion similar to how OpenMP targets multi-core CPUs. In contrast to CUDA and OpenCL, where significant modifications to CPU-based codes are required, this high-level approach therefore requires minimal changes to the codes. In this work, we make use of two available OpenACC compilers, CAPS and PGI. Our experience reveals that different annotations of the code are required for each of the compilers, due to different interpretations of the fairly new standard by the compiler developers. Both versions of the OpenACC accelerated code achieved significant performance improvements, with up to 30× speedup against the sequential CPU code using recent hardware technology. Moreover, we demonstrated that the GPU-accelerated fully explicit MOT-TDVIE solver leveraged energy-consumption gains of the order of 3× against its CPU counterpart. © 2014 IEEE.
  • Automatic performance tuning of parallel and accelerated seismic imaging kernels

    Haberdar, Hakan; Siddiqui, Shahzeb; Feki, Saber (EAGE Workshop on High Performance Computing for Upstream, EAGE Publications, 2014) [Conference Paper]
    With the increased complexity and diversity of mainstream high performance computing systems, significant effort is required to tune parallel applications in order to achieve the best possible performance for each particular platform. This task becomes more and more challenging and requiring a larger set of skills. Automatic performance tuning is becoming a must for optimizing applications such as Reverse Time Migration (RTM) widely used in seismic imaging for oil and gas exploration. An empirical search based auto-tuning approach is applied to the MPI communication operations of the parallel isotropic and tilted transverse isotropic kernels. The application of auto-tuning using the Abstract Data and Communication Library improved the performance of the MPI communications as well as developer productivity by providing a higher level of abstraction. Keeping productivity in mind, we opted toward pragma based programming for accelerated computation on latest accelerated architectures such as GPUs using the fairly new OpenACC standard. The same auto-tuning approach is also applied to the OpenACC accelerated seismic code for optimizing the compute intensive kernel of the Reverse Time Migration application. The application of such technique resulted in an improved performance of the original code and its ability to adapt to different execution environments.
  • Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

    Dongarra, Jack; Faverge, Mathieu; Ltaief, Hatem; Luszczek, Piotr R. (Concurrency and Computation: Practice and Experience, Wiley-Blackwell, 2013-09-18) [Article]
    The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. Copyright © 2013 John Wiley & Sons, Ltd. Copyright © 2013 John Wiley & Sons, Ltd.
  • Data-driven execution of fast multipole methods

    Ltaief, Hatem; Yokota, Rio (Concurrency and Computation: Practice and Experience, Wiley-Blackwell, 2013-09-17) [Article]
    Fast multipole methods (FMMs) have O (N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next-generation supercomputers. Their most common application is to accelerate N-body problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the tree structure is adaptive, load balancing becomes a non-trivial question. A common strategy for load balancing FMMs is to use the work load from the previous step as weights to statically repartition the next step. The authors discuss in the paper another approach based on data-driven execution to efficiently tackle this challenging load balancing problem. The core idea consists of breaking the most time-consuming stages of the FMMs into smaller tasks. The algorithm can then be represented as a directed acyclic graph where nodes represent tasks and edges represent dependencies among them. The execution of the algorithm is performed by asynchronously scheduling the tasks using the queueing and runtime for kernels runtime environment, in a way such that data dependencies are not violated for numerical correctness purposes. This asynchronous scheduling results in an out-of-order execution. The performance results of the data-driven FMM execution outperform the previous strategy and show linear speedup on a quad-socket quad-core Intel Xeon system.Copyright © 2013 John Wiley & Sons, Ltd. Copyright © 2013 John Wiley & Sons, Ltd.

View more