• Login
    Search 
    •   Home
    • Events
    • Scalable Hierarchical Algorithms for eXtreme Computing (SHAXC-2) Workshop 2014
    • Search
    •   Home
    • Events
    • Scalable Hierarchical Algorithms for eXtreme Computing (SHAXC-2) Workshop 2014
    • Search
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Filter by Category

    AuthorKeyes, David E. (10)Yokota, Rio (4)Ltaief, Hatem (3)Abdelfattah, Ahmad (2)AbdulJabbar, Mustafa Abdulmajeed (1)View MoreDepartment
    Computer Science Program (11)
    Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division (11)
    Applied Mathematics and Computational Science Program (10)Extreme Computing Research Center (10)Electrical Engineering Program (1)View MoreTypePoster (11)Year (Issue Date)2014 (11)Item AvailabilityOpen Access (11)

    Browse

    All of KAUSTCommunitiesIssue DateSubmit DateThis CollectionIssue DateSubmit Date

    My Account

    Login

    Quick Links

    Open Access PolicyORCID LibguidePlumX LibguideSubmit an Item

    Statistics

    Display statistics
     

    Search

    Show Advanced FiltersHide Advanced Filters

    Filters

    Now showing items 1-10 of 11

    • List view
    • Grid view
    • Sort Options:
    • Relevance
    • Title Asc
    • Title Desc
    • Issue Date Asc
    • Issue Date Desc
    • Submit Date Asc
    • Submit Date Desc
    • Results Per Page:
    • 5
    • 10
    • 20
    • 40
    • 60
    • 80
    • 100

    • 11CSV
    • 11RefMan
    • 11EndNote
    • 11BibTex
    • Selective Export
    • Select All
    • Help
    Thumbnail

    Enabling High Performance Large Scale Dense Problems through KBLAS

    Abdelfattah, Ahmad; Keyes, David E.; Ltaief, Hatem (2014-05-04) [Poster]
    KBLAS (KAUST BLAS) is a small library that provides highly optimized BLAS routines on systems accelerated with GPUs. KBLAS is entirely written in CUDA C, and targets NVIDIA GPUs with compute capability 2.0 (Fermi) or higher. The current focus is on level-2 BLAS routines, namely the general matrix vector multiplication (GEMV) kernel, and the symmetric/hermitian matrix vector multiplication (SYMV/HEMV) kernel. KBLAS provides these two kernels in all four precisions (s, d, c, and z), with support to multi-GPU systems. Through advanced optimization techniques that target latency hiding and pushing memory bandwidth to the limit, KBLAS outperforms state-of-the-art kernels by 20-90% improvement. Competitors include CUBLAS-5.5, MAGMABLAS-1.4.0, and CULAR17. The SYMV/HEMV kernel from KBLAS has been adopted by NVIDIA, and should appear in CUBLAS-6.0. KBLAS has been used in large scale simulations of multi-object adaptive optics.
    Thumbnail

    Pipelining Computational Stages of the Tomographic Reconstructor for Multi-Object Adaptive Optics on a Multi?GPU System

    Charara, Ali; Ltaief, Hatem; Gratadour, Damien; Keyes, David E.; Sevin, Arnaud; Abdelfattah, Ahmad; Gendron, Eric; Morel, Carine; Vidal, Fabrice (2014-05-04) [Poster]
    European Extreme Large Telescope (E-ELT) is a high priority project in ground based astronomy that aims at constructing the largest telescope ever built. MOSAIC is an instrument proposed for E-ELT using Multi- Object Adaptive Optics (MOAO) technique for astronomical telescopes, which compensates for effects of atmospheric turbulence on image quality, and operates on patches across a large FoV.
    Thumbnail

    Community Detection for Large Graphs

    Peng, Chengbin; Kolda, Tamara G.; Pinar, Ali; Zhang, Zhihua; Keyes, David E. (2014-05-04) [Poster]
    Many real world networks have inherent community structures, including social networks, transportation networks, biological networks, etc. For large scale networks with millions or billions of nodes in real-world applications, accelerating current community detection algorithms is in demand, and we present two approaches to tackle this issue -A K-core based framework that can accelerate existing community detection algorithms significantly; -A parallel inference algorithm via stochastic block models that can distribute the workload.
    Thumbnail

    Hierarchical matrix techniques for the solution of elliptic equations

    Chavez Chavez, Gustavo Ivan; Turkiyyah, George; Yokota, Rio; Keyes, David E. (2014-05-04) [Poster]
    Hierarchical matrix approximations are a promising tool for approximating low-rank matrices given the compactness of their representation and the economy of the operations between them. Integral and differential operators have been the major applications of this technology, but they can be applied into other areas where low-rank properties exist. Such is the case of the Block Cyclic Reduction algorithm, which is used as a direct solver for the constant-coefficient Poisson quation. We explore the variable-coefficient case, also using Block Cyclic reduction, with the addition of Hierarchical Matrices to represent matrix blocks, hence improving the otherwise O(N2) algorithm, into an efficient O(N) algorithm.
    Thumbnail

    Nyström-discretized Magnetic Field Integral Equation for 2D Electromagnetic Scattering

    Al-Harthi, Noha A.; Ulku, Huseyin Arda; Yokota, Rio; Keyes, David E.; Bagci, Hakan (2014-05-04) [Poster]
    Thumbnail

    Fast Multipole-Based Preconditioner for Sparse Iterative Solvers

    Ibeid, Huda; Yokota, Rio; Keyes, David E. (2014-05-04) [Poster]
    Among optimal hierarchical algorithms for the computational solution of elliptic problems, the Fast Multipole Method (FMM) stands out for its adaptability to emerging architectures, having high arithmetic intensity, tunable accuracy, and relaxed global synchronization requirements. We demonstrate that, beyond its traditional use as a solver in problems for which explicit free-space kernel representations are available, the FMM has applicability as a preconditioner in finite domain elliptic boundary value problems, by equipping it with boundary integral capability for finite boundaries and by wrapping it in a Krylov method for extensibility to more general operators. Compared with multilevel methods, it is capable of comparable algebraic convergence rates down to the truncation error of the discretized PDE, and it has superior multicore and distributed memory scalability properties on commodity architecture supercomputers.
    Thumbnail

    Predictive Performance Tuning of OpenACC Accelerated Applications

    Siddiqui, Shahzeb; Feki, Saber (2014-05-04) [Poster]
    Graphics Processing Units (GPUs) are gradually becoming mainstream in supercomputing as their capabilities to significantly accelerate a large spectrum of scientific applications have been clearly identified and proven. Moreover, with the introduction of high level programming models such as OpenACC [1] and OpenMP 4.0 [2], these devices are becoming more accessible and practical to use by a larger scientific community. However, performance optimization of OpenACC accelerated applications usually requires an in-depth knowledge of the hardware and software specifications. We suggest a prediction-based performance tuning mechanism [3] to quickly tune OpenACC parameters for a given application to dynamically adapt to the execution environment on a given system. This approach is applied to a finite difference kernel to tune the OpenACC gang and vector clauses for mapping the compute kernels into the underlying accelerator architecture. Our experiments show a significant performance improvement against the default compiler parameters and a faster tuning by an order of magnitude compared to the brute force search tuning.
    Thumbnail

    Implicit Unstructured Computational Aerodynamics on Many-Integrated Core Architecture

    Al Farhan, Mohammed; Keyes, David E. (2014-05-04) [Poster]
    This research aims to understand the performance of PETSc-FUN3D, a fully nonlinear implicit unstructured grid incompressible or compressible Euler code with origins at NASA and the U.S. DOE, on many-integrated core architecture and how a hybridprogramming paradigm (MPI+OpenMP) can exploit Intel Xeon Phi hardware with upwards of 60 cores per node and 4 threads per core. For the current contribution, we focus on strong scaling with many-integrated core hardware. In most implicit PDE-based codes, while the linear algebraic kernel is limited by the bottleneck of memory bandwidth, the flux kernel arising in control volume discretization of the conservation law residuals and the preconditioner for the Jacobian exploits the Phi hardware well.
    Thumbnail

    Asynchronous Execution of the Fast Multipole Method Using Charm++

    AbdulJabbar, Mustafa Abdulmajeed; Yokota, Rio; Keyes, David E. (2014-05-04) [Poster]
    Thumbnail

    Optimizing Stencil Computations: Multicore-optimized wavefront diamond blocking on Shared and Distributed Memory Systems

    Malas, Tareq Majed Yasin; Ltaief, Hatem; Hager, Georg; Wellein, Gerhard; Keyes, David E. (2014-05-04) [Poster]
    • 1
    • 2
    DSpace software copyright © 2002-2019  DuraSpace
    Quick Guide | Contact Us | Send Feedback
    Open Repository is a service hosted by 
    Atmire NV
     

    Export search results

    The export option will allow you to export the current search results of the entered query to a file. Different formats are available for download. To export the items, click on the button corresponding with the preferred download format.

    By default, clicking on the export buttons will result in a download of the allowed maximum amount of items. For anonymous users the allowed maximum amount is 50 search results.

    To select a subset of the search results, click "Selective Export" button and make a selection of the items you want to export. The amount of items that can be exported at once is similarly restricted as the full export.

    After making a selection, click one of the export format buttons. The amount of items that will be exported is indicated in the bubble next to export format.