Recent Submissions

  • Optimization Specifications for CUDA Code Restructuring Tool

    Khan, Ayaz (2017-03-13)
    In this work we have developed a restructuring software tool (RT-CUDA) following the proposed optimization specifications to bridge the gap between high-level languages and the machine dependent CUDA environment. RT-CUDA takes a C program and convert it into an optimized CUDA kernel with user directives in a configuration file for guiding the compiler. RTCUDA also allows transparent invocation of the most optimized external math libraries like cuSparse and cuBLAS enabling efficient design of linear algebra solvers. We expect RT-CUDA to be needed by many KSA industries dealing with science and engineering simulation on massively parallel computers like NVIDIA GPUs.
  • d3f: Parallel Simulation of Large-scale Groundwater Flow with ug4

    Wittum, Gabriel; Logashenko, Dmitry; Hoffer, Michael; Lampe, Michael; Nägel, Arne; Reiter, Sebastian; Vogel, Andreas (2017-03-13)
  • SPARTex: A Vertex-Centric Framework for RDF Data Analytics

    Abdelaziz, Ibrahim; Al-Harbi, Razen; Salihoglu, Semih; Kalnis, Panos; Mamoulis, Nikos (2017-03-13)
  • ScaleMine: Scalable Parallel Frequent Subgraph Mining in a Single Large Graph

    Abdelhamid, Ehab; Abdelaziz, Ibrahim; Kalnis, Panos; Khayyat, Zuhair; Jamour, Fuad Tarek (2017-03-13)
  • Likelihood Approximation With Parallel Hierarchical Matrices For Large Spatial Datasets

    Litvinenko, Alexander; Sun, Ying; Genton, Marc G.; Keyes, David E. (2017-03-13)
  • Earthquake Ground Motion Analysis and extreme computing on multi-Petaflops machine

    De Martin, Florent; Dupros, Fabrice; Thierry, Philippe; Paciucci, Gabriele; Sochala, Pierre; Boulahya, Faïza; Benaichouche, Abed; Chaljub, Emmanuel; Hadri, Bilel; Ltaief, Hatem; Keyes, David E. (2017-03-13)
  • Batched Triangular DLA for Very Small Matrices on GPUs

    Charara, Ali; Keyes, David E.; Ltaief, Hatem (2017-03-13)
    In several scientific applications, like tensor contractions in deep learning computation or data compression in hierarchical low rank matrix approximation, the bulk of computation typically resides in performing thousands of independent dense linear algebra operations on very small matrix sizes (usually less than 100). Batched dense linear algebra kernels are becoming ubiquitous for such scientific computations. Within a single API call, these kernels are capable of simultaneously launching a large number of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the utilization of the underlying hardware.
  • High-resolution seismic wave propagation using local time stepping

    Peter, Daniel; Rietmann, Max; Galvez, Percy; Ampuero, Jean Paul (2017-03-13)
    High-resolution seismic wave simulations often require local refinements in numerical meshes to accurately capture e.g. steep topography or complex fault geometry. Together with explicit time schemes, this dramatically reduces the global time step size for ground-motion simulations due to numerical stability conditions. To alleviate this problem, local time stepping (LTS) algorithms allow an explicit time stepping scheme to adapt the time step to the element size, allowing nearoptimal time steps everywhere in the mesh. This can potentially lead to significantly faster simulation runtimes.
  • Implicit Unstructured Aerodynamics on Emerging Multi- and Many-Core HPC Architectures

    Al Farhan, Mohammed A.; Kaushik, Dinesh K.; Keyes, David E. (2017-03-13)
    Shared memory parallelization of PETSc-FUN3D, an unstructured tetrahedral mesh Euler code previously characterized for distributed memory Single Program, Multiple Data (SPMD) for thousands of nodes, is hybridized with shared memory Single Instruction, Multiple Data (SIMD) for hundreds of threads per node. We explore thread-level performance optimizations on state-of-the-art multi- and many-core Intel processors, including the second generation of Xeon Phi, Knights Landing (KNL). We study the performance on the KNL with different configurations of memory and cluster modes, with code optimizations to minimize indirect addressing and enhance the cache locality. The optimizations employed are expected to be of value other unstructured applications as many-core architecture evolves.
  • Toward a fault-tolerant operational ensemble data assimilation forecasting system for the Red Sea

    Toye, Habib; Kortas, Samuel; Zhan, Peng; Hoteit, Imbrahim (2017-03-13)
  • Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model

    Hamam, Alwaleed A.; Khan, Ayaz H. (2017-03-13)
    Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it's time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.
  • Performance Results using ANSYS HPC

    Karim, Abbass; Ramon, Jose (2017-03-13)
  • Abnormal Behavior Detection in Arial Video Surveillance

    Walha, Ahlem; Wali, Ali; Alimi, Adel (2017-03-13)
  • HPL and STREAM Benchmarks on SANAM Supercomputer

    Bin Sulaiman, Riman A. (2017-03-13)
    SANAM supercomputer was jointly built by KACST and FIAS in 2012 ranking second that year in the Green500 list with a power efficiency of 2.3 GFLOPS/W (Rohr et al., 2014). It is a heterogeneous accelerator-based HPC system that has 300 compute nodes. Each node includes two Intel Xeon E5?2650 CPUs, two AMD FirePro S10000 dual GPUs and 128 GiB of main memory. In this work, the seven benchmarks of HPCC were installed and configured to reassess the performance of SANAM, as part of an unpublished master thesis, after it was reassembled in the Kingdom of Saudi Arabia. We present here detailed results of HPL and STREAM benchmarks.
  • Scalable Relevance re-ranking using nature-inspired meta-heuristic optimization algorithms

    Ksibi, Amel; Hadj Taieb, Mohamed Amin; Ben Ammar, Anis; Ben Amar, Chokri (2017-03-13)
  • Simulation of Cycle-to-Cycle Variation in Dual-Fuel Engines

    Jaasim, Mohammed; Pasunurthi, Shyamsundar; Jupudi, Ravichandra S.; Gubba, Sreenivasa Rao; Primus, Roy; Klingbeil, Adam; Wijeyakulasuriya, Sameera; Im, Hong G. (2017-03-13)
    Standard practices of internal combustion (IC) engine experiments are to conduct the measurements of quantities averaged over a large number of cycles. Depending on the operating conditions, the cycle-to-cycle variation (CCV) of quantities, such as the indicated mean effective pressure (IMEP) are observed at different levels. Accurate prediction of CCV in IC engines is an important but challenging task. Computational fluid dynamics (CFD) simulations using high performance computing (HPC) can be used effectively to visualize such 3D spatial distributions. In the present study, a dual fuel large engine is considered, with natural gas injected into the manifold accompanied with direct injection of diesel pilot fuel to trigger ignition. Multiple engine cycles in 3D are simulated in series as in the experiments to investigate the potential of HPC based high fidelity simulations to accurately capture the cycle to cycle variation in dual fuel engines. Open cycle simulations are conducted to predict the combined effect of the stratification of fuel-air mixture, temperature and turbulence on the CCV of pressure. The predicted coefficient of variation (COV) of pressure compared to the results from closed cycle simulations and the experiments.
  • Secure Broadcasting with Uncertain Channel State Information

    Hyadi, Amal; Rezki, Zouheir; Khisti, Ashish; Alouini, Mohamed-Slim (2017-03-13)
    We investigate the problem of secure broadcasting over fast fading channels with imperfect main channel state information (CSI) at the transmitter. In particular, we analyze the effect of the noisy estimation of the main CSI on the throughput of a broadcast channel where the transmission is intended for multiple legitimate receivers in the presence of an eavesdropper. Besides, we consider the realistic case where the transmitter is only aware of the statistics of the eavesdropper's CSI and not of its channel's realizations. First, we discuss the common message transmission case where the source broadcasts the same information to all the receivers, and we provide an upper and a lower bounds on the ergodic secrecy capacity. For this case, we show that the secrecy rate is limited by the legitimate receiver having, on average, the worst main channel link and we prove that a non-zero secrecy rate can still be achieved even when the CSI at the transmitter is noisy. Then, we look at the independent messages case where the transmitter broadcasts multiple messages to the receivers, and each intended user is interested in an independent message. For this case, we present an expression for the achievable secrecy sum-rate and an upper bound on the secrecy sum-capacity and we show that, in the limit of large number of legitimate receivers K, our achievable secrecy sum-rate follows the scaling law log((1-a ) log(K)), where is the estimation error variance of the main CSI. The special cases of high SNR, perfect and no-main CSI are also analyzed. Analytical derivations and numerical results are presented to illustrate the obtained expressions for the case of independent and identically distributed Rayleigh fading channels.

View more