## Search

Now showing items 1-10 of 43

JavaScript is disabled for your browser. Some features of this site may not work without it.

AuthorAlouini, Mohamed-Slim (24)Keyes, David E. (6)Ltaief, Hatem (6)Chaaban, Anas (5)Litvinenko, Alexander (5)View MoreDepartmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division (29)Electrical Engineering Program (8)Extreme Computing Research Center (6)Communication Theory Lab (5)Computer, Electrical and Mathematical Sciences and Engineering (4)View MoreJournalIEEE Communications Letters (1)IEEE Signal Processing Letters (1)IEEE Transactions on Vehicular Technology (1)IEEE Transactions on Wireless Communications (1)IEEE Wireless Communications Letters (1)PublisherInstitute of Electrical and Electronics Engineers (IEEE) (5)King Abdullah University of Science and Technology (2)arXiv (1)Office of Scientific and Technical Information (OSTI) (1)Subjectbounded relative error (2)Capacity (2)FSO (2)Gamma-Gamma (2)Importance Sampling (2)View MoreTypeTechnical Report (39)Article (4)Year (Issue Date)2019 (2)2018 (5)2017 (6)2016 (7)2015 (8)View MoreItem AvailabilityOpen Access (43)

Now showing items 1-10 of 43

- List view
- Grid view
- Sort Options:
- Relevance
- Title Asc
- Title Desc
- Issue Date Asc
- Issue Date Desc
- Submit Date Asc
- Submit Date Desc
- Results Per Page:
- 5
- 10
- 20
- 40
- 60
- 80
- 100

Abstraction Layer For Standardizing APIs of Task-Based Engines

Alomairy, Rabab; Ltaief, Hatem; Abduljabbar, Mustafa; Keyes, David (2019-09-02) [Technical Report]

We introduce AL4SAN, a lightweight library for abstracting the APIs of task-based runtime engines. AL4SAN unifies the expression of tasks and their data dependencies. It supports various dynamic runtime systems relying on compiler technology and user-defined APIs. It enables an application to employ different runtimes and their respective scheduling components, while providing user obliviousness to the underlying hardware configurations. AL4SAN exposes common front-end APIs and connects to different back end runtimes. Experiments on performance and overhead assessments are reported on various shared- and distributed-memory possibly hardware accelerator-equipped systems. A range of workloads, from compute-bound to memory-bound regimes, are employed as proxies for current scientific applications. The low overhead (less than 10%) achieved using a variety of workloads enables AL4SAN to be deployed for fast development of task-based numerical algorithms. More interestingly, AL4SAN enables runtime interoperability by switching runtimes at runtime. Blending runtime systems permits to achieve a twofold speedup on a task-based generalized symmetric eigenvalue solver, relative to state-of-the-art implementations. The ultimate goal of AL4SAN is not to create a new runtime, but to strengthen co-design of existing runtimes/applications, while facilitating user productivity and code portability. The code of AL4SAN is freely available at https://github.com/ecrc/al4san, with extensions in progress.

Scaling Distributed Machine Learning with In-Network Aggregation

Sapio, Amedeo; Canini, Marco; Ho, Chen-Yu; Nelson, Jacob; Kalnis,Panos; Kim, Changhoon; Krishnamurthy, Arvind; Moshref, Masoud; Ports, Dan R. K.; Richtárik, Peter (2019-02) [Technical Report]

Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 300%, and at least by 20% for a number of real-world benchmark models.

Ubiquitous Asynchronous Computations for Solving the Acoustic Wave Propagation Equation

Akbudak, Kadir; Ltaief, Hatem; Etienne, Vincent; Abdelkhalak, Rached; Tonellot, Thierry; Keyes, David E. (2018) [Technical Report]

This paper designs and implements an ubiquitous asynchronous computational scheme for solving the acoustic wave propagation equation with Absorbing Boundary Conditions (ABCs) in the context of seismic imaging applications. While the Convolutional Perfectly Matched Layer (CPML) is typically used as ABCs in the oil and gas industry, its formulation further stresses memory accesses and decreases the arithmetic intensity at the physical domain boundaries. The challenges with CPML are twofold: (1) the strong, inherent data dependencies imposed on the explicit time stepping scheme render asynchronous time integration cumbersome and (2) the idle time is further exacerbated by the load imbalance introduced among processing units. In fact, the CPML formulation of the ABCs requires expensive synchronization points, which may hinder parallel performance of the overall asynchronous time integration. In particular, when deployed in conjunction with the Multicore-optimized Wavefront Diamond (MWD) tiling approach for the inner domain points, it results into a major performance slow down. To relax CPML’s synchrony and mitigate the resulting load imbalance, we embed CPML’s calculation into MWD’s inner loop and carry on the time integration with fine-grained computations in an asynchronous, holistic way. This comes at the price of storing transient results to alleviate dependencies from critical data hazards, while maintaining the numerical accuracy of the original scheme. Performance results on various x86 architectures demonstrate the superiority of MWD with CPML against the standard spatial blocking. To our knowledge, this is the first practical study, which highlights the consolidation of CPML ABCs with asynchronous temporal blocking stencil computations.

Performance Impact of Rank-Reordering on Advanced Polar Decomposition Algorithms

Esposito, Aniello; Keyes, David E.; Ltaief, Hatem; Sukkari, Dalal (2018) [Technical Report]

We demonstrate the importance of both MPI rank reordering and choice of processor grid topology in the context of advanced dense linear algebra (DLA) applications for distributed-memory systems. In particular, we focus on the advanced polar decomposition (PD) algorithm, based on the QR-based Dynamically Weighted Halley method (QDWH). The QDWH algorithm may be used as the first computational step toward solving symmetric eigenvalue problems and the singular value decomposition. Sukkari et al. (ACM TOMS, 2017) have shown that QDWH may benefit from rectangular instead of square processor grid topologies, which directly impact the performance of the underlying ScaLAPACK algorithms. In this work, we experiment an extensive combination of grid topologies and rank reorderings for different matrix sizes and number of nodes, and use QDWH as a proxy for advanced compute-bound linear algebra operations, since it is rich in dense linear solvers and factorizations. A performance improvement of up to 54% can be observed for QDWH on 800 nodes of a Cray XC system, thanks to an optimal combination, especially in strong scaling mode of operation, for which communication overheads may become dominant. We perform a thorough application profiling to analyze the impact of reordering and grid topologies on the various linear algebra components of the QDWH algorithm. It turns out that point- to-point communications may be considerably reduced thanks to a judicious choice of grid topology, while properly setting the rank reordering using the features from the cray-mpich library.

A High Performance QDWH-SVD Solver using Hardware Accelerators

Sukkari, Dalal E.; Ltaief, Hatem; Keyes, David E. (2015-04-08) [Technical Report]

This paper describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced by Nakatsukasa and Higham (SIAM SISC, 2013)
and combines three successive computational stages: (1) the polar decomposition calculation of the original matrix using the QDWH algorithm, (2) the symmetric eigendecomposition of the resulting polar factor to obtain the singular values and the
right singular vectors and (3) the matrix-matrix multiplication to get the associated left singular vectors. A comprehensive test suite highlights
the numerical robustness of the QDWH-SVD solver. Although it performs up to two times
more flops when computing all singular vectors compared to the standard SVD solver algorithm, our new high performance implementation on single GPU results in up to 3.8x improvements for asymptotic matrix sizes, compared to the equivalent routines from existing state-of-the-art open-source and commercial libraries.
However, when only singular values are needed,
QDWH-SVD is penalized by performing up to 14 times more flops. The singular value only implementation of QDWH-SVD on single GPU can still run up to 18% faster than the best existing equivalent routines. Integrating mixed precision techniques in the solver can additionally provide up to 40% improvement at the price of losing
few digits of accuracy, compared to the full double precision floating point arithmetic. We further leverage the single GPU QDWH-SVD implementation by introducing the first multi-GPU SVD solver to study the scalability of the QDWH-SVD framework.

Exploiting Data Sparsity for Large-Scale Matrix Computations

Akbudak, Kadir; Ltaief, Hatem; Mikhalev, Aleksandr; Charara, Ali; Keyes, David E. (2018-02-24) [Technical Report]

Exploiting data sparsity in dense matrices is an algorithmic bridge between architectures that are increasingly memory-austere on a per-core basis and extreme-scale applications. The Hierarchical matrix Computations on Manycore Architectures (HiCMA) library tackles this challenging problem by achieving significant reductions in time to solution and memory footprint, while preserving a specified accuracy requirement of the application. HiCMA provides a high-performance implementation on distributed-memory systems of one of the most widely used matrix factorization in large-scale scientific applications, i.e., the Cholesky factorization. It employs the tile low-rank data format to compress the dense data-sparse off-diagonal tiles of the matrix. It then decomposes the matrix computations into interdependent tasks and relies on the dynamic runtime system StarPU for asynchronous out-of-order scheduling, while allowing high user-productivity. Performance comparisons and memory footprint on matrix dimensions up to eleven million show a performance gain and memory saving of more than an order of magnitude for both metrics on thousands of cores, against state-of-the-art open-source and vendor optimized numerical libraries. This represents an important milestone in enabling large-scale matrix computations toward solving big data problems in geospatial statistics for climate/weather forecasting applications.

Batched Tile Low-Rank GEMM on GPUs

Charara, Ali; Keyes, David E.; Ltaief, Hatem (2018-02) [Technical Report]

Dense General Matrix-Matrix (GEMM) multiplication is a core operation of the Basic Linear Algebra Subroutines (BLAS) library, and therefore, often resides at the bottom of the traditional software stack for most of the scientific applications. In fact, chip manufacturers give a special attention to the GEMM kernel implementation since this is exactly where most of the high-performance software libraries extract the hardware performance. With the emergence of big data applications involving large data-sparse, hierarchically low-rank matrices, the off-diagonal tiles can be compressed to reduce the algorithmic complexity and the memory footprint. The resulting tile low-rank (TLR) data format is composed of small data structures, which retains the most significant information for each tile. However, to operate on low-rank tiles, a new GEMM operation and its corresponding API have to be designed on GPUs so that it can exploit the data sparsity structure of the matrix while leveraging the underlying TLR compression format. The main idea consists in aggregating all operations onto a single kernel launch to compensate for their low arithmetic intensities and to mitigate the data transfer overhead on GPUs. The new TLR GEMM kernel outperforms the cuBLAS dense batched GEMM by more than an order of magnitude and creates new opportunities for TLR advance algorithms.

Borehole Tool for the Comprehensive Characterization of Hydrate-bearing Sediments

Dai, Sheng; Santamarina, Carlos (Office of Scientific and Technical Information (OSTI), 2018-02-01) [Technical Report]

Reservoir characterization and simulation require reliable parameters to anticipate hydrate deposits responses and production rates. The acquisition of the required fundamental properties currently relies on wireline logging, pressure core testing, and/or laboratory ob-servations of synthesized specimens, which are challenged by testing capabilities and in-nate sampling disturbances. The project reviews hydrate-bearing sediments, properties, and inherent sampling effects, albeit lessen with the developments in pressure core technology, in order to develop robust correlations with index parameters. The resulting information is incorporated into a tool for optimal field characterization and parameter selection with un-certainty analyses. Ultimately, the project develops a borehole tool for the comprehensive characterization of hydrate-bearing sediments at in situ, with the design recognizing past developments and characterization experience and benefited from the inspiration of nature and sensor miniaturization.

Low-SNR Capacity of MIMO Optical Intensity Channels

Chaaban, Anas; Rezki, Zouheir; Alouini, Mohamed-Slim (2017-09-18) [Technical Report]

The capacity of the multiple-input multiple-output (MIMO) optical intensity channel is studied, under both average and peak intensity constraints. We focus on low SNR, which can be modeled as the scenario where both constraints proportionally vanish, or where the peak constraint is held constant while the average constraint vanishes. A capacity upper bound is derived, and is shown to be tight at low SNR under both scenarios. The capacity achieving input distribution at low SNR is shown to be a maximally-correlated vector-binary input distribution. Consequently, the low-SNR capacity of the channel is characterized. As a byproduct, it is shown that for a channel with peak intensity constraints only, or with peak intensity constraints and individual (per aperture) average intensity constraints, a simple scheme composed of coded on-off keying, spatial repetition, and maximum-ratio combining is optimal at low SNR.

HLIBCov: Parallel Hierarchical Matrix Approximation of Large Covariance Matrices and Likelihoods with Applications in Parameter Identification

Litvinenko, Alexander (2017-09-26) [Technical Report]

The main goal of this article is to introduce the parallel hierarchical matrix library HLIBpro to the statistical community.
We describe the HLIBCov package, which is an extension of the HLIBpro library for approximating large covariance matrices and maximizing likelihood functions. We show that an approximate Cholesky factorization of a dense matrix of size $2M\times 2M$ can be computed on a modern multi-core desktop in few minutes.
Further, HLIBCov is used for estimating the unknown parameters such as the covariance length, variance and smoothness parameter of a Mat\'ern covariance function by maximizing the joint Gaussian log-likelihood function. The computational bottleneck here is expensive linear algebra arithmetics due to large and dense covariance matrices. Therefore covariance matrices are approximated in the hierarchical ($\H$-) matrix format with computational cost $\mathcal{O}(k^2n \log^2 n/p)$ and storage $\mathcal{O}(kn \log n)$, where the rank $k$ is a small integer (typically $k<25$), $p$ the number of cores and $n$ the number of locations on a fairly general mesh. We demonstrate a synthetic example, where the true values of known parameters are known.
For reproducibility we provide the C++ code, the documentation, and the synthetic data.

The export option will allow you to export the current search results of the entered query to a file. Different formats are available for download. To export the items, click on the button corresponding with the preferred download format.

By default, clicking on the export buttons will result in a download of the allowed maximum amount of items. For anonymous users the allowed maximum amount is 50 search results.

To select a subset of the search results, click "Selective Export" button and make a selection of the items you want to export. The amount of items that can be exported at once is similarly restricted as the full export.

After making a selection, click one of the export format buttons. The amount of items that will be exported is indicated in the bubble next to export format.