Browsing Technical Reports by Title
Now showing items 120 of 44

Appendices for: Improper Signaling in TwoPath Relay Channels(20161201)This document contains the appendices for the work in “Improper Signaling in TwoPath Relay Channels,” which is submitted to 2017 IEEE International Conference on Communications (ICC) Workshop on FullDuplex Communications for Future Wireless Networks, Paris, France.

Application of Bayesian Networks for Estimation of Individual Psychological Characteristics(20170719)In this paper we apply Bayesian networks for developing more accurate final overall estimations of psychological characteristics of an individual, based on psychological test results. Psychological tests which identify how much an individual possesses a certain factor are very popular and quite common in the modern world. We call this value for a given factor  the final overall estimation. Examples of factors could be stress resistance, the readiness to take a risk, the ability to concentrate on certain complicated work and many others. An accurate qualitative and comprehensive assessment of human potential is one of the most important challenges in any company or collective. The most common way of studying psychological characteristics of each single person is testing. Psychologists and sociologists are constantly working on improvement of the quality of their tests. Despite serious work, done by psychologists, the questions in tests often do not produce enough feedback due to the use of relatively poor estimation systems. The overall estimation is usually based on personal experiences and the subjective perception of a psychologist or a group of psychologists about the investigated psychological personality factors.

Asynchronous TaskBased Polar Decomposition on Manycore Architectures(20161025)This paper introduces the first asynchronous, taskbased implementation of the polar decomposition on manycore architectures. Based on a new formulation of the iterative QR dynamicallyweighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on finegrained computations, the novel taskbased implementation is also capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been severely weakened compared to previous implementations, unveiling lookahead opportunities for better hardware occupancy. The overall QDWHbased polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the intertask data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an outoforder task scheduling. Benchmarking experiments show significant improvements against existing stateoftheart high performance implementations (i.e., Intel MKL and Elemental) for the polar decomposition on latest sharedmemory vendors' systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy.

Batched Tile LowRank GEMM on GPUs(201802)Dense General MatrixMatrix (GEMM) multiplication is a core operation of the Basic Linear Algebra Subroutines (BLAS) library, and therefore, often resides at the bottom of the traditional software stack for most of the scientific applications. In fact, chip manufacturers give a special attention to the GEMM kernel implementation since this is exactly where most of the highperformance software libraries extract the hardware performance. With the emergence of big data applications involving large datasparse, hierarchically lowrank matrices, the offdiagonal tiles can be compressed to reduce the algorithmic complexity and the memory footprint. The resulting tile lowrank (TLR) data format is composed of small data structures, which retains the most significant information for each tile. However, to operate on lowrank tiles, a new GEMM operation and its corresponding API have to be designed on GPUs so that it can exploit the data sparsity structure of the matrix while leveraging the underlying TLR compression format. The main idea consists in aggregating all operations onto a single kernel launch to compensate for their low arithmetic intensities and to mitigate the data transfer overhead on GPUs. The new TLR GEMM kernel outperforms the cuBLAS dense batched GEMM by more than an order of magnitude and creates new opportunities for TLR advance algorithms.

Borehole Tool for the Comprehensive Characterization of Hydratebearing Sediments(Office of Scientific and Technical Information (OSTI), 20180201)Reservoir characterization and simulation require reliable parameters to anticipate hydrate deposits responses and production rates. The acquisition of the required fundamental properties currently relies on wireline logging, pressure core testing, and/or laboratory observations of synthesized specimens, which are challenged by testing capabilities and innate sampling disturbances. The project reviews hydratebearing sediments, properties, and inherent sampling effects, albeit lessen with the developments in pressure core technology, in order to develop robust correlations with index parameters. The resulting information is incorporated into a tool for optimal field characterization and parameter selection with uncertainty analyses. Ultimately, the project develops a borehole tool for the comprehensive characterization of hydratebearing sediments at in situ, with the design recognizing past developments and characterization experience and benefited from the inspiration of nature and sensor miniaturization.

Capacity Bounds for Parallel Optical Wireless Channels(201601)A system consisting of parallel optical wireless channels with a total average intensity constraint is studied. Capacity upper and lower bounds for this system are derived. Under perfect channelstate information at the transmitter (CSIT), the bounds have to be optimized with respect to the power allocation over the parallel channels. The optimization of the lower bound is nonconvex, however, the KKT conditions can be used to find a list of possible solutions one of which is optimal. The optimal solution can then be found by an exhaustive search algorithm, which is computationally expensive. To overcome this, we propose lowcomplexity power allocation algorithms which are nearly optimal. The optimized capacity lower bound nearly coincides with the capacity at high SNR. Without CSIT, our capacity bounds lead to upper and lower bounds on the outage probability. The outage probability bounds meet at high SNR. The system with average and peak intensity constraints is also discussed.

Comparison of LowComplexity Diversity Schemes for DualHop AF Relaying Systems(Institute of Electrical and Electronics Engineers (IEEE), 20120213)This paper investigates the performance of two lowcomplexity combining schemes, which are based on one or twophase observation, to mitigate multipath fading in dualhop amplifyandforward relaying systems. For the onephasebased combining, a singleantenna station is assumed to relay information from a multipleantenna transmitter to a multipleantenna receiver, and the activation of the receive antennas is adaptively performed based on the secondhop statistics, regardless of the firsthop conditions. On the other hand, the twophasebased combining suggests using multiple singleantenna stations between the multipleantenna transmitter and the singleantenna receiver, where the suitable set of active relays is identified according to the precombining endtoend fading conditions. To facilitate comparisons between the two schemes, formulations for the statistics of the combined signaltonoise ratio and some performance measures are presented. Numerical and simulation results are shown to clarify the tradeoff between the achieved diversityarray gain, the processing complexity, and the power consumption.

Computation of the Response Surface in the Tensor Train data format(20140611)We apply the Tensor Train (TT) approximation to construct the Polynomial Chaos Expansion (PCE) of a random field, and solve the stochastic elliptic diffusion PDE with the stochastic Galerkin discretization. We compare two strategies of the polynomial chaos expansion: sparse and full polynomial (multiindex) sets. In the full set, the polynomial orders are chosen independently in each variable, which provides higher flexibility and accuracy. However, the total amount of degrees of freedom grows exponentially with the number of stochastic coordinates. To cope with this curse of dimensionality, the data is kept compressed in the TT decomposition, a recurrent lowrank factorization. PCE computations on sparse grids sets are extensively studied, but the TT representation for PCE is a novel approach that is investigated in this paper. We outline how to deduce the PCE from the covariance matrix, assemble the Galerkin operator, and evaluate some postprocessing (mean, variance, Sobol indices), staying within the lowrank framework. The most demanding are two stages. First, we interpolate PCE coefficients in the TT format using a few number of samples, which is performed via the block cross approximation method. Second, we solve the discretized equation (large linear system) via the alternating minimal energy algorithm. In the numerical experiments we demonstrate that the full expansion set encapsulated in the TT format is indeed preferable in cases when high accuracy and high polynomial orders are required.

Design and Analysis of Delayed Chip Slope Modulation in Optical Wireless Communication(20150823)In this letter, we propose a novel slopebased binary modulation called delayed chip slope modulation (DCSM) and develop a chipbased harddecision receiver to demodulate the resulting signal, detect the chip sequence, and decode the input bit sequence. Shorter duration of chips than bit duration are used to represent the change of state in an amplitude level according to consecutive bit information and to exploit the tradeoff between bandwidth and power efficiency. We analyze the power spectral density and error rate performance of the proposed DCSM. We show from numerical results that the DCSM scheme can exploit spectrum density more efficiently than the reference schemes while providing an error rate performance comparable to conventional modulation schemes.

A Direct Radiative Transfer Equation Solver for Path Loss Calculation of Underwater Optical Wireless Channels(20141110)In this report, we propose a fast numerical solution for the steady state radiative transfer equation in order to calculate the path loss due to light absorption and scattering in various type of underwater channels. In the proposed scheme, we apply a direct nonuniform method to discretize the angular space and an upwind type finite difference method to discretize the spatial space. A GaussSeidel iterative method is then applied to solve the fully discretized system of linear equations. The accuracy and efficiency of the proposed scheme is validated by Monte Carlo simulations.

Efficient Outage Probability Evaluation of Diversity Receivers Over Generalized Gamma Channels(201610)In this paper, we are interested in determining the cumulative distribution function of the sum of generalized Gamma in the setting of rare event simulations. To this end, we present an efficient importance sampling estimator. The main result of this work is the bounded relative property of the proposed estimator. This result is used to accurately estimate the outage probability of multibranch maximum ratio combining and equal gain combining diversity receivers over generalized Gamma fading channels. Selected numerical simulations are discussed to show the robustness of our estimator compared to naive Monte Carlo.

EnergyEfficient Power Allocation for FixedGain AmplifyandForward Relay Networks with Partial Channel State Information(King Abdullah University of Science and Technology, 201206)In this report, energyefficient transmission and power allocation for fixedgain amplifyandforward relay networks with partial channel state information (CSI) are studied. In the energyefficiency problem, the total power consumed is minimized while keeping the signaltonoiseratio (SNR) above a certain threshold. In the dual problem of power allocation, the endtoend SNR is maximized under individual and global power constraints. Closedform expressions for the optimal source and relay powers and the Lagrangian multiplier are obtained. Numerical results show that the optimal power allocation with partial CSI provides comparable performance as optimal power allocation with full CSI at low SNR.

Error Rates of MPAM and MQAM in Generalized Fading and Generalized Gaussian Noise Environments(IEEE International Symposium on Information Theory  July, 2013 Istanbul, Turkey, 201307)This letter investigates the average symbol error probability (ASEP) of pulse amplitude modulation and quadrature amplitude modulation coherent signaling over flat fading channels subject to additive white generalized Gaussian noise. The new ASEP results are derived in a generic closedform in terms of the Fox H function and the bivariate Fox H function for the extended generalizedK fading case. The utility of this new general closedform is that it includes some special fading distributions, like the GeneralizedK, Nakagamim, and Rayleigh fading and special noise distributions such as Gaussian and Laplacian. Some of these special cases are also treated and are shown to yield simplified results.

Exploiting Data Sparsity for LargeScale Matrix Computations(20180224)Exploiting data sparsity in dense matrices is an algorithmic bridge between architectures that are increasingly memoryaustere on a percore basis and extremescale applications. The Hierarchical matrix Computations on Manycore Architectures (HiCMA) library tackles this challenging problem by achieving significant reductions in time to solution and memory footprint, while preserving a specified accuracy requirement of the application. HiCMA provides a highperformance implementation on distributedmemory systems of one of the most widely used matrix factorization in largescale scientific applications, i.e., the Cholesky factorization. It employs the tile lowrank data format to compress the dense datasparse offdiagonal tiles of the matrix. It then decomposes the matrix computations into interdependent tasks and relies on the dynamic runtime system StarPU for asynchronous outoforder scheduling, while allowing high userproductivity. Performance comparisons and memory footprint on matrix dimensions up to eleven million show a performance gain and memory saving of more than an order of magnitude for both metrics on thousands of cores, against stateoftheart opensource and vendor optimized numerical libraries. This represents an important milestone in enabling largescale matrix computations toward solving big data problems in geospatial statistics for climate/weather forecasting applications.

Extreme Computing for Extreme Adaptive Optics: the Key to Finding Life Outside our Solar System(2018)The realtime correction of telescopic images in the search for exoplanets is highly sensitive to atmospheric aberrations. The pseudo inverse algorithm is an efficient mathematical method to filter out these turbulences. We introduce a new partial singular value decomposition (SVD) algorithm based on QRbased Diagonally Weighted Halley (QDWH) iteration for the pseudoinverse method of adaptive optics. The QDWH partial SVD algorithm selectively calculates the most significant singular values and their corresponding singular vectors. We develop a high performance implementation and demonstrate the numerical robustness of the QDWHbased partial SVD method. We also perform a benchmarking campaign on various generations of GPU hardware accelerators and compare against the stateoftheart SVD implementation SGESDD from the MAGMA library. Numerical accuracy and performance results are reported using synthetic and real observational datasets from the Subaru telescope. Our implementation outperforms SGESDD by up to fivefold and fourfold performance speedups on illconditioned synthetic matrices and real observational datasets, respectively. The pseudoinverse simulation code will be deployed onsky for the Subaru telescope during observation nights scheduled early 2018.

FreeSpace Optical Communications: Capacity Bounds, Approximations, and a New SpherePacking Perspective(201504)The capacity of the intensitymodulation directdetection (IMDD) freespace optical channel is studied. It is shown that for an IMDD channel with generally inputdependent noise, the worst noise at high SNR is inputindependent Gaussian with variance dependent on the input cost. Based on this result, a Gaussian IMDD channel model is proposed where the noise variance depends on the optical intensity constraints only. A new recursive approach for bounding the capacity of the channel based on spherepacking is proposed, which leads to a tighter bound than an existing spherepacking bound for the channel with only an average intensity constraint. Under both average and peak constraints, it yields bounds that characterize the high SNR capacity within a negligible gap, where the achievability is proved by using a truncated Gaussian input distribution. This completes the high SNR capacity characterization of the channel, by closing the gap in the existing characterization for a small averagetopeak ratio. Simple fitting functions that capture the best known achievable rate for the channel are provided. These functions can be of significant practical importance especially for the study of systems operating under atmospheric turbulence and misalignment conditions. Finally, the capacity/SNR loss between heterodyne detection (HD) systems and IMDD systems is bounded at high SNR, where it is shown that the loss grows as SNR increases for a complexvalued HD system, while it is bounded by 1.245 bits or 3.76 dB at most for a realvalued one.

GraMi: Generalized Frequent Pattern Mining in a Single Large Graph(201111)Mining frequent subgraphs is an important operation on graphs. Most existing work assumes a database of many small graphs, but modern applications, such as social networks, citation graphs or proteinprotein interaction in bioinformatics, are modeled as a single large graph. Interesting interactions in such applications may be transitive (e.g., friend of a friend). Existing methods, however, search for frequent isomorphic (i.e., exact match) subgraphs and cannot discover many useful patterns. In this paper the authors propose GRAMI, a framework that generalizes frequent subgraph mining in a large single graph. GRAMI discovers frequent patterns. A pattern is a graph where edges are generalized to distanceconstrained paths. Depending on the definition of the distance function, many instantiations of the framework are possible. Both directed and undirected graphs, as well as multiple labels per vertex, are supported. The authors developed an efficient implementation of the framework that models the frequency resolution phase as a constraint satisfaction problem, in order to avoid the costly enumeration of all instances of each pattern in the graph. The authors also implemented CGRAMI, a version that supports structural and semantic constraints; and AGRAMI, an approximate version that supports very large graphs. The experiments on real data demonstrate that the authors framework is up to 3 orders of magnitude faster and discovers more interesting patterns than existing approaches.

A High Performance QDWHSVD Solver using Hardware Accelerators(20150408)This paper describes a new high performance implementation of the QRbased Dynamically Weighted Halley Singular Value Decomposition (QDWHSVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWHSVD algorithm was introduced by Nakatsukasa and Higham (SIAM SISC, 2013) and combines three successive computational stages: (1) the polar decomposition calculation of the original matrix using the QDWH algorithm, (2) the symmetric eigendecomposition of the resulting polar factor to obtain the singular values and the right singular vectors and (3) the matrixmatrix multiplication to get the associated left singular vectors. A comprehensive test suite highlights the numerical robustness of the QDWHSVD solver. Although it performs up to two times more flops when computing all singular vectors compared to the standard SVD solver algorithm, our new high performance implementation on single GPU results in up to 3.8x improvements for asymptotic matrix sizes, compared to the equivalent routines from existing stateoftheart opensource and commercial libraries. However, when only singular values are needed, QDWHSVD is penalized by performing up to 14 times more flops. The singular value only implementation of QDWHSVD on single GPU can still run up to 18% faster than the best existing equivalent routines. Integrating mixed precision techniques in the solver can additionally provide up to 40% improvement at the price of losing few digits of accuracy, compared to the full double precision floating point arithmetic. We further leverage the single GPU QDWHSVD implementation by introducing the first multiGPU SVD solver to study the scalability of the QDWHSVD framework.