### Recent Submissions

• #### Tucker Tensor analysis of Matern functions in spatial statistics

(2018-03-09)
In this work, we describe advanced numerical tools for working with multivariate functions and for the analysis of large data sets. These tools will drastically reduce the required computing time and the storage cost, and, therefore, will allow us to consider much larger data sets or finer meshes. Covariance matrices are crucial in spatio-temporal statistical tasks, but are often very expensive to compute and store, especially in 3D. Therefore, we approximate covariance functions by cheap surrogates in a low-rank tensor format. We apply the Tucker and canonical tensor decompositions to a family of Matern- and Slater-type functions with varying parameters and demonstrate numerically that their approximations exhibit exponentially fast convergence. We prove the exponential convergence of the Tucker and canonical approximations in tensor rank parameters. Several statistical operations are performed in this low-rank tensor format, including evaluating the conditional covariance matrix, spatially averaged estimation variance, computing a quadratic form, determinant, trace, loglikelihood, inverse, and Cholesky decomposition of a large covariance matrix. Low-rank tensor approximations reduce the computing and storage costs essentially. For example, the storage cost is reduced from an exponential O(n^d) to a linear scaling O(drn), where d is the spatial dimension, n is the number of mesh points in one direction, and r is the tensor rank. Prerequisites for applicability of the proposed techniques are the assumptions that the data, locations, and measurements lie on a tensor (axes-parallel) grid and that the covariance function depends on a distance, ||x-y||.
• #### Exploiting Data Sparsity for Large-Scale Matrix Computations

(2018-02-24)
Exploiting data sparsity in dense matrices is an algorithmic bridge between architectures that are increasingly memory-austere on a per-core basis and extreme-scale applications. The Hierarchical matrix Computations on Manycore Architectures (HiCMA) library tackles this challenging problem by achieving significant reductions in time to solution and memory footprint, while preserving a specified accuracy requirement of the application. HiCMA provides a high-performance implementation on distributed-memory systems of one of the most widely used matrix factorization in large-scale scientific applications, i.e., the Cholesky factorization. It employs the tile low-rank data format to compress the dense data-sparse off-diagonal tiles of the matrix. It then decomposes the matrix computations into interdependent tasks and relies on the dynamic runtime system StarPU for asynchronous out-of-order scheduling, while allowing high user-productivity. Performance comparisons and memory footprint on matrix dimensions up to eleven million show a performance gain and memory saving of more than an order of magnitude for both metrics on thousands of cores, against state-of-the-art open-source and vendor optimized numerical libraries. This represents an important milestone in enabling large-scale matrix computations toward solving big data problems in geospatial statistics for climate/weather forecasting applications.
• #### Borehole Tool for the Comprehensive Characterization of Hydrate-bearing Sediments

(Office of Scientific and Technical Information (OSTI), 2018-02-01)
Reservoir characterization and simulation require reliable parameters to anticipate hydrate deposits responses and production rates. The acquisition of the required fundamental properties currently relies on wireline logging, pressure core testing, and/or laboratory ob-servations of synthesized specimens, which are challenged by testing capabilities and in-nate sampling disturbances. The project reviews hydrate-bearing sediments, properties, and inherent sampling effects, albeit lessen with the developments in pressure core technology, in order to develop robust correlations with index parameters. The resulting information is incorporated into a tool for optimal field characterization and parameter selection with un-certainty analyses. Ultimately, the project develops a borehole tool for the comprehensive characterization of hydrate-bearing sediments at in situ, with the design recognizing past developments and characterization experience and benefited from the inspiration of nature and sensor miniaturization.
• #### Batched Tile Low-Rank GEMM on GPUs

(2018-02)
Dense General Matrix-Matrix (GEMM) multiplication is a core operation of the Basic Linear Algebra Subroutines (BLAS) library, and therefore, often resides at the bottom of the traditional software stack for most of the scientific applications. In fact, chip manufacturers give a special attention to the GEMM kernel implementation since this is exactly where most of the high-performance software libraries extract the hardware performance. With the emergence of big data applications involving large data-sparse, hierarchically low-rank matrices, the off-diagonal tiles can be compressed to reduce the algorithmic complexity and the memory footprint. The resulting tile low-rank (TLR) data format is composed of small data structures, which retains the most significant information for each tile. However, to operate on low-rank tiles, a new GEMM operation and its corresponding API have to be designed on GPUs so that it can exploit the data sparsity structure of the matrix while leveraging the underlying TLR compression format. The main idea consists in aggregating all operations onto a single kernel launch to compensate for their low arithmetic intensities and to mitigate the data transfer overhead on GPUs. The new TLR GEMM kernel outperforms the cuBLAS dense batched GEMM by more than an order of magnitude and creates new opportunities for TLR advance algorithms.
• #### Extreme Computing for Extreme Adaptive Optics: the Key to Finding Life Outside our Solar System

(2018)
The real-time correction of telescopic images in the search for exoplanets is highly sensitive to atmospheric aberrations. The pseudo- inverse algorithm is an efficient mathematical method to filter out these turbulences. We introduce a new partial singular value decomposition (SVD) algorithm based on QR-based Diagonally Weighted Halley (QDWH) iteration for the pseudo-inverse method of adaptive optics. The QDWH partial SVD algorithm selectively calculates the most significant singular values and their corresponding singular vectors. We develop a high performance implementation and demonstrate the numerical robustness of the QDWH-based partial SVD method. We also perform a benchmarking campaign on various generations of GPU hardware accelerators and compare against the state-of-the-art SVD implementation SGESDD from the MAGMA library. Numerical accuracy and performance results are reported using synthetic and real observational datasets from the Subaru telescope. Our implementation outperforms SGESDD by up to fivefold and fourfold performance speedups on ill-conditioned synthetic matrices and real observational datasets, respectively. The pseudo-inverse simulation code will be deployed on-sky for the Subaru telescope during observation nights scheduled early 2018.
• #### HLIBCov: Parallel Hierarchical Matrix Approximation of Large Covariance Matrices and Likelihoods with Applications in Parameter Identification

(2017-09-26)
The main goal of this article is to introduce the parallel hierarchical matrix library HLIBpro to the statistical community. We describe the HLIBCov package, which is an extension of the HLIBpro library for approximating large covariance matrices and maximizing likelihood functions. We show that an approximate Cholesky factorization of a dense matrix of size $2M\times 2M$ can be computed on a modern multi-core desktop in few minutes. Further, HLIBCov is used for estimating the unknown parameters such as the covariance length, variance and smoothness parameter of a Mat\'ern covariance function by maximizing the joint Gaussian log-likelihood function. The computational bottleneck here is expensive linear algebra arithmetics due to large and dense covariance matrices. Therefore covariance matrices are approximated in the hierarchical ($\H$-) matrix format with computational cost $\mathcal{O}(k^2n \log^2 n/p)$ and storage $\mathcal{O}(kn \log n)$, where the rank $k$ is a small integer (typically $k<25$), $p$ the number of cores and $n$ the number of locations on a fairly general mesh. We demonstrate a synthetic example, where the true values of known parameters are known. For reproducibility we provide the C++ code, the documentation, and the synthetic data.
• #### Low-SNR Capacity of MIMO Optical Intensity Channels

(2017-09-18)
The capacity of the multiple-input multiple-output (MIMO) optical intensity channel is studied, under both average and peak intensity constraints. We focus on low SNR, which can be modeled as the scenario where both constraints proportionally vanish, or where the peak constraint is held constant while the average constraint vanishes. A capacity upper bound is derived, and is shown to be tight at low SNR under both scenarios. The capacity achieving input distribution at low SNR is shown to be a maximally-correlated vector-binary input distribution. Consequently, the low-SNR capacity of the channel is characterized. As a byproduct, it is shown that for a channel with peak intensity constraints only, or with peak intensity constraints and individual (per aperture) average intensity constraints, a simple scheme composed of coded on-off keying, spatial repetition, and maximum-ratio combining is optimal at low SNR.
• #### Partial inversion of elliptic operator to speed up computation of likelihood in Bayesian inference

(2017-08-09)
In this paper, we speed up the solution of inverse problems in Bayesian settings. By computing the likelihood, the most expensive part of the Bayesian formula, one compares the available measurement data with the simulated data. To get simulated data, repeated solution of the forward problem is required. This could be a great challenge. Often, the available measurement is a functional $F(u)$ of the solution $u$ or a small part of $u$. Typical examples of $F(u)$ are the solution in a point, solution on a coarser grid, in a small subdomain, the mean value in a subdomain. It is a waste of computational resources to evaluate, first, the whole solution and then compute a part of it. In this work, we compute the functional $F(u)$ direct, without computing the full inverse operator and without computing the whole solution $u$. The main ingredients of the developed approach are the hierarchical domain decomposition technique, the finite element method and the Schur complements. To speed up computations and to reduce the storage cost, we approximate the forward operator and the Schur complement in the hierarchical matrix format. Applying the hierarchical matrix technique, we reduced the computing cost to $\mathcal{O}(k^2n \log^2 n)$, where $k\ll n$ and $n$ is the number of degrees of freedom. Up to the $\H$-matrix accuracy, the computation of the functional $F(u)$ is exact. To reduce the computational resources further, we can approximate $F(u)$ on, for instance, multiple coarse meshes. The offered method is well suited for solving multiscale problems. A disadvantage of this method is the assumption that one has to have access to the discretisation and to the procedure of assembling the Galerkin matrix.
• #### Application of Bayesian Networks for Estimation of Individual Psychological Characteristics

(2017-07-19)
In this paper we apply Bayesian networks for developing more accurate final overall estimations of psychological characteristics of an individual, based on psychological test results. Psychological tests which identify how much an individual possesses a certain factor are very popular and quite common in the modern world. We call this value for a given factor -- the final overall estimation. Examples of factors could be stress resistance, the readiness to take a risk, the ability to concentrate on certain complicated work and many others. An accurate qualitative and comprehensive assessment of human potential is one of the most important challenges in any company or collective. The most common way of studying psychological characteristics of each single person is testing. Psychologists and sociologists are constantly working on improvement of the quality of their tests. Despite serious work, done by psychologists, the questions in tests often do not produce enough feedback due to the use of relatively poor estimation systems. The overall estimation is usually based on personal experiences and the subjective perception of a psychologist or a group of psychologists about the investigated psychological personality factors.
• #### On the Optimality of Repetition Coding among Rate-1 DC-offset STBCs for MIMO Optical Wireless Communications

(2017-07-06)
In this paper, an optical wireless multiple-input multiple-output communication system employing intensity-modulation direct-detection is considered. The performance of direct current offset space-time block codes (DC-STBC) is studied in terms of pairwise error probability (PEP). It is shown that among the class of DC-STBCs, the worst case PEP corresponding to the minimum distance between two codewords is minimized by repetition coding (RC), under both electrical and optical individual power constraints. It follows that among all DC-STBCs, RC is optimal in terms of worst-case PEP for static channels and also for varying channels under any turbulence statistics. This result agrees with previously published numerical results showing the superiority of RC in such systems. It also agrees with previously published analytic results on this topic under log-normal turbulence and further extends it to arbitrary turbulence statistics. This shows the redundancy of the time-dimension of the DC-STBC in this system. This result is further extended to sum power constraints with static and turbulent channels, where it is also shown that the time dimension is redundant, and the optimal DC-STBC has a spatial beamforming structure. Numerical results are provided to demonstrate the difference in performance for systems with different numbers of receiving apertures and different throughput.
• #### A QDWH-Based SVD Software Framework on Distributed-Memory Manycore Systems

(2017)
This paper presents a high performance software framework for computing a dense SVD on distributed- memory manycore systems. Originally introduced by Nakatsukasa et al. (Nakatsukasa et al. 2010; Nakatsukasa and Higham 2013), the SVD solver relies on the polar decomposition using the QR Dynamically-Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying to the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-offs. We report performance results against state-of-the-art existing QDWH software implementations (e.g., Elemental) and their SVD extensions on large-scale distributed-memory manycore systems based on commodity Intel x86 Haswell processors and Knights Landing (KNL) architecture. The QDWH-SVD framework achieves up to 3/8-fold on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well and ill-conditioned matrices. We finally come up herein with a performance model based on these empirical results. Our QDWH-based polar decomposition and its SVD extension are freely available at https://github.com/ecrc/qdwh.git and https://github.com/ecrc/ksvd.git, respectively, and have been integrated into the Cray Scientific numerical library LibSci v17.11.1.
• #### Appendices for: Improper Signaling in Two-Path Relay Channels

(2016-12-01)
This document contains the appendices for the work in “Improper Signaling in Two-Path Relay Channels,” which is submitted to 2017 IEEE International Conference on Communications (ICC) Workshop on Full-Duplex Communications for Future Wireless Networks, Paris, France.
• #### Asynchronous Task-Based Polar Decomposition on Manycore Architectures

(2016-10-25)
This paper introduces the first asynchronous, task-based implementation of the polar decomposition on manycore architectures. Based on a new formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is also capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been severely weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations (i.e., Intel MKL and Elemental) for the polar decomposition on latest shared-memory vendors' systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy.
• #### Efficient Outage Probability Evaluation of Diversity Receivers Over Generalized Gamma Channels

(2016-10)
In this paper, we are interested in determining the cumulative distribution function of the sum of generalized Gamma in the setting of rare event simulations. To this end, we present an efficient importance sampling estimator. The main result of this work is the bounded relative property of the proposed estimator. This result is used to accurately estimate the outage probability of multibranch maximum ratio combining and equal gain combining diversity receivers over generalized Gamma fading channels. Selected numerical simulations are discussed to show the robustness of our estimator compared to naive Monte Carlo.
• #### Nonlinear Adaptive Descriptor Observer for the Joint States and Parameters Estimation

(2016-08-29)
In this note, the joint state and parameters estimation problem for nonlinear multi-input multi-output descriptor systems is considered. Asymptotic convergence of the adaptive descriptor observer is established by a sufficient set of linear matrix inequalities for the noise-free systems. The noise corrupted systems are also considered and it is shown that the state and parameters estimation errors are bounded for bounded noises. In addition, if the noises are bounded and have zero mean, then the estimation errors asymptotically converge to zero in the mean. The performance of the proposed adaptive observer is illustrated by a numerical example.
• #### On the Efficient Simulation of the Distribution of the Sum of Gamma-Gamma Variates with Application to the Outage Probability Evaluation Over Fading Channels

(2016-06)
The Gamma-Gamma distribution has recently emerged in a number of applications ranging from modeling scattering and reverbation in sonar and radar systems to modeling atmospheric turbulence in wireless optical channels. In this respect, assessing the outage probability achieved by some diversity techniques over this kind of channels is of major practical importance. In many circumstances, this is intimately related to the difficult question of analyzing the statistics of a sum of Gamma-Gamma random variables. Answering this question is not a simple matter. This is essentially because outage probabilities encountered in practice are often very small, and hence the use of classical Monte Carlo methods is not a reasonable choice. This lies behind the main motivation of the present work. In particular, this paper proposes a new approach to estimate the left tail of the sum of Gamma-Gamma variates. More specifically, we propose a mean-shift importance sampling scheme that efficiently evaluates the outage probability of L-branch maximum ratio combining diversity receivers over Gamma-Gamma fading channels. The proposed estimator satisfies the well-known bounded relative error criterion, a well-desired property characterizing the robustness of importance sampling schemes, for both identically and non-identically independent distributed cases. We show the accuracy and the efficiency of our approach compared to naive Monte Carlo via some selected numerical simulations.
• #### A novel mirror diversity receiver for indoor MIMO visible light

(2016-03)
In this paper, we propose and study a non-imaging receiver design reducing the correlation of channel matrix for indoor multiple-input multiple-output (MIMO) visible light communication (VLC) systems. Contrary to previous works, our proposed mirror diversity receiver (MDR) not only blocks the reception of light on one specific direction but also improves the channel gain on the other direction by receiving the light reflected by a mirror deployed between the photodetectors. We analyze the channel capacity and optimal height of mirror in terms of maximum channel capacity for a 2 -by-2 MIMO-VLC system in a 2-dimensional geometric model.We prove that this constructive and destructive effects in channel matrix resulting from our proposed MDR are more beneficial to obtain well-conditioned channel matrix which is suitable for implementing spatial-multiplexing MIMO-VLC systems in order to support high data rate.
• #### Capacity Bounds for Parallel Optical Wireless Channels

(2016-01)
A system consisting of parallel optical wireless channels with a total average intensity constraint is studied. Capacity upper and lower bounds for this system are derived. Under perfect channel-state information at the transmitter (CSIT), the bounds have to be optimized with respect to the power allocation over the parallel channels. The optimization of the lower bound is non-convex, however, the KKT conditions can be used to find a list of possible solutions one of which is optimal. The optimal solution can then be found by an exhaustive search algorithm, which is computationally expensive. To overcome this, we propose low-complexity power allocation algorithms which are nearly optimal. The optimized capacity lower bound nearly coincides with the capacity at high SNR. Without CSIT, our capacity bounds lead to upper and lower bounds on the outage probability. The outage probability bounds meet at high SNR. The system with average and peak intensity constraints is also discussed.
• #### Supplementary Appendix for: Constrained Perturbation Regularization Approach for Signal Estimation Using Random Matrix Theory

(2016)
In this supplementary appendix we provide proofs and additional simulation results that complement the paper (constrained perturbation regularization approach for signal estimation using random matrix theory).