Technical Reports
Recent Submissions

Accelerating Geostatistical Modeling and Prediction With MixedPrecision Computations: A HighProductivity Approach with PaRSEC(20210506) [Technical Report]Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Spatial data is assumed to possess properties of stationarity or nonstationarity via a kernel fitted to a covariance matrix. A primary workhorse of stationary spatial statistics is Gaussian maximum loglikelihood estimation (MLE), whose central data structure is a dense, symmetric positive definite covariance matrix of dimension of the number of correlated observations. Two essential operations in MLE are the application of the inverse and evaluation of the determinant of the covariance matrix. These can be rendered through the Cholesky decomposition and triangular solution. In this contribution, we reduce the precision of weakly correlated locations to single or half precision based on distance. We thus exploit mathematical structure to migrate MLE to a threeprecision approximation that takes advantage of contemporary architectures offering BLAS3like operations in a single instruction that are extremely fast for reduced precision. We illustrate applicationexpected accuracy worthy of doubleprecision from a majority halfprecision computation, in a context where uniform single precision is by itself insufficient. In tackling the complexity and imbalance caused by the mixing of three precisions, we deploy the PaRSEC runtime system. PaRSEC delivers ondemand casting of precisions while orchestrating tasks and data movement in a multiGPU distributedmemory environment within a tilebased Cholesky factorization. Applicationexpected accuracy is maintained while achieving up to 1.59 by mixing FP64/FP32 operations on 1536 nodes of HAWK or 4096 nodes of ShaheenII, and up to 2.64X by mixing FP64/FP32/FP16 operations on 128 nodes of Summit, relative to FP64only operations, This translates into up to 4.5, 4.7, and 9.1 (mixed) PFlop/s sustained performance, respectively, demonstrating a synergistic combination of exascale architecture, dynamic runtime software, and algorithmic adaptation applied to challenging environmental problems.

Ariadne: A commonsense thread for enabling provable safety in air mobility systems with unreliable components(20210104) [Technical Report]Commercial air travel is by far the safest transportation modality available to humanity today. It has achieved this enviable status by deploying thousands of professionals, including pilots, dispatchers, and air traffic controllers to operate very reliable air vehicles, bringing them and their passengers safely from origin to destination while managing dangerousweather, other traffic and system failures for decades. The air transportation has been undergoing undeniable and continuous progress and modernization since its inception. Thanks to advances in navigation capabilities, such as satellitebased navigation systems, aircraft can fly increasingly complex trajectories, including final approaches. The same aircraft are envisioned to fly in formation relatively soon. More daring moves include the recent introduction of "Free Flight" operations. Despite all these impressive improvements, they remain largely incremental in nature and they hit a "wall of complexity" that makes it somewhat difficult to incorporate more automation, such as the elusive, and perhaps infeasible, goal of achieving fully automated air traffic control, and to design and insert autonomous vehicles, small and large, in cities and at high altitudes. We introduce Ariadne, a thread to accelerate the productivity gains achieved by air traffic services providers around the globe. Ariadne is an engineered version of the common sense practice of always keeping a "Plan B", and possibly "plans C, D, E, and F" against unexpected events when any decision is made by pilots, air traffic controllers, dispatchers, and any other safetycritical actor of the air transportation system. The name "Ariadne" was chosen to honor the mythical character Ariadne, daughter of Minos the king of Crete, who conceived the "Plan B" mechanism that would allow her lover to exit Daedalus’ Labyrinth after killing the Minotaur. Ariadne and its informal definition as "Plan B engineering" offer surprising opportunities and properties, including not only provable operations safety with unproven components, but also a thread that can inherently be scaled up and quickly adapt to new air traffic scenarios, including the transition to free flight and accommodation of unmanned aviation. It also supports existing operations and therefore it does not conflict with current air traffic control practices. Modern computational capabilities and powerful AI algorithms make its implementation increasingly feasible to address more aspects of air traffic management.

Scaling Distributed Machine Learning with InNetwork Aggregation(20200930) [Technical Report]Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We codesign the switch processing with the endhost protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5× for a number of realworld benchmark models.

Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning(20200930) [Technical Report]Efficient collective communication is crucial to parallelcomputing applications such as distributed training of largescale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models. We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only nonzero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2×. Even at 100 Gbps, we assess that OmniReduce delivers 1.22.6× better performance for networkbottlenecked DNNs.

Unified Finite Series Approximation of FSO Performance over Strong Turbulence Combined with Various Pointing Error Conditions(IEEE Transactions on Communications, Institute of Electrical and Electronics Engineers (IEEE), 20200710) [Article]In this paper, we investigate both the bit error rate (BER) and outage performance of freespace optical (FSO) links over strong turbulence combined with various pointing error conditions. Considering atmospheric turbulence and pointing errors as main factors that deteriorate the quality of an optical link, we obtain a unified finite series approximation of the composite probability density function, which embraces generalized pointing error models. This approximation leads to new unified formulas for the BER and outage capacity of an FSO link, which account for the two possible detection mechanisms of intensity modulation/direct detection and heterodyne detection. Selected simulation results confirm that the newly derived approximations can give precise predictions of both the average BER and the outage capacity of FSO communication that are generally applicable to all environments.

Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation(2020) [Technical Report]Powerful computer clusters are used nowadays to train complex deep neural networks (DNN) on large datasets. Distributed training workloads increasingly become communication bound. For this reason, many lossy compression techniques have been proposed to reduce the volume of transferred data. Unfortunately, it is difficult to argue about the behavior of compression methods, because existing work relies on inconsistent evaluation testbeds and largely ignores the performance impact of practical system configurations. In this paper, we present a comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and lowrank). We also propose a unified framework and API that allows for consistent and easy implementation of compressed communication on popular machine learning toolkits. We instantiate our API on TensorFlow and PyTorch, and implement 16 such methods. Finally, we present a thorough quantitative evaluation with a variety of DNNs (convolutional and recurrent), datasets and system configurations. We show that the DNN architecture affects the relative performance among methods. Interestingly, depending on the underlying communication library and computational cost of compression/decompression, we demonstrate that some methods may be impractical.

PETSc Users Manual: Revision 3.10(Office of Scientific and Technical Information (OSTI), 20180901) [Technical Report]This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on highperformance computers. The Portable, Extensible Toolkit forScientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of largescale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all messagepassing communication.

Exploiting Data Sparsity for LargeScale Matrix Computations(20180224) [Technical Report]Exploiting data sparsity in dense matrices is an algorithmic bridge between architectures that are increasingly memoryaustere on a percore basis and extremescale applications. The Hierarchical matrix Computations on Manycore Architectures (HiCMA) library tackles this challenging problem by achieving significant reductions in time to solution and memory footprint, while preserving a specified accuracy requirement of the application. HiCMA provides a highperformance implementation on distributedmemory systems of one of the most widely used matrix factorization in largescale scientific applications, i.e., the Cholesky factorization. It employs the tile lowrank data format to compress the dense datasparse offdiagonal tiles of the matrix. It then decomposes the matrix computations into interdependent tasks and relies on the dynamic runtime system StarPU for asynchronous outoforder scheduling, while allowing high userproductivity. Performance comparisons and memory footprint on matrix dimensions up to eleven million show a performance gain and memory saving of more than an order of magnitude for both metrics on thousands of cores, against stateoftheart opensource and vendor optimized numerical libraries. This represents an important milestone in enabling largescale matrix computations toward solving big data problems in geospatial statistics for climate/weather forecasting applications.

Batched Tile LowRank GEMM on GPUs(201802) [Technical Report]Dense General MatrixMatrix (GEMM) multiplication is a core operation of the Basic Linear Algebra Subroutines (BLAS) library, and therefore, often resides at the bottom of the traditional software stack for most of the scientific applications. In fact, chip manufacturers give a special attention to the GEMM kernel implementation since this is exactly where most of the highperformance software libraries extract the hardware performance. With the emergence of big data applications involving large datasparse, hierarchically lowrank matrices, the offdiagonal tiles can be compressed to reduce the algorithmic complexity and the memory footprint. The resulting tile lowrank (TLR) data format is composed of small data structures, which retains the most significant information for each tile. However, to operate on lowrank tiles, a new GEMM operation and its corresponding API have to be designed on GPUs so that it can exploit the data sparsity structure of the matrix while leveraging the underlying TLR compression format. The main idea consists in aggregating all operations onto a single kernel launch to compensate for their low arithmetic intensities and to mitigate the data transfer overhead on GPUs. The new TLR GEMM kernel outperforms the cuBLAS dense batched GEMM by more than an order of magnitude and creates new opportunities for TLR advance algorithms.

Performance Impact of RankReordering on Advanced Polar Decomposition Algorithms(2018) [Technical Report]We demonstrate the importance of both MPI rank reordering and choice of processor grid topology in the context of advanced dense linear algebra (DLA) applications for distributedmemory systems. In particular, we focus on the advanced polar decomposition (PD) algorithm, based on the QRbased Dynamically Weighted Halley method (QDWH). The QDWH algorithm may be used as the first computational step toward solving symmetric eigenvalue problems and the singular value decomposition. Sukkari et al. (ACM TOMS, 2017) have shown that QDWH may benefit from rectangular instead of square processor grid topologies, which directly impact the performance of the underlying ScaLAPACK algorithms. In this work, we experiment an extensive combination of grid topologies and rank reorderings for different matrix sizes and number of nodes, and use QDWH as a proxy for advanced computebound linear algebra operations, since it is rich in dense linear solvers and factorizations. A performance improvement of up to 54% can be observed for QDWH on 800 nodes of a Cray XC system, thanks to an optimal combination, especially in strong scaling mode of operation, for which communication overheads may become dominant. We perform a thorough application profiling to analyze the impact of reordering and grid topologies on the various linear algebra components of the QDWH algorithm. It turns out that point topoint communications may be considerably reduced thanks to a judicious choice of grid topology, while properly setting the rank reordering using the features from the craympich library.

Ubiquitous Asynchronous Computations for Solving the Acoustic Wave Propagation Equation(2018) [Technical Report]This paper designs and implements an ubiquitous asynchronous computational scheme for solving the acoustic wave propagation equation with Absorbing Boundary Conditions (ABCs) in the context of seismic imaging applications. While the Convolutional Perfectly Matched Layer (CPML) is typically used as ABCs in the oil and gas industry, its formulation further stresses memory accesses and decreases the arithmetic intensity at the physical domain boundaries. The challenges with CPML are twofold: (1) the strong, inherent data dependencies imposed on the explicit time stepping scheme render asynchronous time integration cumbersome and (2) the idle time is further exacerbated by the load imbalance introduced among processing units. In fact, the CPML formulation of the ABCs requires expensive synchronization points, which may hinder parallel performance of the overall asynchronous time integration. In particular, when deployed in conjunction with the Multicoreoptimized Wavefront Diamond (MWD) tiling approach for the inner domain points, it results into a major performance slow down. To relax CPML’s synchrony and mitigate the resulting load imbalance, we embed CPML’s calculation into MWD’s inner loop and carry on the time integration with finegrained computations in an asynchronous, holistic way. This comes at the price of storing transient results to alleviate dependencies from critical data hazards, while maintaining the numerical accuracy of the original scheme. Performance results on various x86 architectures demonstrate the superiority of MWD with CPML against the standard spatial blocking. To our knowledge, this is the first practical study, which highlights the consolidation of CPML ABCs with asynchronous temporal blocking stencil computations.

Borehole Tool for the Comprehensive Characterization of Hydratebearing Sediments(Office of Scientific and Technical Information (OSTI), 20171230) [Technical Report]Reservoir characterization and simulation require reliable parameters to anticipate hydrate deposits responses and production rates. The acquisition of the required fundamental properties currently relies on wireline logging, pressure core testing, and/or laboratory observations of synthesized specimens, which are challenged by testing capabilities and innate sampling disturbances. The project reviews hydratebearing sediments, properties, and inherent sampling effects, albeit lessen with the developments in pressure core technology, in order to develop robust correlations with index parameters. The resulting information is incorporated into a tool for optimal field characterization and parameter selection with uncertainty analyses. Ultimately, the project develops a borehole tool for the comprehensive characterization of hydratebearing sediments at in situ, with the design recognizing past developments and characterization experience and benefited from the inspiration of nature and sensor miniaturization.

HLIBCov: Parallel Hierarchical Matrix Approximation of Large Covariance Matrices and Likelihoods with Applications in Parameter Identification(20170926) [Technical Report]The main goal of this article is to introduce the parallel hierarchical matrix library HLIBpro to the statistical community. We describe the HLIBCov package, which is an extension of the HLIBpro library for approximating large covariance matrices and maximizing likelihood functions. We show that an approximate Cholesky factorization of a dense matrix of size $2M\times 2M$ can be computed on a modern multicore desktop in few minutes. Further, HLIBCov is used for estimating the unknown parameters such as the covariance length, variance and smoothness parameter of a Mat\'ern covariance function by maximizing the joint Gaussian loglikelihood function. The computational bottleneck here is expensive linear algebra arithmetics due to large and dense covariance matrices. Therefore covariance matrices are approximated in the hierarchical ($\H$) matrix format with computational cost $\mathcal{O}(k^2n \log^2 n/p)$ and storage $\mathcal{O}(kn \log n)$, where the rank $k$ is a small integer (typically $k<25$), $p$ the number of cores and $n$ the number of locations on a fairly general mesh. We demonstrate a synthetic example, where the true values of known parameters are known. For reproducibility we provide the C++ code, the documentation, and the synthetic data.

LowSNR Capacity of MIMO Optical Intensity Channels(20170918) [Technical Report]The capacity of the multipleinput multipleoutput (MIMO) optical intensity channel is studied, under both average and peak intensity constraints. We focus on low SNR, which can be modeled as the scenario where both constraints proportionally vanish, or where the peak constraint is held constant while the average constraint vanishes. A capacity upper bound is derived, and is shown to be tight at low SNR under both scenarios. The capacity achieving input distribution at low SNR is shown to be a maximallycorrelated vectorbinary input distribution. Consequently, the lowSNR capacity of the channel is characterized. As a byproduct, it is shown that for a channel with peak intensity constraints only, or with peak intensity constraints and individual (per aperture) average intensity constraints, a simple scheme composed of coded onoff keying, spatial repetition, and maximumratio combining is optimal at low SNR.

PETSc Users Manual Revision 3.8(Office of Scientific and Technical Information (OSTI), 20170901) [Technical Report]This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on highperformance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of largescale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all messagepassing communication.

Partial inversion of elliptic operator to speed up computation of likelihood in Bayesian inference(20170809) [Technical Report]In this paper, we speed up the solution of inverse problems in Bayesian settings. By computing the likelihood, the most expensive part of the Bayesian formula, one compares the available measurement data with the simulated data. To get simulated data, repeated solution of the forward problem is required. This could be a great challenge. Often, the available measurement is a functional $F(u)$ of the solution $u$ or a small part of $u$. Typical examples of $F(u)$ are the solution in a point, solution on a coarser grid, in a small subdomain, the mean value in a subdomain. It is a waste of computational resources to evaluate, first, the whole solution and then compute a part of it. In this work, we compute the functional $F(u)$ direct, without computing the full inverse operator and without computing the whole solution $u$. The main ingredients of the developed approach are the hierarchical domain decomposition technique, the finite element method and the Schur complements. To speed up computations and to reduce the storage cost, we approximate the forward operator and the Schur complement in the hierarchical matrix format. Applying the hierarchical matrix technique, we reduced the computing cost to $\mathcal{O}(k^2n \log^2 n)$, where $k\ll n$ and $n$ is the number of degrees of freedom. Up to the $\H$matrix accuracy, the computation of the functional $F(u)$ is exact. To reduce the computational resources further, we can approximate $F(u)$ on, for instance, multiple coarse meshes. The offered method is well suited for solving multiscale problems. A disadvantage of this method is the assumption that one has to have access to the discretisation and to the procedure of assembling the Galerkin matrix.

Application of Bayesian Networks for Estimation of Individual Psychological Characteristics(20170719) [Technical Report]In this paper we apply Bayesian networks for developing more accurate final overall estimations of psychological characteristics of an individual, based on psychological test results. Psychological tests which identify how much an individual possesses a certain factor are very popular and quite common in the modern world. We call this value for a given factor  the final overall estimation. Examples of factors could be stress resistance, the readiness to take a risk, the ability to concentrate on certain complicated work and many others. An accurate qualitative and comprehensive assessment of human potential is one of the most important challenges in any company or collective. The most common way of studying psychological characteristics of each single person is testing. Psychologists and sociologists are constantly working on improvement of the quality of their tests. Despite serious work, done by psychologists, the questions in tests often do not produce enough feedback due to the use of relatively poor estimation systems. The overall estimation is usually based on personal experiences and the subjective perception of a psychologist or a group of psychologists about the investigated psychological personality factors.

On the Optimality of Repetition Coding among Rate1 DCoffset STBCs for MIMO Optical Wireless Communications(20170706) [Technical Report]In this paper, an optical wireless multipleinput multipleoutput communication system employing intensitymodulation directdetection is considered. The performance of direct current offset spacetime block codes (DCSTBC) is studied in terms of pairwise error probability (PEP). It is shown that among the class of DCSTBCs, the worst case PEP corresponding to the minimum distance between two codewords is minimized by repetition coding (RC), under both electrical and optical individual power constraints. It follows that among all DCSTBCs, RC is optimal in terms of worstcase PEP for static channels and also for varying channels under any turbulence statistics. This result agrees with previously published numerical results showing the superiority of RC in such systems. It also agrees with previously published analytic results on this topic under lognormal turbulence and further extends it to arbitrary turbulence statistics. This shows the redundancy of the timedimension of the DCSTBC in this system. This result is further extended to sum power constraints with static and turbulent channels, where it is also shown that the time dimension is redundant, and the optimal DCSTBC has a spatial beamforming structure. Numerical results are provided to demonstrate the difference in performance for systems with different numbers of receiving apertures and different throughput.

On the Fast and Precise Evaluation of the Outage Probability of Diversity Receivers Over Generalized Fading Channels(201701) [Technical Report]

Appendices for: Improper Signaling in TwoPath Relay Channels(20161201) [Technical Report]This document contains the appendices for the work in “Improper Signaling in TwoPath Relay Channels,” which is submitted to 2017 IEEE International Conference on Communications (ICC) Workshop on FullDuplex Communications for Future Wireless Networks, Paris, France.