Extreme Computing Research Center
Recent Submissions
-
Test and Visualization of Covariance Properties for Multivariate Spatio-Temporal Random Fields(Journal of Computational and Graphical Statistics, Informa UK Limited, 2023-03-16) [Article]The prevalence of multivariate space-time data collected from monitoring networks and satellites, or generated from numerical models, has brought much attention to multivariate spatio-temporal statistical models, where the covariance function plays a key role in modeling, inference, and prediction. For multivariate space-time data, understanding the spatio-temporal variability, within and across variables, is essential in employing a realistic covariance model. Meanwhile, the complexity of generic covariances often makes model fitting very challenging, and simplified covariance structures, including symmetry and separability, can reduce the model complexity and facilitate the inference procedure. However, a careful examination of these properties is needed in real applications. In the work presented here, we formally define these properties for multivariate spatio-temporal random fields and use functional data analysis techniques to visualize them, hence providing intuitive interpretations. We then propose a rigorous rank-based testing procedure to conclude whether the simplified properties of covariance are suitable for the underlying multivariate space-time data. The good performance of our method is illustrated through synthetic data, for which we know the true structure. We also investigate the covariance of bivariate wind speed, a key variable in renewable energy, over a coastal and an inland area in Saudi Arabia. The Supplementary Material is available online, including the R code for our developed methods.
-
Spatiotemporal data management and analytics for recommender systems(World Wide Web, Springer Science and Business Media LLC, 2023-03-13) [Article]
-
Semantic Segmentation of Mesoscale Eddies in the Arabian Sea: A Deep Learning Approach(Remote Sensing, MDPI AG, 2023-03-10) [Article]Detecting mesoscale ocean eddies provides a better understanding of the oceanic processes that govern the transport of salt, heat, and carbon. Established eddy detection techniques rely on physical or geometric criteria, and they notoriously fail to predict eddies that are neither circular nor elliptical in shape. Recently, deep learning techniques have been applied for semantic segmentation of mesoscale eddies, relying on the outputs of traditional eddy detection algorithms to supervise the training of the neural network. However, this approach limits the network’s predictions because the available annotations are either circular or elliptical. Moreover, current approaches depend on the sea-surface height, temperature, or currents as inputs to the network, and these data may not provide all the information necessary to accurately segment eddies. In the present work, we have trained a neural network for the semantic segmentation of eddies using human-based—and expert-validated—annotations of eddies in the Arabian Sea. Training with human-annotated datasets enables the network predictions to include more complex geometries, which occur commonly in the real ocean. We then examine the impact of different combinations of input surface variables on the segmentation performance of the network. The results indicate that providing additional surface variables as inputs to the network improves the accuracy of the predictions by approximately 5%. We have further fine-tuned another pre-trained neural network to segment eddies and achieved a reduced overall training time and higher accuracy compared to the results from a network trained from scratch.
-
Goodness-of-fit tests for multivariate skewed distributions based on the characteristic function(arXiv, 2023-03-08) [Preprint]We employ a general Monte Carlo method to test composite hypotheses of goodness-of-fit for several popular multivariate models that can accommodate both asymmetry and heavy tails. Specifically, we consider weighted L2-type tests based on a discrepancy measure involving the distance between empirical characteristic functions and thus avoid the need for employing corresponding population quantities which may be unknown or complicated to work with. The only requirements of our tests are that we should be able to draw samples from the distribution under test and possess a reasonable method of estimation of the unknown distributional parameters. Monte Carlo studies are conducted to investigate the performance of the test criteria in finite samples for several families of skewed distributions. Real-data examples are also included to illustrate our method.
-
A Comprehensive Empirical Study of Heterogeneity in Federated Learning(IEEE Internet of Things Journal, Institute of Electrical and Electronics Engineers (IEEE), 2023-03-07) [Article]Federated learning (FL) is becoming a popular paradigm for collaborative learning over distributed, private datasets owned by non-trusting entities. FL has seen successful deployment in production environments, and it has been adopted in services such as virtual keyboards, auto-completion, item recommendation, and several IoT applications. However, FL comes with the challenge of performing training over largely heterogeneous datasets, devices, and networks that are out of the control of the centralized FL server. Motivated by this inherent challenge, we aim to empirically characterize the impact of device and behavioral heterogeneity on the trained model. We conduct an extensive empirical study spanning nearly 1.5K unique configurations on five popular FL benchmarks. Our analysis shows that these sources of heterogeneity have a major impact on both model quality and fairness, causing up to 4.6× and 2.2× degradation in the quality and fairness, respectively, thus shedding light on the importance of considering heterogeneity in FL system design.
-
Winds and waves in the Arabian Gulf: physics, characteristics, and long-term hindcast(International Journal of Climatology, Wiley, 2023-03-01) [Article]A 40-year hindcast of the wave conditions in the Arabian Gulf, the elongated basin between the Arabian Peninsula and Iran, is described. The input wind fields were derived from the ERA5 reanalysis by the European Centre for Medium-Range Weather Forecasts for the period 1979–2019. Extensive comparisons with scatterometer data suggested a direction-dependent wind calibration with a strong influence on the underestimation of offshore blowing winds. The dominant wave patterns in the basin are described together with the characteristics of the related winds. After characterizing the possible meteorological conditions, the corresponding wave fields were hindcast using the WAVEWATCH III model. Extensive validation against altimeter data shows an almost unitary fit and a 0.02-m bias; a comparison with measured data strongly demonstrates that the general patterns and conditions were well reproduced. The wave physics was also analysed, revealing the crucial role of their dynamic generation in the investigated area domain. The very shallow nature of some limited coastal zones suggests that locally focused studies, with more detailed and specific physics, are required for the best description of the waves in these areas.
-
A Universal Question-Answering Platform for Knowledge Graphs(arXiv, 2023-03-01) [Preprint]Knowledge from diverse application domains is organized as knowledge graphs (KGs) that are stored in RDF engines accessible in the web via SPARQL endpoints. Expressing a well-formed SPARQL query requires information about the graph structure and the exact URIs of its components, which is impractical for the average user. Question answering (QA) systems assist by translating natural language questions to SPARQL. Existing QA systems are typically based on application-specific human-curated rules, or require prior information, expensive pre-processing and model adaptation for each targeted KG. Therefore, they are hard to generalize to a broad set of applications and KGs. In this paper, we propose KGQAn, a universal QA system that does not need to be tailored to each target KG. Instead of curated rules, KGQAn introduces a novel formalization of question understanding as a text generation problem to convert a question into an intermediate abstract representation via a neural sequence-to-sequence model. We also develop a just-in-time linker that maps at query time the abstract representation to a SPARQL query for a specific KG, using only the publicly accessible APIs and the existing indices of the RDF store, without requiring any pre-processing. Our experiments with several real KGs demonstrate that KGQAn is easily deployed and outperforms by a large margin the state-of-the-art in terms of quality of answers and processing time, especially for arbitrary KGs, unseen during the training.
-
Effects of Observational Uncertainty and Models Similarity on Climate Change Projections(Research Square Platform LLC, 2023-03-01) [Preprint]Climate change projections (CCPs) are based on the multimodel means of individual climate model simulations that are assumed to be independent. However, model similarity leads to projections biased toward the largest set of similar models and the underestimation of uncertainties. We assessed the influence of similarities in CMIP6 through CMIP3 CCPs. We ascertained model similarity due to shared physics/dynamics and initial conditions by comparing simulated spatial temperature and precipitation with the corresponding observed patterns and accounting for inter-model spread relative to the spread across observational datasets. After accounting for similarity, the information from 57 CMIP6, 47 CMIP5, and 24 CMIP3 models could be explained by just 11 effective models, without significant differences in globally averaged climate change statistics. The effective models showed a smaller globally averaged temperature rise of 0.25ºC (~0.5ºC–1ºC in some regions) by the end of 21 century relative to the multimodel mean of all models for socioeconomic pathways 5–8.5.
-
ZyPR: End-to-End Build Tool and Runtime Manager for Partial Reconfiguration of FPGA SoCs at the Edge(ACM Transactions on Reconfigurable Technology and Systems, Association for Computing Machinery (ACM), 2023-02-27) [Article]Partial reconfiguration (PR) is a key enabler to the design and development of adaptive systems on modern Field Programmable Gate Array (FPGA) Systems-on-Chip (SoCs), allowing hardware to be adapted dynamically at runtime. Vendor supported PR infrastructure is performance limited and blocking, drivers entail complex memory management, and software/hardware design requires bespoke knowledge of the underlying hardware. This paper presents ZyPR: a complete end-to-end framework that provides high performance reconfiguration of hardware from within a software abstraction in the Linux userspace, automating the process of building PR applications, with support for the Xilinx Zynq and Zynq UltraScale+ architectures, aimed at enabling non-expert application designers to leverage PR for edge applications. We compare ZyPR against traditional vendor tooling for PR management as well as recent open source tools that support PR under Linux. The framework provides a high performance runtime along with low overhead for its provided abstractions. We introduce improvements to our previous work, increasing the provisioning throughput for PR bitstreams on the Zynq Ultrascale+ by 2 × and 5.4 × compared to Xilinx’s FPGA Manager.
-
A Preliminary Green Function Database for Global 3-D Centroid Moment Tensor Inversions(Copernicus GmbH, 2023-02-26) [Presentation]Currently, the accuracy of synthetic seismograms used for Global CMT inversion, which are based on modern 3D Earth models, is limited by the validity of the path-average approximation for mode summation and surface-wave ray theory. Inaccurate computation of the ground motion’s amplitude and polarization as well as other effects that are not modeled may bias inverted earthquake parameters. Synthetic seismograms of higher accuracy will improve the determination of seismic sources in the CMT analysis, and remove concerns about this source of uncertainty. Strain tensors, and databases thereof, have recently been implemented for the spectral-element solver SPECFEM3D (Ding et al., 2020) based on the theory of previous work (Zhao et al., 2006) for regional inversion of seismograms for earthquake parameters. The main barriers to a global database of Green functions have been storage, I/O, and computation. But, compression tricks and smart selection of spectral elements, fast I/O data formats for high-performance computing, such as ADIOS, and wave-equation solution on GPUs, have dramatically decreased the cost of storage, I/O, and computation, respectively. Additionally, the global spectral-element grid matches the accuracy of a full forward calculation by virtue of Lagrange interpolation. Here, we present our first preliminary database of stored Green functions for 17 seismic stations of the global seismic networks to be used in future 3-D centroid moment tensor inversions. We demonstrate the fast retrieval and computation of seismograms from the database.
-
Reshaping Geostatistical Modeling and Prediction for Extreme-Scale Environmental Applications(IEEE, 2023-02-23) [Conference Paper]We extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-rank matrix approximations. We exploit the mathematical structure of the dense covariance matrix whose inverse action and determinant are repeatedly required in Gaussian log-likelihood optimization. Geostatistics augments first-principles modeling approaches for the prediction of environmental phenomena given the availability of measurements at a large number of locations; however, traditional Cholesky-based approaches grow cubically in complexity, gating practical extension to continental and global datasets now available. We combine the linear algebraic contributions of mixed-precision and low-rank computations within a tile based Cholesky solver with on-demand casting of precisions and dynamic runtime support from PaRSEC to orchestrate tasks and data movement. Our adaptive approach scales on various systems and leverages the Fujitsu A64FX nodes of Fugaku to achieve up to 12X performance speedup against the highly optimized dense Cholesky implementation.
-
Physics-Informed Deep Neural Network for Backward-in-Time Prediction: Application to Rayleigh–Bénard Convection(Artificial Intelligence for the Earth Systems, American Meteorological Society, 2023-02-14) [Article]Backward-in-time predictions are needed to better understand the underlying dynamics of physical fluid flows and improve future forecasts. However, integrating fluid flows backward in time is challenging because of numerical instabilities caused by the diffusive nature of the fluid systems and nonlinearities of the governing equations. Although this problem has been long addressed using a non-positive diffusion coefficient when integrating backward, it is notoriously inaccurate. In this study, a physics-informed deep neural network (PI-DNN) is presented to predict past states of a dissipative dynamical system from snapshots of the subsequent evolution of the system state. The performance of the PI-DNN is investigated using several systematic numerical experiments and the accuracy of the backward-in-time predictions is evaluated in terms of different error metrics. The proposed PI-DNN can predict the previous state of the Rayleigh–Bénard convection with an 8-time step average normalized ℓ2-error of less than 2% for a turbulent flow at a Rayleigh number of 105.
-
A multivariate modified skew-normal distribution(Statistical Papers, Springer Science and Business Media LLC, 2023-02-13) [Article]We introduce a multivariate version of the modified skew-normal distribution, which contains the multivariate normal distribution as a special case. Unlike the Azzalini multivariate skew-normal distribution, this new distribution has a nonsingular Fisher information matrix when the skewness parameters are all zero, and its profile log-likelihood of the skewness parameters is always a non-monotonic function. We study some basic properties of the proposed family of distributions and present an expectation-maximization (EM) algorithm for parameter estimation that we validate through simulation studies. Finally, we apply the proposed model to the univariate frontier data and to a trivariate wind speed data, and compare its performance with the Azzalini skew-normal model.
-
Uncertainty quantification in coastal aquifers using the multilevel Monte Carlo method(arXiv, 2023-02-13) [Preprint]We consider a class of density-driven flow problems. We are particularly interested in the problem of the salinization of coastal aquifers. We consider the Henry saltwater intrusion problem with uncertain porosity, permeability, and recharge parameters as a test case. The reason for the presence of uncertainties is the lack of knowledge, inaccurate measurements, and inability to measure parameters at each spatial or time location. This problem is nonlinear and time-dependent. The solution is the salt mass fraction, which is uncertain and changes in time. Uncertainties in porosity, permeability, recharge, and mass fraction are modeled using random fields. This work investigates the applicability of the well-known multilevel Monte Carlo (MLMC) method for such problems. The MLMC method can reduce the total computational and storage costs. Moreover, the MLMC method runs multiple scenarios on different spatial and time meshes and then estimates the mean value of the mass fraction. The parallelization is performed in both the physical space and stochastic space. To solve every deterministic scenario, we run the parallel multigrid solver ug4 in a black-box fashion. We use the solution obtained from the quasi-Monte Carlo method as a reference solution.
-
FilFL: Accelerating Federated Learning via Client Filtering(arXiv, 2023-02-13) [Preprint]Federated learning is an emerging machine learning paradigm that enables devices to train collaboratively without exchanging their local data. The clients participating in the training process are a random subset selected from the pool of clients. The above procedure is called client selection which is an important area in federated learning as it highly impacts the convergence rate, learning efficiency, and generalization. In this work, we introduce client filtering in federated learning (FilFL), a new approach to optimize client selection and training. FilFL first filters the active clients by choosing a subset of them that maximizes a specific objective function; then, a client selection method is applied to that subset. We provide a thorough analysis of its convergence in a heterogeneous setting. Empirical results demonstrate several benefits to our approach, including improved learning efficiency, accelerated convergence, 2-3× faster, and higher test accuracy, around 2-10 percentage points higher.
-
Multiple-Relaxation Runge Kutta Methods for Conservative Dynamical Systems(arXiv, 2023-02-10) [Preprint]We generalize the idea of relaxation time stepping methods in order to preserve multiple nonlinear conserved quantities of a dynamical system by projecting along directions defined by multiple time stepping algorithms. Similar to the directional projection method of Calvo et. al., we use embedded Runge-Kutta methods to facilitate this in a computationally efficient manner. Proof of the accuracy of the modified RK methods and the existence of valid relaxation parameters are given, under some restrictions. Among other examples, we apply this technique to Implicit-Explicit Runge-Kutta time integration for the Korteweg-de Vries equation and investigate the feasibility and effect of conserving multiple invariants for multi-soliton solutions.
-
Transdermal and lateral effective diffusivities for drug transport in stratum corneum from a microscopic anisotropic diffusion model.(European journal of pharmaceutics and biopharmaceutics : official journal of Arbeitsgemeinschaft fur Pharmazeutische Verfahrenstechnik e.V, Elsevier BV, 2023-02-09) [Article]This paper presents a computational model of molecular diffusion through the interfollicular stratum corneum. Specifically, it extends an earlier two-dimensional microscopic model for the permeability in two ways: (1) a microporous leakage pathway through the intercellular lipid lamellae allows slow permeation of highly hydrophilic permeants through the tissue; and (2) the model yields explicit predictions of both lateral (Dsc‖ ) and transdermal (Dsc⊥ ) effective (average, homogenized) diffusivities of solutes within the tissue. We present here the mathematical framework for the analysis and a comparison of the predictions with experimental data on desorption of both hydrophilic and lipophilic solutes from human stratum corneum in vitro. Diffusion in the lipid lamellae is found to make the effective diffusivity highly anisotropic, with the predicted ratio Dsc‖ /Dsc⊥ ranging from 34-39 for fully hydrated skin and 150 to more than 1000 for partially hydrated skin. The diffusivities and their ratio are in accord with both experimental data and the results of mathematical analyses performed by others.
-
ChatGPT versus Traditional Question Answering for Knowledge Graphs: Current Status and Future Directions Towards Knowledge Graph Chatbots(arXiv, 2023-02-08) [Preprint]Conversational AI and Question-Answering systems (QASs) for knowledge graphs (KGs) are both emerging research areas: they empower users with natural language interfaces for extracting information easily and effectively. Conversational AI simulates conversations with humans; however, it is limited by the data captured in the training datasets. In contrast, QASs retrieve the most recent information from a KG by understanding and translating the natural language question into a formal query supported by the database engine. In this paper, we present a comprehensive study of the characteristics of the existing alternatives towards combining both worlds into novel KG chatbots. Our framework compares two representative conversational models, ChatGPT and Galactica, against KGQAN, the current state-of-the-art QAS. We conduct a thorough evaluation using four real KGs across various application domains to identify the current limitations of each category of systems. Based on our findings, we propose open research opportunities to empower QASs with chatbot capabilities for KGs. All benchmarks and all raw results are available1 for further analysis.
-
Improving classification of correct and incorrect protein-protein docking models by augmenting the training set(Bioinformatics Advances, Oxford University Press (OUP), 2023-02-02) [Article]Motivation: Protein-protein interactions drive many relevant biological events, such as infection, replication, and recognition. To control or engineer such events, we need to access the molecular details of the interaction provided by experimental 3D structures. However, such experiments take time and are expensive; moreover, the current technology cannot keep up with the high discovery rate of new interactions. Computational modeling, like protein-protein docking, can help to fill this gap by generating docking poses. Protein-protein docking generally consists of two parts, sampling and scoring. The sampling is an exhaustive search of the tridimensional space. The caveat of the sampling is that it generates a large number of incorrect poses, producing a highly unbalanced dataset. This limits the utility of the data to train machine learning classifiers. Results: Using weak supervision, we developed a data augmentation method that we named hAIkal. Using hAIkal, we increased the labeled training data to train several algorithms. We trained and obtained different classifiers; the best classifier has 81% accuracy and 0.51 MCC on the test set, surpassing the state-of-the-art scoring functions.
-
A singular value decomposition approach for detecting and delineating harmful algal blooms in the Red Sea(Frontiers in Remote Sensing, Frontiers Media SA, 2023-01-19) [Article]Harmful algal blooms (HABs) have adverse effects on marine ecosystems. An effective approach for detecting, monitoring, and eventually predicting the occurrences of such events is required. By combining a singular value decomposition (SVD) approach and satellite remote sensing observations, we propose a remote sensing algorithm to detect and delineate species-specific HABs. We implemented and tested the proposed SVD algorithm to detect HABs associated with the mixed assemblages of different phytoplankton functional type (PFT) groupings in the Red Sea. The results were validated with concurrent in-situ data from surface samples, demonstrating that the SVD-model performs remarkably well at detecting and distinguishing HAB species in the Red Sea basin. The proposed SVD-model offers a cost-effective tool for implementing an automated remote-sensing monitoring system for detecting HAB species in the basin. Such a monitoring system could be used for predicting HAB outbreaks based on near real-time measurements, essential to support aquaculture industries, desalination plants, tourism, and public health.