For more information visit: https://stat.kaust.edu.sa/

Recent Submissions

  • Forecasting high-frequency spatio-temporal wind power with dimensionally reduced echo state networks

    Huang, Huang; Castruccio, Stefano; Genton, Marc G. (Journal of the Royal Statistical Society: Series C (Applied Statistics), Wiley, 2022-01-23) [Article]
    Fast and accurate hourly forecasts of wind speed and power are crucial in quantifying and planning the energy budget in the electric grid. Modelling wind at a high resolution brings forth considerable challenges given its turbulent and highly nonlinear dynamics. In developing countries, where wind farms over a large domain are currently under construction or consideration, this is even more challenging given the necessity of modelling wind over space as well. In this work, we propose a machine learning approach to model the nonlinear hourly wind dynamics in Saudi Arabia with a domain-specific choice of knots to reduce spatial dimensionality. Our results show that for locations highlighted as wind abundant by a previous work, our approach results in an 11% improvement in the 2-h-ahead forecasted power against operational standards in the wind energy sector, yielding a saving of nearly one million US dollars over a year under current market prices in Saudi Arabia.
  • Statistical analysis of multi-day solar irradiance using a threshold time series model

    de Jesus Euan Campos, Carolina; Sun, Ying; Reich, Brian J. (Environmetrics, Wiley, 2022-01-20) [Article]
    The analysis of solar irradiance has important applications in predicting solar energy production from solar power plants. Although the sun provides every day more energy than we need, the variability caused by environmental conditions affects electricity production. Recently, new statistical models have been proposed to provide stochastic simulations of high-resolution data to downscale and forecast solar irradiance measurements. Most of the existing models are linear and highly depend on normality assumptions. However, solar irradiance shows strong nonlinearity and is only measured during the day time. Thus, we propose a new multi-day threshold autoregressive model to quantify the variability of the daily irradiance time series. We establish the sufficient conditions for our model to be stationary, and we develop an inferential procedure to estimate the model parameters. When we apply our model to study the statistical properties of observed irradiance data in Guadeloupe island group, a French overseas region located in the Southern Caribbean Sea, we are able to characterize two states of the irradiance series. These states represent the clear-sky and non-clear sky regimes. Using our model, we are able to simulate irradiance series that behave similarly to the real data in mean and variability, and more accurate forecasts compared to linear models.
  • ASSESSING THE RELIABILITY OF WIND POWER OPERATIONS UNDER A CHANGING CLIMATE WITH A NON-GAUSSIAN BIAS CORRECTION

    Zhang, Jiachen; Crippa, Paola; Genton, Marc G.; Castruccio, Stefano (The Annals of Applied Statistics, Institute of Mathematical Statistics, 2021-12-21) [Article]
    Facing increasing societal and economic pressure, many countries have established strategies to develop renewable energy portfolios whose penetration in the market can alleviate the dependence on fossil fuels. In the case of wind, there is a fundamental question related to the resilience and hence profitability of future wind farms to a changing climate, given that current wind turbines have lifespans of up to 30 years. In this work we develop a new non-Gaussian method to adjust assimilated observational data to simulations and to estimate future wind, predicated on a trans-Gaussian transformation and a clusterwise minimization of the Kullback–Leibler divergence. Future winds abundance will be determined for Saudi Arabia, a country with a recently established plan to develop a portfolio of up to 16 GW of wind energy. Further, we estimate the change in profits over future decades using additional high-resolution simulations, an improved method for vertical wind extrapolation and power curves from a collection of popular wind turbines. We find an overall increase in daily profit of $272,000 for the wind energy market for the optimal locations for wind farming in the country.
  • Joint Posterior Inference for Latent Gaussian Models with R-INLA

    Chiuchiolo, Cristian; Niekerk, Janet van; Rue, Haavard (arXiv, 2021-12-06) [Preprint]
    Efficient Bayesian inference remains a computational challenge in hierarchical models. Simulation-based approaches such as Markov Chain Monte Carlo methods are still popular but have a large computational cost. When dealing with the large class of Latent Gaussian Models, the INLA methodology embedded in the R-INLA software provides accurate Bayesian inference by computing deterministic mixture representation to approximate the joint posterior, from which marginals are computed. The INLA approach has from the beginning been targeting to approximate univariate posteriors. In this paper we lay out the development foundation of the tools for also providing joint approximations for subsets of the latent field. These approximations inherit Gaussian copula structure and additionally provide corrections for skewness. The same idea is carried forward also to sampling from the mixture representation, which we now can adjust for skewness.
  • Finite-sample properties of estimators for first and second order autoregressive processes

    Sørbye, Sigrunn Holbek; Nicolau, Pedro G.; Rue, Haavard (Statistical Inference for Stochastic Processes, Springer Science and Business Media LLC, 2021-12-05) [Article]
    The class of autoregressive (AR) processes is extensively used to model temporal dependence in observed time series. Such models are easily available and routinely fitted using freely available statistical software like R. A potential problem is that commonly applied estimators for the coefficients of AR processes are severely biased when the time series are short. This paper studies the finite-sample properties of well-known estimators for the coefficients of stationary AR(1) and AR(2) processes and provides bias-corrected versions of these estimators which are quick and easy to apply. The new estimators are constructed by modeling the relationship between the true and originally estimated AR coefficients using weighted orthogonal polynomial regression, taking the sampling distribution of the original estimators into account. The finite-sample distributions of the new bias-corrected estimators are approximated using transformations of skew-normal densities, combined with a Gaussian copula approximation in the AR(2) case. The properties of the new estimators are demonstrated by simulations and in the analysis of a real ecological data set. The estimators are easily available in our accompanying R-package for AR(1) and AR(2) processes of length 10–50, both giving bias-corrected coefficient estimates and corresponding confidence intervals.
  • Brain waves analysis via a non-parametric Bayesian mixture of autoregressive kernels

    Granados-Garcia, Guilllermo; Fiecas, Mark; Babak, Shahbaba; Fortin, Norbert J.; Ombao, Hernando (Computational Statistics and Data Analysis, Elsevier BV, 2021-12) [Article]
    The standard approach to analyzing brain electrical activity is to examine the spectral density function (SDF) and identify frequency bands, defined a priori, that have the most substantial relative contributions to the overall variance of the signal. However, a limitation of this approach is that the precise frequency and bandwidth of oscillations are not uniform across different cognitive demands. Thus, these bands should not be arbitrarily set in any analysis. To overcome this limitation, the Bayesian mixture auto-regressive decomposition (BMARD) method is proposed, as a data-driven approach that identifies (i) the number of prominent spectral peaks, (ii) the frequency peak locations, and (iii) their corresponding bandwidths (or spread of power around the peaks). Using the BMARD method, the standardized SDF is represented as a Dirichlet process mixture based on a kernel derived from second-order auto-regressive processes which completely characterize the location (peak) and scale (bandwidth) parameters. A Metropolis-Hastings within the Gibbs algorithm is developed for sampling the posterior distribution of the mixture parameters. Simulations demonstrate the robust performance of the proposed method. Finally, the BMARD method is applied to analyze local field potential (LFP) activity from the hippocampus of laboratory rats across different conditions in a non-spatial sequence memory experiment, to identify the most prominent frequency bands and examine the link between specific patterns of brain oscillatory activity and trial-specific cognitive demands.
  • Sub-Dimensional Mardia Measures of Multivariate Skewness and Kurtosis

    Chowdhury, Joydeep; Dutta, Subhajit; Arellano-Valle, Reinaldo B.; Genton, Marc G. (arXiv, 2021-11-29) [Preprint]
    Mardia's measures of multivariate skewness and kurtosis summarize the respective characteristics of a multivariate distribution with two numbers. However, these measures do not reflect the sub-dimensional features of the distribution. Consequently, testing procedures based on these measures may fail to detect skewness or kurtosis present in a sub-dimension of the multivariate distribution. We introduce sub-dimensional Mardia measures of multivariate skewness and kurtosis, and investigate the information they convey about all sub-dimensional distributions of some symmetric and skewed families of multivariate distributions. The maxima of the sub-dimensional Mardia measures of multivariate skewness and kurtosis are considered, as these reflect the maximum skewness and kurtosis present in the distribution, and also allow us to identify the sub-dimension bearing the highest skewness and kurtosis. Asymptotic distributions of the vectors of sub-dimensional Mardia measures of multivariate skewness and kurtosis are derived, based on which testing procedures for the presence of skewness and of deviation from Gaussian kurtosis are developed. The performances of these tests are compared with some existing tests in the literature on simulated and real datasets.
  • Correcting the Laplace Method with Variational Bayes

    Niekerk, Janet van; Rue, Haavard (arXiv, 2021-11-25) [Preprint]
    Approximate inference methods like the Laplace method, Laplace approximations and variational methods, amongst others, are popular methods when exact inference is not feasible due to the complexity of the model or the abundance of data. In this paper we propose a hybrid approximate method namely Low-Rank Variational Bayes correction (VBC), that uses the Laplace method and subsequently a Variational Bayes correction to the posterior mean. The cost is essentially that of the Laplace method which ensures scalability of the method. We illustrate the method and its advantages with simulated and real data, on small and large scale.
  • FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning

    Gasanov, Elnur; Khaled, Ahmed; Horvath, Samuel; Richtarik, Peter (arXiv, 2021-11-22) [Preprint]
    Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.
  • Joint Quantile Disease Mapping for Areal Data

    Alahmadi, Hanan H. (2021-11-16) [Thesis]
    Advisor: Rue, Haavard
    Committee members: Laleg-Kirati, Taous-Meriem; Moraga, Paula; Silva, Giovani
    The statistical analysis based on the quantile method is more comprehensive, flexible, and not sensitive against outliers compared to the mean methods. The study of the joint disease mapping has usually focused on the mean regression. This means they study the correlation or the dependence between the means of the diseases by using standard regression. However, sometimes one disease limits the occurrence of another disease. In this case, the dependence between the two diseases will not be in the means but in the different quantiles; thus, the analyzes will consider a joint disease mapping of high quantile for one disease with low quantile of the other disease. In the proposed joint quantile model, the key idea is to link the diseases with different quantiles and estimate their dependence instead of connecting their means. The various components of this formulation are modeled by using the latent Gaussian model, and the parameters were estimated via R-INLA. Finally, we illustrate the model by analyzing the malaria and G6PD deficiency incidences in 21 African countries.
  • Flexible quantile contour estimation for multivariate functional data: Beyond convexity

    Agarwal, Gaurav; Tu, Wei; Sun, Ying; Kong, Linglong (Computational Statistics & Data Analysis, Elsevier BV, 2021-11-16) [Article]
    Nowadays, multivariate functional data are frequently observed in many scientific fields, and the estimation of quantiles of these data is essential in data analysis. Unlike in the univariate setting, quantiles are more challenging to estimate for multivariate data, let alone multivariate functional data. This article proposes a new method to estimate the quantiles for multivariate functional data with application to air pollution data. The proposed multivariate functional quantile model is a nonparametric, time-varying coefficient model, and basis functions are used for the estimation and prediction. The estimated quantile contours can account for non-Gaussian and even nonconvex features of the multivariate distributions marginally, and the estimated multivariate quantile function is a continuous function of time for a fixed quantile level. Computationally, the proposed method is shown to be efficient for both bivariate and trivariate functional data. The monotonicity, uniqueness, and consistency of the estimated multivariate quantile function have been established. The proposed method was demonstrated on bivariate and trivariate functional data in the simulation studies, and was applied to study the joint distribution of and geopotential height over time in the Northeastern United States; the estimated contours highlight the nonconvex features of the joint distribution, and the functional quantile curves capture the dynamic change across time.
  • Ensemble Kalman filtering with colored observation noise

    Raboudi, Naila Mohammed Fathi; Ait-El-Fquih, Boujemaa; Ombao, Hernando; Hoteit, Ibrahim (Quarterly Journal of the Royal Meteorological Society, Wiley, 2021-11-02) [Article]
    The Kalman filter (KF) is derived under the assumption of time-independent (white) observation noise. Although this assumption can be reasonable in many ocean and atmospheric applications, the recent increase in sensors coverage such as the launching of new constellations of satellites with global spatio-temporal coverage will provide high density of oceanic and atmospheric observations that are expected to have time-dependent (colored) error statistics. In this situation, the KF update has been shown to generally provide overconfident probability estimates, which may degrade the filter performance. Different KF-based schemes accounting for time-correlated observation noise were proposed for small systems by modeling the colored noise as a first-order autoregressive model driven by white Gaussian noise. This work introduces new ensemble Kalman filters (EnKFs) that account for colored observational noises for efficient data assimilation into large-scale oceanic and atmospheric applications. More specifically, we follow the standard and the one-step-ahead smoothing formulations of the Bayesian filtering problem with colored observational noise, modeled as an autoregressive model, to derive two (deterministic) EnKFs. We demonstrate the relevance of the colored observational noise-aware EnKFs and analyze their performances through extensive numerical experiments conducted with the Lorenz-96 model.
  • A deep attention-driven model to forecast solar irradiance

    Dairi, Abdelkader; Harrou, Fouzi; Sun, Ying (IEEE, 2021-10-11) [Conference Paper]
    Accurately forecasting solar irradiance is indispensable in optimally managing and designing photovoltaic systems. It enables the efficient integration of photovoltaic systems in the smart grid. This paper introduces an innovative deep attention-driven model for solar irradiance forecasting. Notably, an extended version of the variational autoencoder (VAE) is introduced by amalgamating the desirable characteristics of the bidirectional LSTM (BiLSTM) and attention mechanism with the VAE model. Specifically, the introduced approach enables the conventional VAE’s ability to model temporal dependencies by incorporating BiLSTM at the VAE’s encoder side to better extract and learn temporal dependencies embed on the solar irradiance concentration measurements. In addition, the self-attention mechanism is embedded in the VAE’s encoder side following the BiLSTM to highlight pertinent features. The performance of the proposed model is evaluated through comparisons with the recurrent neural network (RNN), gated recurrent unit (GRU), LSTM, and BiLSTM. Measurements of solar irradiance in the US and Turkey are used to evaluate the investigated models. Results confirm the superior performance of the proposed model for solar irradiance forecasting over the other models (i.e., RNN, GRU, LSTM, and BiLSTM).
  • A stacked deep learning approach to cyber-attacks detection in industrial systems: application to power system and gas pipeline systems

    Wang, Wu; Harrou, Fouzi; Bouyeddou, Benamar; Senouci, Sidi-Mohammed; Sun, Ying (Cluster Computing, Springer Science and Business Media LLC, 2021-10-05) [Article]
    Presently, Supervisory Control and Data Acquisition (SCADA) systems are broadly adopted in remote monitoring large-scale production systems and modern power grids. However, SCADA systems are continuously exposed to various heterogeneous cyberattacks, making the detection task using the conventional intrusion detection systems (IDSs) very challenging. Furthermore, conventional security solutions, such as firewalls, and antivirus software, are not appropriate for fully protecting SCADA systems because they have distinct specifications. Thus, accurately detecting cyber-attacks in critical SCADA systems is undoubtedly indispensable to enhance their resilience, ensure safe operations, and avoid costly maintenance. The overarching goal of this paper is to detect malicious intrusions that already detoured traditional IDS and firewalls. In this paper, a stacked deep learning method is introduced to identify malicious attacks targeting SCADA systems. Specifically, we investigate the feasibility of a deep learning approach for intrusion detection in SCADA systems. Real data sets from two laboratory-scale SCADA systems, a two-line three-bus power transmission system and a gas pipeline are used to evaluate the proposed method’s performance. The results of this investigation show the satisfying detection performance of the proposed stacked deep learning approach. This study also showed that the proposed approach outperformed the standalone deep learning models and the state-of-the-art algorithms, including Nearest neighbor, Random forests, Naive Bayes, Adaboost, Support Vector Machine, and oneR. Besides detecting the malicious attacks, we also investigate the feature importance of the cyber-attacks detection process using the Random Forest procedure, which helps design more parsimonious models.
  • Automatic Human Fall Detection Using Multiple Tri-axial Accelerometers

    Harrou, Fouzi; Zerrouki, Nabil; Dairi, Abdelkader; Sun, Ying; Houacine, Amrane (IEEE, 2021-09-29) [Conference Paper]
    Accurately detecting human falls of elderly people at an early stage is vital for providing early alert and avoid serious injury. Towards this purpose, multiple triaxial accelerometers data has been used to uncover falls based on an unsupervised monitoring procedure. Specifically, this paper introduces a one-class support vector machine (OCSVM) scheme into human fall detection. The main motivation behind the use of OCSVM is that it is a distribution-free learning model and can separate nonlinear features in an unsupervised way need for labeled data. The proposed OCSVM scheme was evaluated on fall detection databases from the University of Rzeszow's. Three other promising classification algorithms, Mean shift, Expectation-Maximization, k-means, were also assessed based on the same datasets. Their detection performances were compared with those obtained by the OCSVM algorithm. The results showed that the OCSVM scheme outperformed the other methods.
  • Variance partitioning in spatio-temporal disease mapping models

    Franco-Villoria, M.; Ventrucci, M.; Rue, Haavard (arXiv, 2021-09-27) [Preprint]
    Bayesian disease mapping, yet if undeniably useful to describe variation in risk over time and space, comes with the hurdle of prior elicitation on hard-to-interpret precision parameters. We introduce a reparametrized version of the popular spatio-temporal interaction models, based on Kronecker product intrinsic Gaussian Markov Random Fields, that we name variance partitioning (VP) model. The VP model includes a mixing parameter that balances the contribution of the main and interaction effects to the total (generalized) variance and enhances interpretability. The use of a penalized complexity prior on the mixing parameter aids in coding any prior information in a intuitive way. We illustrate the advantages of the VP model on two case studies.
  • Integer-valued autoregressive processes with prespecified marginal and innovation distributions: a novel perspective

    Guerrero, Matheus B.; Barreto-Souza, Wagner; Ombao, Hernando (Stochastic Models, Informa UK Limited, 2021-09-26) [Article]
    Integer-valued autoregressive (INAR) processes are generally defined by specifying the thinning operator and either the innovations or the marginal distributions. The major limitations of such processes include difficulties in deriving the marginal properties and justifying the choice of the thinning operator. To overcome these drawbacks, we propose a novel approach for building an INAR model that offers the flexibility to prespecify both marginal and innovation distributions. Thus, the thinning operator is no longer subjectively selected but is rather a direct consequence of the marginal and innovation distributions specified by the modeler. Novel INAR processes are introduced following this perspective; these processes include a model with geometric marginal and innovation distributions (Geo-INAR) and models with bounded innovations. We explore the Geo-INAR model, which is a natural alternative to the classical Poisson INAR model. The Geo-INAR process has interesting stochastic properties, such as MA(∞) representation, time reversibility, and closed forms for the hth-order transition probabilities, which enables a natural framework to perform coherent forecasting. To demonstrate the real-world application of the Geo-INAR model, we analyze a count time series of criminal records in sex offenses using the proposed methodology and compare it with existing INAR and integer-valued generalized autoregressive conditional heteroscedastic models.
  • Quantification of empirical determinacy: the impact of likelihood weighting on posterior location and spread in Bayesian meta-analysis estimated with JAGS and INLA

    Hunanyan, Sona; Rue, Haavard; Plummer, Martyn; Roos, Małgorzata (arXiv, 2021-09-24) [Preprint]
    The popular Bayesian meta-analysis expressed by Bayesian normal-normal hierarchical model (NNHM) synthesizes knowledge from several studies and is highly relevant in practice. Moreover, NNHM is the simplest Bayesian hierarchical model (BHM), which illustrates problems typical in more complex BHMs. Until now, it has been unclear to what extent the data determines the marginal posterior distributions of the parameters in NNHM. To address this issue we computed the second derivative of the Bhattacharyya coefficient with respect to the weighted likelihood, defined the total empirical determinacy (TED), the proportion of the empirical determinacy of location to TED (pEDL), and the proportion of the empirical determinacy of spread to TED (pEDS). We implemented this method in the R package \texttt{ed4bhm} and considered two case studies and one simulation study. We quantified TED, pEDL and pEDS under different modeling conditions such as model parametrization, the primary outcome, and the prior. This clarified to what extent the location and spread of the marginal posterior distributions of the parameters are determined by the data. Although these investigations focused on Bayesian NNHM, the method proposed is applicable more generally to complex BHMs.
  • Lattice Paths for Persistent Diagrams

    Chung, Moo K.; Ombao, Hernando (Springer International Publishing, 2021-09-21) [Conference Paper]
    Persistent homology has undergone significant development in recent years. However, one outstanding challenge is to build a coherent statistical inference procedure on persistent diagrams. In this paper, we first present a new lattice path representation for persistent diagrams. We then develop a new exact statistical inference procedure for lattice paths via combinatorial enumerations. The lattice path method is applied to the topological characterization of the protein structures of the COVID-19 virus. We demonstrate that there are topological changes during the conformational change of spike proteins.
  • Latent group detection in functional partially linear regression models

    Wang, Huixia Judy; Sun, Ying; Wang, Huixia Judy (Biometrics, Wiley, 2021-09-14) [Article]
    In this paper, we propose a functional partially linear regression model with latent group structures to accommodate the heterogeneous relationship between a scalar response and functional covariates. The proposed model is motivated by a salinity tolerance study of barley families, whose main objective is to detect salinity tolerant barley plants. Our model is flexible, allowing for heterogeneous functional coefficients while being efficient by pooling information within a group for estimation. We develop an algorithm in the spirit of the K-means clustering to identify latent groups of the subjects under study. We establish the consistency of the proposed estimator, derive the convergence rate and the asymptotic distribution, and develop inference procedures. We show by simulation studies that the proposed method has higher accuracy for recovering latent groups and for estimating the functional coefficients than existing methods. The analysis of the barley data shows that the proposed method can help identify groups of barley families with different salinity tolerant abilities.

View more