Statistics Program
For more information visit: https://stat.kaust.edu.sa/
Recent Submissions
-
The Bayesian Learning Rule(Accepted by Journal of Machine Learning Research, 2023-09-21) [Article]We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.
-
Artificial Intelligence Techniques for Solar Irradiance and PV Modeling and Forecasting(Energies, MDPI AG, 2023-09-21) [Article]
-
Past, Present, and Future of Software for Bayesian Inference(Accepted by Statistical Science, 2023-09-19) [Article]Software tools for Bayesian inference have undergone rapid evolution in the past three decades, following popularisation of the first generation MCMC-sampler implementations. More recently, exponential growth in the number of users has been stimulated both by the active development of new packages by the machine learning community and popularity of specialist software for particular applications. This review aims to summarize the most popular software and provide a useful map for a reader to navigate the world of Bayesian computation. We anticipate a vigorous continued development of algorithms and corresponding software in multiple research fields, such as probabilistic programming, likelihood-free inference, and Bayesian neural networks, which will further broaden the possibilities for employing the Bayesian paradigm in exciting applications.
-
Sex differences in mortality among children, adolescents, and young people aged 0-24 years: a systematic assessment of national, regional, and global trends from 1990 to 2021.(The Lancet. Global health, Elsevier BV, 2023-09-19) [Article]Background: Differences in mortality exist between sexes because of biological, genetic, and social factors. Sex differentials are well documented in children younger than 5 years but have not been systematically examined for ages 5–24 years. We aimed to estimate the sex ratio of mortality from birth to age 24 years and reconstruct trends in sex-specific mortality between 1990 and 2021 for 200 countries, major regions, and the world. Methods: We compiled comprehensive databases on the mortality sex ratio (ratio of male to female mortality rates) for individuals aged 0–4 years, 5–14 years, and 15–24 years. The databases contain mortality rates from death registration systems, full birth and sibling histories from surveys, and reports on household deaths in censuses. We modelled the sex ratio of age-specific mortality as a function of the mortality in both sexes using Bayesian hierarchical time-series models. We report the levels and trends of sex ratios and estimate the expected female mortality and excess female mortality rates (the difference between the estimated female mortality and the expected female mortality) to identify countries with outlying sex ratios. Findings: Globally, the mortality sex ratio was 1·13 (ie, boys were more likely to die than girls of the same age) for ages 0–4 years (90% uncertainty interval 1·11 to 1·15) in 2021. This ratio increased with age to 1·16 (1·12 to 1·20) for 5–14 years, reaching 1·65 for 15–24 years (1·52 to 1·75). In all age groups, the global sex ratio of mortality increased between 1990 and 2021, driven by faster declines in female mortality. In 2021, the probability of a newborn male reaching age 25 years was 94·1% (93·7 to 94·4), compared with 95·1% for a newborn female (94·7 to 95·3). We found a disadvantage of females versus males (compared with countries with similar total mortality) in 2021 in five countries for ages 0–4 years (Algeria, Bangladesh, Egypt, India, and Iran), one country (Suriname) for ages 5–14 years, and 13 countries for ages 15–24 years (including Bangladesh and India). We found the reverse pattern (disadvantage of males vs females compared with countries of similar total mortality) in one country in ages 0–4 years (Vietnam) and eight countries in ages 15–24 years (including Brazil and Mexico). Globally, the number of excess female deaths from birth to age 24 years was 86 563 (–6059 to 164 000) in 2021, down from 544 636 (453 982 to 633 265) in 1990. Interpretation: The global sex ratio of mortality for all age groups in the first 25 years of life increased between 1990 and 2021. Targeted interventions should focus on countries with outlying sex ratios of mortality to reduce disparities due to discrimination in health care, nutrition, and violence.
-
A marginalized two-part joint model for a longitudinal biomarker and a terminal event with application to advanced head and neck cancers(Pharmaceutical Statistics, Wiley, 2023-09-17) [Article]The sum of the longest diameter (SLD) of the target lesions is a longitudinal biomarker used to assess tumor response in cancer clinical trials, which can inform about early treatment effect. This biomarker is semicontinuous, often characterized by an excess of zeros and right skewness. Conditional two-part joint models were introduced to account for the excess of zeros in the longitudinal biomarker distribution and link it to a time-to-event outcome. A limitation of the conditional two-part model is that it only provides an effect of covariates, such as treatment, on the conditional mean of positive biomarker values, and not an overall effect on the biomarker, which is often of clinical relevance. As an alternative, we propose in this article, a marginalized two-part joint model (M-TPJM) for the repeated measurements of the SLD and a terminal event, where the covariates affect the overall mean of the biomarker. Our simulation studies assessed the good performance of the marginalized model in terms of estimation and coverage rates. Our application of the M-TPJM to a randomized clinical trial of advanced head and neck cancer shows that the combination of panitumumab in addition with chemotherapy increases the odds of observing a disappearance of all target lesions compared to chemotherapy alone, leading to a possible indirect effect of the combined treatment on time to death.
-
On adaptive kernel intensity estimation on linear networks(arXiv, 2023-09-17) [Preprint]In the analysis of spatial point patterns on linear networks, a critical statistical objective is estimating the first-order intensity function, representing the expected number of points within specific subsets of the network. Typically, non-parametric approaches employing heating kernels are used for this estimation. However, a significant challenge arises in selecting appropriate bandwidths before conducting the estimation. We study an intensity estimation mechanism that overcomes this limitation using adaptive estimators, where bandwidths adapt to the data points in the pattern. While adaptive estimators have been explored in other contexts, their application in linear networks remains underexplored. We investigate the adaptive intensity estimator within the linear network context and extend a partitioning technique based on bandwidth quantiles to expedite the estimation process significantly. Through simulations, we demonstrate the efficacy of this technique, showing that the partition estimator closely approximates the direct estimator while drastically reducing computation time. As a practical application, we employ our method to estimate the intensity of traffic accidents in a neighbourhood in Medellin, Colombia, showcasing its real-world relevance and efficiency.
-
Insights into the drivers and spatio-temporal trends of extreme Mediterranean wildfires with statistical deep-learning(Artificial Intelligence for the Earth Systems, American Meteorological Society, 2023-09-13) [Article]Extreme wildfires continue to be a significant cause of human death and biodiversity destruction within countries that encompass the Mediterranean Basin. Recent worrying trends in wildfire activity (i.e., occurrence and spread) suggest that wildfires are likely to be highly impacted by climate change. In order to facilitate appropriate risk mitigation, it is imperative to identify the main drivers of extreme wildfires and assess their spatio-temporal trends, with a view to understanding the impacts of the changing climate on fire activity. To this end, we analyse the monthly burnt area due to wildfires over a region encompassing most of Europe and the Mediterranean Basin from 2001 to 2020, and identify high fire activity during this period in eastern Europe, Algeria, Italy and Portugal. We build an extreme quantile regression model with a high-dimensional predictor set describing meteorological conditions, land cover usage, and orography, for the domain. To model the complex relationships between the predictor variables and wildfires, we make use of a hybrid statistical deep-learning framework that allows us to disentangle the effects of vapour-pressure deficit (VPD), air temperature, and drought on wildfire activity. Our results highlight that whilst VPD, air temperature, and drought significantly affect wildfire occurrence, only VPD affects wildfire spread. Furthermore, to gain insights into the effect of climate trends on wildfires in the near future, we focus on the extreme wildfires in August 2001 and perturb VPD and temperature according to their observed trends. We find that, on average over Europe, trends in temperature (median over Europe: +0.04K per year) lead to a relative increase of 17.1% and 1.6% in the expected frequency and severity, respectively, of wildfires in August 2001; similar analyses using VPD (median over Europe: +4.82Pa per year) give respective increases of 1.2% and 3.6%. Our analysis finds evidence suggesting that global warming can lead to spatially non-uniform changes in wildfire activity.
-
Joint modelling of landslide counts and sizes using spatial marked point processes with sub-asymptotic mark distributions(JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, Oxford University Press (OUP), 2023-09-13) [Article]To accurately quantify landslide hazard in a region of Turkey, we develop new marked point-process models within a Bayesian hierarchical framework for the joint prediction of landslide counts and sizes. We leverage mark distributions justified by extreme-value theory, and specifically propose ‘sub-asymptotic’ distributions to flexibly model landslide sizes from low to high quantiles. The use of intrinsic conditional autoregressive priors, and a customised adaptive Markov chain Monte Carlo algorithm, allow for fast fully Bayesian inference. We show that sub-asymptotic mark distributions provide improved predictions of large landslide sizes, and use our model for risk assessment and hazard mapping.
-
Cybersecurity of photovoltaic systems: challenges, threats, and mitigation strategies: a short survey(Frontiers in Energy Research, Frontiers Media SA, 2023-09-12) [Article]Photovoltaic (PV) systems, as critical components of the power grid, have become increasingly reliant on standard Information Technology (IT) computing and network infrastructure for their operation and maintenance. However, this dependency exposes PV systems to heightened vulnerabilities and the risk of cyber-attacks. In recent times, the number of reported cyber-attacks targeting PV systems has increased significantly. This paper provides an overview of the cybersecurity challenges faced by PV systems, emphasizing their susceptibility to anomalies and cyber threats. It highlights the urgency of implementing robust cybersecurity measures to protect the integrity and reliability of PV installations. By understanding and addressing these challenges, stakeholders can ensure the resilience and secure integration of PV systems within the power grid infrastructure.
-
Parallel Selected Inversion for Space-Time Gaussian Markov Random Fields(arXiv, 2023-09-11) [Preprint]Performing a Bayesian inference on large spatio-temporal models requires extracting inverse elements of large sparse precision matrices for marginal variances. Although direct matrix factorizations can be used for the inversion, such methods fail to scale well for distributed problems when run on large computing clusters. On the contrary, Krylov subspace methods for the selected inversion have been gaining traction. We propose a parallel hybrid approach based on domain decomposition, which extends the Rao-Blackwellized Monte Carlo estimator for distributed precision matrices. Our approach exploits the strength of Krylov subspace methods as global solvers and efficiency of direct factorizations as base case solvers to compute the marginal variances using a divide-and-conquer strategy. By introducing subdomain overlaps, one can achieve a greater accuracy at an increased computational effort with little to no additional communication. We demonstrate the speed improvements on both simulated models and a massive US daily temperature data.
-
Stationary nonseparable space-time covariance functions on networks(JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, Oxford University Press (OUP), 2023-09-08) [Article]The advent of data science has provided an increasing number of challenges with high data complexity. This paper addresses the challenge of space-time data where the spatial domain is not a planar surface, a sphere, or a linear network, but a generalised network (termed a graph with Euclidean edges). Additionally, data are repeatedly measured over different temporal instants. We provide new classes of stationary nonseparable space-time covariance functions where space can be a generalised network, a Euclidean tree, or a linear network, and where time can be linear or circular (seasonal). Because the construction principles are technical, we focus on illustrations that guide the reader through the construction of statistically interpretable examples. A simulation study demonstrates that the correct model can be recovered when compared to misspecified models. In addition, our simulation studies show that we effectively recover simulation parameters. In our data analysis, we consider a traffic accident dataset that shows improved model performance based on covariance specifications and network-based metrics.
-
The Global, Regional, and National Burden of Adult Lip, Oral, and Pharyngeal Cancer in 204 Countries and Territories(JAMA Oncology, American Medical Association (AMA), 2023-09-07) [Article]Importance: Lip, oral, and pharyngeal cancers are important contributors to cancer burden worldwide, and a comprehensive evaluation of their burden globally, regionally, and nationally is crucial for effective policy planning. Objective: To analyze the total and risk-attributable burden of lip and oral cavity cancer (LOC) and other pharyngeal cancer (OPC) for 204 countries and territories and by Socio-demographic Index (SDI) using 2019 Global Burden of Diseases, Injuries, and Risk Factors (GBD) Study estimates. Evidence Review: The incidence, mortality, and disability-adjusted life years (DALYs) due to LOC and OPC from 1990 to 2019 were estimated using GBD 2019 methods. The GBD 2019 comparative risk assessment framework was used to estimate the proportion of deaths and DALYs for LOC and OPC attributable to smoking, tobacco, and alcohol consumption in 2019. Findings: In 2019, 370 000 (95% uncertainty interval [UI], 338 000-401 000) cases and 199 000 (95% UI, 181 000-217 000) deaths for LOC and 167 000 (95% UI, 153 000-180 000) cases and 114 000 (95% UI, 103 000-126 000) deaths for OPC were estimated to occur globally, contributing 5.5 million (95% UI, 5.0-6.0 million) and 3.2 million (95% UI, 2.9-3.6 million) DALYs, respectively. From 1990 to 2019, low-middle and low SDI regions consistently showed the highest age-standardized mortality rates due to LOC and OPC, while the high SDI strata exhibited age-standardized incidence rates decreasing for LOC and increasing for OPC. Globally in 2019, smoking had the greatest contribution to risk-attributable OPC deaths for both sexes (55.8% [95% UI, 49.2%-62.0%] of all OPC deaths in male individuals and 17.4% [95% UI, 13.8%-21.2%] of all OPC deaths in female individuals). Smoking and alcohol both contributed to substantial LOC deaths globally among male individuals (42.3% [95% UI, 35.2%-48.6%] and 40.2% [95% UI, 33.3%-46.8%] of all risk-attributable cancer deaths, respectively), while chewing tobacco contributed to the greatest attributable LOC deaths among female individuals (27.6% [95% UI, 21.5%-33.8%]), driven by high risk-attributable burden in South and Southeast Asia. Conclusions and Relevance: In this systematic analysis, disparities in LOC and OPC burden existed across the SDI spectrum, and a considerable percentage of burden was attributable to tobacco and alcohol use. These estimates can contribute to an understanding of the distribution and disparities in LOC and OPC burden globally and support cancer control planning efforts.
-
Bayesian Inference for Multivariate Spatial Models with R-INLA(Accepted by The R Journal, arXiv, 2023-09-06) [Article]Bayesian methods and software for spatial data analysis are generally well established now in the scientific community. Despite the wide application of spatial models, the analysis of multivariate spatial data using the integrated nested Laplace approximation through its R package (R-INLA) has not been widely described in the existing literature. Therefore, the main objective of this article is to demonstrate that R-INLA is a convenient toolbox to analyse different types of multivariate spatial datasets. This will be illustrated by analysing three datasets which are publicly available. Furthermore, the details and the R code of these analyses are provided to exemplify how to fit models to multivariate spatial datasets with R-INLA.
-
Joint spatial modeling of the risks of co-circulating mosquito-borne diseases in Ceará, Brazil(Spatial and Spatio-temporal Epidemiology, Elsevier BV, 2023-09-06) [Article]Mosquito-borne diseases such as dengue and chikungunya have been co-circulating in the Americas, causing great damage to the population. In 2021, for instance, almost 1.5 million cases were reported on the continent, being Brazil the responsible for most of them. Even though they are transmitted by the same mosquito, it remains unclear whether there exists a relationship between both diseases. In this paper, we model the geographic distributions of dengue and chikungunya over the years 2016 to 2021 in the Brazilian state of Ceará. We use a Bayesian hierarchical spatial model for the joint analysis of two arboviruses that includes spatial covariates as well as specific and shared spatial effects that take into account the potential autocorrelation between the two diseases. Our findings allow us to identify areas with high risk of one or both diseases. Only 7% of the areas present high relative risk for both diseases, which suggests a competition between viruses. This study advances the understanding of the geographic patterns and the identification of risk factors of dengue and chikungunya being able to help health decision-making.
-
Spatial data fusion adjusting for preferential sampling using INLA and SPDE(arXiv, 2023-09-06) [Preprint]Spatially misaligned data can be fused by using a Bayesian melding model that assumes that underlying all observations there is a spatially continuous Gaussian random field process. This model can be used, for example, to predict air pollution levels by combining point data from monitoring stations and areal data from satellite imagery. However, if the data presents preferential sampling, that is, if the observed point locations are not independent of the underlying spatial process, the inference obtained from models that ignore such a dependence structure might not be valid. In this paper, we present a Bayesian spatial model for the fusion of point and areal data that takes into account preferential sampling. The model combines the Bayesian melding specification and a model for the stochastically dependent sampling and underlying spatial processes. Fast Bayesian inference is performed using the integrated nested Laplace approximation (INLA) and the stochastic partial differential equation (SPDE) approaches. The performance of the model is assessed using simulated data in a range of scenarios and sampling strategies that can appear in real settings. The model is also applied to predict air pollution in the USA.
-
Quantification of Empirical Determinacy: The Impact of Likelihood Weighting on Posterior Location and Spread in Bayesian Meta-Analysis Estimated with JAGS and INLA(BAYESIAN ANALYSIS, Institute of Mathematical Statistics, 2023-09) [Article]The popular Bayesian meta-analysis expressed by the normal-normal hierarchical model synthesizes knowledge from several studies and is highly relevant in practice. The normal-normal hierarchical model is the simplest Bayesian hierarchical model, but illustrates problems typical in more complex Bayesian hierarchical models. Until now, it has been unclear to what extent the data determines the marginal posterior distributions of the parameters in the normal-normal hierarchical model. To address this issue we computed the second derivative of the Bhattacharyya coefficient with respect to the weighted likelihood. This quantity, which we define as the total empirical determinacy (TED), can be written as the sum of two terms: the empirical determinacy of location (EDL), and the empirical determinacy of spread (EDS). We implemented this method in the R package ed4bhm and considered two case studies and one simulation study. We quantified TED, EDL and EDS under different modeling conditions such as model parametrization, the primary outcome, and the prior. This clarifies to what extent the location and spread of the marginal posterior distributions of the parameters are determined by the data. Although these investigations focused on Bayesian normal-normal hierarchical model, the method proposed is applicable more generally to complex Bayesian hierarchical models.
-
Deep graphical regression for jointly moderate and extreme Australian wildfires(arXiv, 2023-08-28) [Preprint]Recent wildfires in Australia have led to considerable economic loss and property destruction, and there is increasing concern that climate change may exacerbate their intensity, duration, and frequency. hazard quantification for extreme wildfires is an important component of wildfire management, as it facilitates efficient resource distribution, adverse effect mitigation, and recovery efforts. However, although extreme wildfires are typically the most impactful, both small and moderate fires can still be devastating to local communities and ecosystems. Therefore, it is imperative to develop robust statistical methods to reliably model the full distribution of wildfire spread. We do so for a novel dataset of Australian wildfires from 1999 to 2019, and analyse monthly spread over areas approximately corresponding to Statistical Areas Level 1 and 2 (SA1/SA2) regions. Given the complex nature of wildfire ignition and spread, we exploit recent advances in statistical deep learning and extreme value theory to construct a parametric regression model using graph convolutional neural networks and the extended generalized Pareto distribution, which allows us to model wildfire spread observed on an irregular spatial domain. We highlight the efficacy of our newly proposed model and perform a wildfire hazard assessment for Australia and population-dense communities, namely Tasmania, Sydney, Melbourne, and Perth.
-
Deep graphical regression for jointly moderate and extreme Australian wildfires(arXiv, 2023-08-28) [Preprint]Recent wildfires in Australia have led to considerable economic loss and property destruction, and there is increasing concern that climate change may exacerbate their intensity, duration, and frequency. hazard quantification for extreme wildfires is an important component of wildfire management, as it facilitates efficient resource distribution, adverse effect mitigation, and recovery efforts. However, although extreme wildfires are typically the most impactful, both small and moderate fires can still be devastating to local communities and ecosystems. Therefore, it is imperative to develop robust statistical methods to reliably model the full distribution of wildfire spread. We do so for a novel dataset of Australian wildfires from 1999 to 2019, and analyse monthly spread over areas approximately corresponding to Statistical Areas Level 1 and 2 (SA1/SA2) regions. Given the complex nature of wildfire ignition and spread, we exploit recent advances in statistical deep learning and extreme value theory to construct a parametric regression model using graph convolutional neural networks and the extended generalized Pareto distribution, which allows us to model wildfire spread observed on an irregular spatial domain. We highlight the efficacy of our newly proposed model and perform a wildfire hazard assessment for Australia and population-dense communities, namely Tasmania, Sydney, Melbourne, and Perth.
-
Which Parameterization of the Matérn Covariance Function?(arXiv, 2023-08-28) [Preprint]The Matérn family of covariance functions is currently the most popularly used model in spatial statistics, geostatistics, and machine learning to specify the correlation between two geographical locations based on spatial distance. Compared to existing covariance functions, the Matérn family has more flexibility in data fitting because it allows the control of the field smoothness through a dedicated parameter. Moreover, it generalizes other popular covariance functions. However, fitting the smoothness parameter is computationally challenging since it complicates the optimization process. As a result, some practitioners set the smoothness parameter at an arbitrary value to reduce the optimization convergence time. In the literature, studies have used various parameterizations of the Matérn covariance function, assuming they are equivalent. This work aims at studying the effectiveness of different parameterizations under various settings. We demonstrate the feasibility of inferring all parameters simultaneously and quantifying their uncertainties on large-scale data using the ExaGeoStat parallel software. We also highlight the importance of the smoothness parameter by analyzing the Fisher information of the statistical parameters. We show that the various parameterizations have different properties and differ from several perspectives. In particular, we study the three most popular parameterizations in terms of parameter estimation accuracy, modeling accuracy and efficiency, prediction efficiency, uncertainty quantification, and asymptotic properties. We further demonstrate their differing performances under nugget effects and approximated covariance. Lastly, we give recommendations for parameterization selection based on our experimental results.
-
A flexible Bayesian tool for CoDa mixed models: logistic-normal distribution with Dirichlet covariance(arXiv, 2023-08-26) [Preprint]Compositional Data Analysis (CoDa) has gained popularity in recent years. This type of data consists of values from disjoint categories that sum up to a constant. Both Dirichlet regression and logistic-normal regression have become popular as CoDa analysis methods. However, fitting this kind of multivariate models presents challenges, especially when structured random effects are included in the model, such as temporal or spatial effects. To overcome these challenges, we propose the logistic-normal Dirichlet Model (LNDM). We seamlessly incorporate this approach into the \textbf{R-INLA} package, facilitating model fitting, model and model predicting within the framework of Latent Gaussian Models (LGMs). Moreover, we explore metrics like Deviance Information Criteria (DIC), Watanabe Akaike information criterion (WAIC), and cross-validation measure conditional predictive ordinate (CPO) for model selection in \textbf{R-INLA} for CoDa. Illustrating LNDM through a simple simulated example and with an ecological case study on \textit{Arabidopsis thaliana} in the Iberian Peninsula, we underscore its potential as an effective tool for managing CoDa and large CoDa databases.