For more information visit: https://stat.kaust.edu.sa/

Recent Submissions

  • Smart Gradient -- An Adaptive Technique for Improving Gradient Estimation

    Fattah, Esmail Abdul; Niekerk, Janet Van; Rue, Haavard (arXiv, 2021-06-14) [Preprint]
    Computing the gradient of a function provides fundamental information about its behavior. This information is essential for several applications and algorithms across various fields. One common application that require gradients are optimization techniques such as stochastic gradient descent, Newton's method and trust region methods. However, these methods usually requires a numerical computation of the gradient at every iteration of the method which is prone to numerical errors. We propose a simple limited-memory technique for improving the accuracy of a numerically computed gradient in this gradient-based optimization framework by exploiting (1) a coordinate transformation of the gradient and (2) the history of previously taken descent directions. The method is verified empirically by extensive experimentation on both test functions and on real data applications. The proposed method is implemented in the R package smartGrad and in C++.
  • Lagrangian Spatio-Temporal Covariance Functions for Multivariate Nonstationary Random Fields

    Salvaña, Mary Lai O. (2021-06-14) [Thesis]
    Advisor: Genton, Marc G.
    Committee members: Ombao, Hernando; Sang, Huiyan; Stenchikov, Georgiy L.
    In geostatistical analysis, we are faced with the formidable challenge of specifying a valid spatio-temporal covariance function, either directly or through the construction of processes. This task is di cult as these functions should yield positive de nite covariance matrices. In recent years, we have seen a ourishing of methods and theories on constructing spatiotemporal covariance functions satisfying the positive de niteness requirement. The current state-of-the-art when modeling environmental processes are those that embed the associated physical laws of the system. The class of Lagrangian spatio-temporal covariance functions ful lls this requirement. Moreover, this class possesses the allure that they turn already established purely spatial covariance functions into spatio-temporal covariance functions by a direct application of the concept of Lagrangian reference frame. In the three main chapters that comprise this dissertation, several developments are proposed and new features are provided to this special class. First, the application of the Lagrangian reference frame on transported purely spatial random elds with second-order nonstationarity is explored, an appropriate estimation methodology is proposed, and the consequences of model misspeci cation is tackled. Furthermore, the new Lagrangian models and the new estimation technique are used to analyze particulate matter concentrations over Saudi Arabia. Second, a multivariate version of the Lagrangian framework is established, catering to both secondorder stationary and nonstationary spatio-temporal random elds. The capabilities of the Lagrangian spatio-temporal cross-covariance functions are demonstrated on a bivariate reanalysis climate model output dataset previously analyzed using purely spatial covariance functions. Lastly, the class of Lagrangian spatio-temporal cross-covariance functions with multiple transport behaviors is presented, its properties are explored, and its use is demonstrated on a bivariate pollutant dataset of particulate matter in Saudi Arabia. Moreover, the importance of accounting for multiple transport behaviors is discussed and validated via numerical experiments. Together, these three extensions to the Lagrangian framework makes it a more viable geostatistical approach in modeling realistic transport scenarios.
  • Copula-based multiple indicator kriging for non-Gaussian random fields

    Agarwal, Gaurav; Sun, Ying; Wang, Huixia J. (Spatial Statistics, Elsevier BV, 2021-06-09) [Article]
    In spatial statistics, the kriging predictor is the best linear predictor at unsampled locations, but not the optimal predictor for non-Gaussian processes. In this paper, we introduce a copula-based multiple indicator kriging model for the analysis of non-Gaussian spatial data by thresholding the spatial observations at a given set of quantile values. The proposed copula model allows for flexible marginal distributions while modeling the spatial dependence via copulas. We show that the covariances required by kriging have a direct link to the chosen copula function. We then develop a semiparametric estimation procedure. The proposed method provides the entire predictive distribution function at a new location, and thus allows for both point and interval predictions. The proposed method demonstrates better predictive performance than the commonly used variogram approach and Gaussian kriging in the simulation studies. We illustrate our methods on precipitation data in Spain during November 2019, and heavy metal dataset in topsoil along the river Meuse, and obtain probability exceedance maps.
  • Markov-Switching State-Space Models with Applications to Neuroimaging

    Degras, David; Ting, Chee-Ming; Ombao, Hernando (arXiv, 2021-06-09) [Preprint]
    State-space models (SSM) with Markov switching offer a powerful framework for detecting multiple regimes in time series, analyzing mutual dependence and dynamics within regimes, and asserting transitions between regimes. These models however present considerable computational challenges due to the exponential number of possible regime sequences to account for. In addition, high dimensionality of time series can hinder likelihood-based inference. This paper proposes novel statistical methods for Markov-switching SSMs using maximum likelihood estimation, Expectation-Maximization (EM), and parametric bootstrap. We develop solutions for initializing the EM algorithm, accelerating convergence, and conducting inference that are ideally suited to massive spatio-temporal data such as brain signals. We evaluate these methods in simulations and present applications to EEG studies of epilepsy and of motor imagery. All proposed methods are implemented in a MATLAB toolbox available at https://github.com/ddegras/switch-ssm.
  • Flexible Covariance Models for Spatio-Temporal and Multivariate Spatial Random Fields

    Qadir, Ghulam A. (2021-06-06) [Thesis]
    Advisor: Sun, Ying
    Committee members: Alouini, Mohamed-Slim; Ombao, Hernando; Kleiber, William
    The modeling of spatio-temporal and multivariate spatial random elds has been an important and growing area of research due to the increasing availability of spacetime- referenced data in a large number of scienti c applications. In geostatistics, the covariance function plays a crucial role in describing the spatio-temporal dependence in the data and is key to statistical modeling, inference, stochastic simulation and prediction. Therefore, the development of exible covariance models, which can accomodate the inherent variability of the real data, is necessary for an advantageous modeling of random elds. This thesis is composed of four signi cant contributions in the development and applications of new covariance models for stationary multivariate spatial processes, and nonstationary spatial and spatio-temporal processes. The rst focus of the thesis is on modeling of stationary multivariate spatial random elds through exible multivariate covariance functions. Chapter 2 proposes a semiparametric approach for multivariate covariance function estimation with exible speci cation of the cross-covariance functions via their spectral representations. The proposed method is applied to model and predict the bivariate data of particulate matter concentration (PM2:5) and wind speed (WS) in the United States. Chapter 3 introduces a parametric class of multivariate covariance functions with asymmetric cross-covariance functions. The proposed covariance model is applied to analyze the asymmetry and perform prediction in a trivariate data of PM2:5, WS and relative humidity (RH) in the United States. The second focus of the thesis is on nonstationary spatial and spatio-temporal random elds. Chapter 4 presents a space deformation method which imparts nonstationarity to any stationary covariance function. The proposed method utilizes the functional data registration algorithm and classical multidimensional scaling to estimate the spatial deformation. The application of the proposed method is demonstrated on a precipitation data. Finally, chapter 5 proposes a parametric class of time-varying spatio-temporal covariance functions, which are nonstationary in time. The proposed class is a time-varying generalization of an existing nonseparable stationary class of spatio-temporal covariance functions. The proposed time-varying model is then used to study the seasonality e ect and perform space-time predictions in the daily PM2:5 data from Oregon, United States.
  • Monitoring patient flow in a hospital emergency department: ARMA-based nonparametric GLRT scheme

    Harrou, Fouzi; Kadri, Farid; Sun, Ying; Khadraoui, Sofiane (Health Informatics Journal, SAGE Publishing, 2021-06-06) [Article]
    Overcrowding in emergency departments (EDs) is a primary concern for hospital administration. They aim to efficiently manage patient demands and reducing stress in the ED. Detection of abnormal ED demands (patient flows) in hospital systems aids ED managers to obtain appropriate decisions by optimally allocating the available resources following patient attendance. This paper presents a monitoring strategy that provides an early alert in an ED when an abnormally high patient influx occurs. Anomaly detection using this strategy involves the amalgamation of autoregressive-moving-average (ARMA) time series models with the generalized likelihood ratio (GLR) chart. A nonparametric procedure based on kernel density estimation is employed to determine the detection threshold of the ARMA-GLR chart. The developed ARMA-based GLR has been validated through practical data from the ED at Lille Hospital, France. Then, the ARMA-based GLR method’s performance was compared to that of other commonly used charts, including a Shewhart chart and an exponentially weighted moving average chart; it proved more accurate.
  • Filtrated Common Functional Principal Components for Multivariate Functional data

    Jiao, Shuhao; Frostig, Ron D.; Ombao, Hernando (arXiv, 2021-06-02) [Preprint]
    Local field potentials (LFPs) are signals that measure electrical activity in localized cortical regions from multiple implanted tetrodes in the human or animal brain. They can be treated as multivariate functional data (i.e., curves observed at many tetrodes spread across a patch on the surface of the cortex). Most multivariate functional data contain both global features (which are shared in common to all curves) as well isolated features (common only to a small subset of curves). The goal is this paper is to develop a procedure for capturing this common features. We propose a novel tree-structured functional principal component (filt-fPC) model through low-dimensional functional representation, specifically via filtration. A popular approach to dimension reduction of functional data is functional principal components analysis (fPCA). Ordinary fPCA can only capture the major information of one population, but fail to reveal the similarity of variation pattern of different groups, which is potentially related to functional connectivity of brain. One major advantage of the proposed filt-fPC method is the ability to extracting components that are common to multiple groups, and meanwhile preserves the idiosyncratic individual features of different groups, leading to a parsimonious and interpretable low dimensional representation of multivariate functional data. Another advantage is that the extracted functional principal components satisfy the orthonormal property for each set, making filt-fPC scores easy to be obtained. The proposed filt-fPC method was employed to study the impact of a shock (induced stroke) on the functional organization structure of the rat brain. Finally we point to further directions as this filtration idea can also be generalized to other functional statistical models, such as functional regression, classification and functional times series models.
  • Modelling short-term precipitation extremes with the blended generalised extreme value distribution

    Vandeskog, Silius M.; Martino, Sara; Castro-Camilo, Daniela; Rue, Haavard (arXiv, 2021-05-19) [Preprint]
    The yearly maxima of short-term precipitation are modelled to produce improved spatial maps of return levels over the south of Norway. The newly proposed blended generalised extreme value (bGEV) distribution is used as a substitute for the more standard generalised extreme value (GEV) distribution in order to simplify inference. Yearly precipitation maxima are modelled using a Bayesian hierarchical model with a latent Gaussian field. Fast inference is performed using the framework of integrated nested Laplace approximations (INLA). Inference is made less wasteful with a two-step procedure that performs separate modelling of the scale parameter of the bGEV distribution using peaks over threshold data. Our model provides good estimates for large return levels of short-term precipitation, and it outperforms standard block maxima models.
  • Efficiency assessment of approximated spatial predictions for large datasets

    Hong, Yiping; Abdulah, Sameh; Genton, Marc G.; Sun, Ying (Spatial Statistics, Elsevier BV, 2021-05-14) [Article]
    Due to the well-known computational showstopper of the exact Maximum Likelihood Estimation (MLE) for large geospatial observations, a variety of approximation methods have been proposed in the literature, which usually require tuning certain inputs. For example, the recently developed Tile Low-Rank approximation (TLR) method involves many tuning parameters, including numerical accuracy. To properly choose the tuning parameters, it is crucial to adopt a meaningful criterion for the assessment of the prediction efficiency with different inputs. Unfortunately, the most commonly-used Mean Square Prediction Error (MSPE) criterion cannot directly assess the loss of efficiency when the spatial covariance model is approximated. Though the Kullback–Leibler Divergence criterion can provide the information loss of the approximated model, it cannot give more detailed information that one may be interested in, e.g., the accuracy of the computed MSE. In this paper, we present three other criteria, the Mean Loss of Efficiency (MLOE), Mean Misspecification of the Mean Square Error (MMOM), and Root mean square MOM (RMOM), and show numerically that, in comparison with the common MSPE criterion and the Kullback–Leibler Divergence criterion, our criteria are more informative, and thus more adequate to assess the loss of the prediction efficiency by using the approximated or misspecified covariance models. Hence, our suggested criteria are more useful for the determination of tuning parameters for sophisticated approximation methods of spatial model fitting. To illustrate this, we investigate the trade-off between the execution time, estimation accuracy, and prediction efficiency for the TLR method with extensive simulation studies and suggest proper settings of the TLR tuning parameters. We then apply the TLR method to a large spatial dataset of soil moisture in the area of the Mississippi River basin, and compare the TLR with the Gaussian predictive process and the composite likelihood method, showing that our suggested criteria can successfully be used to choose the tuning parameters that can keep the estimation or the prediction accuracy in applications.
  • SCAU: Modeling spectral causality for multivariate time series with applications to electroencephalograms

    Pinto-Orellana, Marco Antonio; Mirtaheri, Peyman; Hammer, Hugo L.; Ombao, Hernando (arXiv, 2021-05-13) [Preprint]
    Electroencephalograms (EEG) are noninvasive measurement signals of electrical neuronal activity in the brain. One of the current major statistical challenges is formally measuring functional dependency between those complex signals. This paper, proposes the spectral causality model (SCAU), a robust linear model, under a causality paradigm, to reflect inter- and intra-frequency modulation effects that cannot be identifiable using other methods. SCAU inference is conducted with three main steps: (a) signal decomposition into frequency bins, (b) intermediate spectral band mapping, and (c) dependency modeling through frequency-specific autoregressive models (VAR). We apply SCAU to study complex dependencies during visual and lexical fluency tasks (word generation and visual fixation) in 26 participants' EEGs. We compared the connectivity networks estimated using SCAU with respect to a VAR model. SCAU networks show a clear contrast for both stimuli while the magnitude links also denoted a low variance in comparison with the VAR networks. Furthermore, SCAU dependency connections not only were consistent with findings in the neuroscience literature, but it also provided further evidence on the directionality of the spatio-spectral dependencies such as the delta-originated and theta-induced links in the fronto-temporal brain network.
  • Modeling spatial extremes using normal mean-variance mixtures

    Zhang, Zhongwei; Huser, Raphaël; Opitz, Thomas; Wadsworth, Jennifer L. (Submitted to Extremes, arXiv, 2021-05-11) [Preprint]
    Classical models for multivariate or spatial extremes are mainly based upon the asymptotically justified max-stable or generalized Pareto processes. These models are suitable when asymptotic dependence is present, i.e., the joint tail decays at the same rate as the marginal tail. However, recent environmental data applications suggest that asymptotic independence is equally important and, unfortunately, existing spatial models in this setting that are both flexible and can be fitted efficiently are scarce. Here, we propose a new spatial copula model based on the generalized hyperbolic distribution, which is a specific normal mean-variance mixture and is very popular in financial modeling. The tail properties of this distribution have been studied in the literature, but with contradictory results. It turns out that the proofs from the literature contain mistakes. We here give a corrected theoretical description of its tail dependence structure and then exploit the model to analyze a simulated dataset from the inverted Brown-Resnick process, hindcast significant wave height data in the North Sea, and wind gust data in the state of Oklahoma, USA. We demonstrate that our proposed model is flexible enough to capture the dependence structure not only in the tail but also in the bulk.
  • Topological Data Analysis of COVID-19 Virus Spike Proteins

    Chung, Moo K.; Ombao, Hernando (arXiv, 2021-05-01) [Preprint]
    Topological data analysis, including persistent homology, has undergone significant development in recent years. However, due to heterogenous nature of persistent homology features that do not have one-to-one correspondence across measurements, it is still difficult to build a coherent statistical inference procedure. The paired data structure in persistent homology as birth and death events of topological features add further complexity to conducting inference. To address these current problems, we propose to analyze the birth and death events using lattice paths. The proposed lattice path method is implemented to characterize the topological features of the protein structures of corona viruses. This demonstrates new insights to building a coherent statistical inference procedure in persistent homology.
  • Combined effects of hydrometeorological hazards and urbanisation on dengue risk in Brazil: a spatiotemporal modelling study.

    Lowe, Rachel; Lee, Sophie A; O'Reilly, Kathleen M; Brady, Oliver J; Bastos, Leonardo; Carrasco-Escobar, Gabriel; de Castro Catão, Rafael; Colón-González, Felipe J; Barcellos, Christovam; Carvalho, Marilia Sá; Blangiardo, Marta; Rue, Haavard; Gasparrini, Antonio (The Lancet. Planetary health, Elsevier BV, 2021-04-11) [Article]
    Temperature and rainfall patterns are known to influence seasonal patterns of dengue transmission. However, the effect of severe drought and extremely wet conditions on the timing and intensity of dengue epidemics is poorly understood. In this study, we aimed to quantify the non-linear and delayed effects of extreme hydrometeorological hazards on dengue risk by level of urbanisation in Brazil using a spatiotemporal model. We combined distributed lag non-linear models with a spatiotemporal Bayesian hierarchical model framework to determine the exposure-lag-response association between the relative risk (RR) of dengue and a drought severity index. We fit the model to monthly dengue case data for the 558 microregions of Brazil between January, 2001, and January, 2019, accounting for unobserved confounding factors, spatial autocorrelation, seasonality, and interannual variability. We assessed the variation in RR by level of urbanisation through an interaction between the drought severity index and urbanisation. We also assessed the effect of hydrometeorological hazards on dengue risk in areas with a high frequency of water supply shortages. The dataset included 12 895 293 dengue cases reported between 2001 and 2019 in Brazil. Overall, the risk of dengue increased between 0-3 months after extremely wet conditions (maximum RR at 1 month lag 1·56 [95% CI 1·41-1·73]) and 3-5 months after drought conditions (maximum RR at 4 months lag 1·43 [1·22-1·67]). Including a linear interaction between the drought severity index and level of urbanisation improved the model fit and showed the risk of dengue was higher in more rural areas than highly urbanised areas during extremely wet conditions (maximum RR 1·77 [1·32-2·37] at 0 months lag vs maximum RR 1·58 [1·39-1·81] at 2 months lag), but higher in highly urbanised areas than rural areas after extreme drought (maximum RR 1·60 [1·33-1·92] vs 1·15 [1·08-1·22], both at 4 months lag). We also found the dengue risk following extreme drought was higher in areas that had a higher frequency of water supply shortages. Wet conditions and extreme drought can increase the risk of dengue with different delays. The risk associated with extremely wet conditions was higher in more rural areas and the risk associated with extreme drought was exacerbated in highly urbanised areas, which have water shortages and intermittent water supply during droughts. These findings have implications for targeting mosquito control activities in poorly serviced urban areas, not only during the wet and warm season, but also during drought periods. Royal Society, Medical Research Council, Wellcome Trust, National Institutes of Health, Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro, and Conselho Nacional de Desenvolvimento Científico e Tecnológico. For the Portuguese translation of the abstract see Supplementary Materials section.
  • High Performance Multivariate Geospatial Statistics on Manycore Systems

    Salvaña, Mary Lai O.; Abdulah, Sameh; Huang, Huang; Ltaief, Hatem; Sun, Ying; Genton, Marc G.; Keyes, David E. (IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers (IEEE), 2021-04-06) [Article]
    Modeling and inferring spatial relationships and predicting missing values of environmental data are some of the main tasks of geospatial statisticians. These routine tasks are accomplished using multivariate geospatial models and the cokriging technique, which requires the evaluation of the expensive Gaussian log-likelihood function. This large-scale cokriging challenge provides a fertile ground for supercomputing implementations for the geospatial statistics community as it is paramount to scale computational capability to match the growth in environmental data. In this paper, we develop large-scale multivariate spatial modeling and inference on parallel hardware architectures. To tackle the increasing complexity in matrix operations and the massive concurrency in parallel systems, we leverage low-rank matrix approximation techniques with task-based programming models and schedule the asynchronous computational tasks using a dynamic runtime system. The proposed framework provides both the dense and approximated computations of the Gaussian log-likelihood function. It demonstrates accuracy robustness and performance scalability on a variety of computer systems. Using both synthetic and real datasets, the low-rank matrix approximation shows better performance compared to exact computation, while preserving the application requirements in both parameter estimation and prediction accuracy. We also propose a novel algorithm to assess the prediction accuracy after the online parameter estimation.
  • Comparative study of machine learning methods for COVID-19 transmission forecasting

    Dairi, Abdelkader; Harrou, Fouzi; Zeroual, Abdelhafid; Hittawe, Mohamad; Sun, Ying (Journal of Biomedical Informatics, Elsevier BV, 2021-04) [Article]
    Within the recent pandemic, scientists and clinicians are engaged in seeking new technology to stop or slow down the COVID-19 pandemic. The benefit of machine learning, as an essential aspect of artificial intelligence, on past epidemics offers a new line to tackle the novel Coronavirus outbreak. Accurate short-term forecasting of COVID-19 spread plays an essential role in improving the management of the overcrowding problem in hospitals and enables appropriate optimization of the available resources (i.e., materials and staff).This paper presents a comparative study of machine learning methods for COVID-19 transmission forecasting. We investigated the performances of deep learning methods, including the hybrid convolutional neural networks-Long short-term memory (LSTM-CNN), the hybrid gated recurrent unit-convolutional neural networks (GAN-GRU), GAN, CNN, LSTM, and Restricted Boltzmann Machine (RBM), as well as baseline machine learning methods, namely logistic regression (LR) and support vector regression (SVR). The employment of hybrid models (i.e., LSTM-CNN and GAN-GRU) is expected to eventually improve the forecasting accuracy of COVID-19 future trends. The performance of the investigated deep learning and machine learning models was tested using confirmed and recovered COVID-19 cases time-series data from seven impacted countries: Brazil, France, India, Mexico, Russia, Saudi Arabia, and the US. The results reveal that hybrid deep learning models can efficiently forecast COVID-19 cases. Also, results confirmed the superior performance of deep learning models compared to the two considered baseline machine learning models. Furthermore, results showed that LSTM-CNN achieved improved performances with an averaged mean absolute percentage error of 3.718%, among others.
  • Spectral Dependence

    Ombao, Hernando; Pinto, Marco (arXiv, 2021-03-31) [Preprint]
    This paper presents a general framework for modeling dependence in multivariate time series. Its fundamental approach relies on decomposing each signal in a system into various frequency components and then studying the dependence properties through these oscillatory activities.The unifying theme across the paper is to explore the strength of dependence and possible lead-lag dynamics through filtering. The proposed framework is capable of representing both linear and non-linear dependencies that could occur instantaneously or after some delay(lagged dependence). Examples for studying dependence between oscillations are illustrated through multichannel electroencephalograms. These examples emphasized that some of the most prominent frequency domain measures such as coherence, partial coherence,and dual-frequency coherence can be derived as special cases under this general framework.This paper also introduces related approaches for modeling dependence through phase-amplitude coupling and causality of (one-sided) filtered signals.
  • The Negative Binomial Process: A Tractable Model with Composite Likelihood-Based Inference

    Barreto-Souza, Wagner; Ombao, Hernando (Scandinavian Journal of Statistics, Wiley, 2021-03-24) [Article]
    We propose a log-linear Poisson regression model driven by a stationary latent gamma autoregression. This process has negative binomial (NB) marginals to analyze overdispersed count time series data. Estimation and statistical inference are performed using a composite (CL) likelihood function. We establish theoretical properties of the proposed count model, in particular, the strong consistency and asymptotic normality of the maximum CL estimator. A procedure for calculating the standard error of the parameter estimator and confidence intervals is derived based on the parametric bootstrap. Monte Carlo experiments were conducted to study and compare the finite-sample properties of the proposed estimators. The simulations demonstrate that, compared to the approach that combines generalized linear models with the ordinary least squares method, the proposed composite likelihood approach provides satisfactory results for estimating the parameters related to the correlation structure of the process, even under model misspecification. An empirical illustration of the NB process is presented for the monthly number of viral hepatitis cases in Goiânia (capital and largest city of the Brazilian state of Goiás) from January 2001 to December 2018.
  • Conditional normal extreme-value copulas

    Krupskii, Pavel; Genton, Marc G. (Extremes, Springer Nature, 2021-03-19) [Article]
    We propose a new class of extreme-value copulas which are extreme-value limits of conditional normal models. Conditional normal models are generalizations of conditional independence models, where the dependence among observed variables is modeled using one unobserved factor. Conditional on this factor, the distribution of these variables is given by the Gaussian copula. This structure allows one to build flexible and parsimonious models for data with complex dependence structures, such as data with spatial dependence or factor structure. We study the extreme-value limits of these models and show some interesting special cases of the proposed class of copulas. We develop estimation methods for the proposed models and conduct a simulation study to assess the performance of these algorithms. Finally, we apply these copula models to analyze data on monthly wind maxima and stock return minima.
  • Sparse Functional Boxplots for Multivariate Curves

    Qu, Zhuo; Genton, Marc G. (arXiv, 2021-03-14) [Preprint]
    This paper introduces the sparse functional boxplot and the intensity sparse functional boxplot as practical exploratory tools that make visualization possible for both complete and sparse functional data. These visualization tools can be used either in the univariate or multivariate functional setting. The sparse functional boxplot, which is based on the functional boxplot, depicts sparseness characteristics in the envelope of the 50\% central region, the median curve, and the outliers. The proportion of missingness at each time index within the central region is colored in gray. The intensity sparse functional boxplot displays the relative intensity of sparse points in the central region, revealing where data are more or less sparse. The two-stage functional boxplot, a derivation from the functional boxplot to better detect outliers, is also extended to its sparse form. Several depth proposals for sparse multivariate functional data are evaluated and outlier detection is tested in simulations under various data settings and sparseness scenarios. The practical applications of the sparse functional boxplot and intensity sparse functional boxplot are illustrated with two public health datasets.
  • A Generalized Heckman Model With Varying Sample Selection Bias and Dispersion Parameters

    Bastos, Fernando de Souza; Barreto-Souza, Wagner; Genton, Marc G. (Statistica Sinica, Statistica Sinica (Institute of Statistical Science), 2021-03-09) [Article]
    Many proposals have emerged as alternatives to the Heckman selection model, mainly to address the non-robustness of its normal assumption. The 2001 Medical Expenditure Panel Survey data is often used to illustrate this non-robustness of the Heckman model. In this paper, we propose a generalization of the Heckman sample selection model by allowing the sample selection bias and dispersion parameters to depend on covariates. We show that the non-robustness of the Heckman model may be due to the assumption of the constant sample selection bias parameter rather than the normality assumption. Our proposed methodology allows us to understand which covariates are important to explain the sample selection bias phenomenon rather than to only form conclusions about its presence. Further, our approach may attenuate the non-identifiability and multicollinearity problems faced by the existing sample selection models. We explore the inferential aspects of the maximum likelihood estimators (MLEs) for our proposed generalized Heckman model. More specifically, we show that this model satisfies some regularity conditions such that it ensures consistency.

View more