### Recent Submissions

• #### Decision trees based on 1-consequences

(Discrete Applied Mathematics, Elsevier BV, 2021-07-20) [Article]
In this paper, we study arbitrary infinite binary information systems each of which consists of an infinite set of elements and an infinite set of two-valued non-constant functions (attributes) defined on the set of elements. We consider the notion of a problem over information system, which is described by a finite number of attributes: for a given element, we should determine values of these attributes. As algorithms for problem solving, we study decision trees that use arbitrary attributes from the considered infinite set of attributes and solve the problem based on 1-consequences. In such a tree, we take into account consequences each of which follows from one equation of the kind “attribute value” obtained during the decision tree work and ignore consequences that can be derived only from at least two equations. As time complexity, we study the depth of decision trees. We prove that in the worst case, with the growth of the number of attributes in the problem description, the minimum depth of decision trees based on 1-consequences grows either as a logarithm or linearly.
• #### Bayesian calibration of order and diffusivity parameters in a fractional diffusion equation

(Journal of Physics Communications, IOP Publishing, 2021-07-15) [Article]
This work focuses on parameter calibration of a variable-diffusivity fractional diffusion model. A random, spatially-varying diffusivity field with log-normal distribution is considered. The variance and correlation length of the diffusivity field are considered uncertain parameters, and the order of the fractional sub-diffusion operator is also taken uncertain and uniformly distributed in the range (0,1). A Karhunen-Lo`eve (KL) decomposition of the random diffusivity field is used, leading to a stochastic problem defined in terms of a finite number of canonical random variables. Polynomial chaos (PC) techniques are used to express the dependence of the stochastic solution on these random variables. A non-intrusive methodology is used, and a deterministic finite-difference solver of the fractional diffusion model is utilized for this purpose. The PC surrogates are first used to assess the sensitivity of quantities of interest (QoIs) to uncertain inputs and to examine their statistics. In particular, the analysis indicates that the fractional order has a dominant effect on the variance of the QoIs considered, followed by the leading KL modes. The PC surrogates are further exploited to calibrate the uncertain parameters using a Bayesian methodology. Different setups are considered, including distributed and localized forcing functions and data consisting of either noisy observations of the solution or its first moments. In the broad range of parameters addressed, the analysis shows that the uncertain parameters having a significant impact on the variance of the solution can be reliably inferred, even from limited observations.
• #### DES-Tcell is a knowledgebase for exploring immunology-related literature

(Scientific Reports, Springer Science and Business Media LLC, 2021-07-12) [Article]
AbstractT-cells are a subtype of white blood cells circulating throughout the body, searching for infected and abnormal cells. They have multifaceted functions that include scanning for and directly killing cells infected with intracellular pathogens, eradicating abnormal cells, orchestrating immune response by activating and helping other immune cells, memorizing encountered pathogens, and providing long-lasting protection upon recurrent infections. However, T-cells are also involved in immune responses that result in organ transplant rejection, autoimmune diseases, and some allergic diseases. To support T-cell research, we developed the DES-Tcell knowledgebase (KB). This KB incorporates text- and data-mined information that can expedite retrieval and exploration of T-cell relevant information from the large volume of published T-cell-related research. This KB enables exploration of data through concepts from 15 topic-specific dictionaries, including immunology-related genes, mutations, pathogens, and pathways. We developed three case studies using DES-Tcell, one of which validates effective retrieval of known associations by DES-Tcell. The second and third case studies focuses on concepts that are common to Grave’s disease (GD) and Hashimoto’s thyroiditis (HT). Several reports have shown that up to 20% of GD patients treated with antithyroid medication develop HT, thus suggesting a possible conversion or shift from GD to HT disease. DES-Tcell found miR-4442 links to both GD and HT, and that miR-4442 possibly targets the autoimmune disease risk factor CD6, which provides potential new knowledge derived through the use of DES-Tcell. According to our understanding, DES-Tcell is the first KB dedicated to exploring T-cell-relevant information via literature-mining, data-mining, and topic-specific dictionaries.
• #### Competition on Spatial Statistics for Large Datasets

(Journal of Agricultural, Biological and Environmental Statistics, Springer Science and Business Media LLC, 2021-07-08) [Article]
As spatial datasets are becoming increasingly large and unwieldy, exact inference on spatial models becomes computationally prohibitive. Various approximation methods have been proposed to reduce the computational burden. Although comprehensive reviews on these approximation methods exist, comparisons of their performances are limited to small and medium sizes of datasets for a few selected methods. To achieve a comprehensive comparison comprising as many methods as possible, we organized the Competition on Spatial Statistics for Large Datasets. This competition had the following novel features: (1) we generated synthetic datasets with the ExaGeoStat software so that the number of generated realizations ranged from 100 thousand to 1 million; (2) we systematically designed the data-generating models to represent spatial processes with a wide range of statistical properties for both Gaussian and non-Gaussian cases; (3) the competition tasks included both estimation and prediction, and the results were assessed by multiple criteria; and (4) we have made all the datasets and competition results publicly available to serve as a benchmark for other approximation methods. In this paper, we disclose all the competition details and results along with some analysis of the competition outcomes.
• #### Accelerating Seismic Redatuming Using Tile Low-Rank Approximations on NEC SX-Aurora TSUBASA

(Submitted to SUPERCOMPUTING FRONTIERS AND INNOVATIONS, Submitted to South Ural State University (Chelyabinsk, Russia), 2021-07-07) [Preprint]
With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of the Earth must be numerically re-positioned at locations in the subsurface where reflections have originated, a process generally referred to as redatuming by the geophysical community. Historically, this process has been carried out by numerically time-reversing the data recorded along an open boundary of surface receivers into the subsurface. Despite its simplicity, such an approach is only able to handle seismic energy from primary arrivals (i.e., waves that interact only once with the medium discontinuities), failing to explain multi-scattering in the subsurface. As a result, seismic images are contaminated by artificial reflectors if data are not pre-processed prior to imaging such that multiples are removed from the data. In the last decade, a novel family of methods has emerged under the name of Marchenko redatuming; such methods allow for accurate redatuming of the full-wavefield recorded seismic data including multiple arrivals. This is achieved by solving an inverse problem, whose adjoint modeling can be shown to be equivalent to the standard single-scattering redatuming method for primary-only data. A downside of this application is that the so-called multi-dimensional convolution operator must be repeatedly evaluated as part of the inversion. Such an operator requires the application of multiple dense matrix-vector multiplications (MVM), which represent the most time-consuming operations in the forward and adjoint processes. We identify and leverage the data sparsity structure for each of the frequency matrices during the MVM operation, and propose to accelerate the MVM step using tile low-rank (TLR) matrix approximations. We study the TLR impact on time-to-solution for the MVM using different accuracy thresholds whilst at the same time assessing the quality of the resulting subsurface seismic wavefields and show that TLR leads to a minimal degradation in terms of signal-to-noise ratio on a 3D synthetic dataset. We mitigate the load imbalance overhead and provide performance evaluation on two distributed-memory systems. Our MPI+OpenMP TLR-MVM implementation reaches up to 3X performance speedup against the dense MVM counterpart from NEC scientific library on 128 NEC SX-Aurora TSUBASA cards. Thanks to the second generation of high bandwidth memory technology, it further attains up to 67X performance speedup (i.e., 110 TB/s) compared to the dense MVM from Intel MKL when running on 128 dual-socket 20-core Intel Cascade Lake nodes with DDR4 memory, without deteriorating the quality of the reconstructed seismic wavefields.
• #### Reference Tracking AND Observer Design for Space-Fractional Partial Differential Equation Modeling Gas Pressures in Fractured Media

(arXiv, 2021-07-05) [Preprint]
This paper considers a class of space fractional partial differential equations (FPDEs) that describe gas pressures in fractured media. First, the well-posedness, uniqueness, and the stability in $L_(\infty{R})$of the considered FPDEs are investigated. Then, the reference tracking problem is studied to track the pressure gradient at a downstream location of a channel. This requires manipulation of gas pressure at the downstream location and the use of pressure measurements at an upstream location. To achiever this, the backstepping approach is adapted to the space FPDEs. The key challenge in this adaptation is the non-applicability of the Lyapunov theory which is typically used to prove the stability of the target system as, the obtained target system is fractional in space. In addition, a backstepping adaptive observer is designed to jointly estimate both the system's state and the disturbance. The stability of the closed loop (reference tracking controller/observer) is also investigated. Finally, numerical simulations are given to evaluate the efficiency of the proposed method.
• #### Existence of weak solutions to time-dependent mean-field games

(Nonlinear Analysis, Elsevier BV, 2021-06-30) [Article]
Here, we establish the existence of weak solutions to a wide class of time-dependent monotone mean-field games (MFGs). These MFGs are given as a system of degenerate parabolic equations with initial and terminal conditions. To construct these solutions, we consider a high-order elliptic regularization in space–time. Then, applying Schaefer’s fixed-point theorem, we obtain the existence and uniqueness for this regularized problem. Using Minty’s method, we prove the existence of a weak solution to the original MFG. Finally, the paper ends with a discussion on congestion problems and density constrained MFGs.
• #### Optimization of Decision Trees with Hypotheses for Knowledge Representation

(Electronics, MDPI AG, 2021-06-30) [Article]
In this paper, we consider decision trees that use two types of queries: queries based on one attribute each and queries based on hypotheses about values of all attributes. Such decision trees are similar to the ones studied in exact learning, where membership and equivalence queries are allowed. We present dynamic programming algorithms for minimization of the depth and number of nodes of above decision trees and discuss results of computer experiments on various data sets and randomly generated Boolean functions. Decision trees with hypotheses generally have less complexity, i.e., they are more understandable and more suitable as a means for knowledge representation.
• #### A second-order accurate numerical scheme for a time-fractional Fokker-Planck equation

(arXiv, 2021-06-27) [Preprint]
A time-stepping $L1$ scheme for solving a time fractional Fokker-Planck equation of order $\alpha \in (0, 1)$, with a general driving force, is investigated. A stability bound for the semi-discrete solution is obtained for $\alpha\in(1/2,1)$ {via a novel and concise approach.} Our stability estimate is $\alpha$-robust in the sense that it remains valid in the limiting case where $\alpha$ approaches $1$ (when the model reduces to the classical Fokker-Planck equation), a limit that presents practical importance. Concerning the error analysis, we obtain an optimal second-order accurate estimate for $\alpha\in(1/2,1)$. A time-graded mesh is used to compensate for the singular behavior of the continuous solution near the origin. The $L1$ scheme is associated with a standard spatial Galerkin finite element discretization to numerically support our theoretical contributions. We employ the resulting fully-discrete computable numerical scheme to perform some numerical tests. These tests suggest that the imposed time-graded meshes assumption could be further relaxed, and we observe second-order accuracy even for the case $\alpha\in(0,1/2]$, that is, outside the range covered by the theory.
• #### Optimal control of an SIR epidemic through finite-time non-pharmaceutical intervention

(Journal of Mathematical Biology, Springer Science and Business Media LLC, 2021-06-26) [Article]
We consider the problem of controlling an SIR-model epidemic by temporarily reducing the rate of contact within a population. The control takes the form of a multiplicative reduction in the contact rate of infectious individuals. The control is allowed to be applied only over a finite time interval, while the objective is to minimize the total number of individuals infected in the long-time limit, subject to some cost function for the control. We first consider the no-cost scenario and analytically determine the optimal control and solution. We then study solutions when a cost of intervention is included, as well as a cost associated with overwhelming the available medical resources. Examples are studied through the numerical solution of the associated Hamilton-Jacobi-Bellman equation. Finally, we provide some examples related directly to the current pandemic.
• #### Entropy-Based Greedy Algorithm for Decision Trees Using Hypotheses

(Entropy, MDPI AG, 2021-06-25) [Article]
In this paper, we consider decision trees that use both conventional queries based on one attribute each and queries based on hypotheses of values of all attributes. Such decision trees are similar to those studied in exact learning, where membership and equivalence queries are allowed. We present greedy algorithm based on entropy for the construction of the above decision trees and discuss the results of computer experiments on various data sets and randomly generated Boolean functions.
• #### Uncertainty Quantification and Bayesian Inference of Cloud Parameterization in the NCAR Single Column Community Atmosphere Model (SCAM6)

(Frontiers in Climate, Frontiers, 2021-06-16) [Article]
Uncertainty quantification (UQ) in weather and climate models is required to assess the sensitivity of their outputs to various parameterization schemes and thereby improve their consistency with observations. Herein, we present an efficient UQ and Bayesian inference for the cloud parameters of the NCAR Single Column Atmosphere Model (SCAM6) using surrogate models based on a polynomial chaos expansion. The use of a surrogate model enables to efficiently propagate uncertainties in parameters into uncertainties in model outputs. We investigated eight uncertain parameters: the auto-conversion size threshold for ice to snow (dcs), the fall speed parameter for stratiform cloud ice (ai), the fall speed parameter for stratiform snow (as), the fall speed parameter for cloud water (ac), the collection efficiency of aggregation ice (eii), the efficiency factor of the Bergeron effect (berg_eff), the threshold maximum relative humidity for ice clouds (rhmaxi), and the threshold minimum relative humidity for ice clouds (rhmini). We built two surrogate models using two non-intrusive methods: spectral projection (SP) and basis pursuit denoising (BPDN). Our results suggest that BPDN performs better than SP as it enables to filter out internal noise during the process of fitting the surrogate model. Five out of the eight parameters (namely dcs, ai, rhmaxi, rhmini, and eii) account for most of the variance in predicted climate variables (e.g., total precipitation, cloud distribution, shortwave and longwave cloud forcing, ice and liquid water path). A first-order sensitivity analysis reveals that dcs contributes approximately 40–80% of the total variance of the climate variables, ai around 15–30%, and rhmaxi, rhmini, and eii around 5–15%. The second- and higher-order effects contribute approximately 20% and 11%, respectively. The sensitivity of the model to these parameters was further explored using response curves. A Markov chain Monte Carlo (MCMC) sampling algorithm was also implemented for the Bayesian inference of dcs, ai, as, rhmini, and berg_eff using cloud distribution data collected at the Southern Great Plains (USA). Our study has implications for enhancing our understanding of the physical mechanisms associated with cloud processes leading to uncertainty in model simulations and further helps to improve the models used for their assessment.
• #### Machine learning-based conditional mean filter: a generalization of the ensemble Kalman filter for nonlinear data assimilation

(arXiv, 2021-06-15) [Preprint]
Filtering is a data assimilation technique that performs the sequential inference of dynamical systems states from noisy observations. Herein, we propose a machine learning-based ensemble conditional mean filter (ML-EnCMF) for tracking possibly high-dimensional non-Gaussian state models with nonlinear dynamics based on sparse observations. The proposed filtering method is developed based on the conditional expectation and numerically implemented using machine learning (ML) techniques combined with the ensemble method. The contribution of this work is twofold. First, we demonstrate that the ensembles assimilated using the ensemble conditional mean filter (EnCMF) provide an unbiased estimator of the Bayesian posterior mean, and their variance matches the expected conditional variance. Second, we implement the EnCMF using artificial neural networks, which have a significant advantage in representing nonlinear functions over high-dimensional domains such as the conditional mean. Finally, we demonstrate the effectiveness of the ML-EnCMF for tracking the states of Lorenz-63 and Lorenz-96 systems under the chaotic regime. Numerical results show that the ML-EnCMF outperforms the ensemble Kalman filter.
• #### Fastest rates for stochastic mirror descent methods

(Computational Optimization and Applications, Springer Science and Business Media LLC, 2021-06-09) [Article]
Relative smoothness—a notion introduced in Birnbaum et al. (Proceedings of the 12th ACM conference on electronic commerce, ACM, pp 127–136, 2011) and recently rediscovered in Bauschke et al. (Math Oper Res 330–348, 2016) and Lu et al. (Relatively-smooth convex optimization by first-order methods, and applications, arXiv:1610.05708, 2016)—generalizes the standard notion of smoothness typically used in the analysis of gradient type methods. In this work we are taking ideas from well studied field of stochastic convex optimization and using them in order to obtain faster algorithms for minimizing relatively smooth functions. We propose and analyze two new algorithms: Relative Randomized Coordinate Descent (relRCD) and Relative Stochastic Gradient Descent (relSGD), both generalizing famous algorithms in the standard smooth setting. The methods we propose can be in fact seen as particular instances of stochastic mirror descent algorithms, which has been usually analyzed under stronger assumptions: Lipschitzness of the objective and strong convexity of the reference function. As a consequence, one of the proposed methods, relRCD corresponds to the first stochastic variant of mirror descent algorithm with linear convergence rate.
• #### Nonlinear Preconditioning Strategies for Two-Phase Flows in Porous Media Discretized by a Fully Implicit Discontinuous Galerkin Method

(SIAM Journal on Scientific Computing, Society for Industrial & Applied Mathematics (SIAM), 2021-06-09) [Article]
We consider numerical simulation of two-phase flows in porous media using implicit methods. Because of the complex features involving heterogeneous permeability and nonlinear capillary effects, the nonlinear algebraic systems arising from the discretization are very difficult to solve. The traditional Newton method suffers from slow convergence in the form of a long stagnation or sometimes does not converge at all. In this paper, we develop nonlinear preconditioning strategies for the system of two-phase flows discretized by a fully implicit discontinuous Galerkin method. The preconditioners identify and approximately eliminate the local high nonlinearities that cause the Newton method to take small updates. Specifically, we propose two elimination strategies: one is based on exploring the unbalanced nonlinearities of the pressure and the saturation fields, and the other is based on identifying certain elements of the finite element space that have much higher nonlinearities than the rest of the elements. We compare the performance and robustness of the proposed algorithms with an existing single-field elimination approach and the classical inexact Newton method with respect to some physical and numerical parameters. Experiments on three-dimensional porous media applications show that the proposed algorithms are superior to other methods in terms of robustness and parallel efficiency.
• #### KAUST Metagenomic Analysis Platform (KMAP), enabling access to massive analytics of re-annotated metagenomic data

(Scientific Reports, Springer Science and Business Media LLC, 2021-06-01) [Article]
AbstractExponential rise of metagenomics sequencing is delivering massive functional environmental genomics data. However, this also generates a procedural bottleneck for on-going re-analysis as reference databases grow and methods improve, and analyses need be updated for consistency, which require acceess to increasingly demanding bioinformatic and computational resources. Here, we present the KAUST Metagenomic Analysis Platform (KMAP), a new integrated open web-based tool for the comprehensive exploration of shotgun metagenomic data. We illustrate the capacities KMAP provides through the re-assembly of ~ 27,000 public metagenomic samples captured in ~ 450 studies sampled across ~ 77 diverse habitats. A small subset of these metagenomic assemblies is used in this pilot study grouped into 36 new habitat-specific gene catalogs, all based on full-length (complete) genes. Extensive taxonomic and gene annotations are stored in Gene Information Tables (GITs), a simple tractable data integration format useful for analysis through command line or for database management. KMAP pilot study provides the exploration and comparison of microbial GITs across different habitats with over 275 million genes. KMAP access to data and analyses is available at https://www.cbrc.kaust.edu.sa/aamg/kmap.start.
• #### A review of matrix SIR Arino epidemic models

(arXiv, 2021-06-01) [Preprint]
Many of the models used nowadays in mathematical epidemiology, in particular in COVID-19 research, belong to a certain sub-class of compartmental models whose classes may be divided into three "(x, y, z)" groups, which we will call respectively "susceptible/entrance, diseased, and output" (in the classic SIR case, there is only one class of each type). Roughly, the ODE dynamics of these models contain only linear terms, with the exception of products between x and y terms. It has long been noticed that the basic reproduction number R has a very simple formula (3.3) in terms of the matrices which define the model, and an explicit first integral formula (3.8) is also available. These results can be traced back at least to [ABvdD+07] and [Fen07], respectively, and may be viewed as the "basic laws of SIR-type epidemics"; however many papers continue to reprove them in particular instances (by the next-generation matrix method or by direct computations, which are unnecessary). This motivated us to redraw the attention to these basic laws and provide a self-contained reference of related formulas for (x, y, z) models. We propose to rebaptize the class to which they apply as matrix SIR epidemic models, abbreviated as SYR, to emphasize the similarity to the classic SIR case. For the case of one susceptible class, we propose to use the name SIR-PH, due to a simple probabilistic interpretation as SIR models where the exponential infection time has been replaced by a PH-type distribution. We note that to each SIR-PH model, one may associate a scalar quantity Y(t) which satisfies "classic SIR relations", see (3.8). In the case of several susceptible classes, this generalizes to (5.10); in a future paper, we will show that (3.8), (5.10) may be used to obtain approximate control policies which compare well with the optimal control of the original model.
• #### A review of matrix SIR Arino epidemic models

(arXiv, 2021-06-01) [Preprint]
Many of the models used nowadays in mathematical epidemiology, in particular in COVID-19 research, belong to a certain sub-class of compartmental models whose classes may be divided into three "(x, y, z)" groups, which we will call respectively "susceptible/entrance, diseased, and output" (in the classic SIR case, there is only one class of each type). Roughly, the ODE dynamics of these models contain only linear terms, with the exception of products between x and y terms. It has long been noticed that the basic reproduction number R has a very simple formula (3.3) in terms of the matrices which define the model, and an explicit first integral formula (3.8) is also available. These results can be traced back at least to [ABvdD+07] and [Fen07], respectively, and may be viewed as the "basic laws of SIR-type epidemics"; however many papers continue to reprove them in particular instances (by the next-generation matrix method or by direct computations, which are unnecessary). This motivated us to redraw the attention to these basic laws and provide a self-contained reference of related formulas for (x, y, z) models. We propose to rebaptize the class to which they apply as matrix SIR epidemic models, abbreviated as SYR, to emphasize the similarity to the classic SIR case. For the case of one susceptible class, we propose to use the name SIR-PH, due to a simple probabilistic interpretation as SIR models where the exponential infection time has been replaced by a PH-type distribution. We note that to each SIR-PH model, one may associate a scalar quantity Y(t) which satisfies "classic SIR relations", see (3.8). In the case of several susceptible classes, this generalizes to (5.10); in a future paper, we will show that (3.8), (5.10) may be used to obtain approximate control policies which compare well with the optimal control of the original model.
• #### Unbiased Estimation of the Gradient of the Log-Likelihood for a Class of Continuous-Time State-Space Models

(arXiv, 2021-05-24) [Preprint]
In this paper, we consider static parameter estimation for a class of continuous-time state-space models. Our goal is to obtain an unbiased estimate of the gradient of the log-likelihood (score function), which is an estimate that is unbiased even if the stochastic processes involved in the model must be discretized in time. To achieve this goal, we apply a \emph{doubly randomized scheme} (see, e.g.,~\cite{ub_mcmc, ub_grad}), that involves a novel coupled conditional particle filter (CCPF) on the second level of randomization \cite{jacob2}. Our novel estimate helps facilitate the application of gradient-based estimation algorithms, such as stochastic-gradient Langevin descent. We illustrate our methodology in the context of stochastic gradient descent (SGD) in several numerical examples and compare with the Rhee \& Glynn estimator \cite{rhee,vihola}.
• #### A rotated characteristic decomposition technique for high-order reconstructions in multi-dimensions

(arXiv, 2021-05-21) [Preprint]
When constructing high-order schemes for solving hyperbolic conservation laws, the corresponding high-order reconstructions are commonly performed in characteristic spaces to eliminate spurious oscillations as much as possible. For multi-dimensional finite volume (FV) schemes, we need to perform the characteristic decomposition several times in different normal directions of the target cell, which is very time-consuming. In this paper, we propose a rotated characteristic decomposition technique which requires only one-time decomposition for multi-dimensional reconstructions. The rotated direction depends only on the gradient of a specific physical quantity which is cheap to calculate. This technique not only reduces the computational cost remarkably, but also controls spurious oscillations effectively. We take a third-order weighted essentially non-oscillatory finite volume (WENO-FV) scheme for solving the Euler equations as an example to demonstrate the efficiency of the proposed technique.