Applied Mathematics and Computational Science Program
For more information visit: https://cemse.kaust.edu.sa/amcs
Recent Submissions

Decision trees based on 1consequences(Discrete Applied Mathematics, Elsevier BV, 20210720) [Article]In this paper, we study arbitrary infinite binary information systems each of which consists of an infinite set of elements and an infinite set of twovalued nonconstant functions (attributes) defined on the set of elements. We consider the notion of a problem over information system, which is described by a finite number of attributes: for a given element, we should determine values of these attributes. As algorithms for problem solving, we study decision trees that use arbitrary attributes from the considered infinite set of attributes and solve the problem based on 1consequences. In such a tree, we take into account consequences each of which follows from one equation of the kind “attribute value” obtained during the decision tree work and ignore consequences that can be derived only from at least two equations. As time complexity, we study the depth of decision trees. We prove that in the worst case, with the growth of the number of attributes in the problem description, the minimum depth of decision trees based on 1consequences grows either as a logarithm or linearly.

Bayesian calibration of order and diffusivity parameters in a fractional diffusion equation(Journal of Physics Communications, IOP Publishing, 20210715) [Article]This work focuses on parameter calibration of a variablediffusivity fractional diffusion model. A random, spatiallyvarying diffusivity field with lognormal distribution is considered. The variance and correlation length of the diffusivity field are considered uncertain parameters, and the order of the fractional subdiffusion operator is also taken uncertain and uniformly distributed in the range (0,1). A KarhunenLo`eve (KL) decomposition of the random diffusivity field is used, leading to a stochastic problem defined in terms of a finite number of canonical random variables. Polynomial chaos (PC) techniques are used to express the dependence of the stochastic solution on these random variables. A nonintrusive methodology is used, and a deterministic finitedifference solver of the fractional diffusion model is utilized for this purpose. The PC surrogates are first used to assess the sensitivity of quantities of interest (QoIs) to uncertain inputs and to examine their statistics. In particular, the analysis indicates that the fractional order has a dominant effect on the variance of the QoIs considered, followed by the leading KL modes. The PC surrogates are further exploited to calibrate the uncertain parameters using a Bayesian methodology. Different setups are considered, including distributed and localized forcing functions and data consisting of either noisy observations of the solution or its first moments. In the broad range of parameters addressed, the analysis shows that the uncertain parameters having a significant impact on the variance of the solution can be reliably inferred, even from limited observations.

DESTcell is a knowledgebase for exploring immunologyrelated literature(Scientific Reports, Springer Science and Business Media LLC, 20210712) [Article]AbstractTcells are a subtype of white blood cells circulating throughout the body, searching for infected and abnormal cells. They have multifaceted functions that include scanning for and directly killing cells infected with intracellular pathogens, eradicating abnormal cells, orchestrating immune response by activating and helping other immune cells, memorizing encountered pathogens, and providing longlasting protection upon recurrent infections. However, Tcells are also involved in immune responses that result in organ transplant rejection, autoimmune diseases, and some allergic diseases. To support Tcell research, we developed the DESTcell knowledgebase (KB). This KB incorporates text and datamined information that can expedite retrieval and exploration of Tcell relevant information from the large volume of published Tcellrelated research. This KB enables exploration of data through concepts from 15 topicspecific dictionaries, including immunologyrelated genes, mutations, pathogens, and pathways. We developed three case studies using DESTcell, one of which validates effective retrieval of known associations by DESTcell. The second and third case studies focuses on concepts that are common to Grave’s disease (GD) and Hashimoto’s thyroiditis (HT). Several reports have shown that up to 20% of GD patients treated with antithyroid medication develop HT, thus suggesting a possible conversion or shift from GD to HT disease. DESTcell found miR4442 links to both GD and HT, and that miR4442 possibly targets the autoimmune disease risk factor CD6, which provides potential new knowledge derived through the use of DESTcell. According to our understanding, DESTcell is the first KB dedicated to exploring Tcellrelevant information via literaturemining, datamining, and topicspecific dictionaries.

Competition on Spatial Statistics for Large Datasets(Journal of Agricultural, Biological and Environmental Statistics, Springer Science and Business Media LLC, 20210708) [Article]As spatial datasets are becoming increasingly large and unwieldy, exact inference on spatial models becomes computationally prohibitive. Various approximation methods have been proposed to reduce the computational burden. Although comprehensive reviews on these approximation methods exist, comparisons of their performances are limited to small and medium sizes of datasets for a few selected methods. To achieve a comprehensive comparison comprising as many methods as possible, we organized the Competition on Spatial Statistics for Large Datasets. This competition had the following novel features: (1) we generated synthetic datasets with the ExaGeoStat software so that the number of generated realizations ranged from 100 thousand to 1 million; (2) we systematically designed the datagenerating models to represent spatial processes with a wide range of statistical properties for both Gaussian and nonGaussian cases; (3) the competition tasks included both estimation and prediction, and the results were assessed by multiple criteria; and (4) we have made all the datasets and competition results publicly available to serve as a benchmark for other approximation methods. In this paper, we disclose all the competition details and results along with some analysis of the competition outcomes.

Accelerating Seismic Redatuming Using Tile LowRank Approximations on NEC SXAurora TSUBASA(Submitted to SUPERCOMPUTING FRONTIERS AND INNOVATIONS, Submitted to South Ural State University (Chelyabinsk, Russia), 20210707) [Preprint]With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of the Earth must be numerically repositioned at locations in the subsurface where reflections have originated, a process generally referred to as redatuming by the geophysical community. Historically, this process has been carried out by numerically timereversing the data recorded along an open boundary of surface receivers into the subsurface. Despite its simplicity, such an approach is only able to handle seismic energy from primary arrivals (i.e., waves that interact only once with the medium discontinuities), failing to explain multiscattering in the subsurface. As a result, seismic images are contaminated by artificial reflectors if data are not preprocessed prior to imaging such that multiples are removed from the data. In the last decade, a novel family of methods has emerged under the name of Marchenko redatuming; such methods allow for accurate redatuming of the fullwavefield recorded seismic data including multiple arrivals. This is achieved by solving an inverse problem, whose adjoint modeling can be shown to be equivalent to the standard singlescattering redatuming method for primaryonly data. A downside of this application is that the socalled multidimensional convolution operator must be repeatedly evaluated as part of the inversion. Such an operator requires the application of multiple dense matrixvector multiplications (MVM), which represent the most timeconsuming operations in the forward and adjoint processes. We identify and leverage the data sparsity structure for each of the frequency matrices during the MVM operation, and propose to accelerate the MVM step using tile lowrank (TLR) matrix approximations. We study the TLR impact on timetosolution for the MVM using different accuracy thresholds whilst at the same time assessing the quality of the resulting subsurface seismic wavefields and show that TLR leads to a minimal degradation in terms of signaltonoise ratio on a 3D synthetic dataset. We mitigate the load imbalance overhead and provide performance evaluation on two distributedmemory systems. Our MPI+OpenMP TLRMVM implementation reaches up to 3X performance speedup against the dense MVM counterpart from NEC scientific library on 128 NEC SXAurora TSUBASA cards. Thanks to the second generation of high bandwidth memory technology, it further attains up to 67X performance speedup (i.e., 110 TB/s) compared to the dense MVM from Intel MKL when running on 128 dualsocket 20core Intel Cascade Lake nodes with DDR4 memory, without deteriorating the quality of the reconstructed seismic wavefields.

Reference Tracking AND Observer Design for SpaceFractional Partial Differential Equation Modeling Gas Pressures in Fractured Media(arXiv, 20210705) [Preprint]This paper considers a class of space fractional partial differential equations (FPDEs) that describe gas pressures in fractured media. First, the wellposedness, uniqueness, and the stability in $L_(\infty{R})$of the considered FPDEs are investigated. Then, the reference tracking problem is studied to track the pressure gradient at a downstream location of a channel. This requires manipulation of gas pressure at the downstream location and the use of pressure measurements at an upstream location. To achiever this, the backstepping approach is adapted to the space FPDEs. The key challenge in this adaptation is the nonapplicability of the Lyapunov theory which is typically used to prove the stability of the target system as, the obtained target system is fractional in space. In addition, a backstepping adaptive observer is designed to jointly estimate both the system's state and the disturbance. The stability of the closed loop (reference tracking controller/observer) is also investigated. Finally, numerical simulations are given to evaluate the efficiency of the proposed method.

Existence of weak solutions to timedependent meanfield games(Nonlinear Analysis, Elsevier BV, 20210630) [Article]Here, we establish the existence of weak solutions to a wide class of timedependent monotone meanfield games (MFGs). These MFGs are given as a system of degenerate parabolic equations with initial and terminal conditions. To construct these solutions, we consider a highorder elliptic regularization in space–time. Then, applying Schaefer’s fixedpoint theorem, we obtain the existence and uniqueness for this regularized problem. Using Minty’s method, we prove the existence of a weak solution to the original MFG. Finally, the paper ends with a discussion on congestion problems and density constrained MFGs.

Optimization of Decision Trees with Hypotheses for Knowledge Representation(Electronics, MDPI AG, 20210630) [Article]In this paper, we consider decision trees that use two types of queries: queries based on one attribute each and queries based on hypotheses about values of all attributes. Such decision trees are similar to the ones studied in exact learning, where membership and equivalence queries are allowed. We present dynamic programming algorithms for minimization of the depth and number of nodes of above decision trees and discuss results of computer experiments on various data sets and randomly generated Boolean functions. Decision trees with hypotheses generally have less complexity, i.e., they are more understandable and more suitable as a means for knowledge representation.

A secondorder accurate numerical scheme for a timefractional FokkerPlanck equation(arXiv, 20210627) [Preprint]A timestepping $L1$ scheme for solving a time fractional FokkerPlanck equation of order $\alpha \in (0, 1)$, with a general driving force, is investigated. A stability bound for the semidiscrete solution is obtained for $\alpha\in(1/2,1)$ {via a novel and concise approach.} Our stability estimate is $\alpha$robust in the sense that it remains valid in the limiting case where $\alpha$ approaches $1$ (when the model reduces to the classical FokkerPlanck equation), a limit that presents practical importance. Concerning the error analysis, we obtain an optimal secondorder accurate estimate for $\alpha\in(1/2,1)$. A timegraded mesh is used to compensate for the singular behavior of the continuous solution near the origin. The $L1$ scheme is associated with a standard spatial Galerkin finite element discretization to numerically support our theoretical contributions. We employ the resulting fullydiscrete computable numerical scheme to perform some numerical tests. These tests suggest that the imposed timegraded meshes assumption could be further relaxed, and we observe secondorder accuracy even for the case $\alpha\in(0,1/2]$, that is, outside the range covered by the theory.

Optimal control of an SIR epidemic through finitetime nonpharmaceutical intervention(Journal of Mathematical Biology, Springer Science and Business Media LLC, 20210626) [Article]We consider the problem of controlling an SIRmodel epidemic by temporarily reducing the rate of contact within a population. The control takes the form of a multiplicative reduction in the contact rate of infectious individuals. The control is allowed to be applied only over a finite time interval, while the objective is to minimize the total number of individuals infected in the longtime limit, subject to some cost function for the control. We first consider the nocost scenario and analytically determine the optimal control and solution. We then study solutions when a cost of intervention is included, as well as a cost associated with overwhelming the available medical resources. Examples are studied through the numerical solution of the associated HamiltonJacobiBellman equation. Finally, we provide some examples related directly to the current pandemic.

EntropyBased Greedy Algorithm for Decision Trees Using Hypotheses(Entropy, MDPI AG, 20210625) [Article]In this paper, we consider decision trees that use both conventional queries based on one attribute each and queries based on hypotheses of values of all attributes. Such decision trees are similar to those studied in exact learning, where membership and equivalence queries are allowed. We present greedy algorithm based on entropy for the construction of the above decision trees and discuss the results of computer experiments on various data sets and randomly generated Boolean functions.

Uncertainty Quantification and Bayesian Inference of Cloud Parameterization in the NCAR Single Column Community Atmosphere Model (SCAM6)(Frontiers in Climate, Frontiers, 20210616) [Article]Uncertainty quantification (UQ) in weather and climate models is required to assess the sensitivity of their outputs to various parameterization schemes and thereby improve their consistency with observations. Herein, we present an efficient UQ and Bayesian inference for the cloud parameters of the NCAR Single Column Atmosphere Model (SCAM6) using surrogate models based on a polynomial chaos expansion. The use of a surrogate model enables to efficiently propagate uncertainties in parameters into uncertainties in model outputs. We investigated eight uncertain parameters: the autoconversion size threshold for ice to snow (dcs), the fall speed parameter for stratiform cloud ice (ai), the fall speed parameter for stratiform snow (as), the fall speed parameter for cloud water (ac), the collection efficiency of aggregation ice (eii), the efficiency factor of the Bergeron effect (berg_eff), the threshold maximum relative humidity for ice clouds (rhmaxi), and the threshold minimum relative humidity for ice clouds (rhmini). We built two surrogate models using two nonintrusive methods: spectral projection (SP) and basis pursuit denoising (BPDN). Our results suggest that BPDN performs better than SP as it enables to filter out internal noise during the process of fitting the surrogate model. Five out of the eight parameters (namely dcs, ai, rhmaxi, rhmini, and eii) account for most of the variance in predicted climate variables (e.g., total precipitation, cloud distribution, shortwave and longwave cloud forcing, ice and liquid water path). A firstorder sensitivity analysis reveals that dcs contributes approximately 40–80% of the total variance of the climate variables, ai around 15–30%, and rhmaxi, rhmini, and eii around 5–15%. The second and higherorder effects contribute approximately 20% and 11%, respectively. The sensitivity of the model to these parameters was further explored using response curves. A Markov chain Monte Carlo (MCMC) sampling algorithm was also implemented for the Bayesian inference of dcs, ai, as, rhmini, and berg_eff using cloud distribution data collected at the Southern Great Plains (USA). Our study has implications for enhancing our understanding of the physical mechanisms associated with cloud processes leading to uncertainty in model simulations and further helps to improve the models used for their assessment.

Machine learningbased conditional mean filter: a generalization of the ensemble Kalman filter for nonlinear data assimilation(arXiv, 20210615) [Preprint]Filtering is a data assimilation technique that performs the sequential inference of dynamical systems states from noisy observations. Herein, we propose a machine learningbased ensemble conditional mean filter (MLEnCMF) for tracking possibly highdimensional nonGaussian state models with nonlinear dynamics based on sparse observations. The proposed filtering method is developed based on the conditional expectation and numerically implemented using machine learning (ML) techniques combined with the ensemble method. The contribution of this work is twofold. First, we demonstrate that the ensembles assimilated using the ensemble conditional mean filter (EnCMF) provide an unbiased estimator of the Bayesian posterior mean, and their variance matches the expected conditional variance. Second, we implement the EnCMF using artificial neural networks, which have a significant advantage in representing nonlinear functions over highdimensional domains such as the conditional mean. Finally, we demonstrate the effectiveness of the MLEnCMF for tracking the states of Lorenz63 and Lorenz96 systems under the chaotic regime. Numerical results show that the MLEnCMF outperforms the ensemble Kalman filter.

Fastest rates for stochastic mirror descent methods(Computational Optimization and Applications, Springer Science and Business Media LLC, 20210609) [Article]Relative smoothness—a notion introduced in Birnbaum et al. (Proceedings of the 12th ACM conference on electronic commerce, ACM, pp 127–136, 2011) and recently rediscovered in Bauschke et al. (Math Oper Res 330–348, 2016) and Lu et al. (Relativelysmooth convex optimization by firstorder methods, and applications, arXiv:1610.05708, 2016)—generalizes the standard notion of smoothness typically used in the analysis of gradient type methods. In this work we are taking ideas from well studied field of stochastic convex optimization and using them in order to obtain faster algorithms for minimizing relatively smooth functions. We propose and analyze two new algorithms: Relative Randomized Coordinate Descent (relRCD) and Relative Stochastic Gradient Descent (relSGD), both generalizing famous algorithms in the standard smooth setting. The methods we propose can be in fact seen as particular instances of stochastic mirror descent algorithms, which has been usually analyzed under stronger assumptions: Lipschitzness of the objective and strong convexity of the reference function. As a consequence, one of the proposed methods, relRCD corresponds to the first stochastic variant of mirror descent algorithm with linear convergence rate.

Nonlinear Preconditioning Strategies for TwoPhase Flows in Porous Media Discretized by a Fully Implicit Discontinuous Galerkin Method(SIAM Journal on Scientific Computing, Society for Industrial & Applied Mathematics (SIAM), 20210609) [Article]We consider numerical simulation of twophase flows in porous media using implicit methods. Because of the complex features involving heterogeneous permeability and nonlinear capillary effects, the nonlinear algebraic systems arising from the discretization are very difficult to solve. The traditional Newton method suffers from slow convergence in the form of a long stagnation or sometimes does not converge at all. In this paper, we develop nonlinear preconditioning strategies for the system of twophase flows discretized by a fully implicit discontinuous Galerkin method. The preconditioners identify and approximately eliminate the local high nonlinearities that cause the Newton method to take small updates. Specifically, we propose two elimination strategies: one is based on exploring the unbalanced nonlinearities of the pressure and the saturation fields, and the other is based on identifying certain elements of the finite element space that have much higher nonlinearities than the rest of the elements. We compare the performance and robustness of the proposed algorithms with an existing singlefield elimination approach and the classical inexact Newton method with respect to some physical and numerical parameters. Experiments on threedimensional porous media applications show that the proposed algorithms are superior to other methods in terms of robustness and parallel efficiency.

KAUST Metagenomic Analysis Platform (KMAP), enabling access to massive analytics of reannotated metagenomic data(Scientific Reports, Springer Science and Business Media LLC, 20210601) [Article]AbstractExponential rise of metagenomics sequencing is delivering massive functional environmental genomics data. However, this also generates a procedural bottleneck for ongoing reanalysis as reference databases grow and methods improve, and analyses need be updated for consistency, which require acceess to increasingly demanding bioinformatic and computational resources. Here, we present the KAUST Metagenomic Analysis Platform (KMAP), a new integrated open webbased tool for the comprehensive exploration of shotgun metagenomic data. We illustrate the capacities KMAP provides through the reassembly of ~ 27,000 public metagenomic samples captured in ~ 450 studies sampled across ~ 77 diverse habitats. A small subset of these metagenomic assemblies is used in this pilot study grouped into 36 new habitatspecific gene catalogs, all based on fulllength (complete) genes. Extensive taxonomic and gene annotations are stored in Gene Information Tables (GITs), a simple tractable data integration format useful for analysis through command line or for database management. KMAP pilot study provides the exploration and comparison of microbial GITs across different habitats with over 275 million genes. KMAP access to data and analyses is available at https://www.cbrc.kaust.edu.sa/aamg/kmap.start.

A review of matrix SIR Arino epidemic models(arXiv, 20210601) [Preprint]Many of the models used nowadays in mathematical epidemiology, in particular in COVID19 research, belong to a certain subclass of compartmental models whose classes may be divided into three "(x, y, z)" groups, which we will call respectively "susceptible/entrance, diseased, and output" (in the classic SIR case, there is only one class of each type). Roughly, the ODE dynamics of these models contain only linear terms, with the exception of products between x and y terms. It has long been noticed that the basic reproduction number R has a very simple formula (3.3) in terms of the matrices which define the model, and an explicit first integral formula (3.8) is also available. These results can be traced back at least to [ABvdD+07] and [Fen07], respectively, and may be viewed as the "basic laws of SIRtype epidemics"; however many papers continue to reprove them in particular instances (by the nextgeneration matrix method or by direct computations, which are unnecessary). This motivated us to redraw the attention to these basic laws and provide a selfcontained reference of related formulas for (x, y, z) models. We propose to rebaptize the class to which they apply as matrix SIR epidemic models, abbreviated as SYR, to emphasize the similarity to the classic SIR case. For the case of one susceptible class, we propose to use the name SIRPH, due to a simple probabilistic interpretation as SIR models where the exponential infection time has been replaced by a PHtype distribution. We note that to each SIRPH model, one may associate a scalar quantity Y(t) which satisfies "classic SIR relations", see (3.8). In the case of several susceptible classes, this generalizes to (5.10); in a future paper, we will show that (3.8), (5.10) may be used to obtain approximate control policies which compare well with the optimal control of the original model.

A review of matrix SIR Arino epidemic models(arXiv, 20210601) [Preprint]Many of the models used nowadays in mathematical epidemiology, in particular in COVID19 research, belong to a certain subclass of compartmental models whose classes may be divided into three "(x, y, z)" groups, which we will call respectively "susceptible/entrance, diseased, and output" (in the classic SIR case, there is only one class of each type). Roughly, the ODE dynamics of these models contain only linear terms, with the exception of products between x and y terms. It has long been noticed that the basic reproduction number R has a very simple formula (3.3) in terms of the matrices which define the model, and an explicit first integral formula (3.8) is also available. These results can be traced back at least to [ABvdD+07] and [Fen07], respectively, and may be viewed as the "basic laws of SIRtype epidemics"; however many papers continue to reprove them in particular instances (by the nextgeneration matrix method or by direct computations, which are unnecessary). This motivated us to redraw the attention to these basic laws and provide a selfcontained reference of related formulas for (x, y, z) models. We propose to rebaptize the class to which they apply as matrix SIR epidemic models, abbreviated as SYR, to emphasize the similarity to the classic SIR case. For the case of one susceptible class, we propose to use the name SIRPH, due to a simple probabilistic interpretation as SIR models where the exponential infection time has been replaced by a PHtype distribution. We note that to each SIRPH model, one may associate a scalar quantity Y(t) which satisfies "classic SIR relations", see (3.8). In the case of several susceptible classes, this generalizes to (5.10); in a future paper, we will show that (3.8), (5.10) may be used to obtain approximate control policies which compare well with the optimal control of the original model.

Unbiased Estimation of the Gradient of the LogLikelihood for a Class of ContinuousTime StateSpace Models(arXiv, 20210524) [Preprint]In this paper, we consider static parameter estimation for a class of continuoustime statespace models. Our goal is to obtain an unbiased estimate of the gradient of the loglikelihood (score function), which is an estimate that is unbiased even if the stochastic processes involved in the model must be discretized in time. To achieve this goal, we apply a \emph{doubly randomized scheme} (see, e.g.,~\cite{ub_mcmc, ub_grad}), that involves a novel coupled conditional particle filter (CCPF) on the second level of randomization \cite{jacob2}. Our novel estimate helps facilitate the application of gradientbased estimation algorithms, such as stochasticgradient Langevin descent. We illustrate our methodology in the context of stochastic gradient descent (SGD) in several numerical examples and compare with the Rhee \& Glynn estimator \cite{rhee,vihola}.

A rotated characteristic decomposition technique for highorder reconstructions in multidimensions(arXiv, 20210521) [Preprint]When constructing highorder schemes for solving hyperbolic conservation laws, the corresponding highorder reconstructions are commonly performed in characteristic spaces to eliminate spurious oscillations as much as possible. For multidimensional finite volume (FV) schemes, we need to perform the characteristic decomposition several times in different normal directions of the target cell, which is very timeconsuming. In this paper, we propose a rotated characteristic decomposition technique which requires only onetime decomposition for multidimensional reconstructions. The rotated direction depends only on the gradient of a specific physical quantity which is cheap to calculate. This technique not only reduces the computational cost remarkably, but also controls spurious oscillations effectively. We take a thirdorder weighted essentially nonoscillatory finite volume (WENOFV) scheme for solving the Euler equations as an example to demonstrate the efficiency of the proposed technique.