Comparing theory based and higher-order reduced models for fusion simulation data

We consider using regression to fit a theory-based log-linear ansatz, as well as higher order approximations, for the thermal energy confinement of a Tokamak as a function of device features. We use general linear models based on total order polynomials, as well as deep neural networks. The results indicate that the theory-based model fits the data almost as well as the more sophisticated machines, within the support of the data set. The conclusion we arrive at is that only negligible improvements can be made to the theoretical model, for input data of this type.


Setup
Assume we have a set of labelled data points, or training data pairs D = {x (i) , y (i) } N i=1 , where x (i) ∈ R K and y (i) ∈ R + .
We postulate amongst a family of ansatz f (·; θ), parameterized by θ ∈ R P , that if we can find a θ * such that for all i = 1, . . . , N, then for all x within the convex hull of the set {x (i) } N i=1 , and ideally in a high-probability region over training data distribution, we will have where y is the true output corresponding to input x . In particular, the hope is that if we derive f based on physical theory, then in fact it will hold that even for x different from the set {x (i) } N i=1 . Here we consider parametric approximations, given by generalized linear models (GLM), in particular polynomial models, as well as nonlinear models (NM) given by neural networks (NN).
We will consider the problem of finding the θ * which minimizes data misfit for some distance function . Here we use the standard choice (·) := | · | 2 := · 2 2 . It makes sense to interpret this as the log-likelihood of the data given the parameter, as a function of the parameter.

Physical setup and theory-based model
In recent work [9], a theory-based scaling of thermal energy confinement time has been derived based on a comprehensive turbulent transport model trapped gyrokinetic Landau fluid (TGLF) [13] in core coupled to the edge pedestal (EPED) [12] model, especially in burning plasma conditions with dominant fusion alpha particle heating for future reactor design. The simulation dataset consists of a massive number of predictive Integrated Plasma Simulator (IPS) FASTRAN [8] simulations, selfconsistent with core transport, edge pedestal, fusion alpha particle heating, and MHD equilibrium, built upon a modern integrated modeling framework, IPS.
For input features x (i) ∈ R K , i = 1, . . . , N, the theoretical relationship with the output thermal energy confinement time y (i) ∈ R + , is given by the following ansatz or on a logarithmic scale (redefining log θ 0 → θ 0 )

Parametric regression
Parametric models afford the luxury of discarding the data set after training, and reduce subsequent evaluations of the model to a cost relating to the number of parameters only [1,7,10]. However, these models are much more rigid than the non-parametric ones, by virtue of restricting to a particular parametric family.

General linear models
The simplest type of parametrization is in terms of coefficients of expansion in some basis {ψ i } P−1 i=0 . Here we choose appropriately ordered monomials with total degree ≤ p, then the number of parameters is P = (K + p)!/K!p!. The polynomials {ψ i } are products of monomials in the individual variables, such as ψ i (x) = K k=1 ψ α k (i) (x k ), where ψ 0 (x) = 1, each index i ∈ Z + is associated to a unique multiindex α(i) = (α 1 (i), . . . , α K (i)) ∈ Z K + , and the single variable monomials are defined as ψ α k (i) (z) = z α k (i) for z ∈ R. In particular, for i = 1, . . . , K, α(i) i = i and α(i) j = 0 for j i, i.e. ψ i (x) = x i . The polynomials are constructed with sglib (https://github.com/ezander/sglib) and the computations are done with Matlab.
Assume that f (·; θ) = P−1 i=0 θ i ψ i (·). Under the assumption that the data is corrupted by additive Gaussian white noise, the negative log-likelihood (i.e. data misfit) takes the form (1) , . . . , x (N) ] T and Y = [y (1) , . . . , y (N) ] T , and let Ψ(X) i j = ψ j (x (i) ). Then Φ(θ) = |Ψ(X)θ − Y| 2 and the maximum likelihood solution θ * ∈ R P is given by Predictions of outputs Y for some X are later given by We will also consider log-polynomial regression, where the monomials are composed with log, i.e. the basis functions become ψ i • log, where log acts component-wise. We then define log f (·; θ) = P−1 i=0 θ i ψ i (log(·)) and the misfit is This is the negative log-likelihood under a multiplicative lognormal noise assumption, as opposed to the additive Gaussian noise assumption of (3.1). To understand the motivation for this, observe that (2.2) is a log-linear model corresponding to the logarithm of (2.1). Therefore, this setup includes the theoretical model prediction, when p = 1, and provides a way to systematically improve upon the theoretical model, by increasing p.
As a final note, as p increases, there is a chance of including more irrelevant parameters, in addition to ones which may be relevant. Therefore, we must regularize somehow. For the GLM, it is natural to adopt a standard L 2 Tikhonov regularization term λ|θ| 2 , resulting in the following modified objective function, for λ > 0,Φ which again can be exactly solved where I is the P × P identity matrix. One can readily observe that this provides an upper bound to the pseudo inverse operator mapping the observation set to the optimal parameter (in other words Ψ(X) T Ψ(X) + λI has a spectrum lower bounded by λ).

Nonlinear models
In some very particular cases, it may make sense to invoke nonlinear models, at the significant additional expense of solving a nonlinear least squares problem (in the easiest instance) as opposed to the linear least squares solution presented above. The most notable example is deep neural networks. It has been shown that these models perform very well for tasks such as handwritten digit classification, voice and object recognition [3], and there has recently even been mathematical justification that substantiates this [5]. However, [6] show that one can sometimes find very small perturbations on input values which lead to misclassification. The methods can also be used for regression problems, and will be considered below.
As an example, where σ i is a simple nonlinear "activation function" which is applied element-wise. In this case n i are the number of auxiliary variables, or "neurons", on level i, n 0 = K is the input dimension, n L = 1 is the output dimension, and there are L levels. For regression one typically takes σ L =Id the identity map, as will be done here.
For the neural networks used here, σ i = σ : R → R + , for i < L, is the the rectified linear unit (ReLU) defined by This activation function contains no trainable parameters, and we iterate that it is applied elementwise to yield a function R n i → R n i + .
The misfit again takes the form except now we have a nonlinear and typically non-convex optimization problem.
There are many methods that can be employed to try to find the minimum of this objective function, and they will generally be successful in finding some minimum (though perhaps not global). Due to the mathematically simple functions that make up the neural network, derivatives with respect to the unknown parameters can be determined analytically. A popular method to solve non-convex optimization problems is stochastic gradient descent, or Robbins-Monro stochastic approximation algorithm [4,11], which is defined as follows. Let ∇Φ be an unbiased estimate of ∇Φ, i.e. a random variable such that E ∇Φ (θ) = ∇Φ (θ). Then let If {γ i } are chosen such that i γ i = ∞ and i γ 2 i < ∞, then the algorithm is guaranteed to converge to a minimizer [4,11], which is global if the function is convex [2]. Noting that we can compute easily in this case, one could use for example ∇Φ(θ) = ∇Φ(θ) + ξ, where ξ ∈ R P is some random variable, for example a Gaussian, with Eξ = 0. Something more clever and natural can be done in the particular case of (3.5). Recall that N can be very large. Observe that (3.5) is, up to a multiple of N, an equally weighted average of the summands ( f (x (i) ; θ), y (i) ). Let {n i } N sub i=1 be a random (or deterministic) subset of N sub < N indices, where N sub could be 1. Then the following provides an unbiased estimator which is also much cheaper than the original gradient (since it uses a subset of the data) Naturally the variance of this estimator will affect the convergence time of the algorithm and we refer the interested reader to the monograph [4] for discussion of the finer points of this family of algorithms. Neural Network models were implemented and trained using the Mathematica Neural Network functions built on top of the MXNet framework.
Increasing the number of free parameters, by increasing the depth or width of the network, allows more flexibility and can provide a better approximation to complex input output relationships.
However, as with GLM, increasing the number of degrees of freedom in the optimization problem also increases the susceptibility of the algorithm to instability from fitting noise in the data too accurately. A Tikhonov regularization could be employed again, but here we appeal to another strategy which is applicable only to DNN. In particular, the training data is randomly divided into N training samples and N val validation samples. The algorithm 3.6 is run with the training data, however the loss of the validation data is computed along the way and the network with the smallest loss on the validation data is retained at the end. This serves as a regularization, i.e. avoids fitting noise, since the training and validation sets are corrupted by different realizations of noise.

Numerical results
We consider a data set of 2822 data points with K = 9 input features and a single output. A subset of N = 2100 data points will be used for training, with the rest of the N out data points aside for testing. General linear and log linear models with order up to 6 are considered, as well as deep neural networks.

GLM
An empirically chosen regularization of λ = 1 for p = 2, λ = 50 for p = 3, and λ = 100 for p > 3 is used in (3.3). Figure 1 presents the results for the theoretical log linear model, as well as the best fit based on a expansion in log polynomials of total order 6, for the out-of-sample set of testing data. It is clear already that the enriched model set brings negligible improvement in the reconstruction fidelity. The left subpanels of Figure 4 show some slices over the various covariates, where the other covariates are fixed at a value from the center of the data set. This gives some idea of how the reconstruction behaves, although one should bear in mind that the data set forms a manifold and not a hypercube, in R 9 , and so these slices are bound to include unobserved subsets of the parameter space. Furthermore, there may be sparse coverage even where there is data. This can be observed also in Figure 4 which illustrates the single component marginal histograms over the data. Figure 2 show the relative L 2 fidelity and the coefficient of determination R 2 , defined below, for log polynomial and polynomial models of increasing order.
The relative error L 2 rel ( f ) of the surrogate f is given by The coefficient of determination R 2 ( f ) is given by Both models begin to saturate at the same level of fidelity, indicating that this is a fundamental limitation of the information in the data. Furthermore, one observes from Figure 2 that the saturation level of the fidelity occurs with only a few percentage points of improvement over the theoretically derived log-linear model.   Figure 2. Relative L 2 error and coefficient of determination R 2 for out-of-sample test data as a function of model complexity, for log polynomial models (left) and direct polynomial models (right). The neural network models were setup and trained using the built in neural network functions of Mathematica, which is built around the MXNet machine learning framework. The N val = N out samples are used in this case for validation, as described in the end of subsection 3.2.1.

Neural networks
By connecting various combinations of layers, the neural network model gains the flexibility to model the behavior of the training data. As the size of the network increases so do the number of unknown parameters in the model. If there is too much flexibility then the network model begins to conform to the noise in the data. In addition, due to the nature of gradient decent methods, the minima reached is a local minima dependent on the starting position. As the number of unknown parameters increases, in particular if it exceeds the amount of training data, one expects the inversion to become unstable, with an increased risk of converging to an undesirable local minimum.
To reduce the degeneracy of the objective function, the total number tunable weights and biases in the network will be kept below N, although this can be relaxed if a Tikhonov or other type of regularization is employed. Table 1, shows the make up of the neural network model used here. Initial weights and biases and chosen at random. The networks will be trained multiple times with random initial parameters to explore various local modes. Three examples are shown in Figures 3, 4. Figure  3 shows four pairs of images, with the left being the output in comparison to the true value, and the right being a histogram of the difference between the 2. The left two pairs and the top right are all corresponding to different instances of the same network optimization, but with different random selections of training data and different initial guesses. The bottom right is the best fit log linear model 2.2.
The L 2 rel and R 2 fidelity measures are shown in Table 2. What one observes here is that one finds almost no improvement at all over the theoretical log linear model with this P = 641 parameter neural network, while in fact we do manage to improve the accuracy by a few percent with higher order GLM using P = 5005 parameters (see Figure 2). Figure 4 shows 9 pairs of panels, one for each parameter. This indicates that the models perform better where the data is more dense. Also, the higher order polynomial models can behave wildly outside the support of the data. See for example the left bottom 3 panels.
Finally, Figure 5 shows the pairwise marginal histograms, which gives a slightly better indication of the shape of the data distribution.    . Scaling and input data distribution log linear, sixth order, and NN models. The left subplots show a slice through the parameter space of the various models, giving an indication of their behaviour relative to one another. The right subplots show the marginal histograms for the given parameter. One can observe a large deviation between the curves where the data is quite sparse, particularly for the higher order polynomial models. . Pairwise marginal histograms of the input data X. This gives some idea of the pairwise correlations between the different features over the input data.

Summary
In this work we considered fitting a theory-based log-linear ansatz, as well as higher order approximations for the thermal energy confinement of a Tokamak. It is a supervised machine learning regression problem to identify the map R K → R + with K = 9 inputs. We parametrized the map using general linear models based on total order polynomials up to degree 6 (P = 5005 parameters), as well as deep neural networks with up to P = 641 parameters. The results indicate that the theory-based model fits the data almost as well as the more sophisticated machines, within the support of the data set. The conclusion we arrive at is that only negligible improvements can be made to the theoretical model, for input data of this type. Further work includes (i) Bayesian formulation of the inversion for uncertainty quantification (UQ), or perhaps derivative-based UQ (sensitivity), (ii) inclusion of further hyper-parameters in the GLM for improved regularization within an empirical Bayesian framework, (iii) using other types of regularization for the deep neural network, in order to allow for more parameters, and (iv) exploring alternative machine learning methods, such as Gaussian processes, which naturally include UQ.