Now showing items 1-20 of 1398

• #### Adenita: interactive 3D modelling and visualization of DNA nanostructures.

(Nucleic acids research, Oxford University Press (OUP), 2020-07-22) [Article]
DNA nanotechnology is a rapidly advancing field, which increasingly attracts interest in many different disciplines, such as medicine, biotechnology, physics and biocomputing. The increasing complexity of novel applications requires significant computational support for the design, modelling and analysis of DNA nanostructures. However, current in silico design tools have not been developed in view of these new applications and their requirements. Here, we present Adenita, a novel software tool for the modelling of DNA nanostructures in a user-friendly environment. A data model supporting different DNA nanostructure concepts (multilayer DNA origami, wireframe DNA origami, DNA tiles etc.) has been developed allowing the creation of new and the import of existing DNA nanostructures. In addition, the nanostructures can be modified and analysed on-the-fly using an intuitive toolset. The possibility to combine and re-use existing nanostructures as building blocks for the creation of new superstructures, the integration of alternative molecules (e.g. proteins, aptamers) during the design process, and the export option for oxDNA simulations are outstanding features of Adenita, which spearheads a new generation of DNA nanostructure modelling software. We showcase Adenita by re-using a large nanorod to create a new nanostructure through user interactions that employ different editors to modify the original nanorod.
• #### What is the right sequencing approach? Solo VS extended family analysis in consanguineous populations.

(BMC medical genomics, Springer Science and Business Media LLC, 2020-07-19) [Article]
BACKGROUND:Testing strategies is crucial for genetics clinics and testing laboratories. In this study, we tried to compare the hit rate between solo and trio and trio plus testing and between trio and sibship testing. Finally, we studied the impact of extended family analysis, mainly in complex and unsolved cases. METHODS:Three cohorts were used for this analysis: one cohort to assess the hit rate between solo, trio and trio plus testing, another cohort to examine the impact of the testing strategy of sibship genome vs trio-based analysis, and a third cohort to test the impact of an extended family analysis of up to eight family members to lower the number of candidate variants. RESULTS:The hit rates in solo, trio and trio plus testing were 39, 40, and 41%, respectively. The total number of candidate variants in the sibship testing strategy was 117 variants compared to 59 variants in the trio-based analysis. We noticed that the average number of coding candidate variants in trio-based analysis was 1192 variants and 26,454 noncoding variants, and this number was lowered by 50-75% after adding additional family members, with up to two coding and 66 noncoding homozygous variants only, in families with eight family members. CONCLUSION:There was no difference in the hit rate between solo and extended family members. Trio-based analysis was a better approach than sibship testing, even in a consanguineous population. Finally, each additional family member helped to narrow down the number of variants by 50-75%. Our findings could help clinicians, researchers and testing laboratories select the most cost-effective and appropriate sequencing approach for their patients. Furthermore, using extended family analysis is a very useful tool for complex cases with novel genes.
• #### Improved characterisation of clinical text through ontology-based vocabulary expansion

(Cold Spring Harbor Laboratory, 2020-07-11) [Preprint]
AbstractBackgroundBiomedical ontologies contain a wealth of metadata that constitutes a fundamental infrastructural resource for text mining. For several reasons, redundancies exist in the ontology ecosystem, which lead to the same concepts being described by several terms in the same or similar contexts across several ontologies. While these terms describe the same concepts, they contain different sets of complementary metadata. Linking these definitions to make use of their combined metadata could lead to improved performance in ontology-based information retrieval, extraction, and analysis tasks.ResultsWe develop and present an algorithm that expands the set of labels associated with an ontology class using a combination of strict lexical matching and cross-ontology reasoner-enabled equivalency queries. Across all disease terms in the Disease Ontology, the approach found 51,362 additional labels, more than tripling the number defined by the ontology itself. Manual validation by a clinical expert on a random sampling of expanded synonyms over the Human Phenotype Ontology yielded a precision of 0.912. Furthermore, we found that annotating patient visits in MIMIC-III with an extended set of Disease Ontology labels led to semantic similarity score derived from those labels being a significantly better predictor of matching first diagnosis, with a mean average precision of 0.88 for the unexpanded set of annotations, and 0.913 for the expanded set.ConclusionsInter-ontology synonym expansion can lead to a vast increase in the scale of vocabulary available for text mining applications. While the accuracy of the extended vocabulary is not perfect, it nevertheless led to a significantly improved ontology-based characterisation of patients from text in one setting. Furthermore, where run-on error is not acceptable, the technique can be used to provide candidate synonyms which can be checked by a domain expert.
• #### A Data Science Approach to Estimate Enthalpy of Formation of Cyclic Hydrocarbons

(The Journal of Physical Chemistry A, American Chemical Society (ACS), 2020-07-10) [Article]
In spite of increasing importance of cyclic hydrocarbons in various chemical systems, fundamental properties of these compounds, such as enthalpy of formation, are still scarce. One of the reasons for this is the fact that the estimation of thermodynamic properties of cyclic hydrocarbon species via cost-effective computational approaches, such as group additivity (GA), has several limitations and challenges. In this study, a machine learning (ML) approach is proposed using support vector regression (SVR) algorithm to predict standard enthalpy of formation of cyclic hydrocarbon species. The model is developed based on a thoroughly selected dataset of accurate experimental values of 192 species collected from the literature. The molecular descriptors used as input to the SVR are calculated via alvaDesc software, which computes in total 5255 features classified into 30 categories. The developed SVR model has an average error of approximately 10 kJ/mol. In comparison, the SVR model outperforms the GA approach for complex molecules, and can be therefore proposed as a novel data-driven approach to estimate enthalpy values for complex cyclic species. A sensitivity analysis is also conducted to examine the relevant features that play a role in affecting the standard enthalpy of formation of cyclic species. Our species dataset is expected to be updated and expanded as new data is available in order to develop a more accurate SVR model with broader applicability.
• #### A fast, accurate, and generalisable heuristic-based negation detection algorithm for clinical text

(Cold Spring Harbor Laboratory, 2020-07-04) [Preprint]
AbstractBackgroundNegation detection is an important task in biomedical text mining. Particularly in clinical settings, it is of critical importance to determine whether findings mentioned in text are present or absent. Rule-based negation detection algorithms are a common approach to the task, and more recent investigations have resulted in the development of rule-based systems utilising the rich grammatical information afforded by typed dependency graphs. However, interacting with these complex representations inevitably necessitates complex rules, which are time-consuming to develop and do not generalise well. We hypothesise that a heuristic approach to determining negation via dependency graphs could offer a powerful alternative.ResultsWe describe and implement an algorithm for negation detection based on grammatical distance from a negatory construct in a typed dependency graph. To evaluate the algorithm, we develop two testing corpora comprised of sentences of clinical text extracted from the MIMIC-III database and documents related to hypertrophic cardiomyopathy patients routinely collected at University Hospitals Birmingham NHS trust. Gold-standard validation datasets were built by a combination of human annotation and examination of algorithm error. Finally, we compare the performance of our approach with four other rule-based algorithms on both gold-standard corpora.ConclusionsThe presented algorithm exhibits the best performance by f-measure over the MIMIC-III dataset, and a similar performance to the syntactic negation detection systems over the HCM dataset. It is also the fastest of the dependency-based negation systems. Our results show that dependency-based algorithms, utilising a single heuristic, can be powerful and stable methods for negation detection in clinical text, requiring minimal training and adaptation between datasets. While NegEx retains an extremely high performance in some cases, the presented approach may be more robust to more complex text descriptions. As such, it could present a drop-in replacement or augmentation for syntactic negation components in clinical text-mining pipelines, particularly for cases where adaptation and rule development is not required or possible.
• #### Early-Stage Growth Mechanism and Synthesis Conditions-Dependent Morphology of Nanocrystalline Bi Films Electrodeposited from Perchlorate Electrolyte.

(Nanomaterials (Basel, Switzerland), MDPI AG, 2020-07-02) [Article]
Bi nanocrystalline films were formed from perchlorate electrolyte (PE) on Cu substrate via electrochemical deposition with different duration and current densities. The microstructural, morphological properties, and elemental composition were studied using scanning electron microscopy (SEM), atomic force microscopy (AFM), and energy-dispersive X-ray microanalysis (EDX). The optimal range of current densities for Bi electrodeposition in PE using polarization measurements was demonstrated. For the first time, it was shown and explained why, with a deposition duration of 1 s, co-deposition of Pb and Bi occurs. The correlation between synthesis conditions and chemical composition and microstructure for Bi films was discussed. The analysis of the microstructure evolution revealed the changing mechanism of the films' growth from pillar-like (for Pb-rich phase) to layered granular form (for Bi) with deposition duration rising. This abnormal behavior is explained by the appearance of a strong Bi growth texture and coalescence effects. The investigations of porosity showed that Bi films have a closely-packed microstructure. The main stages and the growth mechanism of Bi films in the galvanostatic regime in PE with a deposition duration of 1-30 s are proposed.
• #### DTiGEMS+: drug–target interaction prediction using graph embedding, graph mining, and similarity-based techniques.

(Journal of Cheminformatics, Springer Science and Business Media LLC, 2020-07-02) [Article]
In silico prediction of drug–target interactions is a critical phase in the sustainable drug development process, especially when the research focus is to capitalize on the repositioning of existing drugs. However, developing such computational methods is not an easy task, but is much needed, as current methods that predict potential drug–target interactions suffer from high false-positive rates. Here we introduce DTiGEMS+, a computational method that predicts Drug–Target interactions using Graph Embedding, graph Mining, and Similarity-based techniques. DTiGEMS+ combines similarity-based as well as feature-based approaches, and models the identification of novel drug–target interactions as a link prediction problem in a heterogeneous network. DTiGEMS+ constructs the heterogeneous network by augmenting the known drug–target interactions graph with two other complementary graphs namely: drug–drug similarity, target–target similarity. DTiGEMS+ combines different computational techniques to provide the final drug target prediction, these techniques include graph embeddings, graph mining, and machine learning. DTiGEMS+ integrates multiple drug–drug similarities and target–target similarities into the final heterogeneous graph construction after applying a similarity selection procedure as well as a similarity fusion algorithm. Using four benchmark datasets, we show DTiGEMS+ substantially improves prediction performance compared to other state-of-the-art in silico methods developed to predict of drug-target interactions by achieving the highest average AUPR across all datasets (0.92), which reduces the error rate by 33.3% relative to the second-best performing model in the state-of-the-art methods comparison.
• #### Attributed heterogeneous network fusion via collaborative matrix tri-factorization

(Information Fusion, Elsevier BV, 2020-06-26) [Article]
Heterogeneous network based data fusion can encode diverse inter- and intra-relations between objects, and has been sparking increasing attention in recent years. Matrix factorization based data fusion models have been invented to fuse multiple data sources. However, these models generally suffer from the widely-witnessed insufficient relations between nodes and from information loss when heterogeneous attributes of diverse network nodes are transformed into ad-hoc homologous networks for fusion. In this paper, we introduce a general data fusion model called Attributed Heterogeneous Network Fusion (AHNF). AHNF firstly constructs an attributed heterogeneous network composed with different types of nodes and the diverse attribute vectors of these nodes. It uses indicator matrices to differentiate the observed inter-relations from the latent ones, and thus reduces the impact of insufficient relations between nodes. Next, it collaboratively factorizes multiple adjacency matrices and attribute data matrices of the heterogeneous network into low-rank matrices to explore the latent relations between these nodes. In this way, both the network topology and diverse attributes of nodes are fused in a coordinated fashion. Finally, it uses the optimized low-rank matrices to approximate the target relational data matrix of objects and to effectively accomplish the relation prediction. We apply AHNF to predict the lncRNA-disease associations using diverse relational and attribute data sources. AHNF achieves a larger area under the receiver operating curve 0.9367 (by at least 2.14%), and a larger area under the precision-recall curve 0.5937 (by at least 28.53%) than competitive data fusion approaches. AHNF also outperforms competing methods on predicting de novo lncRNA-disease associations, and precisely identifies lncRNAs associated with breast, stomach, prostate, and pancreatic cancers. AHNF is a comprehensive data fusion framework for universal attributed multi-type relational data. The code and datasets are available at http://mlda.swu.edu.cn/codes.php?name=AHNF.
• #### Modern Deep Learning in Bioinformatics.

(Journal of molecular cell biology, Oxford University Press (OUP), 2020-06-24) [Article]
Deep learning (DL) has shown explosive growth in its application to bioinformatics and has demonstrated thrillingly promising power to mine the complex relationship hidden in large-scale biological and biomedical data. A number of comprehensive reviews have been published on such applications, ranging from high-level reviews with future perspectives to those mainly serving as tutorials. These reviews have provided an excellent introduction to and guideline for applications of DL in bioinformatics, covering multiple types of machine learning (ML) problems, different DL architectures, and ranges of biological/biomedical problems. However, most of these reviews have focused on previous research, whereas current trends in the principled DL field and perspectives on their future developments and potential new applications to biology and biomedicine are still scarce. We will focus on modern DL, the ongoing trends and future directions of the principled DL field, and postulate new and major applications in bioinformatics.
• #### Modern Deep Learning in Bioinformatics.

(Journal of molecular cell biology, Oxford University Press (OUP), 2020-06-24) [Article]
Deep learning (DL) has shown explosive growth in its application to bioinformatics and has demonstrated thrillingly promising power to mine the complex relationship hidden in large-scale biological and biomedical data. A number of comprehensive reviews have been published on such applications, ranging from high-level reviews with future perspectives to those mainly serving as tutorials. These reviews have provided an excellent introduction to and guideline for applications of DL in bioinformatics, covering multiple types of machine learning (ML) problems, different DL architectures, and ranges of biological/biomedical problems. However, most of these reviews have focused on previous research, whereas current trends in the principled DL field and perspectives on their future developments and potential new applications to biology and biomedicine are still scarce. We will focus on modern DL, the ongoing trends and future directions of the principled DL field, and postulate new and major applications in bioinformatics.
• #### Virtual reality framework for editing and exploring medial axis representations of nanometric scale neural structures

(Computers and Graphics (Pergamon), Elsevier BV, 2020-06-24) [Article]
We present a novel virtual reality (VR) based framework for the exploratory analysis of nanoscale 3D reconstructions of cellular structures acquired from rodent brain samples through serial electron microscopy. The system is specifically targeted on medial axis representations (skeletons) of branched and tubular structures of cellular shapes, and it is designed for providing to domain scientists: i) effective and fast semi-automatic interfaces for tracing skeletons directly on surface-based representations of cells and structures, ii) fast tools for proofreading, i.e., correcting and editing of semi-automatically constructed skeleton representations, and iii) natural methods for interactive exploration, i.e., measuring, comparing, and analyzing geometric features related to cellular structures based on medial axis representations. Neuroscientists currently use the system for performing morphology studies on sparse reconstructions of glial cells and neurons extracted from a sample of the somatosensory cortex of a juvenile rat. The framework runs in a standard PC and has been tested on two different display and interaction setups: PC-tethered stereoscopic head-mounted display (HMD) with 3D controllers and tracking sensors, and a large display wall with a standard gamepad controller. We report on a user study that we carried out for analyzing user performance on different tasks using these two setups.
• #### Network Moments: Extensions and Sparse-Smooth Attacks

(arXiv, 2020-06-21) [Preprint]
The impressive performance of deep neural networks (DNNs) has immensely strengthened the line of research that aims at theoretically analyzing their effectiveness. This has incited research on the reaction of DNNs to noisy input, namely developing adversarial input attacks and strategies that lead to robust DNNs to these attacks. To that end, in this paper, we derive exact analytic expressions for the first and second moments (mean and variance) of a small piecewise linear (PL) network (Affine, ReLU, Affine) subject to Gaussian input. In particular, we generalize the second-moment expression of Bibi et al. to arbitrary input Gaussian distributions, dropping the zero-mean assumption. We show that the new variance expression can be efficiently approximated leading to much tighter variance estimates as compared to the preliminary results of Bibi et al. Moreover, we experimentally show that these expressions are tight under simple linearizations of deeper PL-DNNs, where we investigate the effect of the linearization sensitivity on the accuracy of the moment estimates. Lastly, we show that the derived expressions can be used to construct sparse and smooth Gaussian adversarial attacks (targeted and non-targeted) that tend to lead to perceptually feasible input attacks.
• #### Modeling quantitative traits for COVID-19 case reports

(Cold Spring Harbor Laboratory, 2020-06-21) [Preprint]
Medical practitioners record the condition status of a patient through qualitative and quantitative observations. The measurement of vital signs and molecular parameters in the clinics gives a complementary description of abnormal phenotypes associated with the progression of a disease. The Clinical Measurement Ontology (CMO) is used to standardize annotations of these measurable traits. However, researchers have no way to describe how these quantitative traits relate to phenotype concepts in a machine-readable manner. Using the WHO clinical case report form standard for the COVID-19 pandemic, we modeled quantitative traits and developed OWL axioms to formally relate clinical measurement terms with anatomical, biomolecular entities and phenotypes annotated with the Uber-anatomy ontology (Uberon), Chemical Entities of Biological Interest (ChEBI) and the Phenotype and Trait Ontology (PATO) biomedical ontologies. The formal description of these relations allows interoperability between clinical and biological descriptions, and facilitates automated reasoning for analysis of patterns over quantitative and qualitative biomedical observations.
• #### Analysis of transcript-deleterious variants in Mendelian disorders: implications for RNA-based diagnostics.

(Genome biology, Springer Science and Business Media LLC, 2020-06-20) [Article]
BACKGROUND:At least 50% of patients with suspected Mendelian disorders remain undiagnosed after whole-exome sequencing (WES), and the extent to which non-coding variants that are not captured by WES contribute to this fraction is unclear. Whole transcriptome sequencing is a promising supplement to WES, although empirical data on the contribution of RNA analysis to the diagnosis of Mendelian diseases on a large scale are scarce. RESULTS:Here, we describe our experience with transcript-deleterious variants (TDVs) based on a cohort of 5647 families with suspected Mendelian diseases. We first interrogate all families for which the respective Mendelian phenotype could be mapped to a single locus to obtain an unbiased estimate of the contribution of TDVs at 18.9%. We examine the entire cohort and find that TDVs account for 15% of all "solved" cases. We compare the results of RT-PCR to in silico prediction. Definitive results from RT-PCR are obtained from blood-derived RNA for the overwhelming majority of variants (84.1%), and only a small minority (2.6%) fail analysis on all available RNA sources (blood-, skin fibroblast-, and urine renal epithelial cells-derived), which has important implications for the clinical application of RNA-seq. We also show that RNA analysis can establish the diagnosis in 13.5% of 155 patients who had received "negative" clinical WES reports. Finally, our data suggest a role for TDVs in modulating penetrance even in otherwise highly penetrant Mendelian disorders. CONCLUSIONS:Our results provide much needed empirical data for the impending implementation of diagnostic RNA-seq in conjunction with genome sequencing.
• #### Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

(arXiv, 2020-06-20) [Preprint]
We present a unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer. We do this by extending the unified analysis of Gorbunov, Hanzely \& Richt\'arik (2020) and dropping the requirement that the loss function be strongly convex. Instead, we only rely on convexity of the loss function. Our unified analysis applies to a host of existing algorithms such as proximal SGD, variance reduced methods, quantization and some coordinate descent type methods. For the variance reduced methods, we recover the best known convergence rates as special cases. For proximal SGD, the quantization and coordinate type methods, we uncover new state-of-the-art convergence rates. Our analysis also includes any form of sampling and minibatching. As such, we are able to determine the minibatch size that optimizes the total complexity of variance reduced methods. We showcase this by obtaining a simple formula for the optimal minibatch size of two variance reduced methods (\textit{L-SVRG} and \textit{SAGA}). This optimal minibatch size not only improves the theoretical total complexity of the methods but also improves their convergence in practice, as we show in several experiments.
• #### Introduction to spatio-temporal data driven urban computing

(Distributed and Parallel Databases, Springer Science and Business Media LLC, 2020-06-19) [Article]
This special issue of Distributed and Parallel Databases journal covers recent advances in spatio-temporal data analytics in the context of urban computing. It contains 9 articles that present solid research studies and innovative ideas in the area of spatio-temporal data analytics for urban computing applications. All of the 9 papers went through at least two rounds of rigorous reviews by the guest editors and invited reviewers. Location-based recommender systems are becoming increasingly important in the community of urban computing. The paper, by Hao Zhou et al., “Hybrid route recommendation with taxi and shared bicycles,” develops a two-phase data-driven recommendation framework that integrates prediction and recommendation phases for providing reliable route recommendation results. Another paper, by Hao Zhang et al., “On accurate POI recommendation via transfer learning,” proposes a transfer learning based deep neural model that fuses cross-domain knowledge to achieve more accurate POI recommendation. Spatial keyword search has been receiving much attention in area of spatio-temporal data analytics. Xiangguo Zhao et al. develop anindex structure that comprehensively considers the social, spatial, and textual information of massive-scale spatio-temporal data to support social-aware spatial keyword group query in their paper “Social-aware spatial keyword top-k group query.” Jiajie Xu et al. propose a hybrid indexing structure that integrate the spatial and semantic information of spatio-temporal datain their paper “Multi-objective spatial keyword query with semantics: a distance-owner based approach.” Matching of spatio-temporal data is a fundamental research problem in spatio-temporal data analytics. The paper, by Ning Wang et al., “An efficient algorithm for spatio-textual location matching,” targets the problem of finding all location pairs whose spatio-textual similarity exceeds a given threshold. This matching query is useful in urban computing applications including hot region detection and traffic congestion alleviation. Additionally, their paper “Privacy-preserving spatial keyword location-to-trajectory matching,” presents a network expansion algorithm and pruning strategies for finding location-trajectory pairs from spatio-temporal data while preserving the users’ privacy. Further, the paper, by Lei Xiao et al., “LSTM-based deep learning for spatial–temporal software testing,” developsa test case prioritization approach using LSTM-based deep learning, which exhibits potential application value in self-driving cars. Another paper, by Zhen chang Xia et al., “ForeXGBoost: passenger car sales prediction based on XGBoost,” presents a prediction model that utilizes data filling algorithms. The model achieves a high prediction accuracy with short running time for vehicle sales prediction. Finally, the paper, by Zhiqiang Liu et al., “A parameter-level parallel optimization algorithm for large-scale spatio-temporal data mining,” propose an efficient parameter-level parallel optimization algorithm for large-scale spatio-temporal data mining. Those nine articles represent diverse directions in the fast-growing area of spatio-temporal data analytics in urban computing community. We hope that these papers will foster the development of urban computing techniques and inspire more research in this promising area.
• #### A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning

(arXiv, 2020-06-19) [Preprint]
Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors which are not unbiased, such as Top-$K$. In this paper, we propose a new and theoretically and practically better alternative to EF for dealing with contractive compressors. In particular, we propose a construction which can transform any contractive compressor into an induced unbiased compressor. Following this transformation, existing methods able to work with unbiased compressors can be applied. We show that our approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions. We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes, and demonstrate the benefits thereof. We perform several numerical experiments which validate our theoretical findings.
• #### Solving Acoustic Boundary Integral Equations Using High Performance Tile Low-Rank LU Factorization

(High Performance Computing, Springer International Publishing, 2020-06-18) [Conference Paper]
We design and develop a new high performance implementation of a fast direct LU-based solver using low-rank approximations on massively parallel systems. The LU factorization is the most time-consuming step in solving systems of linear equations in the context of analyzing acoustic scattering from large 3D objects. The matrix equation is obtained by discretizing the boundary integral of the exterior Helmholtz problem using a higher-order Nyström scheme. The main idea is to exploit the inherent data sparsity of the matrix operator by performing local tile-centric approximations while still capturing the most significant information. In particular, the proposed LU-based solver leverages the Tile Low-Rank (TLR) data compression format as implemented in the Hierarchical Computations on Manycore Architectures (HiCMA) library to decrease the complexity of “classical” dense direct solvers from cubic to quadratic order. We taskify the underlying boundary integral kernels to expose fine-grained computations. We then employ the dynamic runtime system StarPU to orchestrate the scheduling of computational tasks on shared and distributed-memory systems. The resulting asynchronous execution permits to compensate for the load imbalance due to the heterogeneous ranks, while mitigating the overhead of data motion. We assess the robustness of our TLR LU-based solver and study the qualitative impact when using different numerical accuracies. The new TLR LU factorization outperforms the state-of-the-art dense factorizations by up to an order of magnitude on various parallel systems, for analysis of scattering from large-scale 3D synthetic and real geometries.
• #### A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization

(arXiv, 2020-06-12) [Preprint]
In this paper, we study the performance of a large family of SGD variants in the smooth nonconvex regime. To this end, we propose a generic and flexible assumption capable of accurate modeling of the second moment of the stochastic gradient. Our assumption is satisfied by a large number of specific variants of SGD in the literature, including SGD with arbitrary sampling, SGD with compressed gradients, and a wide variety of variance-reduced SGD methods such as SVRG and SAGA. We provide a single convergence analysis for all methods that satisfy the proposed unified assumption, thereby offering a unified understanding of SGD variants in the nonconvex regime instead of relying on dedicated analyses of each variant. Moreover, our unified analysis is accurate enough to recover or improve upon the best-known convergence results of several classical methods, and also gives new convergence results for many new methods which arise as special cases. In the more general distributed/federated nonconvex optimization setup, we propose two new general algorithmic frameworks differing in whether direct gradient compression (DC) or compression of gradient differences (DIANA) is used. We show that all methods captured by these two frameworks also satisfy our unified assumption. Thus, our unified convergence analysis also captures a large variety of distributed methods utilizing compressed communication. Finally, we also provide a unified analysis for obtaining faster linear convergence rates in this nonconvex regime under the PL condition.
• #### A self-adaptive deep learning algorithm for accelerating multi-component flash calculation

(Computer Methods in Applied Mechanics and Engineering, Elsevier BV, 2020-06-11) [Article]
In this paper, the first self-adaptive deep learning algorithm is proposed in details to accelerate flash calculations, which can quantitatively predict the total number of phases in the mixture and related thermodynamic properties at equilibrium for realistic reservoir fluids with a large number of components under various environmental conditions. A thermodynamically consistent scheme for phase equilibrium calculation is adopted and implemented at specified moles, volume and temperature, and the flash results are used as the ground truth for training and testing the deep neural network. The critical properties of each component are considered as the input features of the neural network and the final output is the total number of phases at equilibrium and the molar compositions in each phase. Two network structures are well designed, one of which transforms the input of various numbers of components in the training and the objective fluid mixture into a unified space before entering the productive neural network. “Ghost components” are defined and introduced to process the data padding work in order to modify the dimension of input flash calculation data to meet the training and testing requirements of the target fluid mixture. Hyperparameters on both two neural networks are carefully tuned in order to ensure the physical correlations underneath the input parameters are preserved properly through the learning process. This combined structure can make our deep learning algorithm to be self-adaptive to the change of input components and dimensions. Furthermore, two Softmax functions are used in the last layer to enforce the constraint that the summation of mole fractions in each phase is equal to 1. An example is presented that the flash calculation results of a 8-component Eagle Ford oil is used as input to estimate the phase equilibrium state of a 14-component Eagle Ford oil. The results are satisfactory with very small estimation errors. The capability of the proposed deep learning algorithm is also verified that simultaneously completes phase stability test and phase splitting calculation. Remarks are concluded at the end to provide some guidance for further research in this direction, especially the potential application of newly developed neural network models.