For more information visit: https://sfb.kaust.edu.sa/Pages/Home.aspx

Recent Submissions

  • PATHcre8: A Tool That Facilitates the Searching for Heterologous Biosynthetic Routes

    Motwalli, Olaa Amin; Uludag, Mahmut; Mijakovic, Ivan; Alazmi, Meshari; Bajic, Vladimir B.; Gojobori, Takashi; Gao, Xin; Essack, Magbubah (ACS Synthetic Biology, American Chemical Society (ACS), 2020-11-16) [Article]
    Developing computational tools that can facilitate the rational design of cell factories producing desired products at increased yields is challenging, as the tool needs to take into account that the preferred host organism usually has compounds that are consumed by competing reactions that reduce the yield of the desired product. On the other hand, the preferred host organisms may not have the native metabolic reactions needed to produce the compound of interest; thus, the computational tool needs to identify the metabolic reactions that will most efficiently produce the desired product. In this regard, we developed the generic tool PATHcre8 to facilitate an optimized search for heterologous biosynthetic pathway routes. PATHcre8 finds and ranks biosynthesis routes in a large number of organisms, including Cyanobacteria. The tool ranks the pathways based on feature scores that reflect reaction thermodynamics, the potentially toxic products in the pathway (compound toxicity), intermediate products in the pathway consumed by competing reactions (product consumption), and host-specific information such as enzyme copy number. A comparison with several other similar tools shows that PATHcre8 is more efficient in ranking functional pathways. To illustrate the effectiveness of PATHcre8, we further provide case studies focused on isoprene production and the biodegradation of cocaine. PATHcre8 is free for academic and nonprofit users and can be accessed at https://www.cbrc.kaust.edu.sa/pathcre8/.
  • Few-shot learning for classification of novel macromolecular structures in cryo-electron tomograms

    Li, Ran; Yu, Liangyong; Zhou, Bo; Zeng, Xiangrui; Wang, Zhenyu; Yang, Xiaoyan; Zhang, Jing; Gao, Xin; Jiang, Rui; Xu, Min (PLOS Computational Biology, Public Library of Science (PLoS), 2020-11-11) [Article]
    Cryo-electron tomography (cryo-ET) provides 3D visualization of subcellular components in the near-native state and at sub-molecular resolutions in single cells, demonstrating an increasingly important role in structural biology in situ. However, systematic recognition and recovery of macromolecular structures in cryo-ET data remain challenging as a result of low signal-to-noise ratio (SNR), small sizes of macromolecules, and high complexity of the cellular environment. Subtomogram structural classification is an essential step for such task. Although acquisition of large amounts of subtomograms is no longer an obstacle due to advances in automation of data collection, obtaining the same number of structural labels is both computation and labor intensive. On the other hand, existing deep learning based supervised classification approaches are highly demanding on labeled data and have limited ability to learn about new structures rapidly from data containing very few labels of such new structures. In this work, we propose a novel approach for subtomogram classification based on few-shot learning. With our approach, classification of unseen structures in the training data can be conducted given few labeled samples in test data through instance embedding. Experiments were performed on both simulated and real datasets. Our experimental results show that we can make inference on new structures given only five labeled samples for each class with a competitive accuracy (> 0.86 on the simulated dataset with SNR = 0.1), or even one sample with an accuracy of 0.7644. The results on real datasets are also promising with accuracy > 0.9 on both conditions and even up to 1 on one of the real datasets. Our approach achieves significant improvement compared with the baseline method and has strong capabilities of generalizing to other cellular components.
  • Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species

    Zheng, Yumin; Wang, Haohan; Zhang, Yang; Gao, Xin; Xing, Eric P.; Xu, Min (PLOS Computational Biology, Public Library of Science (PLoS), 2020-11-05) [Article]
    In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis-determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use three species and build cross-species training sets with two of them and evaluate the performance of the remaining one. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.
  • RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads.

    Liao, Xingyu; Gao, Xin; Zhang, Xiankai; Wu, Fang-Xiang; Wang, Jianxin (BMC bioinformatics, Springer Science and Business Media LLC, 2020-10-19) [Article]
    BACKGROUND:Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. RESULTS:In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. CONLUSIONS:We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.
  • Semantic similarity and machine learning with ontologies.

    Kulmanov, Maxat; Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Briefings in bioinformatics, Oxford University Press (OUP), 2020-10-13) [Article]
    Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
  • A unified linear convergence analysis of k-SVD

    Xu, Zhiqiang; Ke, Yiping; Cao, Xin; Zhou, Chunlai; Wei, Pengfei; Gao, Xin (Memetic Computing, Springer Science and Business Media LLC, 2020-10-12) [Article]
    Eigenvector computation, e.g., k-SVD for finding top-k singular subspaces, is often of central importance to many scientific and engineering tasks. There has been resurgent interest recently in analyzing relevant methods in terms of singular value gap dependence. Particularly, when the gap vanishes, the convergence of k-SVD is considered to be capped by a gap-free sub-linear rate. We argue in this work both theoretically and empirically that this is not necessarily the case, refreshing our understanding on this significant problem. Specifically, we leverage the recently proposed structured gap in a careful analysis to establish a unified linear convergence of k-SVD to one of the ground-truth solutions, regardless of what target matrix and how large target rank k are given. Theoretical results are evaluated and verified by experiments on synthetic or real data.
  • Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data.

    Wang, Chunxiang; Gao, Xin; Liu, Juntao (BMC bioinformatics, Springer Science and Business Media LLC, 2020-10-07) [Article]
    BACKGROUND:Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. RESULTS:We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. CONCLUSION:The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.
  • Automatic and Interpretable Model for Periodontitis Diagnosis in Panoramic Radiographs

    Li, Haoyang; Zhou, Juexiao; Zhou, Yi; Chen, Jieyu; Gao, Feng; Xu, Ying; Gao, Xin (Springer International Publishing, 2020-09-29) [Conference Paper]
    Periodontitis is a prevalent and irreversible chronic inflammatory disease both in developed and developing countries, and affects about 20%-50% of the global population. The tool for automatically diagnosing periodontitis is highly demanded to screen at-risk people for periodontitis and its early detection could prevent the onset of tooth loss, especially in local community and health care settings with limited dental professionals. In the medical field, doctors need to understand and trust the decisions made by computational models and developing interpretable machine learning models is crucial for disease diagnosis. Based on these considerations, we propose an interpretable machine learning method called Deetal-Perio to predict the severity degree of periodontitis in dental panoramic radiographs. In our method, alveolar bone loss (ABL), the clinical hallmark for periodontitis diagnosis, could be interpreted as the key feature. To calculate ABL, we also propose a method for teeth numbering and segmentation. First, Deetal-Perio segments and indexes the individual tooth via Mask R-CNN combined with a novel calibration method. Next, Deetal-Perio segments the contour of the alveolar bone and calculates a ratio for individual tooth to represent ABL. Finally, Deetal-Perio predicts the severity degree of periodontitis given the ratios of all the teeth. The entire architecture could not only outperform state-of-the-art methods and show robustness on two data sets in both periodontitis prediction, and teeth numbering and segmentation tasks, but also be interpretable for doctors to understand the reason why Deetal-Perio works so well.
  • AttPNet: Attention-Based Deep Neural Network for 3D Point Set Analysis

    Yang, Yufeng; Ma, Yixiao; Zhang, Jing; Gao, Xin; Xu, Min (Sensors, MDPI AG, 2020-09-23) [Article]
    Point set is a major type of 3D structure representation format characterized by its data availability and compactness. Most former deep learning-based point set models pay equal attention to different point set regions and channels, thus having limited ability in focusing on small regions and specific channels that are important for characterizing the object of interest. In this paper, we introduce a novel model named Attention-based Point Network (AttPNet). It uses attention mechanism for both global feature masking and channel weighting to focus on characteristic regions and channels. There are two branches in our model. The first branch calculates an attention mask for every point. The second branch uses convolution layers to abstract global features from point sets, where channel attention block is adapted to focus on important channels. Evaluations on the ModelNet40 benchmark dataset show that our model outperforms the existing best model in classification tasks by 0.7% without voting. In addition, experiments on augmented data demonstrate that our model is robust to rotational perturbations and missing points. We also design a Electron Cryo-Tomography (ECT) point cloud dataset and further demonstrate our model’s ability in dealing with fine-grained structures on the ECT dataset.
  • Spark-based parallel calculation of 3D fourier shell correlation for macromolecule structure local resolution estimation

    Lü, Yongchun; Zeng, Xiangrui; Tian, Xinhui; Shi, Xiao; Wang, Hui; Zheng, Xiaohui; Liu, Xiaodong; Zhao, Xiaofang; Gao, Xin; Xu, Min (BMC Bioinformatics, Springer Science and Business Media LLC, 2020-09-17) [Article]
    Abstract Background Resolution estimation is the main evaluation criteria for the reconstruction of macromolecular 3D structure in the field of cryoelectron microscopy (cryo-EM). At present, there are many methods to evaluate the 3D resolution for reconstructed macromolecular structures from Single Particle Analysis (SPA) in cryo-EM and subtomogram averaging (SA) in electron cryotomography (cryo-ET). As global methods, they measure the resolution of the structure as a whole, but they are inaccurate in detecting subtle local changes of reconstruction. In order to detect the subtle changes of reconstruction of SPA and SA, a few local resolution methods are proposed. The mainstream local resolution evaluation methods are based on local Fourier shell correlation (FSC), which is computationally intensive. However, the existing resolution evaluation methods are based on multi-threading implementation on a single computer with very poor scalability. Results This paper proposes a new fine-grained 3D array partition method by key-value format in Spark. Our method first converts 3D images to key-value data (K-V). Then the K-V data is used for 3D array partitioning and data exchange in parallel. So Spark-based distributed parallel computing framework can solve the above scalability problem. In this distributed computing framework, all 3D local FSC tasks are simultaneously calculated across multiple nodes in a computer cluster. Through the calculation of experimental data, 3D local resolution evaluation algorithm based on Spark fine-grained 3D array partition has a magnitude change in computing speed compared with the mainstream FSC algorithm under the condition that the accuracy remains unchanged, and has better fault tolerance and scalability. Conclusions In this paper, we proposed a K-V format based fine-grained 3D array partition method in Spark to parallel calculating 3D FSC for getting a 3D local resolution density map. 3D local resolution density map evaluates the three-dimensional density maps reconstructed from single particle analysis and subtomogram averaging. Our proposed method can significantly increase the speed of the 3D local resolution evaluation, which is important for the efficient detection of subtle variations among reconstructed macromolecular structures.
  • Efficient locality-sensitive hashing over high-dimensional streaming data

    Wang, Hao; Yang, Chengcheng; Zhang, Xiangliang; Gao, Xin (Neural Computing and Applications, Springer Science and Business Media LLC, 2020-09-17) [Article]
    Approximate nearest neighbor (ANN) search in high-dimensional spaces is fundamental in many applications. Locality-sensitive hashing (LSH) is a well-known methodology to solve the ANN problem. Existing LSH-based ANN solutions typically employ a large number of individual indexes optimized for searching efficiency. Updating such indexes might be impractical when processing high-dimensional streaming data. In this paper, we present a novel disk-based LSH index that offers efficient support for both searches and updates. The contributions of our work are threefold. First, we use the write-friendly LSM-trees to store the LSH projections to facilitate efficient updates. Second, we develop a novel estimation scheme to estimate the number of required LSH functions, with which the disk storage and access costs are effectively reduced. Third, we exploit both the collision number and the projection distance to improve the efficiency of candidate selection, improving the search performance with theoretical guarantees on the result quality. Experiments on four real-world datasets show that our proposal outperforms the state-of-the-art schemes.
  • Integrated Metabolic Modeling, Culturing, and Transcriptomics Explain Enhanced Virulence of Vibrio cholerae during Coinfection with Enterotoxigenic Escherichia coli.

    Abdel-Haleem, Alyaa M.; Ravikumar, Vaishnavi; Ji, Boyang; Mineta, Katsuhiko; Gao, Xin; Nielsen, J.; Gojobori, Takashi; Mijakovic, Ivan (mSystems, American Society for Microbiology, 2020-09-08) [Article]
    Gene essentiality is altered during polymicrobial infections. Nevertheless, most studies rely on single-species infections to assess pathogen gene essentiality. Here, we use genome-scale metabolic models (GEMs) to explore the effect of coinfection of the diarrheagenic pathogen Vibrio cholerae with another enteric pathogen, enterotoxigenic Escherichia coli (ETEC). Model predictions showed that V. cholerae metabolic capabilities were increased due to ample cross-feeding opportunities enabled by ETEC. This is in line with increased severity of cholera symptoms known to occur in patients with dual infections by the two pathogens. In vitro coculture systems confirmed that V. cholerae growth is enhanced in cocultures relative to single cultures. Further, expression levels of several V. cholerae metabolic genes were significantly perturbed as shown by dual RNA sequencing (RNAseq) analysis of its cocultures with different ETEC strains. A decrease in ETEC growth was also observed, probably mediated by nonmetabolic factors. Single gene essentiality analysis predicted conditionally independent genes that are essential for the pathogen's growth in both single-infection and coinfection scenarios. Our results reveal growth differences that are of relevance to drug targeting and efficiency in polymicrobial infections.IMPORTANCE Most studies proposing new strategies to manage and treat infections have been largely focused on identifying druggable targets that can inhibit a pathogen's growth when it is the single cause of infection. In vivo, however, infections can be caused by multiple species. This is important to take into account when attempting to develop or use current antibacterials since their efficacy can change significantly between single infections and coinfections. In this study, we used genome-scale metabolic models (GEMs) to interrogate the growth capabilities of Vibrio cholerae in single infections and coinfections with enterotoxigenic E. coli (ETEC), which cooccur in a large fraction of diarrheagenic patients. Coinfection model predictions showed that V. cholerae growth capabilities are enhanced in the presence of ETEC relative to V. cholerae single infection, through cross-fed metabolites made available to V. cholerae by ETEC. In vitro, cocultures of the two enteric pathogens further confirmed model predictions showing an increased growth of V. cholerae in coculture relative to V. cholerae single cultures while ETEC growth was suppressed. Dual RNAseq analysis of the cocultures also confirmed that the transcriptome of V. cholerae was distinct during coinfection compared to single-infection scenarios where processes related to metabolism were significantly perturbed. Further, in silico gene-knockout simulations uncovered discrepancies in gene essentiality for V. cholerae growth between single infections and coinfections. Integrative model-guided analysis thus identified druggable targets that would be critical for V. cholerae growth in both single infections and coinfections; thus, designing inhibitors against those targets would provide a broader spectrum of coverage against cholera infections.
  • Long-read individual-molecule sequencing reveals CRISPR-induced genetic heterogeneity in human ESCs

    Bi, Chongwei; Wang, Lin; Yuan, Baolei; Zhou, Xuan; Li, Yu; Wang, Sheng; Pang, Yuhong; Gao, Xin; Huang, Yanyi; Li, Mo (Genome Biology, Springer Science and Business Media LLC, 2020-08-24) [Article]
    Quantifying the genetic heterogeneity of a cell population is essential to understanding of biological systems. We develop a universal method to label individual DNA molecules for single-base-resolution haplotype-resolved quantitative characterization of diverse types of rare variants, with frequency as low as 4 × 10−5 , using both short- or long-read sequencing platforms. It provides the first quantitative evidence of persistent nonrandom large structural variants and an increase in singlenucleotide variants at the on-target locus following repair of double-strand breaks induced by CRISPR-Cas9 in human embryonic stem cells.
  • TransBorrow: genome-guided transcriptome assembly by borrowing assemblies from different assemblers.

    Yu, Ting; Mu, Zengchao; Fang, Zhaoyuan; Liu, Xiaoping; Gao, Xin; Liu, Juntao (Genome research, Cold Spring Harbor Laboratory, 2020-08-17) [Article]
    RNA-seq technology is widely used in various transcriptomic studies and provides great opportunities to reveal the complex structures of transcriptomes. To effectively analyze RNA-seq data, we introduce a novel transcriptome assembler, TransBorrow, which borrows the assemblies from different assemblers to search for reliable subsequences by building a colored graph from those borrowed assemblies. Then, by seeding reliable subsequences, a newly designed path extension strategy accurately searches for a transcript-representing path cover over each splicing graph. TransBorrow was tested on both simulated and real data sets and showed great superiority over all the compared leading assemblers.
  • A Data Science Approach to Estimate Enthalpy of Formation of Cyclic Hydrocarbons

    Yalamanchi, Kiran K.; Monge Palacios, Manuel; van Oudenhoven, Vincent C.O.; Gao, Xin; Sarathy, Mani (The Journal of Physical Chemistry A, American Chemical Society (ACS), 2020-07-10) [Article]
    In spite of increasing importance of cyclic hydrocarbons in various chemical systems, fundamental properties of these compounds, such as enthalpy of formation, are still scarce. One of the reasons for this is the fact that the estimation of thermodynamic properties of cyclic hydrocarbon species via cost-effective computational approaches, such as group additivity (GA), has several limitations and challenges. In this study, a machine learning (ML) approach is proposed using support vector regression (SVR) algorithm to predict standard enthalpy of formation of cyclic hydrocarbon species. The model is developed based on a thoroughly selected dataset of accurate experimental values of 192 species collected from the literature. The molecular descriptors used as input to the SVR are calculated via alvaDesc software, which computes in total 5255 features classified into 30 categories. The developed SVR model has an average error of approximately 10 kJ/mol. In comparison, the SVR model outperforms the GA approach for complex molecules, and can be therefore proposed as a novel data-driven approach to estimate enthalpy values for complex cyclic species. A sensitivity analysis is also conducted to examine the relevant features that play a role in affecting the standard enthalpy of formation of cyclic species. Our species dataset is expected to be updated and expanded as new data is available in order to develop a more accurate SVR model with broader applicability.
  • DTiGEMS+: drug–target interaction prediction using graph embedding, graph mining, and similarity-based techniques.

    Thafar, Maha A.; Olayan, Rawan S.; Ashoor, Haitham; Albaradei, Somayah; Bajic, Vladimir B.; Gao, Xin; Gojobori, Takashi; Essack, Magbubah (Journal of Cheminformatics, Springer Science and Business Media LLC, 2020-06-29) [Article]
    In silico prediction of drug–target interactions is a critical phase in the sustainable drug development process, especially when the research focus is to capitalize on the repositioning of existing drugs. However, developing such computational methods is not an easy task, but is much needed, as current methods that predict potential drug–target interactions suffer from high false-positive rates. Here we introduce DTiGEMS+, a computational method that predicts Drug–Target interactions using Graph Embedding, graph Mining, and Similarity-based techniques. DTiGEMS+ combines similarity-based as well as feature-based approaches, and models the identification of novel drug–target interactions as a link prediction problem in a heterogeneous network. DTiGEMS+ constructs the heterogeneous network by augmenting the known drug–target interactions graph with two other complementary graphs namely: drug–drug similarity, target–target similarity. DTiGEMS+ combines different computational techniques to provide the final drug target prediction, these techniques include graph embeddings, graph mining, and machine learning. DTiGEMS+ integrates multiple drug–drug similarities and target–target similarities into the final heterogeneous graph construction after applying a similarity selection procedure as well as a similarity fusion algorithm. Using four benchmark datasets, we show DTiGEMS+ substantially improves prediction performance compared to other state-of-the-art in silico methods developed to predict of drug-target interactions by achieving the highest average AUPR across all datasets (0.92), which reduces the error rate by 33.3% relative to the second-best performing model in the state-of-the-art methods comparison.
  • Modern Deep Learning in Bioinformatics.

    Li, Haoyang; Tian, Shuye; Li, Yu; Fang, Qiming; Tan, Renbo; Pan, Yijie; Huang, Chao; Xu, Ying; Gao, Xin (Journal of molecular cell biology, Oxford University Press (OUP), 2020-06-23) [Article]
    Deep learning (DL) has shown explosive growth in its application to bioinformatics and has demonstrated thrillingly promising power to mine the complex relationship hidden in large-scale biological and biomedical data. A number of comprehensive reviews have been published on such applications, ranging from high-level reviews with future perspectives to those mainly serving as tutorials. These reviews have provided an excellent introduction to and guideline for applications of DL in bioinformatics, covering multiple types of machine learning (ML) problems, different DL architectures, and ranges of biological/biomedical problems. However, most of these reviews have focused on previous research, whereas current trends in the principled DL field and perspectives on their future developments and potential new applications to biology and biomedicine are still scarce. We will focus on modern DL, the ongoing trends and future directions of the principled DL field, and postulate new and major applications in bioinformatics.
  • Analysis of transcript-deleterious variants in Mendelian disorders: implications for RNA-based diagnostics.

    Maddirevula, Sateesh; Kuwahara, Hiroyuki; Ewida, Nour; Shamseldin, Hanan E; Patel, Nisha; AlZahrani, Fatema; AlSheddi, Tarfa; AlObeid, Eman; Alenazi, Mona; Alsaif, Hessa S; Alqahtani, Maha; AlAli, Maha; Al Ali, Hatoon; Helaby, Rana; Ibrahim, Niema; Abdulwahab, Firdous; Hashem, Mais; Hanna, Nadine; Monies, Dorota; Derar, Nada; Alsagheir, Afaf; Alhashem, Amal; Alsaleem, Badr; Alhebbi, Hamoud; Wali, Sami; Umarov, Ramzan; Gao, Xin; Alkuraya, Fowzan S. (Genome biology, Springer Science and Business Media LLC, 2020-06-17) [Article]
    BACKGROUND:At least 50% of patients with suspected Mendelian disorders remain undiagnosed after whole-exome sequencing (WES), and the extent to which non-coding variants that are not captured by WES contribute to this fraction is unclear. Whole transcriptome sequencing is a promising supplement to WES, although empirical data on the contribution of RNA analysis to the diagnosis of Mendelian diseases on a large scale are scarce. RESULTS:Here, we describe our experience with transcript-deleterious variants (TDVs) based on a cohort of 5647 families with suspected Mendelian diseases. We first interrogate all families for which the respective Mendelian phenotype could be mapped to a single locus to obtain an unbiased estimate of the contribution of TDVs at 18.9%. We examine the entire cohort and find that TDVs account for 15% of all "solved" cases. We compare the results of RT-PCR to in silico prediction. Definitive results from RT-PCR are obtained from blood-derived RNA for the overwhelming majority of variants (84.1%), and only a small minority (2.6%) fail analysis on all available RNA sources (blood-, skin fibroblast-, and urine renal epithelial cells-derived), which has important implications for the clinical application of RNA-seq. We also show that RNA analysis can establish the diagnosis in 13.5% of 155 patients who had received "negative" clinical WES reports. Finally, our data suggest a role for TDVs in modulating penetrance even in otherwise highly penetrant Mendelian disorders. CONCLUSIONS:Our results provide much needed empirical data for the impending implementation of diagnostic RNA-seq in conjunction with genome sequencing.
  • A Rapid, Accurate and Machine-agnostic Segmentation and Quantification Method for CT-based COVID-19 Diagnosis

    Zhou, Longxi; Li, Zhongxiao; Zhou, Juexiao; Li, Haoyang; Chen, Yupeng; Huang, Yuxin; Xie, Dexuan; Zhao, Lintao; Fan, Ming; Hashmi, Shahrukh; AbdelKareem, Faisal; Eiada, Riham; Xiao, Xigang; Li, Lihua; Qiu, Zhaowen; Gao, Xin (IEEE Transactions on Medical Imaging, IEEE, 2020-06-11) [Article]
    COVID-19 has caused a global pandemic and become the most urgent threat to the entire world. Tremendous efforts and resources have been invested in developing diagnosis, prognosis and treatment strategies to combat the disease. Although nucleic acid detection has been mainly used as the gold standard to confirm this RNA virus-based disease, it has been shown that such a strategy has a high false negative rate, especially for patients in the early stage, and thus CT imaging has been applied as a major diagnostic modality in confirming positive COVID-19. Despite the various, urgent advances in developing artificial intelligence (AI)-based computer-aided systems for CT-based COVID-19 diagnosis, most of the existing methods can only perform classification, whereas the stateof-the-art segmentation method requires a high level of human intervention. In this paper, we propose a fullyautomatic, rapid, accurate, and machine-agnostic method that can segment and quantify the infection regions on CT scans from different sources. Our method is founded upon two innovations: 1) the first CT scan simulator for COVID-19, by fitting the dynamic change of real patients’ data measured at different time points, which greatly alleviates the data scarcity issue; and 2) a novel deep learning algorithm to solve the large-scene-small-object problem, which decomposes the 3D segmentation problem into three 2D ones, and thus reduces the model complexity by an.
  • A self-adaptive deep learning algorithm for accelerating multi-component flash calculation

    Zhang, Tao; Li, Yu; Li, Yiteng; Sun, Shuyu; Gao, Xin (Computer Methods in Applied Mechanics and Engineering, Elsevier BV, 2020-06-11) [Article]
    In this paper, the first self-adaptive deep learning algorithm is proposed in details to accelerate flash calculations, which can quantitatively predict the total number of phases in the mixture and related thermodynamic properties at equilibrium for realistic reservoir fluids with a large number of components under various environmental conditions. A thermodynamically consistent scheme for phase equilibrium calculation is adopted and implemented at specified moles, volume and temperature, and the flash results are used as the ground truth for training and testing the deep neural network. The critical properties of each component are considered as the input features of the neural network and the final output is the total number of phases at equilibrium and the molar compositions in each phase. Two network structures are well designed, one of which transforms the input of various numbers of components in the training and the objective fluid mixture into a unified space before entering the productive neural network. “Ghost components” are defined and introduced to process the data padding work in order to modify the dimension of input flash calculation data to meet the training and testing requirements of the target fluid mixture. Hyperparameters on both two neural networks are carefully tuned in order to ensure the physical correlations underneath the input parameters are preserved properly through the learning process. This combined structure can make our deep learning algorithm to be self-adaptive to the change of input components and dimensions. Furthermore, two Softmax functions are used in the last layer to enforce the constraint that the summation of mole fractions in each phase is equal to 1. An example is presented that the flash calculation results of a 8-component Eagle Ford oil is used as input to estimate the phase equilibrium state of a 14-component Eagle Ford oil. The results are satisfactory with very small estimation errors. The capability of the proposed deep learning algorithm is also verified that simultaneously completes phase stability test and phase splitting calculation. Remarks are concluded at the end to provide some guidance for further research in this direction, especially the potential application of newly developed neural network models.

View more