For more information visit: https://sfb.kaust.edu.sa/Pages/Home.aspx

Recent Submissions

  • msRepDB: a comprehensive repetitive sequence database of over 80 000 species.

    Liao, Xingyu; Hu, Kang; Salhi, Adil; Zou, You; Wang, Jianxin; Gao, Xin (Nucleic acids research, Oxford University Press (OUP), 2021-12-01) [Article]
    Repeats are prevalent in the genomes of all bacteria, plants and animals, and they cover nearly half of the Human genome, which play indispensable roles in the evolution, inheritance, variation and genomic instability, and serve as substrates for chromosomal rearrangements that include disease-causing deletions, inversions, and translocations. Comprehensive identification, classification and annotation of repeats in genomes can provide accurate and targeted solutions towards understanding and diagnosis of complex diseases, optimization of plant properties and development of new drugs. RepBase and Dfam are two most frequently used repeat databases, but they are not sufficiently complete. Due to the lack of a comprehensive repeat database of multiple species, the current research in this field is far from being satisfactory. LongRepMarker is a new framework developed recently by our group for comprehensive identification of genomic repeats. We here propose msRepDB based on LongRepMarker, which is currently the most comprehensive multi-species repeat database, covering >80 000 species. Comprehensive evaluations show that msRepDB contains more species, and more complete repeats and families than RepBase and Dfam databases. (https://msrepdb.cbrc.kaust.edu.sa/pages/msRepDB/index.html).
  • Radiogenomic Signatures of Oncotype DX Recurrence Score Enable Prediction of Survival in Estrogen Receptor–Positive Breast Cancer: A Multicohort Study

    Fan, Ming; Cui, Yajing; You, Chao; Liu, Li; Gu, Yajia; Peng, Weijun; Bai, Qianming; Gao, Xin; Li, Lihua (Radiology, Radiological Society of North America (RSNA), 2021-11-30) [Article]
    Radiogenomic signatures associated with genomic assays (Oncotype DX) were identified as independent predictors after adjusting for clinical factors for survival and neoadjuvant chemotherapy response in estrogen receptor–positive breast cancer.
  • Critical role of backbone coordination in the mRNA recognition by RNA induced silencing complex

    Zhu, Lizhe; Jiang, Hanlun; Cao, Siqin; Unarta, Ilona Christy; Gao, Xin; Huang, Xuhui (Communications Biology, Springer Science and Business Media LLC, 2021-11-30) [Article]
    AbstractDespite its functional importance, the molecular mechanism underlying target mRNA recognition by Argonaute (Ago) remains largely elusive. Based on extensive all-atom molecular dynamics simulations, we constructed quasi-Markov State Model (qMSM) to reveal the dynamics during recognition at position 6-7 in the seed region of human Argonaute 2 (hAgo2). Interestingly, we found that the slowest mode of motion therein is not the gRNA-target base-pairing, but the coordination of the target phosphate groups with a set of positively charged residues of hAgo2. Moreover, the ability of Helix-7 to approach the PIWI and MID domains was found to reduce the effective volume accessible to the target mRNA and therefore facilitate both the backbone coordination and base-pair formation. Further mutant simulations revealed that alanine mutation of the D358 residue on Helix-7 enhanced a trap state to slow down the loading of target mRNA. Similar trap state was also observed when wobble pairs were introduced in g6 and g7, indicating the role of Helix-7 in suppressing non-canonical base-paring. Our study pointed to a general mechanism for mRNA recognition by eukaryotic Agos and demonstrated the promise of qMSM in investigating complex conformational changes of biomolecular systems.
  • A deep matrix factorization framework for identifying underlying tissue-specific patterns of DCE-MRI: applications for molecular subtype classification in breast cancer

    Fan, Ming; Yuan, Wei; Liu, Weifen; Gao, Xin; Xu, Maosheng; Wang, Shiwei; Li, Lihua (Physics in Medicine & Biology, IOP Publishing, 2021-11-17) [Article]
    Objective Breast cancer is heterogeneous in that different angiogenesis and blood flow characteristics could be present within a tumor. The pixel kinetics of DCE-MRI can assume several distinct signal patterns related to specific tissue characteristics. Identification of the latent, tissue-specific dynamic patterns of intratumor heterogeneity can shed light on the biological mechanisms underlying the heterogeneity of tumors. Approach To mine this information, we propose a deep matrix factorization-based dynamic decomposition (DMFDE) model specifically designed according to DCE-MRI characteristics. The time-series imaging data were decomposed into tissue-specific dynamic patterns and their corresponding proportion maps. The image pixel matrix and the reference matrix of population-level kinetics obtained by clustering the dynamic signals were used as the inputs. Two multilayer neural network branches were designed to collaboratively project the input matrix into a latent dynamic pattern and a dynamic proportion matrix, which was justified using simulated data. Clinical implications of DMFDE were assessed by radiomics analysis of proportion maps obtained from the tumor/parenchyma region for classifying the luminal A subtype. Main results The decomposition performance of DMFDE was evaluated by the root mean square error (RMSE) and was shown to be better than that of the conventional convex analysis of mixtures (CAM) method. The predictive model with K=3, 4, and 5 dynamic proportion maps generated AUC values of 0.780, 0.786 and 0.790, respectively, in distinguishing between luminal A and nonluminal A tumors, which are better than the CAM method (AUC=0.726). The combination of statistical features from images with different proportion maps has the highest prediction value (AUC= 0.813), which is significantly higher than that based on CAM. Conclusion This proposed method identified the latent dynamic patterns associated with different molecular subtypes, and radiomics analysis based on the pixel compositions of the uncovered dynamic patterns was able to determine molecular subtypes of breast cancer.
  • A deep matrix completion method for imputing missing histological data in breast cancer by integrating DCE-MRI radiomics

    Fan, Ming; Zhang, You; Fu, Zhenyu; Xu, Maosheng; Wang, Shiwei; Xie, Sangma; Gao, Xin; Wang, Yue; Li, Lihua (Medical Physics, Wiley, 2021-11-13) [Article]
    Purpose :Clinical indicators of histological information are important for breast cancer treatment and operational decision making, but these histological data suffer from frequent missing values due to various experimental/clinical reasons. The limited amount of histological information from breast cancer samples impedes the accuracy of data imputation. The purpose of this study was to impute missing histological data, including Ki-67 expression level, luminal A subtype, and histological grade, by integrating tumor radiomics. Methods : To this end, a deep matrix completion (DMC) method was proposed for imputing missing histological data using nonmissing features composed of histological and tumor radiomics (termed radiohistological features). DMC finds a latent nonlinear association between radiohistological features across all samples and samples for all the features. Radiomic features of morphologic, statistical and texture features were extracted from dynamic enhanced magnetic imaging (DCE-MRI) inside the tumor. Experiments on missing histological data imputation were performed with a variable number of features and missing data rates. The performance of the DMC method was compared with those of the nonnegative matrix factorization (NMF) and collaborative filtering (MCF)-based data imputation methods. The area under the curve (AUC) was used to assess the performance of missing histological data imputation. Results : By integrating radiomics from DCE-MRI, the DMC method showed significantly better performance in terms of AUC than that using only histological data. Additionally, DMC using 120 radiomic features showed an optimal prediction performance (AUC = 0.793), which was better than the NMF (AUC = 0.756) and MCF methods (AUC = 0.706; corrected p = 0.001). The DMC method consistently performed better than the NMF and MCF methods with a variable number of radiomic features and missing data rates. Conclusions : DMC improves imputation performance by integrating tumor histological and radiomics data. This study transforms latent imaging-scale patterns for interactions with molecular-scale histological information and is promising in the tumor characterization and management of patients.
  • Impact of computational approaches in the fight against COVID-19: an AI guided review of 17 000 studies

    Napolitano, Francesco; Xu, Xiaopeng; Gao, Xin (Briefings in Bioinformatics, Oxford University Press (OUP), 2021-11-11) [Article]
    SARS-CoV-2 caused the first severe pandemic of the digital era. Computational approaches have been ubiquitously used in an attempt to timely and effectively cope with the resulting global health crisis. In order to extensively assess such contribution, we collected, categorized and prioritized over 17 000 COVID-19-related research articles including both peer-reviewed and preprint publications that make a relevant use of computational approaches. Using machine learning methods, we identified six broad application areas i.e. Molecular Pharmacology and Biomarkers, Molecular Virology, Epidemiology, Healthcare, Clinical Medicine and Clinical Imaging. We then used our prioritization model as a guidance through an extensive, systematic review of the most relevant studies. We believe that the remarkable contribution provided by computational applications during the ongoing pandemic motivates additional efforts toward their further development and adoption, with the aim of enhancing preparedness and critical response for current and future emergencies.
  • Predicting Bone Metastasis Using Gene Expression-Based Machine Learning Models

    Albaradei, Somayah; Uludag, Mahmut; Thafar, Maha A.; Gojobori, Takashi; Essack, Magbubah; Gao, Xin (Frontiers in Genetics, Frontiers Media SA, 2021-11-10) [Article]
    Bone is the most common site of distant metastasis from malignant tumors, with the highest prevalence observed in breast and prostate cancers. Such bone metastases (BM) cause many painful skeletal-related events, such as severe bone pain, pathological fractures, spinal cord compression, and hypercalcemia, with adverse effects on life quality. Many bone-targeting agents developed based on the current understanding of BM onset’s molecular mechanisms dull these adverse effects. However, only a few studies investigated potential predictors of high risk for developing BM, despite such knowledge being critical for early interventions to prevent or delay BM. This work proposes a computational network-based pipeline that incorporates a ML/DL component to predict BM development. Based on the proposed pipeline we constructed several machine learning models. The deep neural network (DNN) model exhibited the highest prediction accuracy (AUC of 92.11%) using the top 34 featured genes ranked by betweenness centrality scores. We further used an entirely separate, “external” TCGA dataset to evaluate the robustness of this DNN model and achieved sensitivity of 85%, specificity of 80%, positive predictive value of 78.10%, negative predictive value of 80%, and AUC of 85.78%. The result shows the models’ way of learning allowed it to zoom in on the featured genes that provide the added benefit of the model displaying generic capabilities, that is, to predict BM for samples from different primary sites. Furthermore, existing experimental evidence provides confidence that about 50% of the 34 hub genes have BM-related functionality, which suggests that these common genetic markers provide vital insight about BM drivers. These findings may prompt the transformation of such a method into an artificial intelligence (AI) diagnostic tool and direct us towards mechanisms that underlie metastasis to bone events.
  • WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model

    Fei, Nanyi; Lu, Zhiwu; Gao, Yizhao; Yang, Guoxing; Huo, Yuqi; Wen, Jingyuan; Lu, Haoyu; Song, Ruihua; Gao, Xin; Xiang, Tao; Sun, Hao; Wen, Ji-Rong (arXiv, 2021-10-27) [Preprint]
    The fundamental goal of artificial intelligence (AI) is to mimic the core cognitive activities of human including perception, memory, and reasoning. Although tremendous success has been achieved in various AI research fields (e.g., computer vision and natural language processing), the majority of existing works only focus on acquiring single cognitive ability (e.g., image classification, reading comprehension, or visual commonsense reasoning). To overcome this limitation and take a solid step to artificial general intelligence (AGI), we develop a novel foundation model pre-trained with huge multimodal (visual and textual) data, which is able to be quickly adapted for a broad class of downstream cognitive tasks. Such a model is fundamentally different from the multimodal foundation models recently proposed in the literature that typically make strong semantic correlation assumption and expect exact alignment between image and text modalities in their pre-training data, which is often hard to satisfy in practice thus limiting their generalization abilities. To resolve this issue, we propose to pre-train our foundation model by self-supervised learning with weak semantic correlation data crawled from the Internet and show that state-of-the-art results can be obtained on a wide range of downstream tasks (both single-modal and cross-modal). Particularly, with novel model-interpretability tools developed in this work, we demonstrate that strong imagination ability (even with hints of commonsense) is now possessed by our foundation model. We believe our work makes a transformative stride towards AGI and will have broad impact on various AI+ fields (e.g., neuroscience and healthcare).
  • NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

    Wang, Kanix; Stevens, Robert; Alachram, Halima; Li, Yu; Soldatova, Larisa; King, Ross; Ananiadou, Sophia; Schoene, Annika M.; Li, Maolin; Christopoulou, Fenia; Ambite, José Luis; Matthew, Joel; Garg, Sahil; Hermjakob, Ulf; Marcu, Daniel; Sheng, Emily; Beißbarth, Tim; Wingender, Edgar; Galstyan, Aram; Gao, Xin; Chambers, Brendan; Pan, Weidi; Khomtchouk, Bohdan B.; Evans, James A.; Rzhetsky, Andrey (npj Systems Biology and Applications, Springer Science and Business Media LLC, 2021-10-20) [Article]
    Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.
  • Lethal variants in humans: lessons learned from a large molecular autopsy cohort.

    Shamseldin, Hanan E; AlAbdi, Lama; Maddirevula, Sateesh; Alsaif, Hessa S; AlZahrani, Fatema; Ewida, Nour; Hashem, Mais; Abdulwahab, Firdous; Abuyousef, Omar; Kuwahara, Hiroyuki; Gao, Xin; Molecular Autopsy Consortium; Alkuraya, Fowzan S (Genome medicine, Springer Science and Business Media LLC, 2021-10-13) [Article]
    BackgroundMolecular autopsy refers to DNA-based identification of the cause of death. Despite recent attempts to broaden its scope, the term remains typically reserved to sudden unexplained death in young adults. In this study, we aim to showcase the utility of molecular autopsy in defining lethal variants in humans.MethodsWe describe our experience with a cohort of 481 cases in whom the cause of premature death was investigated using DNA from the index or relatives (molecular autopsy by proxy). Molecular autopsy tool was typically exome sequencing although some were investigated using targeted approaches in the earlier stages of the study; these include positional mapping, targeted gene sequencing, chromosomal microarray, and gene panels.ResultsThe study includes 449 cases from consanguineous families and 141 lacked family history (simplex). The age range was embryos to 18 years. A likely causal variant (pathogenic/likely pathogenic) was identified in 63.8% (307/481), a much higher yield compared to the general diagnostic yield (43%) from the same population. The predominance of recessive lethal alleles allowed us to implement molecular autopsy by proxy in 55 couples, and the yield was similarly high (63.6%). We also note the occurrence of biallelic lethal forms of typically non-lethal dominant disorders, sometimes representing a novel bona fide biallelic recessive disease trait. Forty-six disease genes with no OMIM phenotype were identified in the course of this study. The presented data support the candidacy of two other previously reported novel disease genes (FAAH2 and MSN). The focus on lethal phenotypes revealed many examples of interesting phenotypic expansion as well as remarkable variability in clinical presentation. Furthermore, important insights into population genetics and variant interpretation are highlighted based on the results.ConclusionsMolecular autopsy, broadly defined, proved to be a helpful clinical approach that provides unique insights into lethal variants and the clinical annotation of the human genome.
  • BlockPolish: accurate polishing of long-read assembly via block divide-and-conquer.

    Huang, Neng; Nie, Fan; Ni, Peng; Gao, Xin; Luo, Feng; Wang, Jianxin (Briefings in bioinformatics, Oxford University Press (OUP), 2021-10-08) [Article]
    Long-read sequencing technology enables significant progress in de novo genome assembly. However, the high error rate and the wide error distribution of raw reads result in a large number of errors in the assembly. Polishing is a procedure to fix errors in the draft assembly and improve the reliability of genomic analysis. However, existing methods treat all the regions of the assembly equally while there are fundamental differences between the error distributions of these regions. How to achieve very high accuracy in genome assembly is still a challenging problem. Motivated by the uneven errors in different regions of the assembly, we propose a novel polishing workflow named BlockPolish. In this method, we divide contigs into blocks with low complexity and high complexity according to statistics of aligned nucleotide bases. Multiple sequence alignment is applied to realign raw reads in complex blocks and optimize the alignment result. Due to the different distributions of error rates in trivial and complex blocks, two multitask bidirectional Long short-term memory (LSTM) networks are proposed to predict the consensus sequences. In the whole-genome assemblies of NA12878 assembled by Wtdbg2 and Flye using Nanopore data, BlockPolish has a higher polishing accuracy than other state-of-the-arts including Racon, Medaka and MarginPolish & HELEN. In all assemblies, errors are predominantly indels and BlockPolish has a good performance in correcting them. In addition to the Nanopore assemblies, we further demonstrate that BlockPolish can also reduce the errors in the PacBio assemblies. The source code of BlockPolish is freely available on Github (https://github.com/huangnengCSU/BlockPolish).
  • DTi2Vec: Drug–target interaction prediction using network embedding and ensemble learning

    Thafar, Maha A.; Olayan, Rawan S.; Albaradei, Somayah; Bajic, Vladimir B.; Gojobori, Takashi; Essack, Magbubah; Gao, Xin (Journal of Cheminformatics, Springer Science and Business Media LLC, 2021-09-22) [Article]
    AbstractDrug–target interaction (DTI) prediction is a crucial step in drug discovery and repositioning as it reduces experimental validation costs if done right. Thus, developing in-silico methods to predict potential DTI has become a competitive research niche, with one of its main focuses being improving the prediction accuracy. Using machine learning (ML) models for this task, specifically network-based approaches, is effective and has shown great advantages over the other computational methods. However, ML model development involves upstream hand-crafted feature extraction and other processes that impact prediction accuracy. Thus, network-based representation learning techniques that provide automated feature extraction combined with traditional ML classifiers dealing with downstream link prediction tasks may be better-suited paradigms. Here, we present such a method, DTi2Vec, which identifies DTIs using network representation learning and ensemble learning techniques. DTi2Vec constructs the heterogeneous network, and then it automatically generates features for each drug and target using the nodes embedding technique. DTi2Vec demonstrated its ability in drug–target link prediction compared to several state-of-the-art network-based methods, using four benchmark datasets and large-scale data compiled from DrugBank. DTi2Vec showed a statistically significant increase in the prediction performances in terms of AUPR. We verified the "novel" predicted DTIs using several databases and scientific literature. DTi2Vec is a simple yet effective method that provides high DTI prediction performance while being scalable and efficient in computation, translating into a powerful drug repositioning tool.
  • ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation

    Umarov, Ramzan; Li, Yu; Arakawa, Takahiro; Takizawa, Satoshi; Gao, Xin; Arner, Erik (PLOS Computational Biology, Public Library of Science (PLoS), 2021-09-07) [Article]
    Regulatory elements control gene expression through transcription initiation (promoters) and by enhancing transcription at distant regions (enhancers). Accurate identification of regulatory elements is fundamental for annotating genomes and understanding gene expression patterns. While there are many attempts to develop computational promoter and enhancer identification methods, reliable tools to analyze long genomic sequences are still lacking. Prediction methods often perform poorly on the genome-wide scale because the number of negatives is much higher than that in the training sets. To address this issue, we propose a dynamic negative set updating scheme with a two-model approach, using one model for scanning the genome and the other one for testing candidate positions. The developed method achieves good genome-level performance and maintains robust performance when applied to other vertebrate species, without re-training. Moreover, the unannotated predicted regulatory regions made on the human genome are enriched for disease-associated variants, suggesting them to be potentially true regulatory elements rather than false positives. We validated high scoring “false positive” predictions using reporter assay and all tested candidates were successfully validated, demonstrating the ability of our method to discover novel human regulatory regions.
  • Machine Learning and Deep Learning Methods that use Omics Data for Metastasis Prediction

    Albaradei, Somayah; Thafar, Maha A.; Alsaedi, Asim; Van Neste, Christophe Marc; Gojobori, Takashi; Essack, Magbubah; Gao, Xin (Computational and Structural Biotechnology Journal, Elsevier BV, 2021-09-04) [Article]
    Knowing metastasis is the primary cause of cancer-related deaths, incentivized research directed towards unraveling the complex cellular processes that drive the metastasis. Advancement in technology and specifically the advent of high-throughput sequencing provides knowledge of such processes. This knowledge led to the development of therapeutic and clinical applications, and is now being used to predict the onset of metastasis to improve diagnostics and disease therapies. In this regard, predicting metastasis onset has also been explored using artificial intelligence approaches that are machine learning, and more recently, deep learning-based. This review summarizes the different machine learning and deep learning-based metastasis prediction methods developed to date. We also detail the different types of molecular data used to build the models and the critical signatures derived from the different methods. We further highlight the challenges associated with using machine learning and deep learning methods, and provide suggestions to improve the predictive performance of such methods.
  • MetaCancer: A Deep Learning-Based Pan-cancer Metastasis Prediction Model Developed using Multi-omics Data

    Albaradei, Somayah; Napolitano, Farancesco; Thafar, Maha A.; Gojobori, Takashi; Essack, Magbubah; Gao, Xin (Computational and Structural Biotechnology Journal, Elsevier BV, 2021-08-09) [Article]
    Predicting metastasis in the early stages means that clinicians have more time to adjust a treatment regimen to target the primary and metastasized cancer. In this regard, several computational approaches are being developed to identify metastasis early. However, most of the approaches focus on changes on one genomic level only, and they are not being developed from a pan-cancer perspective. Thus, we here present a deep learning (DL)–based model, MetaCancer, that differentiates pan-cancer metastasis status based on three heterogeneous data layers. In particular, we built the DL-based model using 400 patients' data that includes RNA sequencing (RNA-Seq), microRNA sequencing (microRNA-Seq), and DNA methylation data from The Cancer Genome Atlas (TCGA). We quantitatively assess the proposed convolutional variational autoencoder (CVAE) and alternative feature extraction methods. We further show that integrating mRNA, microRNA, and DNA methylation data as features improves our model's performance compared to when we used mRNA data only. In addition, we show that the mRNA-related features make a more significant contribution when attempting to distinguish the primary tumors from metastatic ones computationally. Lastly, we show that our DL model significantly outperformed a machine learning (ML) ensemble method based on various metrics.
  • Protein-RNA interaction prediction with deep learning: Structure matters

    Wei, Junkang; Chen, Siyuan; Zong, Licheng; Gao, Xin; Li, Yu (arXiv, 2021-07-26) [Preprint]
    Protein-RNA interactions are of vital importance to a variety of cellular activities. Both experimental and computational techniques have been developed to study the interactions. Due to the limitation of the previous database, especially the lack of protein structure data, most of the existing computational methods rely heavily on the sequence data, with only a small portion of the methods utilizing the structural information. Recently, AlphaFold has revolutionized the entire protein and biology field. Foreseeably, the protein-RNA interaction prediction will also be promoted significantly in the upcoming years. In this work, we give a thorough review of this field, surveying both the binding site and binding preference prediction problems and covering the commonly used datasets, features, and models. We also point out the potential challenges and opportunities in this field. This survey summarizes the development of the RBP-RNA interaction field in the past and foresees its future development in the post-AlphaFold era.
  • A sensitive repeat identification framework based on short and long reads

    Liao, Xingyu; Li, M; Hu, K; Wu, FX; Gao, Xin (Nucleic Acids Research, Oxford University Press, 2021-07-02) [Article]
    Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker).
  • An Interpretable Computer-Aided Diagnosis Method for Periodontitis From Panoramic Radiographs.

    Li, Haoyang; Zhou, Juexiao; Zhou, Yi; Chen, Qiang; She, Yangyang; Gao, Feng; Xu, Ying; Chen, Jieyu; Gao, Xin (Frontiers in physiology, Frontiers Media SA, 2021-06-22) [Article]
    Periodontitis is a prevalent and irreversible chronic inflammatory disease both in developed and developing countries, and affects about 20–50% of the global population. The tool for automatically diagnosing periodontitis is highly demanded to screen at-risk people for periodontitis and its early detection could prevent the onset of tooth loss, especially in local communities and health care settings with limited dental professionals. In the medical field, doctors need to understand and trust the decisions made by computational models and developing interpretable models is crucial for disease diagnosis. Based on these considerations, we propose an interpretable method called Deetal-Perio to predict the severity degree of periodontitis in dental panoramic radiographs. In our method, alveolar bone loss (ABL), the clinical hallmark for periodontitis diagnosis, could be interpreted as the key feature. To calculate ABL, we also propose a method for teeth numbering and segmentation. First, Deetal-Perio segments and indexes the individual tooth via Mask R-CNN combined with a novel calibration method. Next, Deetal-Perio segments the contour of the alveolar bone and calculates a ratio for individual tooth to represent ABL. Finally, Deetal-Perio predicts the severity degree of periodontitis given the ratios of all the teeth. The Macro F1-score and accuracy of the periodontitis prediction task in our method reach 0.894 and 0.896, respectively, on Suzhou data set, and 0.820 and 0.824, respectively on Zhongshan data set. The entire architecture could not only outperform state-of-the-art methods and show robustness on two data sets in both periodontitis prediction, and teeth numbering and segmentation tasks, but also be interpretable for doctors to understand the reason why Deetal-Perio works so well.
  • DeeReCT-TSS: A novel meta-learning-based method annotates TSS in multiple cell types based on DNA sequences and RNA-seq data

    Zhou, Juexiao; zhang, bin; Li, Haoyang; Zhou, Longxi; Li, Zhongxiao; Long, Yongkang; Han, Wenkai; Wang, Mengran; Cui, Huanhuan; Chen, Wei; Gao, Xin (Research Square Platform LLC, 2021-06-21) [Preprint]
    Abstract The accurate annotation of transcription start sites (TSSs) and their usage is critical for the mechanistic understanding of gene regulation under different biological contexts. To fulfil this, on one hand, specific high-throughput experimental technologies have been developed to capture TSSs in a genome-wide manner. On the other hand, various computational tools have also been developed for in silico prediction of TSSs solely based on genomic sequences. Most of these computational tools cast the problem as a binary classification task on a balanced dataset and thus result in drastic false positive predictions when applied on the genome-scale. To address these issues, we present DeeReCT-TSS, a deep-learning-based method that is capable of TSSs identification across the whole genome based on both DNA sequences and conventional RNA-seq data. We show that by effectively incorporating these two sources of information, DeeReCT-TSS significantly outperforms other solely sequence-based methods on the precise annotation of TSSs used in different cell types. Furthermore, we develop a meta-learning-based extension for simultaneous transcription start site (TSS) annotation on 10 cell types, which enables the identification of cell-type-specific TSS. Finally, we demonstrate the high precision of DeeReCT-TSS on two independent datasets from the ENCODE project by correlating our predicted TSSs with experimentally defined TSS chromatin states. Our application, pre-trained models and data are available at https://github.com/JoshuaChou2018/DeeReCT-TSS_release.
  • Lunar features detection for energy discovery via deep learning

    Chen, Siyuan; Li, Yu; Zhang, Tao; Zhu, Xingyu; Sun, Shuyu; Gao, Xin (Applied Energy, Elsevier BV, 2021-05-19) [Article]
    Because of the impending energy crisis and the environmental Impact of fossil fuels, researchers are actively looking for alternatives, such as Helium-3 on the Moon. Although it remains challenging to explore energies on the Moon due to the long physical distance, the lunar features, such as craters and rilles, can be the hotspots for such energy sources, according to recent studies. Thus, identifying lunar features, such as craters and rilles, can facilitate the discovery of Helium-3 on the Moon, which is enriched in such hotspots. However, previously, no computational method was developed to recognize the lunar features automatically for facilitating space energy discovery. In our research, we aim at developing the first deep learning method to identify multiple lunar features simultaneously for potential energy source discovery. Based on the state-of-the-art deep learning model, High Resolution Net, our model can efficiently extract semantic information and high-resolution spatial information from the input images, which ensures the performance for recognizing the lunar features. With a novel framework, our method can recognize multiple lunar features, such as craters and rilles, at the same time. We also used transfer learning to handle the data deficiency issue. With comprehensive experiments on three datasets, we show the effectiveness of the proposed method. All the datasets and codes are available online.

View more