For more information visit: https://sfb.kaust.edu.sa/Pages/Home.aspx

Recent Submissions

  • miR-596-3p suppresses brain metastasis of non-small cell lung cancer by modulating YAP1 and IL-8

    Li, Chenlong; Zheng, Hongshan; Xiong, Jinsheng; Huang, Yuxin; Li, Haoyang; Jin, Hua; Ai, Siqi; Wang, Yingjie; Su, Tianqi; Sun, Guiyin; Xiao, Xu; Fu, Tianjiao; Wang, Yujie; Gao, Xin; Liang, Peng (Cell Death & Disease, Springer Science and Business Media LLC, 2022-08-12) [Article]
    Brain metastasis (BM) frequently occurs in advanced non-small cell lung cancer (NSCLC) and is associated with poor clinical prognosis. Due to the location of metastatic lesions, the surgical resection is limited and the chemotherapy is ineffective because of the existence of the blood brain barrier (BBB). Therefore, it is essential to enhance our understanding about the underlying mechanisms associated with brain metastasis in NSCLC. In the present study, we explored the RNA-Seq data of brain metastasis cells from the GEO database, and extracted RNA collected from primary NSCLC tumors as well as paired brain metastatic lesions followed by microRNA PCR array. Meanwhile, we improved the in vivo model and constructed a cancer stem cell-derived transplantation model of brain metastasis in mice. Our data indicated that the level of miR-596-3p is high in primary NSCLC tumors, but significantly downregulated in the brain metastatic lesion. The prediction target of microRNA suggested that miR-596-3p was considered to modulate two genes essential in the brain invasion process, YAP1 and IL-8 that restrain the invasion of cancer cells and permeability of BBB, respectively. Moreover, in vivo experiments suggested that our model mimics the clinical aspect of NSCLC and improves the success ratio of brain metastasis model. The results demonstrated that miR-596-3p significantly inhibited the capacity of NSCLC cells to metastasize to the brain. Furthermore, these finding elucidated that miR-596-3p exerts a critical role in brain metastasis of NSCLC by modulating the YAP1-IL8 network, and this miRNA axis may provide a potential therapeutic strategy for brain metastasis.
  • The SWI/SNF chromatin remodeling factor DPF3 regulates metastasis of ccRCC by modulating TGF-β signaling

    Cui, Huanhuan; Yi, Hongyang; Bao, Hongyu; Tan, Ying; Tian, Chi; Shi, Xinyao; Gan, Diwen; Zhang, Bin; Liang, Weizheng; Chen, Rui; Zhu, Qionghua; Fang, Liang; Gao, Xin; Huang, Hongda; Tian, Ruijun; Sperling, Silke R.; Hu, Yuhui; Chen, Wei (Nature Communications, Springer Science and Business Media LLC, 2022-08-09) [Article]
    DPF3, a component of the SWI/SNF chromatin remodeling complex, has been associated with clear cell renal cell carcinoma (ccRCC) in a genome-wide association study. However, the functional role of DPF3 in ccRCC development and progression remains unknown. In this study, we demonstrate that DPF3a, the short isoform of DPF3, promotes kidney cancer cell migration both in vitro and in vivo, consistent with the clinical observation that DPF3a is significantly upregulated in ccRCC patients with metastases. Mechanistically, DPF3a specifically interacts with SNIP1, via which it forms a complex with SMAD4 and p300 histone acetyltransferase (HAT), the major transcriptional regulators of TGF-β signaling pathway. Moreover, the binding of DPF3a releases the repressive effect of SNIP1 on p300 HAT activity, leading to the increase in local histone acetylation and the activation of cell movement related genes. Overall, our findings reveal a metastasis-promoting function of DPF3, and further establish the link between SWI/SNF components and ccRCC.
  • Unveiling the “Template-Dependent” Inhibition on the Viral Transcription of SARS-CoV-2

    Luo, Xueying; Wang, Xiaowei; Yao, Yuan; Gao, Xin; Zhang, Lu (The Journal of Physical Chemistry Letters, American Chemical Society (ACS), 2022-07-30) [Article]
    Remdesivir is one nucleotide analogue prodrug capable to terminate RNA synthesis in SARS-CoV-2 RNA-dependent RNA polymerase (RdRp) by two distinct mechanisms. Although the “delayed chain termination” mechanism has been extensively investigated, the “template-dependent” inhibitory mechanism remains elusive. In this study, we have demonstrated that remdesivir embedded in the template strand seldom directly disrupted the complementary NTP incorporation at the active site. Instead, the translocation of remdesivir from the +2 to the +1 site was hindered due to the steric clash with V557. Moreover, we have elucidated the molecular mechanism characterizing the drug resistance upon V557L mutation. Overall, our studies have provided valuable insight into the “template-dependent” inhibitory mechanism exerted by remdesivir on SARS-CoV-2 RdRp and paved venues for an alternative antiviral strategy for the COVID-19 pandemic. As the “template-dependent” inhibition occurs across diverse viral RdRps, our findings may also shed light on a common acting mechanism of inhibitors.
  • MetastaSite: Predicting metastasis to different sites using deep learning with gene expression data

    Albaradei, Somayah; Albaradei, Abdurhman; Alsaedi, Asim; Uludag, Mahmut; Thafar, Maha A.; Gojobori, Takashi; Essack, Magbubah; Gao, Xin (Frontiers in Molecular Biosciences, Frontiers Media SA, 2022-07-22) [Article]
    Deep learning has massive potential in predicting phenotype from different omics profiles. However, deep neural networks are viewed as black boxes, providing predictions without explanation. Therefore, the requirements for these models to become interpretable are increasing, especially in the medical field. Here we propose a computational framework that takes the gene expression profile of any primary cancer sample and predicts whether patients’ samples are primary (localized) or metastasized to the brain, bone, lung, or liver based on deep learning architecture. Specifically, we first constructed an AutoEncoder framework to learn the non-linear relationship between genes, and then DeepLIFT was applied to calculate genes’ importance scores. Next, to mine the top essential genes that can distinguish the primary and metastasized tumors, we iteratively added ten top-ranked genes based upon their importance score to train a DNN model. Then we trained a final multi-class DNN that uses the output from the previous part as an input and predicts whether samples are primary or metastasized to the brain, bone, lung, or liver. The prediction performances ranged from AUC of 0.93–0.82. We further designed the model’s workflow to provide a second functionality beyond metastasis site prediction, i.e., to identify the biological functions that the DL model uses to perform the prediction. To our knowledge, this is the first multi-class DNN model developed for the generic prediction of metastasis to various sites.
  • Alternative role of motif B in template dependent polymerase inhibition

    Luo, Xueying; Xu, Tiantian; Gao, Xin; Zhang, Lu (Chinese Journal of Chemical Physics, AIP Publishing, 2022-07-19) [Article]
    Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) relies on the central molecular machine RNA-dependent RNA polymerase (RdRp) for the viral replication and transcription. Remdesivir at the template strand has been shown to effectively inhibit the RNA synthesis in SARS-CoV-2 RdRp by deactivating not only the complementary UTP incorporation but also the next nucleotide addition. How-ever, the underlying molecular mechanism of the second inhibitory point remains unclear. In this work, we have performed molecular dynamics simulations and demonstrated that such inhibition has not directly acted on the nucleotide addition at the active site. Instead, the translocation of Remdesivir from +1 to −1 site is hindered thermodynamically as the post-translocation state is less stable than the pre-translocation state due to the motif B residue G683. Moreover, another conserved residue S682 on motif B further hinders the dynamic translocation of Remdesivir due to the steric clash with the 1′-cyano substitution. Overall, our study has unveiled an alternative role of motif B in mediating the translocation when Remdesivir is present in the template strand and complemented our understanding about the inhibitory mechanisms exerted by Remdesivir on the RNA synthesis in SARS-CoV-2 RdRp.
  • Target-aware Abstractive Related Work Generation with Contrastive Learning

    Chen, Xiuying; Alamro, Hind; Li, Mingzhe; Gao, Shen; Yan, Rui; Gao, Xin; Zhang, Xiangliang (ACM, 2022-07-07) [Conference Paper]
    The related work section is an important component of a scientific paper, which highlights the contribution of the target paper in the context of the reference papers. Authors can save their time and effort by using the automatically generated related work section as a draft to complete the final related work. Most of the existing related work section generation methods rely on extracting off-the-shelf sentences to make a comparative discussion about the target work and the reference papers. However, such sentences need to be written in advance and are hard to obtain in practice. Hence, in this paper, we propose an abstractive target-aware related work generator (TAG), which can generate related work sections consisting of new sentences. Concretely, we first propose a target-aware graph encoder, which models the relationships between reference papers and the target paper with target-centered attention mechanisms. In the decoding process, we propose a hierarchical decoder that attends to the nodes of different levels in the graph with keyphrases as semantic indicators. Finally, to generate a more informative related work, we propose multi-level contrastive optimization objectives, which aim to maximize the mutual information between the generated related work with the references and minimize that with non-references. Extensive experiments on two public scholar datasets show that the proposed model brings substantial improvements over several strong baselines in terms of automatic and tailored human evaluations.
  • Predicting the antigenic evolution of SARS-COV-2 with deep learning

    Han, Wenkai; Chen, NingNing; Sun, Shiwei; Gao, Xin (Cold Spring Harbor Laboratory, 2022-06-29) [Preprint]
    The severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) antigenic profile evolves in response to the vaccine and natural infection-derived immune pressure, resulting in immune escape and threatening public health. Exploring the possible antigenic evolutionary potentials improves public health preparedness, but it is limited by the lack of experimental assays as the sequence space is exponentially large. Here we introduce the Machine Learning-guided Antigenic Evolution Prediction (MLAEP), which combines structure modeling, multi-task learning, and genetic algorithm to model the viral fitness landscape and explore the antigenic evolution via in silico directed evolution. As demonstrated by existing SARS-COV-2 variants, MLEAP can infer the order of variants along antigenic evolutionary trajectories, which is also strongly correlated with their sampling time. The novel mutations predicted by MLEAP are also found in immunocompromised covid patients and newly emerging variants, like BA. 4/5. In sum, our approach enables profiling existing variants and forecasting prospective antigenic variants, thus may help guide the development of vaccines and increase preparedness against future variants.
  • A Peculiar Binding Characterization of DNA (RNA) Nucleobases at MoOS-Based Janus Biosensor: Dissimilar Facets Role on Selectivity and Sensitivity

    Laref, Slimane; Wang, Bin; Inal, Sahika; Al-Ghamdi, Salah; Gao, Xin; Gojobori, Takashi (Biosensors, MDPI AG, 2022-06-23) [Article]
    Distinctive properties of Janus monolayer have drawn much interest in biotechnology applications. For this purpose, it has explored theoretically all sensing possibilities of nucleobases molecules (DNA/RNA) by Janus MoOS monolayer on both oxygen and sulfur terminations by means of rigorous first–principles calculation. Indeed, differences in interaction energy between nucleobases indicate that a monolayer can be used for DNA sequencing. Exothermic interaction energy range for DNA/RNA molecules with both oxygen and sulfur sides of the Janus MoOS surfaces have been found to range between (0.61–0.91 eV), and (0.63–0.88 eV), respectively, and the binding distances indicate that these molecules bind to both facets by physisorption. The exchange of weak electronic charges between the MoOS monolayer and the nucleobases molecules has been studied by means of Hirshfeld-I charge analysis. It has been observed that the introduction of DNA/RNA nucleobases molecules alters the electronic properties of both oxygen and sulfur atomic layers of the Janus MoOS complex systems as determined by plotting the 3D Kohn–Sham frontier orbitals. A good correlation has been found between the interaction energy, van der Waals energy, Hirshfeld-I, and d–band center as a function of the nucleobase’s affinity, and the interaction energy, suggesting adsorption dominated by van der Waals interactions driven by molybdenum d–orbital. Moreover, the lowering in the adsorption energy leads to an active interaction of the DNA/RNA with the surfaces, accordingly its conduct to shorter the recovery time. The selectivity of the biosensor modulation device has illustrated a significant sensitivity for the nucleobases on both the oxygen and sulfur layer sides of the MoOS monolayer. This finding reveals that apart from graphene, dichalcogenides–Janus transition metal may also be adequate for identifying DNA/RNA bases in applied biotechnology.
  • SARS-CoV-2 RdRp Follows Asynchronous Translocation Pathway for Viral Transcription and Replication

    Wang, Xiaowei; Yao, Yuan; Gao, Xin; Zhang, Lu (American Chemical Society (ACS), 2022-06-15) [Preprint]
    RNA-dependent RNA polymerase (RdRp) is the replicase machinery for SARS-CoV-2 and thereby it has become one of the most promising drug targets to combat the pandemic as well as the healthy threat posed by the novel coronavirus. Translocation is one essential step for RdRp to exert the viral replication and transcription, and it describes the dynamic process in which the double-stranded RNA moves upstream by one base pair position to empty the active site for the continuous substrate incorporation. However, the molecular mechanisms underlying the dynamic translocation of SARS-CoV-2 RdRp remain elusive. In the current study, we have elucidated the molecular insights into the translocation dynamics of SARS- CoV-2 RdRp by constructing a Markov State Model based on extensive molecular dynamics simulations. We have identified two previously uncharacterized intermediates which pinpoint an asynchronous and rate-limiting translocation of the nascent-template duplex. The movement of the 3’-terminal nucleotide in the nascent strand lags behind its upstream nucleotides due to the uneven protein environment while the translocation of template strand is delayed by the hurdle residue K500. Although the motions of the two strands are not synchronous, they share the same “ratchet” to stabilize the system in the post-translocation state, suggesting a coupled Brownian-ratchet model. Overall, our study has provided the intriguing insights into the translocation dynamics with unprecedented molecular details, which would significantly deepen our understanding about the transcriptional mechanisms of SARS-CoV-2.
  • Towards artificial general intelligence via a multimodal foundation model

    Fei, Nanyi; Lu, Zhiwu; Gao, Yizhao; Yang, Guoxing; Huo, Yuqi; Wen, Jingyuan; Lu, Haoyu; Song, Ruihua; Gao, Xin; Xiang, Tao; Sun, Hao; Wen, Ji-Rong (Nature Communications, Springer Science and Business Media LLC, 2022-06-02) [Article]
    The fundamental goal of artificial intelligence (AI) is to mimic the core cognitive activities of human. Despite tremendous success in the AI research, most of existing methods have only single-cognitive ability. To overcome this limitation and take a solid step towards artificial general intelligence (AGI), we develop a foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream cognitive tasks. To achieve this goal, we propose to pre-train our foundation model by self-supervised learning with weak semantic correlation data crawled from the Internet and show that promising results can be obtained on a wide range of downstream tasks. Particularly, with the developed model-interpretability tools, we demonstrate that strong imagination ability is now possessed by our foundation model. We believe that our work makes a transformative stride towards AGI, from our common practice of “weak or narrow AI” to that of “strong or generalized AI”.
  • RBP-TSTL is a two-stage transfer learning framework for genome-scale prediction of RNA-binding proteins

    Peng, Xinxin; Wang, Xiaoyu; Guo, Yuming; Ge, Zongyuan; Li, Fuyi; Gao, Xin; Song, Jiangning (Briefings in bioinformatics, Oxford University Press (OUP), 2022-06-02) [Article]
    RNA binding proteins (RBPs) are critical for the post-transcriptional control of RNAs and play vital roles in a myriad of biological processes, such as RNA localization and gene regulation. Therefore, computational methods that are capable of accurately identifying RBPs are highly desirable and have important implications for biomedical and biotechnological applications. Here, we propose a two-stage deep transfer learning-based framework, termed RBP-TSTL, for accurate prediction of RBPs. In the first stage, the knowledge from the self-supervised pre-trained model was extracted as feature embeddings and used to represent the protein sequences, while in the second stage, a customized deep learning model was initialized based on an annotated pre-training RBPs dataset before being fine-tuned on each corresponding target species dataset. This two-stage transfer learning framework can enable the RBP-TSTL model to be effectively trained to learn and improve the prediction performance. Extensive performance benchmarking of the RBP-TSTL models trained using the features generated by the self-supervised pre-trained model and other models trained using hand-crafting encoding features demonstrated the effectiveness of the proposed two-stage knowledge transfer strategy based on the self-supervised pre-trained models. Using the best-performing RBP-TSTL models, we further conducted genome-scale RBP predictions for Homo sapiens, Arabidopsis thaliana, Escherichia coli, and Salmonella and established a computational compendium containing all the predicted putative RBPs candidates. We anticipate that the proposed RBP-TSTL approach will be explored as a useful tool for the characterization of RNA-binding proteins and exploration of their sequence-structure-function relationships.
  • Special issue on computational biology and bioinformatic applications to the COVID-19 pandemic

    Napolitano, Francesco; Gao, Xin (Quantitative Biology, Engineering Sciences Press, 2022-06) [Article]
  • A framework for deep multitask learning with multiparametric magnetic resonance imaging for the joint prediction of histological characteristics in breast cancer

    Fan, Ming; Yuan, Chengcheng; Huang, Guangyao; Xu, Maosheng; Wang, Shiwei; Gao, Xin; Li, Lihua (IEEE Journal of Biomedical and Health Informatics, Institute of Electrical and Electronics Engineers (IEEE), 2022-05-30) [Article]
    The clinical management and decision-making process related to breast cancer are based on multiple histological indicators. This study aims to jointly predict the Ki-67 expression level, luminal A subtype and histological grade molecular biomarkers using a new deep multitask learning method with multiparametric magnetic resonance imaging. A multitask learning network structure was proposed by introducing a common-task layer and task-specific layers to learn the high-level features that are common to all tasks and related to a specific task, respectively. A network pretrained with knowledge from the ImageNet dataset was used and fine-tuned with MRI data. Information from multiparametric MR images was fused using the strategy at the feature and decision levels. The area under the receiver operating characteristic curve (AUC) was used to measure model performance. For single-task learning using a single image series, the deep learning model generated AUCs of 0.752, 0.722, and 0.596 for the Ki-67, luminal A and histological grade prediction tasks, respectively. The performance was improved by freezing the first 5 convolutional layers, using 20% shared layers and fusing multiparametric series at the feature level, which achieved AUCs of 0.819, 0.799 and 0.747 for Ki-67, luminal A and histological grade prediction tasks, respectively. Our study showed advantages in jointly predicting correlated clinical biomarkers using a deep multitask learning framework with an appropriate number of fine-tuned convolutional layers by taking full advantage of common and complementary imaging features. Multiparametric image series-based multitask learning could be a promising approach for the multiple clinical indicator-based management of breast cancer.
  • PointSite: A Point Cloud Segmentation Tool for Identification of Protein Ligand Binding Atoms

    Yan, Xu; Lu, Yingfeng; Li, Zhen; Wei, Qing; Gao, Xin; Wang, Sheng; Wu, Song; Cui, Shuguang (Journal of Chemical Information and Modeling, American Chemical Society (ACS), 2022-05-27) [Article]
    Accurate identification of ligand binding sites (LBS) on a protein structure is critical for understanding protein function and designing structure-based drugs. As the previous pocket-centric methods are usually based on the investigation of pseudo-surface-points outside the protein structure, they cannot fully take advantage of the local connectivity of atoms within the protein, as well as the global 3D geometrical information from all the protein atoms. In this paper, we propose a novel point clouds segmentation method, PointSite, for accurate identification of protein ligand binding atoms, which performs protein LBS identification at the atom-level in a protein-centric manner. Specifically, we first transfer the original 3D protein structure to point clouds and then conduct segmentation through Submanifold Sparse Convolution based U-Net. With the fine-grained atom-level binding atoms representation and enhanced feature learning, PointSite can outperform previous methods in atom Intersection over Union (atom-IoU) by a large margin. Furthermore, our segmented binding atoms, that is, atoms with high probability predicted by our model can work as a filter on predictions achieved by previous pocket-centric approaches, which significantly decreases the false-positive of LBS candidates. Besides, we further directly extend PointSite trained on bound proteins for LBS identification on unbound proteins, which demonstrates the superior generalization capacity of PointSite. Through cascaded filter and reranking aided by the segmented atoms, state-of-the-art performance can be achieved over various canonical benchmarks, CAMEO hard targets, and unbound proteins in terms of the commonly used DCA criteria.
  • Pan-cancer pervasive upregulation of 3′ UTR splicing drives tumourigenesis

    Chan, Jia Jia; Zhang, Bin; Chew, Xiao Hong; Salhi, Adil; Kwok, Zhi Hao; Lim, Chun You; Desi, Ng; Subramaniam, Nagavidya; Siemens, Angela; Kinanti, Tyas; Ong, Shane; Sanchez-Mejias, Avencia; Ly, Phuong Thao; An, Omer; Sundar, Raghav; Fan, Xiaonan; Wang, Shi; Siew, Bei En; Lee, Kuok Chung; Chong, Choon Seng; Lieske, Bettina; Cheong, Wai-Kit; Goh, Yufen; Fam, Wee Nih; Ooi, Melissa G.; Koh, Bryan T. H.; Iyer, Shridhar Ganpathi; Ling, Wen Huan; Chen, Jianbin; Yoong, Boon-Koon; Chanwat, Rawisak; Bonney, Glenn Kunnath; Goh, Brian K. P.; Zhai, Weiwei; Fullwood, Melissa J.; Wang, Wilson; Tan, Ker-Kan; Chng, Wee Joo; Dan, Yock Young; Pitt, Jason J.; Roca, Xavier; Guccione, Ernesto; Vardy, Leah A.; Chen, Leilei; Gao, Xin; Chow, Pierce K. H.; Yang, Henry; Tay, Yvonne (Nature Cell Biology, Springer Science and Business Media LLC, 2022-05-26) [Article]
    Most mammalian genes generate messenger RNAs with variable untranslated regions (UTRs) that are important post-transcriptional regulators. In cancer, shortening at 3′ UTR ends via alternative polyadenylation can activate oncogenes. However, internal 3′ UTR splicing remains poorly understood as splicing studies have traditionally focused on protein-coding alterations. Here we systematically map the pan-cancer landscape of 3′ UTR splicing and present this in SpUR (http://www.cbrc.kaust.edu.sa/spur/home/). 3′ UTR splicing is widespread, upregulated in cancers, correlated with poor prognosis and more prevalent in oncogenes. We show that antisense oligonucleotide-mediated inhibition of 3′ UTR splicing efficiently reduces oncogene expression and impedes tumour progression. Notably, CTNNB1 3′ UTR splicing is the most consistently dysregulated event across cancers. We validate its upregulation in hepatocellular carcinoma and colon adenocarcinoma, and show that the spliced 3′ UTR variant is the predominant contributor to its oncogenic functions. Overall, our study highlights the importance of 3′ UTR splicing in cancer and may launch new avenues for RNA-based anti-cancer therapeutics.
  • An interpretable deep learning workflow for discovering subvisual abnormalities in CT scans of COVID-19 inpatients and survivors

    Zhou, Longxi; Meng, Xianglin; Huang, Yuxin; Kang, Kai; Zhou, Juexiao; Chu, Yuetan; Li, Haoyang; Xie, Dexuan; Zhang, Jiannan; Yang, Weizhen; Bai, Na; Zhao, Yi; Zhao, Mingyan; Wang, Guohua; Carin, Lawrence; Xiao, Xigang; Yu, Kaijiang; Qiu, Zhaowen; Gao, Xin (Nature Machine Intelligence, Springer Science and Business Media LLC, 2022-05-23) [Article]
    Tremendous efforts have been made to improve diagnosis and treatment of COVID-19, but knowledge on long-term complications is limited. In particular, a large portion of survivors has respiratory complications, but currently, experienced radiologists and state-of-the-art artificial intelligence systems are not able to detect many abnormalities from follow-up computerized tomography (CT) scans of COVID-19 survivors. Here we propose Deep-LungParenchyma-Enhancing (DLPE), a computer-aided detection (CAD) method for detecting and quantifying pulmonary parenchyma lesions on chest CT. Through proposing a number of deep-learning-based segmentation models and assembling them in an interpretable manner, DLPE removes irrelevant tissues from the perspective of pulmonary parenchyma, and calculates the scan-level optimal window, which considerably enhances parenchyma lesions relative to the lung window. Aided by DLPE, radiologists discovered novel and interpretable lesions from COVID-19 inpatients and survivors, which were previously invisible under the lung window. Based on DLPE, we removed the scan-level bias of CT scans, and then extracted precise radiomics from such novel lesions. We further demonstrated that these radiomics have strong predictive power for key COVID-19 clinical metrics on an inpatient cohort of 1,193 CT scans and for sequelae on a survivor cohort of 219 CT scans. Our work sheds light on the development of interpretable medical artificial intelligence and showcases how artificial intelligence can discover medical findings that are beyond sight.
  • Discovering trends and hotspots of biosafety and biosecurity research via machine learning

    Guan, Renchu; Pang, Haoyu; Liang, Yanchun; Shao, Zhongjun; Gao, Xin; Xu, Dong; Feng, Xiaoyue (Briefings in bioinformatics, Oxford University Press (OUP), 2022-05-22) [Article]
    Coronavirus disease 2019 (COVID-19) has infected hundreds of millions of people and killed millions of them. As an RNA virus, COVID-19 is more susceptible to variation than other viruses. Many problems involved in this epidemic have made biosafety and biosecurity (hereafter collectively referred to as 'biosafety') a popular and timely topic globally. Biosafety research covers a broad and diverse range of topics, and it is important to quickly identify hotspots and trends in biosafety research through big data analysis. However, the data-driven literature on biosafety research discovery is quite scant. We developed a novel topic model based on latent Dirichlet allocation, affinity propagation clustering and the PageRank algorithm (LDAPR) to extract knowledge from biosafety research publications from 2011 to 2020. Then, we conducted hotspot and trend analysis with LDAPR and carried out further studies, including annual hot topic extraction, a 10-year keyword evolution trend analysis, topic map construction, hot region discovery and fine-grained correlation analysis of interdisciplinary research topic trends. These analyses revealed valuable information that can guide epidemic prevention work: (1) the research enthusiasm over a certain infectious disease not only is related to its epidemic characteristics but also is affected by the progress of research on other diseases, and (2) infectious diseases are not only strongly related to their corresponding microorganisms but also potentially related to other specific microorganisms.
  • ProNet DB: A proteome-wise database for protein surface property representations and RNA-binding profiles

    Wei, Junkang; Xiao, Jin; Chen, Siyuan; Zong, Licheng; Gao, Xin; Li, Yu (arXiv, 2022-05-16) [Preprint]
    The rapid growth in the number of experimental and predicted protein structures and more complicated protein structures challenge users in computational biology for utilizing the structural information and protein surface property representation. Recently, AlphaFold2 released the comprehensive proteome of various species, and protein surface property representation plays a crucial role in protein-molecule interaction prediction such as protein-protein interaction, protein-nucleic acid interaction, and protein-compound interaction. Here, we propose the first comprehensive database, namely ProNet DB, which incorporates multiple protein surface representations and RNA-binding landscape for more than 33,000 protein structures covering the proteome from AlphaFold Protein Structure Database (AlphaFold DB) and experimentally validated protein structures deposited in Protein Data Bank (PDB). For each protein, we provide the original protein structure, surface property representation including hydrophobicity, charge distribution, hydrogen bond, interacting face, and RNA-binding landscape such as RNA binding sites and RNA binding preference. To interpret protein surface property representation and RNA binding landscape intuitively, we also integrate Mol* and Online 3D Viewer to visualize the representation on the protein surface. The pre-computed features are available for the users instantaneously and their potential applications are including molecular mechanism exploration, drug discovery, and novel therapeutics development. The server is now available on https://proj.cse.cuhk.edu.hk/pronet/ and future releases will expand the species and property coverage.
  • iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets

    Chen, Zhen; Liu, Xuhan; Zhao, Pei; Li, Chen; Wang, Yanan; Li, Fuyi; Akutsu, Tatsuya; Bain, Chris; Gasser, Robin B; Li, Junzhou; Yang, Zuoren; Gao, Xin; Kurgan, Lukasz; Song, Jiangning (Nucleic acids research, Oxford University Press (OUP), 2022-05-07) [Article]
    The rapid accumulation of molecular data motivates development of innovative approaches to computationally characterize sequences, structures and functions of biological and chemical molecules in an efficient, accessible and accurate manner. Notwithstanding several computational tools that characterize protein or nucleic acids data, there are no one-stop computational toolkits that comprehensively characterize a wide range of biomolecules. We address this vital need by developing a holistic platform that generates features from sequence and structural data for a diverse collection of molecule types. Our freely available and easy-to-use iFeatureOmega platform generates, analyzes and visualizes 189 representations for biological sequences, structures and ligands. To the best of our knowledge, iFeatureOmega provides the largest scope when directly compared to the current solutions, in terms of the number of feature extraction and analysis approaches and coverage of different molecules. We release three versions of iFeatureOmega including a webserver, command line interface and graphical interface to satisfy needs of experienced bioinformaticians and less computer-savvy biologists and biochemists. With the assistance of iFeatureOmega, users can encode their molecular data into representations that facilitate construction of predictive models and analytical studies. We highlight benefits of iFeatureOmega based on three research applications, demonstrating how it can be used to accelerate and streamline research in bioinformatics, computational biology, and cheminformatics areas.
  • 2′- and 3′-Ribose Modifications of Nucleotide Analogues Establish the Structural Basis to Inhibit the Viral Replication of SARS-CoV-2

    Li, Yongfang; Zhang, Dong; Gao, Xin; Wang, Xiaowei; Zhang, Lu (The Journal of Physical Chemistry Letters, American Chemical Society (ACS), 2022-05-03) [Article]
    Inhibition of RNA-dependent RNA polymerase (RdRp) by nucleotide analogues with ribose modification provides a promising antiviral strategy for the treatment of SARS-CoV-2. Previous works have shown that remdesivir carrying 1'-substitution can act as a "delayed chain terminator", while nucleotide analogues with 2'-methyl group substitution could immediately terminate the chain extension. However, how the inhibition can be established by the 3'-ribose modification as well as other 2'-ribose modifications is not fully understood. Herein, we have evaluated the potential of several adenosine analogues with 2'- and/or 3'-modifications as obligate chain terminators by comprehensive structural analysis based on extensive molecular dynamics simulations. Our results suggest that 2'-modification couples with the protein environment to affect the structural stability, while 3'-hydrogen substitution inherently exerts "immediate termination" without compromising the structural stability in the active site. Our study provides an alternative promising modification scheme to orientate the further optimization of obligate terminators for SARS-CoV-2 RdRp.

View more