Structural and Functional Bioinformatics Group
For more information visit: https://sfb.kaust.edu.sa/Pages/Home.aspx
Recent Submissions
-
AB-Gen: Antibody Library Design with Generative Pre-trained Transformer and Deep Reinforcement Learning(Cold Spring Harbor Laboratory, 2023-03-21) [Preprint]Antibody leads must fulfill multiple desirable properties to be clinical candidates. Primarily due to the low throughput in the experimental procedure, the need for such multi-property optimization causes the bottleneck in preclinical antibody discovery and development, because addressing one issue usually causes another. We developed a reinforcement learning (RL) method, named AB-Gen, for antibody library design using a generative pre-trained Transformer (GPT) as the policy network of the RL agent. We showed that this model can learn the antibody space of heavy chain complementarity determining region 3 (CDRH3) and generate sequences with similar property distributions. Besides, when using HER2 as the target, the agent model of AB-Gen was able to generate novel CDRH3 sequences that fulfill multi-property constraints. 509 generated sequences were able to pass all property filters and three highly conserved residues were identified. The importance of these residues was further demonstrated by molecular dynamics simulations, which consolidated that the agent model was capable of grasping important information in this complex optimization task. Overall, the AB-Gen method is able to design novel antibody sequences with an improved success rate than the traditional propose-then-filter approach. It has the potential to be used in practical antibody design, thus empowering the antibody discovery and development process.
-
A comprehensive benchmarking with practical guidelines for cellular deconvolution of spatial transcriptomics(Nature Communications, Springer Science and Business Media LLC, 2023-03-21) [Article]Spatial transcriptomics technologies are used to profile transcriptomes while preserving spatial information, which enables high-resolution characterization of transcriptional patterns and reconstruction of tissue architecture. Due to the existence of low-resolution spots in recent spatial transcriptomics technologies, uncovering cellular heterogeneity is crucial for disentangling the spatial patterns of cell types, and many related methods have been proposed. Here, we benchmark 18 existing methods resolving a cellular deconvolution task with 50 real-world and simulated datasets by evaluating the accuracy, robustness, and usability of the methods. We compare these methods comprehensively using different metrics, resolutions, spatial transcriptomics technologies, spot numbers, and gene numbers. In terms of performance, CARD, Cell2location, and Tangram are the best methods for conducting the cellular deconvolution task. To refine our comparative results, we provide decision-tree-style guidelines and recommendations for method selection and their additional features, which will help users easily choose the best method for fulfilling their concerns.
-
A universal framework for single-cell multi-omics data integration with graph convolutional networks(Briefings in bioinformatics, Oxford University Press (OUP), 2023-03-17) [Article]Single-cell omics data are growing at an unprecedented rate, whereas effective integration of them remains challenging due to different sequencing methods, quality, and expression pattern of each omics data. In this study, we propose a universal framework for the integration of single-cell multi-omics data based on graph convolutional network (GCN-SC). Among the multiple single-cell data, GCN-SC usually selects one data with the largest number of cells as the reference and the rest as the query dataset. It utilizes mutual nearest neighbor algorithm to identify cell-pairs, which provide connections between cells both within and across the reference and query datasets. A GCN algorithm further takes the mixed graph constructed from these cell-pairs to adjust count matrices from the query datasets. Finally, dimension reduction is performed by using non-negative matrix factorization before visualization. By applying GCN-SC on six datasets, we show that GCN-SC can effectively integrate sequencing data from multiple single-cell sequencing technologies, species or different omics, which outperforms the state-of-the-art methods, including Seurat, LIGER, GLUER and Pamona.
-
miProBERT: identification of microRNA promoters based on the pre-trained model BERT.(Briefings in bioinformatics, Oxford University Press (OUP), 2023-03-17) [Article]Accurate prediction of promoter regions driving miRNA gene expression has become a major challenge due to the lack of annotation information for pri-miRNA transcripts. This defect hinders our understanding of miRNA-mediated regulatory networks. Some algorithms have been designed during the past decade to detect miRNA promoters. However, these methods rely on biosignal data such as CpG islands and still need to be improved. Here, we propose miProBERT, a BERT-based model for predicting promoters directly from gene sequences without using any structural or biological signals. According to our information, it is the first time a BERT-based model has been employed to identify miRNA promoters. We use the pre-trained model DNABERT, fine-tune the pre-trained model on the gene promoter dataset so that the model includes information about the richer biological properties of promoter sequences in its representation, and then systematically scan the upstream regions of each intergenic miRNA using the fine-tuned model. About, 665 miRNA promoters are found. The innovative use of a random substitution strategy to construct a negative dataset improves the discriminative ability of the model and further reduces the false positive rate (FPR) to as low as 0.0421. On independent datasets, miProBERT outperformed other gene promoter prediction methods. With comparison on 33 experimentally validated miRNA promoter datasets, miProBERT significantly outperformed previously developed miRNA promoter prediction programs with 78.13% precision and 75.76% recall. We further verify the predicted promoter regions by analyzing conservation, CpG content and histone marks. The effectiveness and robustness of miProBERT are highlighted.
-
Diabetic cardiomyopathy: The role of microRNAs and long non-coding RNAs(Frontiers in Endocrinology, Frontiers Media SA, 2023-03-07) [Article]Diabetes mellitus (DM) is on the rise, necessitating the development of novel therapeutic and preventive strategies to mitigate the disease’s debilitating effects. Diabetic cardiomyopathy (DCMP) is among the leading causes of morbidity and mortality in diabetic patients globally. DCMP manifests as cardiomyocyte hypertrophy, apoptosis, and myocardial interstitial fibrosis before progressing to heart failure. Evidence suggests that non-coding RNAs, such as long non-coding RNAs (lncRNAs) and microRNAs (miRNAs), regulate diabetic cardiomyopathy-related processes such as insulin resistance, cardiomyocyte apoptosis and inflammation, emphasizing their heart-protective effects. This paper reviewed the literature data from animal and human studies on the non-trivial roles of miRNAs and lncRNAs in the context of DCMP in diabetes and demonstrated their future potential in DCMP treatment in diabetic patients.
-
Computational network analysis of host genetic risk variants of severe COVID-19.(Human genomics, Springer Science and Business Media LLC, 2023-03-02) [Article]Background: Genome-wide association studies have identified numerous human host genetic risk variants that play a substantial role in the host immune response to SARS-CoV-2. Although these genetic risk variants significantly increase the severity of COVID-19, their influence on body systems is poorly understood. Therefore, we aim to interpret the biological mechanisms and pathways associated with the genetic risk factors and immune responses in severe COVID-19. We perform a deep analysis of previously identified risk variants and infer the hidden interactions between their molecular networks through disease mapping and the similarity of the molecular functions between constructed networks. Results: We designed a four-stage computational workflow for systematic genetic analysis of the risk variants. We integrated the molecular profiles of the risk factors with associated diseases, then constructed protein–protein interaction networks. We identified 24 protein–protein interaction networks with 939 interactions derived from 109 filtered risk variants in 60 risk genes and 56 proteins. The majority of molecular functions, interactions and pathways are involved in immune responses; several interactions and pathways are related to the metabolic and cardiovascular systems, which could lead to multi-organ complications and dysfunction. Conclusions: This study highlights the importance of analyzing molecular interactions and pathways to understand the heterogeneous susceptibility of the host immune response to SARS-CoV-2. We propose new insights into pathogenicity analysis of infections by including genetic risk information as essential factors to predict future complications during and after infection. This approach may assist more precise clinical decisions and accurate treatment plans to reduce COVID-19 complications.
-
Interpretable Research Interest Shift Detection with Temporal Heterogeneous Graphs(Association for Computing Machinery, Inc, 2023-02-27) [Conference Paper]Researchers dedicate themselves to research problems they are interested in and often have evolving research interests in their academic careers. The study of research interest shift detection can help to find facts relevant to scientific training paths, scientific funding trends, and knowledge discovery. Existing methods define specific graph structures like author-conference-topic networks, and co-citing networks to detect research interest shift. They either ignore the temporal factor or miss heterogeneous information characterizing academic activities. More importantly, the detection results lack the interpretations of how research interests change over time, thus reducing the model's credibility. To address these issues, we propose a novel interpretable research interest shift detection model with temporal heterogeneous graphs. We first construct temporal heterogeneous graphs to represent the research interests of the target authors. To make the detection interpretable, we design a deep neural network to parameterize the generation process of interpretation on the predicted results in the form of a weighted sub-graph. Additionally, to improve the training process, we propose a semantic-aware negative data sampling strategy to generate non-interesting auxiliary shift graphs as contrastive samples. Extensive experiments demonstrate that our model outperforms the state-of-the-art baselines on two public academic graph datasets and is capable of producing interpretable results.
-
Personalized and privacy-preserving federated heterogeneous medical image analysis with PPPML-HMI(Cold Spring Harbor Laboratory, 2023-02-24) [Preprint]Heterogeneous data is endemic due to the use of diverse models and settings of devices by hospitals in the field of medical imaging. However, there are few open-source frameworks for federated heterogeneous medical image analysis with personalization and privacy protection simultaneously without the demand to modify the existing model structures or to share any private data. In this paper, we proposed PPPML-HMI, an open-source learning paradigm for personalized and privacy-preserving federated heterogeneous medical image analysis. To our best knowledge, personalization and privacy protection were achieved simultaneously for the first time under the federated scenario by integrating the PerFedAvg algorithm and designing our novel cyclic secure aggregation with the homomorphic encryption algorithm. To show the utility of PPPML-HMI, we applied it to a simulated classification task namely the classification of healthy people and patients from the RAD-ChestCT Dataset, and one real-world segmentation task namely the segmentation of lung infections from COVID-19 CT scans. For the real-world task, PPPML-HMI achieved ∼5% higher Dice score on average compared to conventional FL under the heterogeneous scenario. Meanwhile, we applied the improved deep leakage from gradients to simulate adversarial attacks and showed the solid privacy-preserving capability of PPPML-HMI. By applying PPPML-HMI to both tasks with different neural networks, a varied number of users, and sample sizes, we further demonstrated the strong robustness of PPPML-HMI.
-
Applications of deep learning in understanding gene regulation(Cell reports methods, Elsevier BV, 2023-02-22) [Article]Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of omics data have provided better opportunities for gene regulation studies than ever before. For this reason deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles and datasets used by each method, creating a reference for researchers who wish to replicate or improve existing methods. We also discuss the common problems of existing approaches and prospectively introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article will provide a rich and up-to-date resource and shed light on future research directions in this area.
-
Audit to Forget: A Unified Method to Revoke Patients’ Private Data in Intelligent Healthcare(Cold Spring Harbor Laboratory, 2023-02-21) [Preprint]Revoking personal private data is one of the basic human rights, which has already been sheltered by several privacy-preserving laws in many countries. However, with the development of data science, machine learning and deep learning techniques, this right is usually neglected or violated as more and more patients’ data are being collected and used for model training, especially in intelligent healthcare, thus making intelligent healthcare a sector where technology must meet the law, regulations, and privacy principles to ensure that the innovation is for the common good. In order to secure patients’ right to be forgotten, we proposed a novel solution by using auditing to guide the forgetting process, where auditing means determining whether a dataset has been used to train the model and forgetting requires the information of a query dataset to be forgotten from the target model. We unified these two tasks by introducing a new approach called knowledge purification. To implement our solution, we developed AFS, a unified open-source software, which is able to evaluate and revoke patients’ private data from pre-trained deep learning models. We demonstrated the generality of AFS by applying it to four tasks on different datasets with various data sizes and architectures of deep learning networks.
-
New insights on the cardiovascular effects of IGF-1(Frontiers in endocrinology, Frontiers Media SA, 2023-02-09) [Article]Introduction: Cardiovascular (CV) disorders are steadily increasing, making them the world’s most prevalent health issue. New research highlights the importance of insulin-like growth factor 1 (IGF-1) for maintaining CV health Methods: We searched PubMed and MEDLINE for English and non-English articles with English abstracts published between 1957 (when the first report on IGF-1 identification was published) and 2022. The top search terms were: IGF-1, cardiovascular disease, IGF-1 receptors, IGF-1 and microRNAs, therapeutic interventions with IGF-1, IGF-1 and diabetes, IGF-1 and cardiovascular disease. The search retrieved original peer-reviewed articles, which were further analyzed, focusing on the role of IGF-1 in pathophysiological conditions. We specifically focused on including the most recent findings published in the past five years. Results: IGF-1, an anabolic growth factor, regulates cell division, proliferation, and survival. In addition to its well-known growth-promoting and metabolic effects, there is mounting evidence that IGF-1 plays a specialized role in the complex activities that underpin CV function. IGF-1 promotes cardiac development and improves cardiac output, stroke volume, contractility, and ejection fraction. Furthermore, IGF-1 mediates many growth hormones (GH) actions. IGF-1 stimulates contractility and tissue remodeling in humans to improve heart function after myocardial infarction. IGF-1 also improves the lipid profile, lowers insulin levels, increases insulin sensitivity, and promotes glucose metabolism. These findings point to the intriguing medicinal potential of IGF-1. Human studies associate low serum levels of free or total IGF-1 with an increased risk of CV and cerebrovascular illness. Extensive human trials are being conducted to investigate the therapeutic efficacy and outcomes of IGF-1-related therapy. Discussion: We anticipate the development of novel IGF-1-related therapy with minimal side effects. This review discusses recent findings on the role of IGF-1 in the cardiovascular (CVD) system, including both normal and pathological conditions. We also discuss progress in therapeutic interventions aimed at targeting the IGF axis and provide insights into the epigenetic regulation of IGF-1 mediated by microRNAs.
-
TriNet: A tri-fusion neural network for the prediction of anticancer and antimicrobial peptides(Patterns, Elsevier BV, 2023-02-03) [Article]The accurate identification of anticancer peptides (ACPs) and antimicrobial peptides (AMPs) remains a computational challenge. We propose a tri-fusion neural network termed TriNet for the accurate prediction of both ACPs and AMPs. The framework first defines three kinds of features to capture the peptide information contained in serial fingerprints, sequence evolutions, and physicochemical properties, which are then fed into three parallel modules: a convolutional neural network module enhanced by channel attention, a bidirectional long short-term memory module, and an encoder module for training and final classification. To achieve a better training effect, TriNet is trained via a training approach using iterative interactions between the samples in the training and validation datasets. TriNet is tested on multiple challenging ACP and AMP datasets and exhibits significant improvements over various state-of-the-art methods.
-
AdvCat: Domain-Agnostic Robustness Assessment for Cybersecurity-Critical Applications with Categorical Inputs(IEEE, 2023-01-26) [Conference Paper]Machine Learning-as-a-Service systems (MLaaS) have been largely developed for cybersecurity-critical applications, such as detecting network intrusions and fake news campaigns. Despite effectiveness, their robustness against adversarial attacks is one of the key trust concerns for MLaaS deployment. We are thus motivated to assess the adversarial robustness of the Machine Learning models residing at the core of these securitycritical applications with categorical inputs. Previous research efforts on accessing model robustness against manipulation of categorical inputs are specific to use cases and heavily depend on domain knowledge, or require white-box access to the target ML model. Such limitations prevent the robustness assessment from being as a domain-agnostic service provided to various real-world applications. We propose a provably optimal yet computationally highly efficient adversarial robustness assessment protocol for a wide band of ML-driven cybersecurity-critical applications. We demonstrate the use of the domain-agnostic robustness assessment method with substantial experimental study on fake news detection and intrusion detection problems.
-
Type 2 Diabetes Mellitus and its comorbidity, Alzheimer’s disease: Identifying critical microRNA using machine learning(Frontiers in Endocrinology, Frontiers Media SA, 2023-01-19) [Article]MicroRNAs (miRNAs) are critical regulators of gene expression in healthy and diseased states, and numerous studies have established their tremendous potential as a tool for improving the diagnosis of Type 2 Diabetes Mellitus (T2D) and its comorbidities. In this regard, we computationally identify novel top-ranked hub miRNAs that might be involved in T2D. We accomplish this via two strategies: 1) by ranking miRNAs based on the number of T2D differentially expressed genes (DEGs) they target, and 2) using only the common DEGs between T2D and its comorbidity, Alzheimer’s disease (AD) to predict and rank miRNA. Then classifier models are built using the DEGs targeted by each miRNA as features. Here, we show the T2D DEGs targeted by hsa-mir-1-3p, hsa-mir-16-5p, hsa-mir-124-3p, hsa-mir-34a-5p, hsa-let-7b-5p, hsa-mir-155-5p, hsa-mir-107, hsa-mir-27a-3p, hsa-mir-129-2-3p, and hsa-mir-146a-5p are capable of distinguishing T2D samples from the controls, which serves as a measure of confidence in the miRNAs’ potential role in T2D progression. Moreover, for the second strategy, we show other critical miRNAs can be made apparent through the disease’s comorbidities, and in this case, overall, the hsa-mir-103a-3p models work well for all the datasets, especially in T2D, while the hsa-mir-124-3p models achieved the best scores for the AD datasets. To the best of our knowledge, this is the first study that used predicted miRNAs to determine the features that can separate the diseased samples (T2D or AD) from the normal ones, instead of using conventional non-biology-based feature selection methods.
-
Computational Studies of Auto-Active van der Waals Interaction Molecules on Ultra-Thin Black-Phosphorus Film(Molecules, MDPI AG, 2023-01-09) [Article]Using the van der Waals density functional theory, we studied the binding peculiarities of favipiravir (FP) and ebselen (EB) molecules on a monolayer of black phosphorene (BP). We systematically examined the interaction characteristics and thermodynamic properties in a vacuum and a continuum, solvent interface for active drug therapy. These results illustrate that the hybrid molecules are enabled functionalized two-dimensional (2D) complex systems with a vigorous thermostability. We demonstrate in this study that these molecules remain flat on the monolayer BP system and phosphorus atoms are intact. It is inferred that the hybrid FP+EB molecules show larger adsorption energy due to the van der Waals forces and planar electrostatic interactions. The changes in Gibbs free energy at different surface charge fluctuations and temperatures imply that the FP and EB are allowed to adsorb from the gas phase onto the 2D film at high temperatures. Thereby, the results unveiled beneficial inhibitor molecules on two dimensional BP nanocarriers, potentially introducing a modern strategy to enhance the development of advanced materials, biotechnology, and nanomedicine.
-
The protective role of nutritional antioxidants against oxidative stress in thyroid disorders(FRONTIERS IN ENDOCRINOLOGY, Frontiers Media SA, 2023-01-04) [Article]An imbalance between pro-oxidative and antioxidative cellular mechanisms is oxidative stress (OxS) which may be systemic or organ-specific. Although OxS is a consequence of normal body and organ physiology, severely impaired oxidative homeostasis results in DNA hydroxylation, protein denaturation, lipid peroxidation, and apoptosis, ultimately compromising cells’ function and viability. The thyroid gland is an organ that exhibits both oxidative and antioxidative processes. In terms of OxS severity, the thyroid gland’s response could be physiological (i.e. hormone production and secretion) or pathological (i.e. development of diseases, such as goitre, thyroid cancer, or thyroiditis). Protective nutritional antioxidants may benefit defensive antioxidative systems in resolving pro-oxidative dominance and redox imbalance, preventing or delaying chronic thyroid diseases. This review provides information on nutritional antioxidants and their protective roles against impaired redox homeostasis in various thyroid pathologies. We also review novel findings related to the connection between the thyroid gland and gut microbiome and analyze the effects of probiotics with antioxidant properties on thyroid diseases.
-
Vision Transformer-based Weakly-Supervised Histopathological Image Analysis of Primary Brain Tumors(iScience, Elsevier BV, 2022-12-24) [Article]Diagnosis of primary brain tumors relies heavily on histopathology. Although various computational pathology methods have been developed for automated diagnosis of primary brain tumors, they usually require neuropathologists’ annotation of region-of-interests or selection of image patches on whole-slide images (WSI). We developed an end-to-end Vision Transformer (ViT) – based deep learning architecture for brain tumor WSI analysis, yielding a highly interpretable deep-learning model, ViT-WSI. Based on the principle of weakly-supervised machine learning, ViT-WSI accomplishes the task of major primary brain tumor type and subtype classification. Using a systematic gradient-based attribution analysis procedure, ViT-WSI can discover diagnostic histopathological features for primary brain tumors. Furthermore, we demonstrated that ViT-WSI has high predictive power of inferring the status of three diagnostic glioma molecular markers, IDH1 mutation, p53 mutation, and MGMT methylation, directly from H&E-stained histopathological images, with patient level AUC scores of 0.960, 0.874, and 0.845, respectively.
-
scBKAP: a clustering model for single-cell RNA-seq data based on bisecting K-means(IEEE/ACM Transactions on Computational Biology and Bioinformatics, IEEE, 2022-12-19) [Article]Advances in single-cell RNA sequencing (scRNA-seq) technologies allow researchers to analyze the genome-wide transcription profile and to solve biological problems at the individual-cell resolution. However, existing clustering methods on scRNA-seq suffer from high dropout rate and curse of dimensionality in the data. Here, we propose a novel pipeline, scBKAP, the cornerstone of which is a single-cell bisecting K-means clustering method based on an autoencoder network and a dimensionality reduction model MPDR. Specially, scBKAP utilizes an autoencoder network to reconstruct gene expression values from scRNA-seq data to alleviate the dropout issue, and the MPDR model composed of the M3Drop feature selection algorithm and the PHATE dimensionality reduction algorithm to reduce the dimensions of reconstructed data. The dimensionality-reduced data are then fed into the bisecting K-means clustering algorithm to identify the clusters of cells. Comprehensive experiments demonstrate scBKAP's superior performance over nine state-of-the-art single-cell clustering methods on 21 public scRNA-seq datasets and simulated datasets.
-
Annotating TSSs in Multiple Cell Types Based on DNA Sequence and RNA-seq Data via DeeReCT-TSS.(Genomics, proteomics & bioinformatics, Elsevier BV, 2022-12-15) [Article]The accurate annotation of transcription start sites (TSSs) and their usage is critical for the mechanistic understanding of gene regulation in different biological contexts. To fulfill this, specific high-throughput experimental technologies have been developed to capture TSSs in a genome-wide manner and various computational tools have also been developed for in silico prediction of TSSs solely based on genomic sequences. Most of these computational tools cast the problem as a binary classification task on a balanced dataset, thus resulting in drastic false positive predictions when applied on the genome-scale. Here, we present DeeReCT-TSS, a deep learning-based method that is capable of identifying TSSs across the whole genome based on both DNA sequence and conventional RNA sequencing data. We show that by effectively incorporating these two sources of information, DeeReCT-TSS significantly outperforms other solely sequence-based methods on the precise annotation of TSSs used in different cell types. Furthermore, we have developed a meta-learning-based extension for simultaneous TSS annotation on 10 cell types, which enables the identification of cell type-specific TSSs. Finally, we demonstrate the high precision of DeeReCT-TSS on two independent datasets by correlating our predicted TSSs with experimentally defined TSS chromatin states.
-
Towards Efficient and Domain-Agnostic Evasion Attack with High-dimensional Categorical Inputs(arXiv, 2022-12-13) [Preprint]Our work targets at searching feasible adversarial perturbation to attack a classifier with high-dimensional categorical inputs in a domain-agnostic setting. This is intrinsically an NP-hard knapsack problem where the exploration space becomes explosively larger as the feature dimension increases. Without the help of domain knowledge, solving this problem via heuristic method, such as Branch-and-Bound, suffers from exponential complexity, yet can bring arbitrarily bad attack results. We address the challenge via the lens of multi-armed bandit based combinatorial search. Our proposed method, namely FEAT, treats modifying each categorical feature as pulling an arm in multi-armed bandit programming. Our objective is to achieve highly efficient and effective attack using an Orthogonal Matching Pursuit (OMP)-enhanced Upper Confidence Bound (UCB) exploration strategy. Our theoretical analysis bounding the regret gap of FEAT guarantees its practical attack performance. In empirical analysis, we compare FEAT with other state-of-the-art domain-agnostic attack methods over various real-world categorical data sets of different applications. Substantial experimental observations confirm the expected efficiency and attack effectiveness of FEAT applied in different application scenarios. Our work further hints the applicability of FEAT for assessing the adversarial vulnerability of classification systems with high-dimensional categorical inputs.