Bio-Ontology Research Group (BORG)

Permanent URI for this collection

For more information visit: https://cemse.kaust.edu.sa/borg

Browse

Recent Submissions

Now showing 1 - 5 of 170
  • Preprint

    DeepGO-SE: Protein function prediction as Approximate Semantic Entailment

    (Research Square Platform LLC, 2023-09-26) Kulmanov, Maxat; Guzmán-Vega, Francisco J.; Duek, Paula; Lane, Lydie; Arold, Stefan T.; Hoehndorf, Robert; Computer Science Program; Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division; Bioscience Program; Biological and Environmental Science and Engineering (BESE) Division; CALIPHO group, SIB Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, Geneva 4, 1211, Switzerland.; Department of Microbiology and Molecular Medicine, Faculty of Medicine, University of Geneva, CMU, 1 rue Michel Servet, Geneva 4, 1211, Switzerland

    The Gene Ontology (GO) is one of the most successful ontologies in the biological domain. GO is a formal theory with over 100,000 axioms that describe the molecular functions, biological processes, and cellular locations of proteins in three sub-ontologies. Many methods have been developed to automatically predict protein functions. However, only few of them use the background knowledge provided in the axioms of GO for knowledge-enhanced machine learning, or adjust and evaluate the model for the differences between the sub-ontologies. We have developed DeepGO-SE, a novel method which predicts GO functions from protein sequences using a pretrained large language model combined with a neuro-symbolic model that exploits GO axioms and performs protein function prediction as a form of approximate semantic entailment. We specifically evaluate DeepGO-SE on proteins that have no significant similarity with training proteins and demonstrate that DeepGO-SE can improve function prediction for those proteins.

  • Article

    Improving the classification of cardinality phenotypes using collections.

    (Springer Science and Business Media LLC, 2023-08-07) Alghamdi, Sarah M.; Hoehndorf, Robert; Computational Bioscience Research Center (CBRC), Computer, Electrical, and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia.; Computer Science Program; Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division; Computational Bioscience Research Center (CBRC); King Abdul-Aziz University, Faculty of Computing and Information Technology, 25732, Rabigh, Saudi Arabia.

    MotivationPhenotypes are observable characteristics of an organism and they can be highly variable. Information about phenotypes is collected in a clinical context to characterize disease, and is also collected in model organisms and stored in model organism databases where they are used to understand gene functions. Phenotype data is also used in computational data analysis and machine learning methods to provide novel insights into disease mechanisms and support personalized diagnosis of disease. For mammalian organisms and in a clinical context, ontologies such as the Human Phenotype Ontology and the Mammalian Phenotype Ontology are widely used to formally and precisely describe phenotypes. We specifically analyze axioms pertaining to phenotypes of collections of entities within a body, and we find that some of the axioms in phenotype ontologies lead to inferences that may not accurately reflect the underlying biological phenomena.ResultsWe reformulate the phenotypes of collections of entities using an ontological theory of collections. By reformulating phenotypes of collections in phenotypes ontologies, we avoid potentially incorrect inferences pertaining to the cardinality of these collections. We apply our method to two phenotype ontologies and show that the reformulation not only removes some problematic inferences but also quantitatively improves biological data analysis.

  • Conference Paper

    From Axioms over Graphs to Vectors, and Back Again: Evaluating the Properties of Graph-based Ontology Embeddings

    (CEUR-WS, 2023-05) Zhapa-Camacho, Fernando; Hoehndorf, Robert; Computational Bioscience Research Center, Computer, Electrical & Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, 4700 KAUST, 23955 Thuwal, Saudi Arabia; Computer Science Program; Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division; Computational Bioscience Research Center (CBRC)

    Several approaches have been developed that generate embeddings for Description Logic ontologies and use these embeddings in machine learning. One approach of generating ontologies embeddings is by first embedding the ontologies into a graph structure, i.e., introducing a set of nodes and edges for named entities and logical axioms, and then applying a graph embedding to embed the graph in RN. Methods that embed ontologies in graphs (graph projections) have different formal properties related to the type of axioms they can utilize, whether the projections are invertible or not, and whether they can be applied to asserted axioms or their deductive closure. We analyze, qualitatively and quantitatively, several graph projection methods that have been used to embed ontologies, and we demonstrate the effect of the properties of graph projections on the performance of predicting axioms from ontology embeddings. We find that there are substantial differences between different projection methods, and both the projection of axioms into nodes and edges as well ontological choices in representing knowledge will impact the success of using ontology embeddings to predict axioms.

  • Software

    bio-ontology-research-group/STARVar: STARVar:Symptom based Tool for Automatic Ranking of Variants using evidence from literature and genomes

    (Github, 2022-01-03) Kafkas, Senay; Abdelhakim, Marwa; Uludag, Mahmut; Althagafi, Azza Th.; Alghamdi, Malak; Hoehndorf, Robert; Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, 23955, Thuwal, Saudi Arabia; Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division; Computer Science Program; Computer Science Department, College of Computers and Information Technology, Taif University, 21655, Taif, Saudi Arabia; Medical Genetic Division, Department of Pediatrics, College of Medicine, King Saud University, 2925, Riyadh, Saudi Arabia

    STARVar:Symptom based Tool for Automatic Ranking of Variants using evidence from literature and genomes

  • Preprint

    Critical assessment of variant prioritization methods for rare disease diagnosis within the Rare Genomes Project

    (Cold Spring Harbor Laboratory, 2023-08-05) Stenton, Sarah L.; O'Leary, Melanie; Lemire, Gabrielle; VanNoy, Grace E.; DiTroia, Stephanie; Ganesh, Vijay S.; Groopman, Emily; O'Heir, Emily; Mangilog, Brian; Osei-Owusu, Ikeoluwa; Pais, Lynn S.; Serrano, Jillian; Singer-Berk, Moriel; Weisburd, Ben; Wilson, Michael; Austin-Tse, Christina; Abdelhakim, Marwa; Althagafi, Azza Th.; Babbi, Giulia; Bellazzi, Riccardo; Bovo, Samuele; Carta, Maria Giulia; Casadio, Rita; Coenen, Pieter-Jan; De Paoli, Federica; Floris, Matteo; Gajapathy, Manavalan; Hoehndorf, Robert; Jacobsen, Julius O.B.; Joseph, Thomas; Kamandula, Akash; Katsonis, Panagiotis; Kint, Cyrielle; Lichtarge, Olivier; Limongelli, Ivan; Lu, Yulan; Magni, Paolo; Mamidi, Tarun Karthik Kumar; Martelli, Pier Luigi; Mulargia, Marta; Nicora, Giovanna; Nykamp, Keith; Pejaver, Vikas; Peng, Yisu; Pham, Thi Hong Cam; Podda, Maurizio S.; Rao, Aditya; Rizzo, Ettore; Saipradeep, Vangala G.; Savojardo, Castrense; Schols, Peter; Shen, Yang; Sivadasan, Naveen; Smedley, Damian; Soru, Dorian; Srinivasan, Rajgopal; Sun, Yuanfei; Sunderam, Uma; Tan, Wuwei; Tiwari, Naina; Wang, Xiao; Wang, Yaqiong; Williams, Amanda; Worthey, Elizabeth A.; Yin, Rujie; You, Yuning; Zeiberg, Daniel; Zucca, Susanna; Bakolitsa, Constantina; Brenner, Steven E.; Fullerton, Stephanie M.; Radivojac, Predrag; Rehm, Heidi L.; O'Donnell-Luria, Anne�; Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division; Computer Science Program; Computational Bioscience Research Center (CBRC); Computer Science Department, College of Computers and Information Technology, Taif University, Taif, Saudi Arabia

    Background: A major obstacle faced by rare disease families is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years, and causal variants are identified in under 50%. The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing (GS) for diagnosis and gene discovery. Families are consented for sharing of sequence and phenotype data with researchers, allowing development of a Critical Assessment of Genome Interpretation (CAGI) community challenge, placing variant prioritization models head-to-head in a real-life clinical diagnostic setting. Methods: Predictors were provided a dataset of phenotype terms and variant calls from GS of 175 RGP individuals (65 families), including 35 solved training set families, with causal variants specified, and 30 test set families (14 solved, 16 unsolved). The challenge tasked teams with identifying the causal variants in as many test set families as possible. Ranked variant predictions were submitted with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on rank position of true positive causal variants and maximum F-measure, based on precision and recall of causal variants across EPCR thresholds. Results: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performing teams recalled the causal variants in up to 13 of 14 solved families by prioritizing high quality variant calls that were rare, predicted deleterious, segregating correctly, and consistent with reported phenotype. In unsolved families, newly discovered diagnostic variants were returned to two families following confirmatory RNA sequencing, and two prioritized novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant, in an unsolved proband with phenotype overlap with asparagine synthetase deficiency. Conclusions: By objective assessment of variant predictions, we provide insights into current state-of-the-art algorithms and platforms for genome sequencing analysis for rare disease diagnosis and explore areas for future optimization. Identification of diagnostic variants in unsolved families promotes synergy between researchers with clinical and computational expertise as a means of advancing the field of clinical genome interpretation.