Bio-Ontology Research Group (BORG)
For more information visit: https://cemse.kaust.edu.sa/borg
Recent Submissions
-
LEP-AD: Language Embedding of Proteins and Attention to Drugs predicts drug target interactions(Cold Spring Harbor Laboratory, 2023-03-15) [Preprint]Predicting drug-target interactions is a tremendous challenge for drug development and lead optimization. Recent advances include training algorithms to learn drug-target interactions from data and molecular simulations. Here we utilize Evolutionary Scale Modeling (ESM-2) models to establish a Transformer protein language model for drug-target interaction predictions. Our architecture, LEP- AD, combines pre-trained ESM-2 and Transformer-GCN models predicting bind-ing affinity values. We report new best-in-class state-of-the-art results compared to competing methods such as SimBoost, DeepCPI, Attention-DTA, GraphDTA, and more using multiple datasets, including Davis, KIBA, DTC, Metz, ToxCast, and STITCH. Finally, we find that a pre-trained model with embedding of proteins (the LED-AD) outperforms a model using an explicit alpha-fold 3D representation of proteins (e.g., LEP-AD supervised by Alphafold). The LEP-AD model scales favorably in performance with the size of training data.
-
BORD: A Biomedical Ontology based method for concept Recognition using Distant supervision: Application to Phenotypes and Diseases(Cold Spring Harbor Laboratory, 2023-02-16) [Preprint]Motivation: Concept recognition in biomedical text is an important yet challenging task. The two main approaches to recognize concepts in text are dictionary-based approaches and supervised machine learning approaches. While dictionary-based approaches fail in recognising new concepts and variations of existing concepts, supervised methods require sufficiently large annotated datasets which are expensive to obtain. Methods based on distant supervision have been developed to use machine learning without large annotated corpora. However, for biomedical concept recognition, these approaches do not yet exploit the context in which a concept occurs in literature, and they do not make use of prior knowledge about dependencies between concepts. Results: We developed BORD, a Biomedical Ontology-based method for concept Recognition using Distant supervision. BORD utilises context from corpora which are lexically annotated using labels and synonyms from the classes of a biomedical ontology for model training. Furthermore, BORD utilises the ontology hierarchy for normalising the recognised mentions to their concept identifiers. We show how our method improves the performance of state of the art methods for recognising disease and phenotype concepts in biomedical literature. Our method is generic, does not require manually annotated corpora, and is robust to identify mentions of ontology classes in text. Moreover, to the best of our knowledge, this is the first approach utilising the ontology hierarchy for concept recognition.
-
Semantic Units: Organizing knowledge graphs into semantically meaningful units of representation(arXiv, 2023-01-03) [Preprint]Background: Knowledge graphs and ontologies are becoming increasingly important as technical solutions for Findable, Accessible, Interoperable, and Reusable data and metadata (FAIR Guiding Principles). We discuss four challenges that impede the use of FAIR knowledge graphs and propose semantic units as their potential solution. Results: Semantic units structure a knowledge graph into identifiable and semantically meaningful subgraphs. Each unit is represented by its own resource, instantiates a corresponding semantic unit class, and can be implemented as a FAIR Digital Object and a nanopublication in RDF/OWL and property graphs. We distinguish statement and compound units as basic categories of semantic units. Statement units represent smallest, independent propositions that are semantically meaningful for a human reader. They consist of one or more triples and mathematically partition a knowledge graph. We distinguish assertional, contingent (prototypical), and universal statement units as basic types of statement units and propose representational schemes and formal semantics for them (including for absence statements, negations, and cardinality restrictions) that do not involve blank nodes and that translate back to OWL. Compound units, on the other hand, represent semantically meaningful collections of semantic units and we distinguish various types of compound units, representing different levels of representational granularity, different types of granularity trees, and different frames of reference. Conclusions: Semantic units support making statements about statements, can be used for graph-alignment, subgraph-matching, knowledge graph profiling, and for managing access restrictions to sensitive data. Organizing the graph into semantic units supports the separation of ontological, diagnostic (i.e., referential), and discursive information, and it also supports the differentiation of multiple frames of reference.
-
Klarigi: Characteristic explanations for semantic biomedical data(Computers in Biology and Medicine, Elsevier BV, 2022-12-22) [Article]Annotation of biomedical entities with ontology classes provides for formal semantic analysis and mobilisation of background knowledge in determining their relationships. To date, enrichment analysis has been routinely employed to identify classes that are over-represented in annotations across sets of groups, such as biosample gene expression profiles or patient phenotypes, and is useful for a range of tasks including differential diagnosis and causative variant prioritisation. These approaches, however, usually consider only univariate relationships, make limited use of the semantic features of ontologies, and provide limited information and evaluation of the explanatory power of both singular and grouped candidate classes. Moreover, they are not designed to solve the problem of deriving cohesive, characteristic, and discriminatory sets of classes for entity groups. We have developed a new tool, called Klarigi, which introduces multiple scoring heuristics for identification of classes that are both compositional and discriminatory for groups of entities annotated with ontology classes. The tool includes a novel algorithm for derivation of multivariable semantic explanations for entity groups, makes use of semantic inference through live use of an ontology reasoner, and includes a classification method for identifying the discriminatory power of candidate sets, in addition to significance testing apposite to traditional enrichment approaches. We describe the design and implementation of Klarigi, including its scoring and explanation determination methods, and evaluate its use in application to two test cases with clinical significance, comparing and contrasting methods and results with literature-based and enrichment analysis methods. We demonstrate that Klarigi produces characteristic and discriminatory explanations for groups of biomedical entities in two settings. We also show that these explanations recapitulate and extend the knowledge held in existing biomedical databases and literature for several diseases. We conclude that Klarigi provides a distinct and valuable perspective on biomedical datasets when compared with traditional enrichment methods, and therefore constitutes a new method by which biomedical datasets can be explored, contributing to improved insight into semantic data.
-
mOWL: Python library for machine learning with biomedical ontologies(Bioinformatics, Oxford University Press (OUP), 2022-12-19) [Article]Motivation: Ontologies contain formal and structured information about a domain and are widely used in bioinformatics for annotation and integration of data. Several methods use ontologies to provide background knowledge in machine learning tasks, which is of particular importance in bioinformatics. These methods rely on a set of common primitives that are not readily available in a software library; a library providing these primitives would facilitate the use of current machine learning methods with ontologies and the development of novel methods for other ontology-based biomedical applications. Results: We developed mOWL, a Python library for machine learning with ontologies formalized in the Web Ontology Language (OWL). mOWL implements ontology embedding methods that map information contained in formal knowledge bases and ontologies into vector spaces while preserving some of the properties and relations in ontologies, as well as methods to use these embeddings for similarity computation, deductive inference, and zero-shot learning. We demonstrate mOWL on the knowledge-based prediction of protein–protein interactions (PPIs) using the Gene Ontology and gene–disease associations (GDAs) using phenotype ontologies.
-
A personal, reference quality, fully annotated genome from a Saudi individual(Cold Spring Harbor Laboratory, 2022-11-08) [Preprint]We have used multiple sequencing approaches to sequence the genome of a volunteer from Saudi Arabia. We use the resulting data to generate a de novo assembly of the genome, and use different computational approaches to refine the assembly. As a consequence, we provide a continguous assembly of the complete genome of an individual from Saudi Arabia for all chromosomes except chromosome Y, and label this assembly KSA001. We transferred genome annotations from reference genomes and predicted genome features using methods from Artificial Intelligence to fully annotate KSA001, and we make all primary sequencing data, the assembly, and the genome annotations freely available in public databases using the FAIR data principles.
-
A comprehensive update on CIDO: the community-based coronavirus infectious disease ontology.(Journal of biomedical semantics, Springer Science and Business Media LLC, 2022-10-21) [Article]Background The current COVID-19 pandemic and the previous SARS/MERS outbreaks of 2003 and 2012 have resulted in a series of major global public health crises. We argue that in the interest of developing effective and safe vaccines and drugs and to better understand coronaviruses and associated disease mechenisms it is necessary to integrate the large and exponentially growing body of heterogeneous coronavirus data. Ontologies play an important role in standard-based knowledge and data representation, integration, sharing, and analysis. Accordingly, we initiated the development of the community-based Coronavirus Infectious Disease Ontology (CIDO) in early 2020. Results As an Open Biomedical Ontology (OBO) library ontology, CIDO is open source and interoperable with other existing OBO ontologies. CIDO is aligned with the Basic Formal Ontology and Viral Infectious Disease Ontology. CIDO has imported terms from over 30 OBO ontologies. For example, CIDO imports all SARS-CoV-2 protein terms from the Protein Ontology, COVID-19-related phenotype terms from the Human Phenotype Ontology, and over 100 COVID-19 terms for vaccines (both authorized and in clinical trial) from the Vaccine Ontology. CIDO systematically represents variants of SARS-CoV-2 viruses and over 300 amino acid substitutions therein, along with over 300 diagnostic kits and methods. CIDO also describes hundreds of host-coronavirus protein-protein interactions (PPIs) and the drugs that target proteins in these PPIs. CIDO has been used to model COVID-19 related phenomena in areas such as epidemiology. The scope of CIDO was evaluated by visual analysis supported by a summarization network method. CIDO has been used in various applications such as term standardization, inference, natural language processing (NLP) and clinical data integration. We have applied the amino acid variant knowledge present in CIDO to analyze differences between SARS-CoV-2 Delta and Omicron variants. CIDO's integrative host-coronavirus PPIs and drug-target knowledge has also been used to support drug repurposing for COVID-19 treatment. Conclusion CIDO represents entities and relations in the domain of coronavirus diseases with a special focus on COVID-19. It supports shared knowledge representation, data and metadata standardization and integration, and has been used in a range of applications.
-
bio-ontology-research-group/KSA001: Telomere-2-Telomere Genome from Saudi Arabia(Github, 2022-09-27) [Software]Telomere-2-Telomere Genome from Saudi Arabia
-
FrameRate: learning the coding potential of unassembled metagenomic reads(Cold Spring Harbor Laboratory, 2022-09-19) [Preprint]Motivation: Metagenomic assembly is a slow and computationally intensive process and despite needing iterative rounds for improvement and completeness the resulting assembly often fails to incorporate many of the input sequencing reads. This is further complicated when there is reduced read-depth and/or artefacts which result in chimeric assemblies both of which are especially prominent in the assembly of metagenomic datasets. Many of these limitations could potentially be overcome by exploiting the information content stored in the reads directly and thus eliminating the need for assembly in a number of situations. Results: We explored the prediction of coding potential of DNA reads by training a machine learning model on existing protein sequences. Named 'FrameRate', this model can predict the coding frame(s) from unassembled DNA sequencing reads directly, thus greatly reducing the computational resources required for genome assembly and similarity-based inference to pre-computed databases. Using the eggNOG-mapper function annotation tool, the predicted coding frames from FrameRate were functionally verified by comparing to the results from full-length protein sequences reconstructed with an established metagenome assembly and gene prediction pipeline from the same metagenomic sample. FrameRate captured equivalent functional profiles from the coding frames while reducing the required storage and time resources significantly. FrameRate was also able to annotate reads that were not represented in the assembly, capturing this 'missing' information. As an ultra-fast read-level assembly-free coding profiler, FrameRate enables rapid characterisation of almost every sequencing read directly, whether it can be assembled or not, and thus circumvent many of the problems caused by contemporary assembly workflows.
-
FALCON: Sound and Complete Neural Semantic Entailment over ALC Ontologies(arXiv, 2022-08-16) [Preprint]Many ontologies, i.e., Description Logic (DL) knowledge bases, have been developed to provide rich knowledge about various domains, and a lot of them are based on ALC, i.e., a prototypical and expressive DL, or its extensions. The main task that explores ALC ontologies is to compute semantic entailment. Symbolic approaches can guarantee sound and complete semantic entailment but are sensitive to inconsistency and missing information. To this end, we propose FALCON, a Fuzzy ALC Ontology Neural reasoner. FALCON uses fuzzy logic operators to generate single model structures for arbitrary ALC ontologies, and uses multiple model structures to compute semantic entailments. Theoretical results demonstrate that FALCON is guaranteed to be a sound and complete algorithm for computing semantic entailments over ALC ontologies. Experimental results show that FALCON enables not only approximate reasoning (reasoning over incomplete ontologies) and paraconsistent reasoning (reasoning over inconsistent ontologies), but also improves machine learning in the biomedical domain by incorporating background knowledge from ALC ontologies.
-
Positive-Unlabeled Learning with Adversarial Data Augmentation for Knowledge Graph Completion(International Joint Conferences on Artificial Intelligence Organization, 2022-07) [Conference Paper]Most real-world knowledge graphs (KG) are far from complete and comprehensive. This problem has motivated efforts in predicting the most plausible missing facts to complete a given KG, i.e., knowledge graph completion (KGC). However, existing KGC methods suffer from two main issues, 1) the false negative issue, i.e., the sampled negative training instances may include potential true facts; and 2) the data sparsity issue, i.e., true facts account for only a tiny part of all possible facts. To this end, we propose positive-unlabeled learning with adversarial data augmentation (PUDA) for KGC. In particular, PUDA tailors positive-unlabeled risk estimator for the KGC task to deal with the false negative issue. Furthermore, to address the data sparsity issue, PUDA achieves a data augmentation strategy by unifying adversarial training and positive-unlabeled learning under the positive-unlabeled minimax game. Extensive experimental results on real-world benchmark datasets demonstrate the effectiveness and compatibility of our proposed method.
-
How much do model organism phenotypes contribute to the computational identification of human disease genes?(Disease models & mechanisms, Cold Spring Harbor Laboratory, 2022-06-27) [Article]Computing phenotypic similarity has been shown to be useful in identification of new disease genes and for rare disease diagnostic support. Genotype-phenotype data from orthologous genes in model organisms can compensate for lack of human data to greatly increase genome coverage. Work over the past decade has demonstrated the power of cross-species phenotype comparisons, and several cross-species phenotype ontologies have been developed for this purpose. The relative contribution of different model organisms to computational identification of disease-associated genes is not yet fully explored. We use methods based on phenotype ontologies to semantically relate phenotypes resulting from loss-of-function mutations in different model organisms to disease-associated phenotypes in humans. Semantic machine learning methods are used to measure how much different model organisms contribute to the identification of known human gene-disease associations. We find that mouse genotype-phenotype data is the most important dataset in the identification of human disease genes by semantic similarity and machine learning over phenotype ontologies. Data from other model organisms does not improve identification over that obtained by using the mouse alone, and therefore does not contribute significantly to this task. Our work has implications for the future development of integrated phenotype ontologies, as well as for the use of model organism phenotypes in human genetic variant interpretation.
-
DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms(Bioinformatics (Oxford, England), Oxford University Press (OUP), 2022-06-27) [Article]Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50 000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require a significant amount of training data and cannot make predictions for GO classes that have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted.
-
Joint Abductive and Inductive Neural Logical Reasoning(arXiv, 2022-05-29) [Preprint]Neural logical reasoning (NLR) is a fundamental task in knowledge discovery and artificial intelligence. NLR aims at answering multi-hop queries with logical operations on structured knowledge bases based on distributed representations of queries and answers. While previous neural logical reasoners can give specific entity-level answers, i.e., perform inductive reasoning from the perspective of logic theory, they are not able to provide descriptive concept-level answers, i.e., perform abductive reasoning, where each concept is a summary of a set of entities. In particular, the abductive reasoning task attempts to infer the explanations of each query with descriptive concepts, which make answers comprehensible to users and is of great usefulness in the field of applied ontology. In this work, we formulate the problem of the joint abductive and inductive neural logical reasoning (AI-NLR), solving which needs to address challenges in incorporating, representing, and operating on concepts. We propose an original solution named ABIN for AI-NLR. Firstly, we incorporate description logic-based ontological axioms to provide the source of concepts. Then, we represent concepts and queries as fuzzy sets, i.e., sets whose elements have degrees of membership, to bridge concepts and queries with entities. Moreover, we design operators involving concepts on top of the fuzzy set representation of concepts and queries for optimization and inference. Extensive experimental results on two real-world datasets demonstrate the effectiveness of ABIN for AI-NLR.
-
Combining biomedical knowledge graphs and text to improve predictions for drug-target interactions and drug-indications(PeerJ, PeerJ, 2022-04-04) [Article]Biomedical knowledge is represented in structured databases and published in biomedical literature, and different computational approaches have been developed to exploit each type of information in predictive models. However, the information in structured databases and literature is often complementary. We developed a machine learning method that combines information from literature and databases to predict drug targets and indications. To effectively utilize information in published literature, we integrate knowledge graphs and published literature using named entity recognition and normalization before applying a machine learning model that utilizes the combination of graph and literature. We then use supervised machine learning to show the effects of combining features from biomedical knowledge and published literature on the prediction of drug targets and drug indications. We demonstrate that our approach using datasets for drug-target interactions and drug indications is scalable to large graphs and can be used to improve the ranking of targets and indications by exploiting features from either structure or unstructured information alone.
-
Description Logic EL++ Embeddings with Intersectional Closure(arXiv, 2022-02-28) [Preprint]Many ontologies, in particular in the biomedical domain, are based on the Description Logic EL++. Several efforts have been made to interpret and exploit EL++ ontologies by distributed representation learning. Specifically, concepts within EL++ theories have been represented as n-balls within an n-dimensional embedding space. However, the intersectional closure is not satisfied when using n-balls to represent concepts because the intersection of two n-balls is not an n-ball. This leads to challenges when measuring the distance between concepts and inferring equivalence between concepts. To this end, we developed EL Box Embedding (ELBE) to learn Description Logic EL++ embeddings using axis-parallel boxes. We generate specially designed box-based geometric constraints from EL++ axioms for model training. Since the intersection of boxes remains as a box, the intersectional closure is satisfied. We report extensive experimental results on three datasets and present a case study to demonstrate the effectiveness of the proposed method.
-
Combining biomedical knowledge graphs and text to improve predictions for drug-target interactions and drug-indications.(Zenodo, 2022-02-18) [Dataset]The datasets used in the publications titled "Combining biomedical knowledge graphs and text to improve predictions for drug-target interactions and drug-indications"
-
Evaluating semantic similarity methods for comparison of text-derived phenotype profiles.(BMC medical informatics and decision making, Springer Science and Business Media LLC, 2022-02-06) [Article]BackgroundSemantic similarity is a valuable tool for analysis in biomedicine. When applied to phenotype profiles derived from clinical text, they have the capacity to enable and enhance 'patient-like me' analyses, automated coding, differential diagnosis, and outcome prediction. While a large body of work exists exploring the use of semantic similarity for multiple tasks, including protein interaction prediction, and rare disease differential diagnosis, there is less work exploring comparison of patient phenotype profiles for clinical tasks. Moreover, there are no experimental explorations of optimal parameters or better methods in the area.MethodsWe develop a platform for reproducible benchmarking and comparison of experimental conditions for patient phentoype similarity. Using the platform, we evaluate the task of ranking shared primary diagnosis from uncurated phenotype profiles derived from all text narrative associated with admissions in the medical information mart for intensive care (MIMIC-III).Results300 semantic similarity configurations were evaluated, as well as one embedding-based approach. On average, measures that did not make use of an external information content measure performed slightly better, however the best-performing configurations when measured by area under receiver operating characteristic curve and Top Ten Accuracy used term-specificity and annotation-frequency measures.ConclusionWe identified and interpreted the performance of a large number of semantic similarity configurations for the task of classifying diagnosis from text-derived phenotype profiles in one setting. We also provided a basis for further research on other settings and related tasks in the area.
-
Using SPARQL to unify queries over data, ontologies, and machine learning models in the PhenomeBrowser knowledgebase(CEUR-WS, 2022-01-01) [Conference Paper]We have developed the PhenomeBrowser knowledge base to integrate phenotype associations from a variety of sources into a single knowledge base. We use SPARQL as a unifying query language to access RDF data, perform Description Logic queries over ontologies, and compute the semantic similarity between entities in the knowledge base.
-
A-LIOn - Alignment Learning through Inconsistency negatives of the aligned Ontologies(CEUR-WS, 2022-01-01) [Conference Paper]Ontologies play an important role in sharing and reusing knowledge. Several ontologies have been developed to describe a particular domain but from different perspectives from communities of developers and users. This has led to the existence of multiple ontologies covering the same or a different domain with varying degrees of variability. Ontology Alignment is typically used to identify correspondences between semantically related elements of two or more ontologies in order to address this problem. We propose A-LIOn a system that learns alignments by combining lexical and semantic approaches as well as machine learning. The system utilizes OWL EL reasoning for negative sampling which is iteratively used to inform the correction of the learning of the alignments. We demonstrate that A-LIOn produces alignments that are coherent with respect to OWL EL.