For more information visit:

Recent Submissions

  • Positive-Unlabeled Learning with Adversarial Data Augmentation for Knowledge Graph Completion

    Tang, Zhenwei; Pei, Shichao; Zhang, Zhao; Zhu, Yongchun; Zhuang, Fuzhen; Hoehndorf, Robert; Zhang, Xiangliang (International Joint Conferences on Artificial Intelligence Organization, 2022-07) [Conference Paper]
    Most real-world knowledge graphs (KG) are far from complete and comprehensive. This problem has motivated efforts in predicting the most plausible missing facts to complete a given KG, i.e., knowledge graph completion (KGC). However, existing KGC methods suffer from two main issues, 1) the false negative issue, i.e., the sampled negative training instances may include potential true facts; and 2) the data sparsity issue, i.e., true facts account for only a tiny part of all possible facts. To this end, we propose positive-unlabeled learning with adversarial data augmentation (PUDA) for KGC. In particular, PUDA tailors positive-unlabeled risk estimator for the KGC task to deal with the false negative issue. Furthermore, to address the data sparsity issue, PUDA achieves a data augmentation strategy by unifying adversarial training and positive-unlabeled learning under the positive-unlabeled minimax game. Extensive experimental results on real-world benchmark datasets demonstrate the effectiveness and compatibility of our proposed method.
  • How much do model organism phenotypes contribute to the computational identification of human disease genes?

    Alghamdi, Sarah M.; Schofield, Paul N.; Hoehndorf, Robert (Disease models & mechanisms, Cold Spring Harbor Laboratory, 2022-06-27) [Article]
    Computing phenotypic similarity has been shown to be useful in identification of new disease genes and for rare disease diagnostic support. Genotype-phenotype data from orthologous genes in model organisms can compensate for lack of human data to greatly increase genome coverage. Work over the past decade has demonstrated the power of cross-species phenotype comparisons, and several cross-species phenotype ontologies have been developed for this purpose. The relative contribution of different model organisms to computational identification of disease-associated genes is not yet fully explored. We use methods based on phenotype ontologies to semantically relate phenotypes resulting from loss-of-function mutations in different model organisms to disease-associated phenotypes in humans. Semantic machine learning methods are used to measure how much different model organisms contribute to the identification of known human gene-disease associations. We find that mouse genotype-phenotype data is the most important dataset in the identification of human disease genes by semantic similarity and machine learning over phenotype ontologies. Data from other model organisms does not improve identification over that obtained by using the mouse alone, and therefore does not contribute significantly to this task. Our work has implications for the future development of integrated phenotype ontologies, as well as for the use of model organism phenotypes in human genetic variant interpretation.
  • DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms

    Kulmanov, Maxat; Hoehndorf, Robert (Bioinformatics (Oxford, England), Oxford University Press (OUP), 2022-06-27) [Article]
    Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50 000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require a significant amount of training data and cannot make predictions for GO classes that have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted.
  • Joint Abductive and Inductive Neural Logical Reasoning

    Tang, Zhenwei; Pei, Shichao; Peng, Xi; Zhuang, Fuzhen; Zhang, Xiangliang; Hoehndorf, Robert (arXiv, 2022-05-29) [Preprint]
    Neural logical reasoning (NLR) is a fundamental task in knowledge discovery and artificial intelligence. NLR aims at answering multi-hop queries with logical operations on structured knowledge bases based on distributed representations of queries and answers. While previous neural logical reasoners can give specific entity-level answers, i.e., perform inductive reasoning from the perspective of logic theory, they are not able to provide descriptive concept-level answers, i.e., perform abductive reasoning, where each concept is a summary of a set of entities. In particular, the abductive reasoning task attempts to infer the explanations of each query with descriptive concepts, which make answers comprehensible to users and is of great usefulness in the field of applied ontology. In this work, we formulate the problem of the joint abductive and inductive neural logical reasoning (AI-NLR), solving which needs to address challenges in incorporating, representing, and operating on concepts. We propose an original solution named ABIN for AI-NLR. Firstly, we incorporate description logic-based ontological axioms to provide the source of concepts. Then, we represent concepts and queries as fuzzy sets, i.e., sets whose elements have degrees of membership, to bridge concepts and queries with entities. Moreover, we design operators involving concepts on top of the fuzzy set representation of concepts and queries for optimization and inference. Extensive experimental results on two real-world datasets demonstrate the effectiveness of ABIN for AI-NLR.
  • Combining biomedical knowledge graphs and text to improve predictions for drug-target interactions and drug-indications

    Alshahrani, Mona; Almansour, Abdullah; Alkhaldi, Asma; Thafar, Maha A.; Uludag, Mahmut; Essack, Magbubah; Hoehndorf, Robert (PeerJ, PeerJ, 2022-04-04) [Article]
    Biomedical knowledge is represented in structured databases and published in biomedical literature, and different computational approaches have been developed to exploit each type of information in predictive models. However, the information in structured databases and literature is often complementary. We developed a machine learning method that combines information from literature and databases to predict drug targets and indications. To effectively utilize information in published literature, we integrate knowledge graphs and published literature using named entity recognition and normalization before applying a machine learning model that utilizes the combination of graph and literature. We then use supervised machine learning to show the effects of combining features from biomedical knowledge and published literature on the prediction of drug targets and drug indications. We demonstrate that our approach using datasets for drug-target interactions and drug indications is scalable to large graphs and can be used to improve the ranking of targets and indications by exploiting features from either structure or unstructured information alone.
  • Description Logic EL++ Embeddings with Intersectional Closure

    Peng, Xi; Tang, Zhenwei; Kulmanov, Maxat; Niu, Kexin; Hoehndorf, Robert (arXiv, 2022-02-28) [Preprint]
    Many ontologies, in particular in the biomedical domain, are based on the Description Logic EL++. Several efforts have been made to interpret and exploit EL++ ontologies by distributed representation learning. Specifically, concepts within EL++ theories have been represented as n-balls within an n-dimensional embedding space. However, the intersectional closure is not satisfied when using n-balls to represent concepts because the intersection of two n-balls is not an n-ball. This leads to challenges when measuring the distance between concepts and inferring equivalence between concepts. To this end, we developed EL Box Embedding (ELBE) to learn Description Logic EL++ embeddings using axis-parallel boxes. We generate specially designed box-based geometric constraints from EL++ axioms for model training. Since the intersection of boxes remains as a box, the intersectional closure is satisfied. We report extensive experimental results on three datasets and present a case study to demonstrate the effectiveness of the proposed method.
  • Using SPARQL to unify queries over data, ontologies, and machine learning models in the PhenomeBrowser knowledgebase

    Syed, Ali Raza; Kafkas, Senay; Kulmanov, Maxat; Hoehndorf, Robert (CEUR-WS, 2022-01-01) [Conference Paper]
    We have developed the PhenomeBrowser knowledge base to integrate phenotype associations from a variety of sources into a single knowledge base. We use SPARQL as a unifying query language to access RDF data, perform Description Logic queries over ontologies, and compute the semantic similarity between entities in the knowledge base.
  • DeepSVP: Integration of genotype and phenotype for structural variant prioritization using deep learning

    Althagafi, Azza Th.; Alsubaie, Lamia; Kathiresan, Nagarajan; Mineta, Katsuhiko; Aloraini, Taghrid; Almutairi, Fuad; Alfadhel, Majid; Gojobori, Takashi; Alfares, Ahmad; Hoehndorf, Robert (Bioinformatics, Oxford University Press (OUP), 2021-12-24) [Article]
    Abstract Motivation Structural genomic variants account for much of human variability and are involved in several diseases. Structural variants are complex and may affect coding regions of multiple genes, or affect the functions of genomic regions in different ways from single nucleotide variants. Interpreting the phenotypic consequences of structural variants relies on information about gene functions, haploinsufficiency or triplosensitivity, and other genomic features. Phenotype-based methods to identifying variants that are involved in genetic diseases combine molecular features with prior knowledge about the phenotypic consequences of altering gene functions. While phenotype-based methods have been applied successfully to single nucleotide variants as well as short insertions and deletions, the complexity of structural variants makes it more challenging to link them to phenotypes. Furthermore, structural variants can affect a large number of coding regions, and phenotype information may not be available for all of them. Results We developed DeepSVP, a computational method to prioritize structural variants involved in genetic diseases by combining genomic and gene functions information. We incorporate phenotypes linked to genes, functions of gene products, gene expression in individual celltypes, and anatomical sites of expression, and systematically relate them to their phenotypic consequences through ontologies and machine learning. DeepSVP significantly improves the success rate of finding causative variants in several benchmarks and can identify novel pathogenic structural variants in consanguineous families. Availability
  • bio-ontology-research-group/mo-phenotype-analysis: Model organism phenotypes contribution in predicting gene disease associations

    Alghamdi, Sarah M.; Schofield, Paul N.; Hoehndorf, Robert (Github, 2021-11-08) [Software]
    Model organism phenotypes contribution in predicting gene disease associations
  • Multi-faceted semantic clustering with text-derived phenotypes.

    Slater, Luke T; Williams, John A; Karwath, Andreas; Fanning, Hilary; Ball, Simon; Schofield, Paul N; Hoehndorf, Robert; Gkoutos, Georgios V (Computers in biology and medicine, Elsevier BV, 2021-09-27) [Article]
    Identification of ontology concepts in clinical narrative text enables the creation of phenotype profiles that can be associated with clinical entities, such as patients or drugs. Constructing patient phenotype profiles using formal ontologies enables their analysis via semantic similarity, in turn enabling the use of background knowledge in clustering or classification analyses. However, traditional semantic similarity approaches collapse complex relationships between patient phenotypes into a unitary similarity scores for each pair of patients. Moreover, single scores may be based only on matching terms with the greatest information content (IC), ignoring other dimensions of patient similarity. This process necessarily leads to a loss of information in the resulting representation of patient similarity, and is especially apparent when using very large text-derived and highly multi-morbid phenotype profiles. Moreover, it renders finding a biological explanation for similarity very difficult; the black box problem. In this article, we explore the generation of multiple semantic similarity scores for patients based on different facets of their phenotypic manifestation, which we define through different sub-graphs in the Human Phenotype Ontology. We further present a new methodology for deriving sets of qualitative class descriptions for groups of entities described by ontology terms. Leveraging this strategy to obtain meaningful explanations for our semantic clusters alongside other evaluation techniques, we show that semantic clustering with ontology-derived facets enables the representation, and thus identification of, clinically relevant phenotype relationships not easily recoverable using overall clustering alone. In this way, we demonstrate the potential of faceted semantic clustering for gaining a deeper and more nuanced understanding of text-derived patient phenotypes.
  • Linking common human diseases to their phenotypes; development of a resource for human phenomics

    Kafkas, Senay; Althubaiti, Sara; Gkoutos, Georgios V.; Hoehndorf, Robert; Schofield, Paul N. (Journal of Biomedical Semantics, Springer Science and Business Media LLC, 2021-08-23) [Article]
    Abstract Background In recent years a large volume of clinical genomics data has become available due to rapid advances in sequencing technologies. Efficient exploitation of this genomics data requires linkage to patient phenotype profiles. Current resources providing disease-phenotype associations are not comprehensive, and they often do not have broad coverage of the disease terminologies, particularly ICD-10, which is still the primary terminology used in clinical settings. Methods We developed two approaches to gather disease-phenotype associations. First, we used a text mining method that utilizes semantic relations in phenotype ontologies, and applies statistical methods to extract associations between diseases in ICD-10 and phenotype ontology classes from the literature. Second, we developed a semi-automatic way to collect ICD-10–phenotype associations from existing resources containing known relationships. Results We generated four datasets. Two of them are independent datasets linking diseases to their phenotypes based on text mining and semi-automatic strategies. The remaining two datasets are generated from these datasets and cover a subset of ICD-10 classes of common diseases contained in UK Biobank. We extensively validated our text mined and semi-automatically curated datasets by: comparing them against an expert-curated validation dataset containing disease–phenotype associations, measuring their similarity to disease–phenotype associations found in public databases, and assessing how well they could be used to recover gene–disease associations using phenotype similarity. Conclusion We find that our text mining method can produce phenotype annotations of diseases that are correct but often too general to have significant information content, or too specific to accurately reflect the typical manifestations of the sporadic disease. On the other hand, the datasets generated from integrating multiple knowledgebases are more complete (i.e., cover more of the required phenotype annotations for a given disease). We make all data freely available at 10.5281/zenodo.4726713.
  • The COVID-19 epidemiology and monitoring ontology

    Queralt-Rosinach, Núria; Schofield, Paul N.; Hoehndorf, Robert; Weiland, Claus; Schultes, Erik Anthony; Bernabé, César Henrique; Roos, Marco (Center for Open Science, 2021-08-11) [Preprint]
    The novel COVID-19 infectious disease emerged and spread, causing high mortality and morbidity rates worldwide. In the OBO Foundry, there are more than one hundred ontologies to share and analyse large-scale datasets for biological and biomedical sciences. However, this pandemic revealed that we lack tools for an efficient and timely exchange of this epidemiological data which is necessary to assess the impact of disease outbreaks, the efficacy of mitigating interventions and to provide a rapid response. In this study we present our findings and contributions for the bio-ontologies community.
  • DTI-Voodoo: machine learning over interaction networks and ontology-based background knowledge predicts drug–target interactions

    Hinnerichs, Tilman; Hoehndorf, Robert (Bioinformatics, Oxford University Press (OUP), 2021-07-28) [Article]
    Motivation In silico drug–target interaction (DTI) prediction is important for drug discovery and drug repurposing. Approaches to predict DTIs can proceed indirectly, top-down, using phenotypic effects of drugs to identify potential drug targets, or they can be direct, bottom-up and use molecular information to directly predict binding affinities. Both approaches can be combined with information about interaction networks. Results We developed DTI-Voodoo as a computational method that combines molecular features and ontology-encoded phenotypic effects of drugs with protein–protein interaction networks, and uses a graph convolutional neural network to predict DTIs. We demonstrate that drug effect features can exploit information in the interaction network whereas molecular features do not. DTI-Voodoo is designed to predict candidate drugs for a given protein; we use this formulation to show that common DTI datasets contain intrinsic biases with major effects on performance evaluation and comparison of DTI prediction methods. Using a modified evaluation scheme, we demonstrate that DTI-Voodoo improves significantly over state of the art DTI prediction methods.
  • bio-ontology-research-group/deepgozero: DeepGO with Fuzzy DL

    Kulmanov, Maxat; Hoehndorf, Robert (Github, 2021-07-05) [Software]
    DeepGO with Fuzzy DL
  • DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web

    Kulmanov, Maxat; Zhapa-Camacho, Fernando; Hoehndorf, Robert (Nucleic Acids Research, Oxford University Press (OUP), 2021-05-21) [Article]
    Abstract Understanding the functions of proteins is crucial to understand biological processes on a molecular level. Many more protein sequences are available than can be investigated experimentally. DeepGOPlus is a protein function prediction method based on deep learning and sequence similarity. DeepGOWeb makes the prediction model available through a website, an API, and through the SPARQL query language for interoperability with databases that rely on Semantic Web technologies. DeepGOWeb provides accurate and fast predictions and ensures that predicted functions are consistent with the Gene Ontology; it can provide predictions for any protein and any function in Gene Ontology. DeepGOWeb is freely available at
  • Towards Similarity-based Differential Diagnostics For Common Diseases

    Slater, Luke T; Karwath, Andreas; Williams, John A.; Russell, Sophie; Makepeace, Silver; Carberry, Alexander; Hoehndorf, Robert; Gkoutos, Georgios (Computers in Biology and Medicine, Elsevier BV, 2021-04-01) [Article]
    Ontology-based phenotype profiles have been utilised for the purpose of differential diagnosis of rare genetic diseases, and for decision support in specific disease domains. Particularly, semantic similarity facilitates diagnostic hypothesis generation through comparison with disease phenotype profiles. However, the approach has not been applied for differential diagnosis of common diseases, or generalised clinical diagnostics from uncurated text-derived phenotypes. In this work, we describe the development of an approach for deriving patient phenotype profiles from clinical narrative text, and apply this to text associated with MIMIC-III patient visits. We then explore the use of semantic similarity with those text-derived phenotypes to classify primary patient diagnosis, comparing the use of patient-patient similarity and patient-disease similarity using phenotype-disease profiles previously mined from literature. We also consider a combined approach, in which literature-derived phenotypes are extended with the content of text-derived phenotypes we mined from 500 patients. The results reveal a powerful approach, showing that in one setting, uncurated text phenotypes can be used for differential diagnosis of common diseases, making use of information both inside and outside the setting. While the methods themselves should be explored for further optimisation, they could be applied to a variety of clinical tasks, such as differential diagnosis, cohort discovery, document and text classification, and outcome prediction.
  • DeepMOCCA: A pan-cancer prognostic model identifies personalized prognostic markers through graph attention and multi-omics data integration

    Althubaiti, Sara; Kulmanov, Maxat; Liu, Yang; Gkoutos, Georgios; Schofield, Paul N.; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2021-03-03) [Preprint]
    Combining multiple types of genomic, transcriptional, proteomic, and epigenetic datasets has the potential to reveal biological mechanisms across multiple scales, and may lead to more accurate models for clinical decision support. Developing efficient models that can derive clinical outcomes from high-dimensional data remains problematical; challenges include the integration of multiple types of omics data, inclusion of biological background knowledge, and developing machine learning models that are able to deal with this high dimensionality while having only few samples from which to derive a model. We developed DeepMOCCA, a framework for multi-omics cancer analysis. We combine different types of omics data using biological relations between genes, transcripts, and proteins, combine the multi-omics data with background knowledge in the form of protein-protein interaction networks, and use graph convolution neural networks to exploit this combination of multi-omics data and background knowledge. DeepMOCCA predicts survival time for individual patient samples for 33 cancer types and outperforms most existing survival prediction methods. Moreover, DeepMOCCA includes a graph attention mechanism which prioritizes driver genes and prognostic markers in a patient-specific manner; the attention mechanism can be used to identify drivers and prognostic markers within cohorts and individual patients.
  • DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes.

    Liu-Wei, Wang; Kafkas, Senay; Chen, Jun; Dimonaco, Nicholas J; Tegner, Jesper; Hoehndorf, Robert (Bioinformatics, Oxford University Press (OUP), 2021-03-03) [Article]
    MotivationInfectious diseases caused by novel viruses have become a major public health concern. Rapid identification of virus-host interactions can reveal mechanistic insights into infectious diseases and shed light on potential treatments. Current computational prediction methods for novel viruses are based mainly on protein sequences. However, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e., signs and symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts.ResultsWe developed DeepViral, a deep learning based method that predicts protein-protein interactions (PPI) between humans and viruses. Motivated by the potential utility of infectious disease phenotypes, we first embedded human proteins and viruses in a shared space using their associated phenotypes and functions, supported by formalized background knowledge from biomedical ontologies. By jointly learning from protein sequences and phenotype features, DeepViral significantly improves over existing sequence-based methods for intra- and inter-species PPI prediction.AvailabilityCode and datasets for reproduction and customization are available at Prediction results for 14 virus families are available at
  • reality/mimpred:

    Slater, Luke T; Russell, Sophie; Makepeace, Silver; Carberry, Alexander; Karwath, Andreas; Williams, John A; Fanning, Hilary; Ball, Simon; Hoehndorf, Robert; Gkoutos, Georgios (Github, 2020-12-01) [Software]
  • NuriaQueralt/covid19-epidemiology-ontology: Epidemiology and monitoring ontology for COVID-19

    Queralt-Rosinach, Núria; Schofield, Paul N.; Hoehndorf, Robert; Weiland, Claus; Schultes, Erik Anthony; Bernabé, César Henrique; Roos, Marco (Github, 2020-11-09) [Software]
    Epidemiology and monitoring ontology for COVID-19

View more