Ontology-based Representations of Biological Entities
AbstractBiomedical ontologies are widely used as a way to formally structure and represent knowledge in the biomedical field. Ontologies describe biological concepts and their relations through logical axioms and annotation properties (meta-data).
The structure and information contained in biomedical ontologies and their annotations make them valuable for data analysis and knowledge extraction tasks. Despite being a rich source of biomedical information, ontologies are poorly unexploited by ontology-based analysis methods such as semantic similarity measures, which only use limited information from the ontologies.
We propose two methods, Onto2Vec and OPA2Vec that can be used to generate vector representations of biological entities, by encoding most of the information in ontologies and their annotations.
We propose a method that learns dense-vector representations of biological entities based on logical axioms and ontology-based annotations of biological entities:
Fig 1. Onto2Vec workflow
Onto2Vec learns the vector representations in three steps:
•Inferring new axioms using a semantic reasoner.
•Representing entity-concept associations as axioms and merging them with the ontology axioms in the corpus.
•Training Word2Vec on the ontology corpus.
In addition to formal axioms, ontologies encode a rich meta-data in natural language describing different aspects of the biological concepts (e.g. labels, descriptions, …).
This meta-data is completely unexploited by data analysis methods that use ontologies.
OPA2Vec generates vector representations of biological entities by:
• Combining formal ontology axioms with the ontology meta-data.
•Pre-training Word2Vec on PubMed to provide background knowledge about the words and concepts used in the ontology annotation properties
1.Protein interactions using Onto2Vec: - We apply Onto2Vec on the gene ontology (GO) and produce protein vector representations.
- The obtained vectors are then trained (using cosine similarity and a neural network) to predict protein interactions on human and yeast and compared to Resnik semantic similarity:
human yeast Fig 3. ROC curves for PPI prediction using Onto2Vec
2.Enzyme visualization using Onto2Vec: - The vectors obtained through Onto2Vec can also be used for clustering and identifying entities within the same functional group.
- As an example, we illustrate the vector representations of 10,000 enzymes labelled by their first-level EC category:
Fig 4. TSNE visualization of 10,000 enzymes using Onto2Vec
Protein interaction prediction using OPA2Vec: - To evaluate OPA2Vec, we also apply it on the Gene Ontology and protein-GO annotations to produce vector representations of proteins.
To make a better use of the rich meta-data available in GO in the form of labels, descriptions, synonyms, etc, we pre-train Word2Vec on Medline and PMC, and use the trained models to produce the protein vectors. The obtained results are then used to predict protein interactions and compared to Onto2Vec and Resnik:
Fig 5. AUC values for PPI prediction using OPA2Vec.
Gene—disease association prediction using OPA2VEC: - As an additional evaluation , we applied OPA2Vec on PhenomeNet, jointly with the known gene-phenotype and disease-phenotype associations to obtain vector representations of genes and diseases.
- The obtained vectors have then been used to predict gene-disease associations on human and mouse datasets:
human mouse Fig 6. ROC curves for gene-disease association prediction using OPA2Vec
We have developed two methods, Onto2Vec and OPA2Vec that can be used to produce vector representations of biological entities based on ontologies and their annotations to properly utilize most of the information encoded in ontology axioms and meta-data.
Our workflow is quite generic and can be applied to a wide range of ontology-based analysis tasks.