For more information visit: https://cemse.kaust.edu.sa/borg

Recent Submissions

  • Towards Similarity-based Differential Diagnostics For Common Diseases

    Slater, Luke T; Karwath, Andreas; Williams, John A.; Russell, Sophie; Makepeace, Silver; Carberry, Alexander; Hoehndorf, Robert; Gkoutos, Georgios (Computers in Biology and Medicine, Elsevier BV, 2021-04-01) [Article]
    Ontology-based phenotype profiles have been utilised for the purpose of differential diagnosis of rare genetic diseases, and for decision support in specific disease domains. Particularly, semantic similarity facilitates diagnostic hypothesis generation through comparison with disease phenotype profiles. However, the approach has not been applied for differential diagnosis of common diseases, or generalised clinical diagnostics from uncurated text-derived phenotypes. In this work, we describe the development of an approach for deriving patient phenotype profiles from clinical narrative text, and apply this to text associated with MIMIC-III patient visits. We then explore the use of semantic similarity with those text-derived phenotypes to classify primary patient diagnosis, comparing the use of patient-patient similarity and patient-disease similarity using phenotype-disease profiles previously mined from literature. We also consider a combined approach, in which literature-derived phenotypes are extended with the content of text-derived phenotypes we mined from 500 patients. The results reveal a powerful approach, showing that in one setting, uncurated text phenotypes can be used for differential diagnosis of common diseases, making use of information both inside and outside the setting. While the methods themselves should be explored for further optimisation, they could be applied to a variety of clinical tasks, such as differential diagnosis, cohort discovery, document and text classification, and outcome prediction.
  • DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes.

    Liu-Wei, Wang; Kafkas, Senay; Chen, Jun; Dimonaco, Nicholas J; Tegner, Jesper; Hoehndorf, Robert (Bioinformatics (Oxford, England), Oxford University Press (OUP), 2021-03-08) [Article]
    MotivationInfectious diseases caused by novel viruses have become a major public health concern. Rapid identification of virus-host interactions can reveal mechanistic insights into infectious diseases and shed light on potential treatments. Current computational prediction methods for novel viruses are based mainly on protein sequences. However, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e., signs and symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts.ResultsWe developed DeepViral, a deep learning based method that predicts protein-protein interactions (PPI) between humans and viruses. Motivated by the potential utility of infectious disease phenotypes, we first embedded human proteins and viruses in a shared space using their associated phenotypes and functions, supported by formalized background knowledge from biomedical ontologies. By jointly learning from protein sequences and phenotype features, DeepViral significantly improves over existing sequence-based methods for intra- and inter-species PPI prediction.AvailabilityCode and datasets for reproduction and customization are available at https://github.com/bio-ontology-research-group/DeepViral. Prediction results for 14 virus families are available at https://doi.org/10.5281/zenodo.4429824.
  • DeepMOCCA: A pan-cancer prognostic model identifies personalized prognostic markers through graph attention and multi-omics data integration

    Althubaiti, Sara; Kulmanov, Maxat; Liu, Yang; Gkoutos, Georgios; Schofield, Paul N.; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2021-03-03) [Preprint]
    Combining multiple types of genomic, transcriptional, proteomic, and epigenetic datasets has the potential to reveal biological mechanisms across multiple scales, and may lead to more accurate models for clinical decision support. Developing efficient models that can derive clinical outcomes from high-dimensional data remains problematical; challenges include the integration of multiple types of omics data, inclusion of biological background knowledge, and developing machine learning models that are able to deal with this high dimensionality while having only few samples from which to derive a model. We developed DeepMOCCA, a framework for multi-omics cancer analysis. We combine different types of omics data using biological relations between genes, transcripts, and proteins, combine the multi-omics data with background knowledge in the form of protein-protein interaction networks, and use graph convolution neural networks to exploit this combination of multi-omics data and background knowledge. DeepMOCCA predicts survival time for individual patient samples for 33 cancer types and outperforms most existing survival prediction methods. Moreover, DeepMOCCA includes a graph attention mechanism which prioritizes driver genes and prognostic markers in a patient-specific manner; the attention mechanism can be used to identify drivers and prognostic markers within cohorts and individual patients.
  • DeepSVP: Integration of genotype and phenotype for structural variant prioritization using deep learning

    Althagafi, Azza Th.; Alsubaie, Lamia; Kathiresan, Nagarajan; Mineta, Katsuhiko; Aloraini, Taghrid; Almutairi, Fuad; Alfadhel, Majid; Gojobori, Takashi; Alfares, Ahmed; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2021-01-28) [Preprint]
    Motivation: Structural genomic variants account for much of human variability and are involved in several diseases. Structural variants are complex and may affect coding regions of multiple genes, or affect the functions of genomic regions in different ways from single nucleotide variants. Interpreting the phenotypic consequences of structural variants relies on information about gene functions, haploinsufficiency or triplosensitivity, and other genomic features. Phenotype-based methods to identifying variants that are involved in genetic diseases combine molecular features with prior knowledge about the phenotypic consequences of altering gene functions. While phenotype-based methods have been applied successfully to single nucleotide variants, as well as short insertions and deletions, the complexity of structural variants makes it more challenging to link them to phenotypes. Furthermore, structural variants can affect a large number of coding regions, and phenotype information may not be available for all of them. Results: We developed DeepSVP, a computational method to prioritize structural variants involved in genetic diseases by combining genomic information with information about gene functions. We incorporate phenotypes linked to genes, functions of gene products, gene expression in individual celltypes, and anatomical sites of expression, and systematically relate them to their phenotypic consequences through ontologies and machine learning. DeepSVP significantly improves the success rate of finding causative variants in several benchmarks and can identify novel pathogenic structural variants in consanguineous families. Availability: https://github.com/bio-ontology-research-group/DeepSVP Contact: robert.hoehndorf@kaust.edu.sa
  • Predicting Candidate Genes From Phenotypes, Functions, And Anatomical Site Of Expression.

    Chen, Jun; Althagafi, Azza Th.; Hoehndorf, Robert (Bioinformatics (Oxford, England), Oxford University Press (OUP), 2020-10-14) [Article]
    MOTIVATION:Over the past years, many computational methods have been developed to incorporate information about phenotypes for disease gene prioritization task. These methods generally compute the similarity between a patient's phenotypes and a database of gene-phenotype to find the most phenotypically similar match. The main limitation in these methods is their reliance on knowledge about phenotypes associated with particular genes, which is not complete in humans as well as in many model organisms such as the mouse and fish. Information about functions of gene products and anatomical site of gene expression is available for more genes and can also be related to phenotypes through ontologies and machine learning models. RESULTS:We developed a novel graph-based machine learning method for biomedical ontologies which is able to exploit axioms in ontologies and other graph-structured data. Using our machine learning method, we embed genes based on their associated phenotypes, functions of the gene products, and anatomical location of gene expression. We then develop a machine learning model to predict gene-disease associations based on the associations between genes and multiple biomedical ontologies, and this model significantly improves over state of the art methods. Furthermore, we extend phenotype-based gene prioritization methods significantly to all genes which are associated with phenotypes, functions, or site of expression. AVAILABILITY:Software and data are available at https://github.com/bio-ontology-research-group/DL2Vec.
  • Semantic similarity and machine learning with ontologies.

    Kulmanov, Maxat; Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Briefings in bioinformatics, Oxford University Press (OUP), 2020-10-13) [Article]
    Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
  • EMC10 Homozygous Variant Identified in a Family with Global Developmental Delay, Mild Intellectual Disability, and Speech Delay.

    Umair, Muhammad; Ballow, Mariam; Asiri, Abdulaziz; Alyafee, Yusra; Al Tuwaijri, Abeer; Alhamoudi, Kheloud M; Aloraini, Taghrid; Abdelhakim, Marwa; Althagafi, Azza Th.; Kafkas, Senay; Alsubaie, Lamia; Alrifai, Muhammad Talal; Hoehndorf, Robert; Alfares, Ahmed; Alfadhel, Majid (Clinical genetics, Wiley, 2020-09-15) [Article]
    In recent years, several genes have been implicated in the variable disease presentation of global developmental delay (GDD) and intellectual disability (ID). The endoplasmic reticulum membrane protein complex (EMC) family is known to be involved in GDD and ID. Homozygous variants of EMC1 are associated with GDD, scoliosis, and cerebellar atrophy, indicating the relevance of this pathway for neurogenetic disorders. EMC10 is a bone marrow-derived angiogenic growth factor that plays an important role in infarct vascularization and promoting tissue repair. However, this gene has not been previously associated with human disease. Herein, we describe a Saudi family with two individuals segregating a recessive neurodevelopmental disorder. Both of the affected individuals showed mild ID, speech delay, and GDD. Whole-exome sequencing (WES) and Sanger sequencing were performed to identify candidate genes. Further, to elucidate the functional effects of the variant, quantitative real-time PCR (RT-qPCR)-based expression analysis was performed. WES revealed a homozygous splice acceptor site variant (c.679-1G > A) in EMC10 (chromosome 19q13.33) that segregated perfectly within the family. RT-qPCR showed a substantial decrease in the relative EMC10 gene expression in the patients, indicating the pathogenicity of the identified variant. For the first time in the literature, the EMC10 gene variant was associated with mild ID, speech delay, and GDD. Thus, this gene plays a key role in developmental milestones, with the potential to cause neurodevelopmental disorders in humans. This article is protected by copyright. All rights reserved.
  • Komenti: A semantic text mining framework

    Slater, Luke T; Bradlow, William; Hoehndorf, Robert; Motti, Dino FA; Ball, Simon; Gkoutos, Georgios (Cold Spring Harbor Laboratory, 2020-08-05) [Preprint]
    Komenti is a reasoner-enabled semantic query and information extraction tool. It is the only text mining tool that enables querying inferred knowledge from biomedical ontologies. It also contains multiple novel components for vocabulary construction and context disambiguation, which can improve the power of text mining and ontology-based analysis tasks, with a view towards making full use of the semantic provision of biomedical ontologies in the text extraction and characterisation space. Here, we describe Komenti, its features, and a use case wherein we automate a clinical audit process, classifying the medications of patients with hypertrophic cardiomyopathy from text records, revealing a high precision, and a subcohort of candidate patients who have atrial fibrillation but were not anti-coagulated, and are therefore at a higher risk of stroke.
  • What is the right sequencing approach? Solo VS extended family analysis in consanguineous populations.

    Alfares, Ahmed; Alsubaie, Lamia; Aloraini, Taghrid; Alaskar, Aljoharah; Althagafi, Azza Th.; Alahmad, Ahmed; Rashid, Mamoon; Alswaid, Abdulrahman; Alothaim, Ali; Eyaid, Wafaa; Ababneh, Faroug; Albalwi, Mohammed; Alotaibi, Raniah; Almutairi, Mashael; Altharawi, Nouf; Alsamer, Alhanouf; Abdelhakim, Marwa; Kafkas, Senay; Mineta, Katsuhiko; Cheung, Nicole; Abdallah, Abdallah; Büchmann-Møller, Stine; Fukasawa, Yoshinori; Zhao, Xiang; Rajan, Issaac; Hoehndorf, Robert; Al Mutairi, Fuad; Gojobori, Takashi; Alfadhel, Majid (BMC medical genomics, Springer Nature, 2020-07-17) [Article]
    BACKGROUND:Testing strategies is crucial for genetics clinics and testing laboratories. In this study, we tried to compare the hit rate between solo and trio and trio plus testing and between trio and sibship testing. Finally, we studied the impact of extended family analysis, mainly in complex and unsolved cases. METHODS:Three cohorts were used for this analysis: one cohort to assess the hit rate between solo, trio and trio plus testing, another cohort to examine the impact of the testing strategy of sibship genome vs trio-based analysis, and a third cohort to test the impact of an extended family analysis of up to eight family members to lower the number of candidate variants. RESULTS:The hit rates in solo, trio and trio plus testing were 39, 40, and 41%, respectively. The total number of candidate variants in the sibship testing strategy was 117 variants compared to 59 variants in the trio-based analysis. We noticed that the average number of coding candidate variants in trio-based analysis was 1192 variants and 26,454 noncoding variants, and this number was lowered by 50-75% after adding additional family members, with up to two coding and 66 noncoding homozygous variants only, in families with eight family members. CONCLUSION:There was no difference in the hit rate between solo and extended family members. Trio-based analysis was a better approach than sibship testing, even in a consanguineous population. Finally, each additional family member helped to narrow down the number of variants by 50-75%. Our findings could help clinicians, researchers and testing laboratories select the most cost-effective and appropriate sequencing approach for their patients. Furthermore, using extended family analysis is a very useful tool for complex cases with novel genes.
  • Modeling quantitative traits for COVID-19 case reports

    Queralt-Rosinach, Núria; Bello, Susan; Hoehndorf, Robert; Weiland, Claus; Rocca-Serra, Philippe; Schofield, Paul N. (Cold Spring Harbor Laboratory, 2020-06-21) [Preprint]
    Medical practitioners record the condition status of a patient through qualitative and quantitative observations. The measurement of vital signs and molecular parameters in the clinics gives a complementary description of abnormal phenotypes associated with the progression of a disease. The Clinical Measurement Ontology (CMO) is used to standardize annotations of these measurable traits. However, researchers have no way to describe how these quantitative traits relate to phenotype concepts in a machine-readable manner. Using the WHO clinical case report form standard for the COVID-19 pandemic, we modeled quantitative traits and developed OWL axioms to formally relate clinical measurement terms with anatomical, biomolecular entities and phenotypes annotated with the Uber-anatomy ontology (Uberon), Chemical Entities of Biological Interest (ChEBI) and the Phenotype and Trait Ontology (PATO) biomedical ontologies. The formal description of these relations allows interoperability between clinical and biological descriptions, and facilitates automated reasoning for analysis of patterns over quantitative and qualitative biomedical observations.
  • bio-ontology-research-group/DeepSVP: Prioritizing Copy Number Variants (CNV) using Phenotype and Gene Functional Similarity

    Althagafi, Azza Th.; Alsubaie, Lamia; Kathiresan, Nagarajan; Mineta, Katsuhiko; Aloraini, Taghrid; Almutairi, Fuad; Alfadhel, Majid; Gojobori, Takashi; Alfares, Ahmed; Hoehndorf, Robert (Github, 2020-06-08) [Software]
    Prioritizing Copy Number Variants (CNV) using Phenotype and Gene Functional Similarity
  • Machine learning with biomedical ontologies

    Kulmanov, Maxat; Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2020-05-08) [Preprint]
    Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge, and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in biomedical ontologies, and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.Key pointsOntologies provide background knowledge that can be exploited in machine learning models.Ontology embeddings are structure-preserving maps from ontologies into vector spaces and provide an important method for utilizing ontologies in machine learning. Embeddings can preserve different structures in ontologies, including their graph structures, syntactic regularities, or their model-theoretic semantics.Axioms in ontologies, in particular those involving negation, can be used as constraints in optimization and machine learning to reduce the search space.
  • bio-ontology-research-group/machine-learning-with-ontologies:

    Kulmanov, Maxat; Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Github, 2020-04-29) [Software]
  • Self-normalizing learning on biomedical ontologies using a deep Siamese neural network

    Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2020-04-25) [Preprint]
    Motivation:Ontologies are widely used in biomedicine for the annotation and standardization of data.One of the main roles of ontologies is to provide structured background knowledge within a domain as well as a set of labels, synonyms, and definitions for the classes within a domain. The two types of information provided by ontologies have been extensively exploited in natural language processing and machine learning applications. However, they are commonly used separately, and thus it is unknown if joining the two sources of information can further benefit data analysis tasks. Results:We developed a novel method that applies named entity recognition and normalization methods on texts to connect the structured information in biomedical ontologies with the information contained in natural language. We apply this normalization both to literature and to the natural language information contained within ontologies themselves. The normalized ontologies and text are then used to generate embeddings, and relations between entities are predicted using a deep Siamese neural network model that takes these embeddings as input. We demonstrate that our novel embedding and prediction method using self normalized biomedical ontologies significantly outperforms the state of the art methods in embedding ontologies on two benchmark tasks: prediction of interactions between proteins and prediction of gene disease associations. Our method also allows us to apply ontology based annotations and axioms to the prediction of toxicological effects of chemicals where our method shows superior performance. Our method is generic and can be applied in scenarios where ontologies consisting of both structured information and natural language labels or synonyms are used.
  • Prediction of novel virus-host interactions by integrating clinical symptoms and protein sequences

    Wang, Liu-Wei; Kafkas, Senay; Chen, Jun; Tegner, Jesper; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2020-04-25) [Preprint]
    Motivation: Infectious diseases from novel viruses are becoming a major public health concern. Fast identification of virus-host interactions can reveal mechanistic insights of infectious diseases and shed light on potential treatments and drug discoveries. Current computational prediction methods for novel viruses are based only on protein sequences. Yet, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e., symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts. Results: We developed DeepViral, a deep learning method that predicts potential protein-protein interactions between human and viruses. First, human proteins and viruses were embedded in a shared space using their associated phenotypes, functions, taxonomic classification, as well as formalized background knowledge from biomedical ontologies. By extending a sequence learning model with phenotype features, our model can not only significantly improve over previous sequence-based approaches for inter-species interaction prediction, but also identify pathways of viral targets under a realistic experimental setup for novel viruses. Availability:https://github.com/bio-ontology-research-group/DeepViral
  • bio-ontology-research-group/DeepViral: Source code for the DeepViral paper

    Kafkas, Senay; Chen, Jun; Tegner, Jesper; Hoehndorf, Robert; Wang, Liu-Wei (Github, 2020-04-22) [Software]
    Source code for the DeepViral paper
  • D4: Deep Drug-drug interaction Discovery and Demystification

    Noor, Adeeb; Liu-Wei, Wang; Barnawi, Ahmed; Nour, Redhwan; Assiri, Abdullah A; Chan Bukhari, Syed Ahmad; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2020-04-09) [Preprint]
    AbstractMotivationDrug-drug interactions (DDIs) are complex processes which may depend on many clinical and non-clinical factors. Identifying and distinguishing ways in which drugs interact remains a challenge. To minimize DDIs and to personalize treatment based on accurate stratification of patients, it is crucial that mechanisms of interaction can be identified. Most DDIs are a consequence of metabolic mechanisms of interaction, but DDIs with different mechanisms occur less frequently and are therefore more difficult to identify.ResultsWe developed a method (D4) for computationally identifying potential DDIs and determining whether they interact based on one of eleven mechanisms of interaction. D4 predicts DDIs and their mechanisms through features that are generated through a deep learning approach from phenotypic and functional knowledge about drugs, their side-effects and targets. Our findings indicate that our method is able to identify known DDIs with high accuracy and that D4 can determine mechanisms of interaction. We also identify numerous novel and potential DDIs for each mechanism of interaction and evaluate our predictions using DDIs from adverse event reporting systems.Availabilityhttps://github.com/bio-ontology-research-group/D4Contactarnoor@kau.edu.sa and robert.hoehndorf@kaust.edu.sa
  • Efficient long-distance relation extraction with DG-SpanBERT

    Chen, Jun; Hoehndorf, Robert; Elhoseiny, Mohamed; Zhang, Xiangliang (arXiv, 2020-04-07) [Preprint]
    In natural language processing, relation extraction seeks to rationally understand unstructured text. Here, we propose a novel SpanBERT-based graph convolutional network (DG-SpanBERT) that extracts semantic features from a raw sentence using the pre-trained language model SpanBERT and a graph convolutional network to pool latent features. Our DG-SpanBERT model inherits the advantage of SpanBERT on learning rich lexical features from large-scale corpus. It also has the ability to capture long-range relations between entities due to the usage of GCN on dependency tree. The experimental results show that our model outperforms other existing dependency-based and sequence-based models and achieves a state-of-the-art performance on the TACRED dataset.
  • BioHackathon 2015: Semantics of data for life sciences and reproducible research

    Katayama, Toshiaki; Vos, Rutger A.; Mishima, Hiroyuki; Kawano, Shin; Kawashima, Shuichi; Kim, Jin Dong; Moriya, Yuki; Tokimatsu, Toshiaki; Yamaguchi, Atsuko; Yamamoto, Yasunori; Wu, Hongyan; Amstutz, Peter; Antezana, Erick; Aoki, Nobuyuki P.; Arakawa, Kazuharu; Bolleman, Jerven T.; Bolton, Evan; Bonnal, Raoul J.P.; Bono, Hidemasa; Burger, Kees; Chiba, Hirokazu; Cohen, Kevin B.; Deutsch, Eric W.; Fernández-Breis, Jesualdo T.; Fu, Gang; Fujisawa, Takatomo; Fukushima, Atsushi; García, Alexander; Goto, Naohisa; Groza, Tudor; Hercus, Colin; Hoehndorf, Robert; Itaya, Kotone; Juty, Nick; Kawashima, Takeshi; Kim, Jee Hyub; Kinjo, Akira R.; Kotera, Masaaki; Kozaki, Kouji; Kumagai, Sadahiro; Kushida, Tatsuya; Lütteke, Thomas; Matsubara, Masaaki; Miyamoto, Joe; Mohsen, Attayeb; Mori, Hiroshi; Naito, Yuki; Nakazato, Takeru; Nguyen-Xuan, Jeremy; Nishida, Kozo; Nishida, Naoki; Nishide, Hiroyo; Ogishima, Soichi; Ohta, Tazro; Okuda, Shujiro; Paten, Benedict; Perret, Jean Luc; Prathipati, Philip; Prins, Pjotr; Queralt-Rosinach, Núria; Shinmachi, Daisuke; Suzuki, Shinya; Tabata, Tsuyosi; Takatsuki, Terue; Taylor, Kieron; Thompson, Mark; Uchiyama, Ikuo; Vieira, Bruno; Wei, Chih Hsuan; Wilkinson, Mark; Yamada, Issaku; Yamanaka, Ryota; Yoshitake, Kazutoshi; Yoshizawa, Akiyasu C.; Dumontier, Michel; Kosaki, Kenjiro; Takagi, Toshihisa (F1000Research, F1000 Research Ltd, 2020-02-24) [Article]
    We report on the activities of the 2015 edition of the BioHackathon, an annual event that brings together researchers and developers from around the world to develop tools and technologies that promote the reusability of biological data. We discuss issues surrounding the representation, publication, integration, mining and reuse of biological data and metadata across a wide range of biomedical data types of relevance for the life sciences, including chemistry, genotypes and phenotypes, orthology and phylogeny, proteomics, genomics, glycomics, and metabolomics. We describe our progress to address ongoing challenges to the reusability and reproducibility of research results, and identify outstanding issues that continue to impede the progress of bioinformatics research. We share our perspective on the state of the art, continued challenges, and goals for future research and development for the life sciences Semantic Web.
  • CAN-VP: CANcer Variant Prioritization

    Althubaiti, Sara; Gkoutos, Georgios; Hoehndorf, Robert (2020-1-20) [Poster]
    Introduction Identifying and prioritizing driver mutations that play main role to develop cancer still a  major challenge. Several computational approaches involved machine learning and statistical methods exist to access finding these driver mutations depending on pre-computed pathogenicity scores derived from different tools. We have developed CANcerVariant Prioritization (CAN-VP) system to identify and prioritize driver mutations. Ourtool exploits the background knowledge behind using different ontologies that utilize cellular phenotypes, functions, and whole-body physiological phenotypes besides combining region-based information as features. We demonstrate the performance of CAN-VP in prioritizing causative driver mutations on a number of synthetic whole exome from The  Cancer Genome Atlas (TCGA), targeting 4 different primary sites. We find that CAN-VP could identify most of the causative driver mutations compared to the existing tools which showed its capability as a tool for discovering driver mutations. Methods and Materials Data sources We relied on two main types of datasets, first one is from well-known cancer-related databases such as:  COSMIC1, CanProVar2, IntOGen3. The second one is the real samples included in The Cancer Genome Atlas (TCGA)4 which involve more than 60 different projects covering 67 primary sites;  but till now we focus on 4 projects (Sarcoma,  Kidney, Lung, and Bladder). Moreover, we used the 579 validated driver mutations in Bailey, Matthew H., etal5. Results and Discussion 1. Prediction model 1.1 Model details We implemented our CAN-VP using a fully connected neural network model in Python 3.6 as shown in Figure 4. We used Keraswith a TensorFlow backend. We ignored the missing values for all the features being used. We added additional flags for missing values as features.  We retrieved genes embeddings from and used them as features in the prediction model. 1.2 Training and testing data We downloaded COSMIC mutations VCF file on 26th Jul, 2019.  It includes 4,788,121cancer mutations.  We also downloaded DoCMdataset as a VCF file on 18th Nov, 2019. It includes 1364 curated driver mutations.  Moreover, we downloaded CanProVaras afastqfile on 18th Nov, 2019.  It includes 156,671 driver mutations. Based on that, we tried to find how much mutations of DoCM+ CanProVarexist within COSMIC and consider them as positives; otherwise, they would be negatives. As Table 1 showed, the number of negatives data (unknown driver somatic mutations) are much more than the positive ones (validated as driver mutations). 1.3 Prediction performance We trained our model in Figure 2 using the dataset in Table 1 and do the testing on the synthetic datasets. The updated results of CAN-VP compared to the other tools are shown in Table 2. In terms of evaluating the importance of different features in our prediction model, we first test the different combinations of features from CanDrAwhich includes (86 from CHASMplus and 3 from Mutation Assessor) plus 3 from UCSC. Moreover, we add the gene embeddings and the results become better by 3%. Table 3summaries the performance for each experiment. Future Work - Test CAN-VP on much comprehensive cancer-related datasets. - Integrate graph-basedfeaturestoCAN-VP model. References 1SallyBamford et al. “The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. In: British journal of cancer 91.2 (2004), p. 355.   2 Jing Li, Dexter T Duncan, and Bing Zhang. “CanProVar: a human cancer proteome variation database. In: Human mutation 31.3 (2010), pp. 219–228.   3 GunesGundemet al. “IntOGen: integration and data mining of multidimensional oncogenomic data. In: Nature methods 7.2 (2010), p. 92.   4 Katarzyna Tomczak, Patrycja Czerwínska, andMaciejWiznerowicz. “The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. In: Contemporaryoncology19.1A(2015), A68.   5 Matthew H Bailey et al. “Comprehensive characterization of cancer driver genes and mutations. In: Cell173.2(2018), pp. 371–385.

View more