Now showing items 1-20 of 92

    • Predicting Candidate Genes From Phenotypes, Functions, And Anatomical Site Of Expression.

      Chen, Jun; Althagafi, Azza Th.; Hoehndorf, Robert (Bioinformatics (Oxford, England), Oxford University Press (OUP), 2020-10-14) [Article]
      MOTIVATION:Over the past years, many computational methods have been developed to incorporate information about phenotypes for disease gene prioritization task. These methods generally compute the similarity between a patient's phenotypes and a database of gene-phenotype to find the most phenotypically similar match. The main limitation in these methods is their reliance on knowledge about phenotypes associated with particular genes, which is not complete in humans as well as in many model organisms such as the mouse and fish. Information about functions of gene products and anatomical site of gene expression is available for more genes and can also be related to phenotypes through ontologies and machine learning models. RESULTS:We developed a novel graph-based machine learning method for biomedical ontologies which is able to exploit axioms in ontologies and other graph-structured data. Using our machine learning method, we embed genes based on their associated phenotypes, functions of the gene products, and anatomical location of gene expression. We then develop a machine learning model to predict gene-disease associations based on the associations between genes and multiple biomedical ontologies, and this model significantly improves over state of the art methods. Furthermore, we extend phenotype-based gene prioritization methods significantly to all genes which are associated with phenotypes, functions, or site of expression. AVAILABILITY:Software and data are available at https://github.com/bio-ontology-research-group/DL2Vec.
    • Semantic similarity and machine learning with ontologies.

      Kulmanov, Maxat; Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Briefings in bioinformatics, Oxford University Press (OUP), 2020-10-13) [Article]
      Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
    • EMC10 Homozygous Variant Identified in a Family with Global Developmental Delay, Mild Intellectual Disability, and Speech Delay.

      Umair, Muhammad; Ballow, Mariam; Asiri, Abdulaziz; Alyafee, Yusra; Al Tuwaijri, Abeer; Alhamoudi, Kheloud M; Aloraini, Taghrid; Abdelhakim, Marwa; Althagafi, Azza Th.; Kafkas, Senay; Alsubaie, Lamia; Alrifai, Muhammad Talal; Hoehndorf, Robert; Alfares, Ahmed; Alfadhel, Majid (Clinical genetics, Wiley, 2020-09-15) [Article]
      In recent years, several genes have been implicated in the variable disease presentation of global developmental delay (GDD) and intellectual disability (ID). The endoplasmic reticulum membrane protein complex (EMC) family is known to be involved in GDD and ID. Homozygous variants of EMC1 are associated with GDD, scoliosis, and cerebellar atrophy, indicating the relevance of this pathway for neurogenetic disorders. EMC10 is a bone marrow-derived angiogenic growth factor that plays an important role in infarct vascularization and promoting tissue repair. However, this gene has not been previously associated with human disease. Herein, we describe a Saudi family with two individuals segregating a recessive neurodevelopmental disorder. Both of the affected individuals showed mild ID, speech delay, and GDD. Whole-exome sequencing (WES) and Sanger sequencing were performed to identify candidate genes. Further, to elucidate the functional effects of the variant, quantitative real-time PCR (RT-qPCR)-based expression analysis was performed. WES revealed a homozygous splice acceptor site variant (c.679-1G > A) in EMC10 (chromosome 19q13.33) that segregated perfectly within the family. RT-qPCR showed a substantial decrease in the relative EMC10 gene expression in the patients, indicating the pathogenicity of the identified variant. For the first time in the literature, the EMC10 gene variant was associated with mild ID, speech delay, and GDD. Thus, this gene plays a key role in developmental milestones, with the potential to cause neurodevelopmental disorders in humans. This article is protected by copyright. All rights reserved.
    • Komenti: A semantic text mining framework

      Slater, Luke T; Bradlow, William; Hoehndorf, Robert; Motti, Dino FA; Ball, Simon; Gkoutos, Georgios (Cold Spring Harbor Laboratory, 2020-08-05) [Preprint]
      Komenti is a reasoner-enabled semantic query and information extraction tool. It is the only text mining tool that enables querying inferred knowledge from biomedical ontologies. It also contains multiple novel components for vocabulary construction and context disambiguation, which can improve the power of text mining and ontology-based analysis tasks, with a view towards making full use of the semantic provision of biomedical ontologies in the text extraction and characterisation space. Here, we describe Komenti, its features, and a use case wherein we automate a clinical audit process, classifying the medications of patients with hypertrophic cardiomyopathy from text records, revealing a high precision, and a subcohort of candidate patients who have atrial fibrillation but were not anti-coagulated, and are therefore at a higher risk of stroke.
    • What is the right sequencing approach? Solo VS extended family analysis in consanguineous populations.

      Alfares, Ahmed; Alsubaie, Lamia; Aloraini, Taghrid; Alaskar, Aljoharah; Althagafi, Azza Th.; Alahmad, Ahmed; Rashid, Mamoon; Alswaid, Abdulrahman; Alothaim, Ali; Eyaid, Wafaa; Ababneh, Faroug; Albalwi, Mohammed; Alotaibi, Raniah; Almutairi, Mashael; Altharawi, Nouf; Alsamer, Alhanouf; Abdelhakim, Marwa; Kafkas, Senay; Mineta, Katsuhiko; Cheung, Nicole; Abdallah, Abdallah; Büchmann-Møller, Stine; Fukasawa, Yoshinori; Zhao, Xiang; Rajan, Issaac; Hoehndorf, Robert; Al Mutairi, Fuad; Gojobori, Takashi; Alfadhel, Majid (BMC medical genomics, Springer Science and Business Media LLC, 2020-07-17) [Article]
      BACKGROUND:Testing strategies is crucial for genetics clinics and testing laboratories. In this study, we tried to compare the hit rate between solo and trio and trio plus testing and between trio and sibship testing. Finally, we studied the impact of extended family analysis, mainly in complex and unsolved cases. METHODS:Three cohorts were used for this analysis: one cohort to assess the hit rate between solo, trio and trio plus testing, another cohort to examine the impact of the testing strategy of sibship genome vs trio-based analysis, and a third cohort to test the impact of an extended family analysis of up to eight family members to lower the number of candidate variants. RESULTS:The hit rates in solo, trio and trio plus testing were 39, 40, and 41%, respectively. The total number of candidate variants in the sibship testing strategy was 117 variants compared to 59 variants in the trio-based analysis. We noticed that the average number of coding candidate variants in trio-based analysis was 1192 variants and 26,454 noncoding variants, and this number was lowered by 50-75% after adding additional family members, with up to two coding and 66 noncoding homozygous variants only, in families with eight family members. CONCLUSION:There was no difference in the hit rate between solo and extended family members. Trio-based analysis was a better approach than sibship testing, even in a consanguineous population. Finally, each additional family member helped to narrow down the number of variants by 50-75%. Our findings could help clinicians, researchers and testing laboratories select the most cost-effective and appropriate sequencing approach for their patients. Furthermore, using extended family analysis is a very useful tool for complex cases with novel genes.
    • Improved characterisation of clinical text through ontology-based vocabulary expansion

      Slater, Luke T; Bradlow, William; Ball, Simon; Hoehndorf, Robert; Gkoutos, Georgios (Cold Spring Harbor Laboratory, 2020-07-11) [Preprint]
      AbstractBackgroundBiomedical ontologies contain a wealth of metadata that constitutes a fundamental infrastructural resource for text mining. For several reasons, redundancies exist in the ontology ecosystem, which lead to the same concepts being described by several terms in the same or similar contexts across several ontologies. While these terms describe the same concepts, they contain different sets of complementary metadata. Linking these definitions to make use of their combined metadata could lead to improved performance in ontology-based information retrieval, extraction, and analysis tasks.ResultsWe develop and present an algorithm that expands the set of labels associated with an ontology class using a combination of strict lexical matching and cross-ontology reasoner-enabled equivalency queries. Across all disease terms in the Disease Ontology, the approach found 51,362 additional labels, more than tripling the number defined by the ontology itself. Manual validation by a clinical expert on a random sampling of expanded synonyms over the Human Phenotype Ontology yielded a precision of 0.912. Furthermore, we found that annotating patient visits in MIMIC-III with an extended set of Disease Ontology labels led to semantic similarity score derived from those labels being a significantly better predictor of matching first diagnosis, with a mean average precision of 0.88 for the unexpanded set of annotations, and 0.913 for the expanded set.ConclusionsInter-ontology synonym expansion can lead to a vast increase in the scale of vocabulary available for text mining applications. While the accuracy of the extended vocabulary is not perfect, it nevertheless led to a significantly improved ontology-based characterisation of patients from text in one setting. Furthermore, where run-on error is not acceptable, the technique can be used to provide candidate synonyms which can be checked by a domain expert.
    • Modeling quantitative traits for COVID-19 case reports

      Queralt-Rosinach, Núria; Bello, Susan; Hoehndorf, Robert; Weiland, Claus; Rocca-Serra, Philippe; Schofield, Paul N. (Cold Spring Harbor Laboratory, 2020-06-21) [Preprint]
      Medical practitioners record the condition status of a patient through qualitative and quantitative observations. The measurement of vital signs and molecular parameters in the clinics gives a complementary description of abnormal phenotypes associated with the progression of a disease. The Clinical Measurement Ontology (CMO) is used to standardize annotations of these measurable traits. However, researchers have no way to describe how these quantitative traits relate to phenotype concepts in a machine-readable manner. Using the WHO clinical case report form standard for the COVID-19 pandemic, we modeled quantitative traits and developed OWL axioms to formally relate clinical measurement terms with anatomical, biomolecular entities and phenotypes annotated with the Uber-anatomy ontology (Uberon), Chemical Entities of Biological Interest (ChEBI) and the Phenotype and Trait Ontology (PATO) biomedical ontologies. The formal description of these relations allows interoperability between clinical and biological descriptions, and facilitates automated reasoning for analysis of patterns over quantitative and qualitative biomedical observations.
    • Machine learning with biomedical ontologies

      Kulmanov, Maxat; Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2020-05-08) [Preprint]
      Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge, and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in biomedical ontologies, and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.Key pointsOntologies provide background knowledge that can be exploited in machine learning models.Ontology embeddings are structure-preserving maps from ontologies into vector spaces and provide an important method for utilizing ontologies in machine learning. Embeddings can preserve different structures in ontologies, including their graph structures, syntactic regularities, or their model-theoretic semantics.Axioms in ontologies, in particular those involving negation, can be used as constraints in optimization and machine learning to reduce the search space.
    • Self-normalizing learning on biomedical ontologies using a deep Siamese neural network

      Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2020-04-25) [Preprint]
      Motivation:Ontologies are widely used in biomedicine for the annotation and standardization of data.One of the main roles of ontologies is to provide structured background knowledge within a domain as well as a set of labels, synonyms, and definitions for the classes within a domain. The two types of information provided by ontologies have been extensively exploited in natural language processing and machine learning applications. However, they are commonly used separately, and thus it is unknown if joining the two sources of information can further benefit data analysis tasks. Results:We developed a novel method that applies named entity recognition and normalization methods on texts to connect the structured information in biomedical ontologies with the information contained in natural language. We apply this normalization both to literature and to the natural language information contained within ontologies themselves. The normalized ontologies and text are then used to generate embeddings, and relations between entities are predicted using a deep Siamese neural network model that takes these embeddings as input. We demonstrate that our novel embedding and prediction method using self normalized biomedical ontologies significantly outperforms the state of the art methods in embedding ontologies on two benchmark tasks: prediction of interactions between proteins and prediction of gene disease associations. Our method also allows us to apply ontology based annotations and axioms to the prediction of toxicological effects of chemicals where our method shows superior performance. Our method is generic and can be applied in scenarios where ontologies consisting of both structured information and natural language labels or synonyms are used.
    • Prediction of novel virus-host interactions by integrating clinical symptoms and protein sequences

      Wang, Liu-Wei; Kafkas, Senay; Chen, Jun; Tegner, Jesper; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2020-04-25) [Preprint]
      Motivation: Infectious diseases from novel viruses are becoming a major public health concern. Fast identification of virus-host interactions can reveal mechanistic insights of infectious diseases and shed light on potential treatments and drug discoveries. Current computational prediction methods for novel viruses are based only on protein sequences. Yet, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e., symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts. Results: We developed DeepViral, a deep learning method that predicts potential protein-protein interactions between human and viruses. First, human proteins and viruses were embedded in a shared space using their associated phenotypes, functions, taxonomic classification, as well as formalized background knowledge from biomedical ontologies. By extending a sequence learning model with phenotype features, our model can not only significantly improve over previous sequence-based approaches for inter-species interaction prediction, but also identify pathways of viral targets under a realistic experimental setup for novel viruses. Availability:https://github.com/bio-ontology-research-group/DeepViral
    • D4: Deep Drug-drug interaction Discovery and Demystification

      Noor, Adeeb; Liu-Wei, Wang; Barnawi, Ahmed; Nour, Redhwan; Assiri, Abdullah A; Chan Bukhari, Syed Ahmad; Hoehndorf, Robert (Cold Spring Harbor Laboratory, 2020-04-09) [Preprint]
      AbstractMotivationDrug-drug interactions (DDIs) are complex processes which may depend on many clinical and non-clinical factors. Identifying and distinguishing ways in which drugs interact remains a challenge. To minimize DDIs and to personalize treatment based on accurate stratification of patients, it is crucial that mechanisms of interaction can be identified. Most DDIs are a consequence of metabolic mechanisms of interaction, but DDIs with different mechanisms occur less frequently and are therefore more difficult to identify.ResultsWe developed a method (D4) for computationally identifying potential DDIs and determining whether they interact based on one of eleven mechanisms of interaction. D4 predicts DDIs and their mechanisms through features that are generated through a deep learning approach from phenotypic and functional knowledge about drugs, their side-effects and targets. Our findings indicate that our method is able to identify known DDIs with high accuracy and that D4 can determine mechanisms of interaction. We also identify numerous novel and potential DDIs for each mechanism of interaction and evaluate our predictions using DDIs from adverse event reporting systems.Availabilityhttps://github.com/bio-ontology-research-group/D4Contactarnoor@kau.edu.sa and robert.hoehndorf@kaust.edu.sa
    • Efficient long-distance relation extraction with DG-SpanBERT

      Chen, Jun; Hoehndorf, Robert; Elhoseiny, Mohamed; Zhang, Xiangliang (arXiv, 2020-04-07) [Preprint]
      In natural language processing, relation extraction seeks to rationally understand unstructured text. Here, we propose a novel SpanBERT-based graph convolutional network (DG-SpanBERT) that extracts semantic features from a raw sentence using the pre-trained language model SpanBERT and a graph convolutional network to pool latent features. Our DG-SpanBERT model inherits the advantage of SpanBERT on learning rich lexical features from large-scale corpus. It also has the ability to capture long-range relations between entities due to the usage of GCN on dependency tree. The experimental results show that our model outperforms other existing dependency-based and sequence-based models and achieves a state-of-the-art performance on the TACRED dataset.
    • BioHackathon 2015: Semantics of data for life sciences and reproducible research

      Katayama, Toshiaki; Vos, Rutger A.; Mishima, Hiroyuki; Kawano, Shin; Kawashima, Shuichi; Kim, Jin Dong; Moriya, Yuki; Tokimatsu, Toshiaki; Yamaguchi, Atsuko; Yamamoto, Yasunori; Wu, Hongyan; Amstutz, Peter; Antezana, Erick; Aoki, Nobuyuki P.; Arakawa, Kazuharu; Bolleman, Jerven T.; Bolton, Evan; Bonnal, Raoul J.P.; Bono, Hidemasa; Burger, Kees; Chiba, Hirokazu; Cohen, Kevin B.; Deutsch, Eric W.; Fernández-Breis, Jesualdo T.; Fu, Gang; Fujisawa, Takatomo; Fukushima, Atsushi; García, Alexander; Goto, Naohisa; Groza, Tudor; Hercus, Colin; Hoehndorf, Robert; Itaya, Kotone; Juty, Nick; Kawashima, Takeshi; Kim, Jee Hyub; Kinjo, Akira R.; Kotera, Masaaki; Kozaki, Kouji; Kumagai, Sadahiro; Kushida, Tatsuya; Lütteke, Thomas; Matsubara, Masaaki; Miyamoto, Joe; Mohsen, Attayeb; Mori, Hiroshi; Naito, Yuki; Nakazato, Takeru; Nguyen-Xuan, Jeremy; Nishida, Kozo; Nishida, Naoki; Nishide, Hiroyo; Ogishima, Soichi; Ohta, Tazro; Okuda, Shujiro; Paten, Benedict; Perret, Jean Luc; Prathipati, Philip; Prins, Pjotr; Queralt-Rosinach, Núria; Shinmachi, Daisuke; Suzuki, Shinya; Tabata, Tsuyosi; Takatsuki, Terue; Taylor, Kieron; Thompson, Mark; Uchiyama, Ikuo; Vieira, Bruno; Wei, Chih Hsuan; Wilkinson, Mark; Yamada, Issaku; Yamanaka, Ryota; Yoshitake, Kazutoshi; Yoshizawa, Akiyasu C.; Dumontier, Michel; Kosaki, Kenjiro; Takagi, Toshihisa (F1000Research, F1000 Research Ltd, 2020-02-24) [Article]
      We report on the activities of the 2015 edition of the BioHackathon, an annual event that brings together researchers and developers from around the world to develop tools and technologies that promote the reusability of biological data. We discuss issues surrounding the representation, publication, integration, mining and reuse of biological data and metadata across a wide range of biomedical data types of relevance for the life sciences, including chemistry, genotypes and phenotypes, orthology and phylogeny, proteomics, genomics, glycomics, and metabolomics. We describe our progress to address ongoing challenges to the reusability and reproducibility of research results, and identify outstanding issues that continue to impede the progress of bioinformatics research. We share our perspective on the state of the art, continued challenges, and goals for future research and development for the life sciences Semantic Web.
    • CAN-VP: CANcer Variant Prioritization

      Althubaiti, Sara; Gkoutos, Georgios; Hoehndorf, Robert (2020-1-20) [Poster]
      Introduction Identifying and prioritizing driver mutations that play main role to develop cancer still a  major challenge. Several computational approaches involved machine learning and statistical methods exist to access finding these driver mutations depending on pre-computed pathogenicity scores derived from different tools. We have developed CANcerVariant Prioritization (CAN-VP) system to identify and prioritize driver mutations. Ourtool exploits the background knowledge behind using different ontologies that utilize cellular phenotypes, functions, and whole-body physiological phenotypes besides combining region-based information as features. We demonstrate the performance of CAN-VP in prioritizing causative driver mutations on a number of synthetic whole exome from The  Cancer Genome Atlas (TCGA), targeting 4 different primary sites. We find that CAN-VP could identify most of the causative driver mutations compared to the existing tools which showed its capability as a tool for discovering driver mutations. Methods and Materials Data sources We relied on two main types of datasets, first one is from well-known cancer-related databases such as:  COSMIC1, CanProVar2, IntOGen3. The second one is the real samples included in The Cancer Genome Atlas (TCGA)4 which involve more than 60 different projects covering 67 primary sites;  but till now we focus on 4 projects (Sarcoma,  Kidney, Lung, and Bladder). Moreover, we used the 579 validated driver mutations in Bailey, Matthew H., etal5. Results and Discussion 1. Prediction model 1.1 Model details We implemented our CAN-VP using a fully connected neural network model in Python 3.6 as shown in Figure 4. We used Keraswith a TensorFlow backend. We ignored the missing values for all the features being used. We added additional flags for missing values as features.  We retrieved genes embeddings from and used them as features in the prediction model. 1.2 Training and testing data We downloaded COSMIC mutations VCF file on 26th Jul, 2019.  It includes 4,788,121cancer mutations.  We also downloaded DoCMdataset as a VCF file on 18th Nov, 2019. It includes 1364 curated driver mutations.  Moreover, we downloaded CanProVaras afastqfile on 18th Nov, 2019.  It includes 156,671 driver mutations. Based on that, we tried to find how much mutations of DoCM+ CanProVarexist within COSMIC and consider them as positives; otherwise, they would be negatives. As Table 1 showed, the number of negatives data (unknown driver somatic mutations) are much more than the positive ones (validated as driver mutations). 1.3 Prediction performance We trained our model in Figure 2 using the dataset in Table 1 and do the testing on the synthetic datasets. The updated results of CAN-VP compared to the other tools are shown in Table 2. In terms of evaluating the importance of different features in our prediction model, we first test the different combinations of features from CanDrAwhich includes (86 from CHASMplus and 3 from Mutation Assessor) plus 3 from UCSC. Moreover, we add the gene embeddings and the results become better by 3%. Table 3summaries the performance for each experiment. Future Work - Test CAN-VP on much comprehensive cancer-related datasets. - Integrate graph-basedfeaturestoCAN-VP model. References 1SallyBamford et al. “The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. In: British journal of cancer 91.2 (2004), p. 355.   2 Jing Li, Dexter T Duncan, and Bing Zhang. “CanProVar: a human cancer proteome variation database. In: Human mutation 31.3 (2010), pp. 219–228.   3 GunesGundemet al. “IntOGen: integration and data mining of multidimensional oncogenomic data. In: Nature methods 7.2 (2010), p. 92.   4 Katarzyna Tomczak, Patrycja Czerwínska, andMaciejWiznerowicz. “The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. In: Contemporaryoncology19.1A(2015), A68.   5 Matthew H Bailey et al. “Comprehensive characterization of cancer driver genes and mutations. In: Cell173.2(2018), pp. 371–385.
    • Combining lexical and context features for automatic ontology extension.

      Althubaiti, Sara; Kafkas, Senay; Abdelhakim, Marwa; Hoehndorf, Robert (Journal of biomedical semantics, Springer Science and Business Media LLC, 2020-01-13) [Article]
      BACKGROUND:Ontologies are widely used across biology and biomedicine for the annotation of databases. Ontology development is often a manual, time-consuming, and expensive process. Automatic or semi-automatic identification of classes that can be added to an ontology can make ontology development more efficient. RESULTS:We developed a method that uses machine learning and word embeddings to identify words and phrases that are used to refer to an ontology class in biomedical Europe PMC full-text articles. Once labels and synonyms of a class are known, we use machine learning to identify the super-classes of a class. For this purpose, we identify lexical term variants, use word embeddings to capture context information, and rely on automated reasoning over ontologies to generate features, and we use an artificial neural network as classifier. We demonstrate the utility of our approach in identifying terms that refer to diseases in the Human Disease Ontology and to distinguish between different types of diseases. CONCLUSIONS:Our method is capable of discovering labels that refer to a class in an ontology but are not present in an ontology, and it can identify whether a class should be a subclass of some high-level ontology classes. Our approach can therefore be used for the semi-automatic extension and quality control of ontologies. The algorithm, corpora and evaluation datasets are available at https://github.com/bio-ontology-research-group/ontology-extension.
    • What is the right sequencing approach? Solo VS extended family analysis in consanguineous populations

      Alfares, Ahmed; Alsubaie, Lamia; Aloraini, Taghrid; Alaskar, Aljoharah; Althagafi, Azza Th.; Alahmad, Ahmed; Rashid, Mamoon; Alswaid, Abdulrahman; Alothaim, Ali; Eyaid, Wafaa; Ababneh, Faroug; Albalwi, Mohammed; Alotaibi, Raniah; Almutairi, Mashael; Altharawi, Nouf; Alsamer, Alhanouf; Abdelhakim, Marwa; Kafkas, Senay; Mineta, Katsuhiko; Cheung, Nicole; Abdallah, Abdallah; Büchmann-Møller, Stine; Fukasawa, Yoshinori; Zhao, Xiang; Rajan, Issaac; Hoehndorf, Robert; Al Mutairi, Fuad; Gojobori, Takashi; Alfadhel, Majid (figshare, 2020) [Dataset]
      Abstract Background Testing strategies is crucial for genetics clinics and testing laboratories. In this study, we tried to compare the hit rate between solo and trio and trio plus testing and between trio and sibship testing. Finally, we studied the impact of extended family analysis, mainly in complex and unsolved cases. Methods Three cohorts were used for this analysis: one cohort to assess the hit rate between solo, trio and trio plus testing, another cohort to examine the impact of the testing strategy of sibship genome vs trio-based analysis, and a third cohort to test the impact of an extended family analysis of up to eight family members to lower the number of candidate variants. Results The hit rates in solo, trio and trio plus testing were 39, 40, and 41%, respectively. The total number of candidate variants in the sibship testing strategy was 117 variants compared to 59 variants in the trio-based analysis. We noticed that the average number of coding candidate variants in trio-based analysis was 1192 variants and 26,454 noncoding variants, and this number was lowered by 50–75% after adding additional family members, with up to two coding and 66 noncoding homozygous variants only, in families with eight family members. Conclusion There was no difference in the hit rate between solo and extended family members. Trio-based analysis was a better approach than sibship testing, even in a consanguineous population. Finally, each additional family member helped to narrow down the number of variants by 50–75%. Our findings could help clinicians, researchers and testing laboratories select the most cost-effective and appropriate sequencing approach for their patients. Furthermore, using extended family analysis is a very useful tool for complex cases with novel genes.
    • Combining lexical and context features for automatic ontology extension

      Althubaiti, Sara; Kafkas, Senay; Abdelhakim, Marwa; Hoehndorf, Robert (figshare, 2020) [Dataset]
      Abstract Background Ontologies are widely used across biology and biomedicine for the annotation of databases. Ontology development is often a manual, time-consuming, and expensive process. Automatic or semi-automatic identification of classes that can be added to an ontology can make ontology development more efficient. Results We developed a method that uses machine learning and word embeddings to identify words and phrases that are used to refer to an ontology class in biomedical Europe PMC full-text articles. Once labels and synonyms of a class are known, we use machine learning to identify the super-classes of a class. For this purpose, we identify lexical term variants, use word embeddings to capture context information, and rely on automated reasoning over ontologies to generate features, and we use an artificial neural network as classifier. We demonstrate the utility of our approach in identifying terms that refer to diseases in the Human Disease Ontology and to distinguish between different types of diseases. Conclusions Our method is capable of discovering labels that refer to a class in an ontology but are not present in an ontology, and it can identify whether a class should be a subclass of some high-level ontology classes. Our approach can therefore be used for the semi-automatic extension and quality control of ontologies. The algorithm, corpora and evaluation datasets are available at https://github.com/bio-ontology-research-group/ontology-extension.
    • Comparative genomics study reveals Red Sea Bacillus with characteristics associated with potential microbial cell factories (MCFs)

      Othoum, Ghofran K.; Prigent, S.; Derouiche, A.; Shi, L.; Bokhari, Ameerah; Alamoudi, S.; Bougouffa, Salim; Gao, Xin; Hoehndorf, Robert; Arold, Stefan T.; Gojobori, Takashi; Hirt, Heribert; Lafi, Feras Fawzi; Nielsen, J.; Bajic, Vladimir B.; Mijakovic, I.; Essack, Magbubah (Scientific Reports, Springer Science and Business Media LLC, 2019-12-17) [Article]
      Recent advancements in the use of microbial cells for scalable production of industrial enzymes encourage exploring new environments for efficient microbial cell factories (MCFs). Here, through a comparison study, ten newly sequenced Bacillus species, isolated from the Rabigh Harbor Lagoon on the Red Sea shoreline, were evaluated for their potential use as MCFs. Phylogenetic analysis of 40 representative genomes with phylogenetic relevance, including the ten Red Sea species, showed that the Red Sea species come from several colonization events and are not the result of a single colonization followed by speciation. Moreover, clustering reactions in reconstruct metabolic networks of these Bacillus species revealed that three metabolic clades do not fit the phylogenetic tree, a sign of convergent evolution of the metabolism of these species in response to special environmental adaptation. We further showed Red Sea strains Bacillus paralicheniformis (Bac48) and B. halosaccharovorans (Bac94) had twice as much secreted proteins than the model strain B. subtilis 168. Also, Bac94 was enriched with genes associated with the Tat and Sec protein secretion system and Bac48 has a hybrid PKS/NRPS cluster that is part of a horizontally transferred genomic region. These properties collectively hint towards the potential use of Red Sea Bacillus as efficient protein secreting microbial hosts, and that this characteristic of these strains may be a consequence of the unique ecological features of the isolation environment.
    • Formal axioms in biomedical ontologies improve analysis and interpretation of associated data.

      Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Bioinformatics (Oxford, England), Oxford University Press (OUP), 2019-12-10) [Article]
      Over the past years, significant resources have been invested into formalizing biomedical ontologies. Formal axioms in ontologies have been developed and used to detect and ensure ontology consistency, find unsatisfiable classes, improve interoperability, guide ontology extension through the application of axiom-based design patterns, and encode domain background knowledge. The domain knowledge in biomedical ontologies may also have the potential to provide background knowledge for machine learning and predictive modelling. We use ontology-based machine learning methods to evaluate the contribution of formal axioms and ontology meta-data to the prediction of protein-protein interactions and gene-disease associations. We find that the background knowledge provided by the Gene Ontology and other ontologies significantly improves the performance of ontology-based prediction models through provision of domain-specific background knowledge. Furthermore, we find that the labels, synonyms and definitions in ontologies can also provide background knowledge that may be exploited for prediction. The axioms and meta-data of different ontologies contribute to improving data analysis in a context-specific manner. Our results have implications on the further development of formal knowledge bases and ontologies in the life sciences, in particular as machine learning methods are more frequently being applied. Our findings motivate the need for further development, and the systematic, application-driven evaluation and improvement, of formal axioms in ontologies. https://github.com/bio-ontology-research-group/tsoe.