### Recent Submissions

• #### Prediction of novel virus-host interactions by integrating clinical symptoms and protein sequences

(Cold Spring Harbor Laboratory, 2020-04-25) [Preprint]
Motivation: Infectious diseases from novel viruses are becoming a major public health concern. Fast identification of virus-host interactions can reveal mechanistic insights of infectious diseases and shed light on potential treatments and drug discoveries. Current computational prediction methods for novel viruses are based only on protein sequences. Yet, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e., symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts. Results: We developed DeepViral, a deep learning method that predicts potential protein-protein interactions between human and viruses. First, human proteins and viruses were embedded in a shared space using their associated phenotypes, functions, taxonomic classification, as well as formalized background knowledge from biomedical ontologies. By extending a sequence learning model with phenotype features, our model can not only significantly improve over previous sequence-based approaches for inter-species interaction prediction, but also identify pathways of viral targets under a realistic experimental setup for novel viruses. Availability:https://github.com/bio-ontology-research-group/DeepViral
• #### Self-normalizing learning on biomedical ontologies using a deep Siamese neural network

(Cold Spring Harbor Laboratory, 2020-04-25) [Preprint]
Motivation:Ontologies are widely used in biomedicine for the annotation and standardization of data.One of the main roles of ontologies is to provide structured background knowledge within a domain as well as a set of labels, synonyms, and definitions for the classes within a domain. The two types of information provided by ontologies have been extensively exploited in natural language processing and machine learning applications. However, they are commonly used separately, and thus it is unknown if joining the two sources of information can further benefit data analysis tasks. Results:We developed a novel method that applies named entity recognition and normalization methods on texts to connect the structured information in biomedical ontologies with the information contained in natural language. We apply this normalization both to literature and to the natural language information contained within ontologies themselves. The normalized ontologies and text are then used to generate embeddings, and relations between entities are predicted using a deep Siamese neural network model that takes these embeddings as input. We demonstrate that our novel embedding and prediction method using self normalized biomedical ontologies significantly outperforms the state of the art methods in embedding ontologies on two benchmark tasks: prediction of interactions between proteins and prediction of gene disease associations. Our method also allows us to apply ontology based annotations and axioms to the prediction of toxicological effects of chemicals where our method shows superior performance. Our method is generic and can be applied in scenarios where ontologies consisting of both structured information and natural language labels or synonyms are used.
• #### D4: Deep Drug-drug interaction Discovery and Demystification

(Cold Spring Harbor Laboratory, 2020-04-09) [Preprint]
AbstractMotivationDrug-drug interactions (DDIs) are complex processes which may depend on many clinical and non-clinical factors. Identifying and distinguishing ways in which drugs interact remains a challenge. To minimize DDIs and to personalize treatment based on accurate stratification of patients, it is crucial that mechanisms of interaction can be identified. Most DDIs are a consequence of metabolic mechanisms of interaction, but DDIs with different mechanisms occur less frequently and are therefore more difficult to identify.ResultsWe developed a method (D4) for computationally identifying potential DDIs and determining whether they interact based on one of eleven mechanisms of interaction. D4 predicts DDIs and their mechanisms through features that are generated through a deep learning approach from phenotypic and functional knowledge about drugs, their side-effects and targets. Our findings indicate that our method is able to identify known DDIs with high accuracy and that D4 can determine mechanisms of interaction. We also identify numerous novel and potential DDIs for each mechanism of interaction and evaluate our predictions using DDIs from adverse event reporting systems.Availabilityhttps://github.com/bio-ontology-research-group/D4Contactarnoor@kau.edu.sa and robert.hoehndorf@kaust.edu.sa
• #### Efficient long-distance relation extraction with DG-SpanBERT

(arXiv, 2020-04-07) [Preprint]
In natural language processing, relation extraction seeks to rationally understand unstructured text. Here, we propose a novel SpanBERT-based graph convolutional network (DG-SpanBERT) that extracts semantic features from a raw sentence using the pre-trained language model SpanBERT and a graph convolutional network to pool latent features. Our DG-SpanBERT model inherits the advantage of SpanBERT on learning rich lexical features from large-scale corpus. It also has the ability to capture long-range relations between entities due to the usage of GCN on dependency tree. The experimental results show that our model outperforms other existing dependency-based and sequence-based models and achieves a state-of-the-art performance on the TACRED dataset.
• #### CAN-VP: CANcer Variant Prioritization

(2020-1-20) [Poster]
Introduction Identifying and prioritizing driver mutations that play main role to develop cancer still a  major challenge. Several computational approaches involved machine learning and statistical methods exist to access finding these driver mutations depending on pre-computed pathogenicity scores derived from different tools. We have developed CANcerVariant Prioritization (CAN-VP) system to identify and prioritize driver mutations. Ourtool exploits the background knowledge behind using different ontologies that utilize cellular phenotypes, functions, and whole-body physiological phenotypes besides combining region-based information as features. We demonstrate the performance of CAN-VP in prioritizing causative driver mutations on a number of synthetic whole exome from The  Cancer Genome Atlas (TCGA), targeting 4 different primary sites. We find that CAN-VP could identify most of the causative driver mutations compared to the existing tools which showed its capability as a tool for discovering driver mutations. Methods and Materials Data sources We relied on two main types of datasets, first one is from well-known cancer-related databases such as:  COSMIC1, CanProVar2, IntOGen3. The second one is the real samples included in The Cancer Genome Atlas (TCGA)4 which involve more than 60 different projects covering 67 primary sites;  but till now we focus on 4 projects (Sarcoma,  Kidney, Lung, and Bladder). Moreover, we used the 579 validated driver mutations in Bailey, Matthew H., etal5. Results and Discussion 1. Prediction model 1.1 Model details We implemented our CAN-VP using a fully connected neural network model in Python 3.6 as shown in Figure 4. We used Keraswith a TensorFlow backend. We ignored the missing values for all the features being used. We added additional flags for missing values as features.  We retrieved genes embeddings from and used them as features in the prediction model. 1.2 Training and testing data We downloaded COSMIC mutations VCF file on 26th Jul, 2019.  It includes 4,788,121cancer mutations.  We also downloaded DoCMdataset as a VCF file on 18th Nov, 2019. It includes 1364 curated driver mutations.  Moreover, we downloaded CanProVaras afastqfile on 18th Nov, 2019.  It includes 156,671 driver mutations. Based on that, we tried to find how much mutations of DoCM+ CanProVarexist within COSMIC and consider them as positives; otherwise, they would be negatives. As Table 1 showed, the number of negatives data (unknown driver somatic mutations) are much more than the positive ones (validated as driver mutations). 1.3 Prediction performance We trained our model in Figure 2 using the dataset in Table 1 and do the testing on the synthetic datasets. The updated results of CAN-VP compared to the other tools are shown in Table 2. In terms of evaluating the importance of different features in our prediction model, we first test the different combinations of features from CanDrAwhich includes (86 from CHASMplus and 3 from Mutation Assessor) plus 3 from UCSC. Moreover, we add the gene embeddings and the results become better by 3%. Table 3summaries the performance for each experiment. Future Work - Test CAN-VP on much comprehensive cancer-related datasets. - Integrate graph-basedfeaturestoCAN-VP model. References 1SallyBamford et al. “The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. In: British journal of cancer 91.2 (2004), p. 355.   2 Jing Li, Dexter T Duncan, and Bing Zhang. “CanProVar: a human cancer proteome variation database. In: Human mutation 31.3 (2010), pp. 219–228.   3 GunesGundemet al. “IntOGen: integration and data mining of multidimensional oncogenomic data. In: Nature methods 7.2 (2010), p. 92.   4 Katarzyna Tomczak, Patrycja Czerwínska, andMaciejWiznerowicz. “The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. In: Contemporaryoncology19.1A(2015), A68.   5 Matthew H Bailey et al. “Comprehensive characterization of cancer driver genes and mutations. In: Cell173.2(2018), pp. 371–385.
• #### Combining lexical and context features for automatic ontology extension.

(Journal of biomedical semantics, Springer Science and Business Media LLC, 2020-01-13) [Article]
BACKGROUND:Ontologies are widely used across biology and biomedicine for the annotation of databases. Ontology development is often a manual, time-consuming, and expensive process. Automatic or semi-automatic identification of classes that can be added to an ontology can make ontology development more efficient. RESULTS:We developed a method that uses machine learning and word embeddings to identify words and phrases that are used to refer to an ontology class in biomedical Europe PMC full-text articles. Once labels and synonyms of a class are known, we use machine learning to identify the super-classes of a class. For this purpose, we identify lexical term variants, use word embeddings to capture context information, and rely on automated reasoning over ontologies to generate features, and we use an artificial neural network as classifier. We demonstrate the utility of our approach in identifying terms that refer to diseases in the Human Disease Ontology and to distinguish between different types of diseases. CONCLUSIONS:Our method is capable of discovering labels that refer to a class in an ontology but are not present in an ontology, and it can identify whether a class should be a subclass of some high-level ontology classes. Our approach can therefore be used for the semi-automatic extension and quality control of ontologies. The algorithm, corpora and evaluation datasets are available at https://github.com/bio-ontology-research-group/ontology-extension.
• #### Comparative genomics study reveals Red Sea Bacillus with characteristics associated with potential microbial cell factories (MCFs)

(Scientific Reports, Springer Science and Business Media LLC, 2019-12-17) [Article]
Recent advancements in the use of microbial cells for scalable production of industrial enzymes encourage exploring new environments for efficient microbial cell factories (MCFs). Here, through a comparison study, ten newly sequenced Bacillus species, isolated from the Rabigh Harbor Lagoon on the Red Sea shoreline, were evaluated for their potential use as MCFs. Phylogenetic analysis of 40 representative genomes with phylogenetic relevance, including the ten Red Sea species, showed that the Red Sea species come from several colonization events and are not the result of a single colonization followed by speciation. Moreover, clustering reactions in reconstruct metabolic networks of these Bacillus species revealed that three metabolic clades do not fit the phylogenetic tree, a sign of convergent evolution of the metabolism of these species in response to special environmental adaptation. We further showed Red Sea strains Bacillus paralicheniformis (Bac48) and B. halosaccharovorans (Bac94) had twice as much secreted proteins than the model strain B. subtilis 168. Also, Bac94 was enriched with genes associated with the Tat and Sec protein secretion system and Bac48 has a hybrid PKS/NRPS cluster that is part of a horizontally transferred genomic region. These properties collectively hint towards the potential use of Red Sea Bacillus as efficient protein secreting microbial hosts, and that this characteristic of these strains may be a consequence of the unique ecological features of the isolation environment.
• #### Formal axioms in biomedical ontologies improve analysis and interpretation of associated data.

(Bioinformatics (Oxford, England), Oxford University Press (OUP), 2019-12-10) [Article]
Over the past years, significant resources have been invested into formalizing biomedical ontologies. Formal axioms in ontologies have been developed and used to detect and ensure ontology consistency, find unsatisfiable classes, improve interoperability, guide ontology extension through the application of axiom-based design patterns, and encode domain background knowledge. The domain knowledge in biomedical ontologies may also have the potential to provide background knowledge for machine learning and predictive modelling. We use ontology-based machine learning methods to evaluate the contribution of formal axioms and ontology meta-data to the prediction of protein-protein interactions and gene-disease associations. We find that the background knowledge provided by the Gene Ontology and other ontologies significantly improves the performance of ontology-based prediction models through provision of domain-specific background knowledge. Furthermore, we find that the labels, synonyms and definitions in ontologies can also provide background knowledge that may be exploited for prediction. The axioms and meta-data of different ontologies contribute to improving data analysis in a context-specific manner. Our results have implications on the further development of formal knowledge bases and ontologies in the life sciences, in particular as machine learning methods are more frequently being applied. Our findings motivate the need for further development, and the systematic, application-driven evaluation and improvement, of formal axioms in ontologies. https://github.com/bio-ontology-research-group/tsoe.
• #### Ontology-based prediction of cancer driver genes

(Scientific Reports, Springer Science and Business Media LLC, 2019-11-22) [Article]
Identifying and distinguishing cancer driver genes among thousands of candidate mutations remains a major challenge. Accurate identification of driver genes and driver mutations is critical for advancing cancer research and personalizing treatment based on accurate stratification of patients. Due to inter-tumor genetic heterogeneity many driver mutations within a gene occur at low frequencies, which make it challenging to distinguish them from non-driver mutations. We have developed a novel method for identifying cancer driver genes. Our approach utilizes multiple complementary types of information, specifically cellular phenotypes, cellular locations, functions, and whole body physiological phenotypes as features. We demonstrate that our method can accurately identify known cancer driver genes and distinguish between their role in different types of cancer. In addition to confirming known driver genes, we identify several novel candidate driver genes. We demonstrate the utility of our method by validating its predictions in nasopharyngeal cancer and colorectal cancer using whole exome and whole genome sequencing.
• #### The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.

(Genome biology, Springer Science and Business Media LLC, 2019-11-21) [Article]
BACKGROUND:The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. RESULTS:Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. CONCLUSION:We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
• #### DeepPheno: Predicting single gene knockout phenotypes

(Cold Spring Harbor Laboratory, 2019-11-13) [Preprint]
Predicting the phenotypes resulting from molecular perturbations is one of the key challenges in genetics. Both forward and reverse genetic screen are employed to identify the molecular mechanisms underlying phenotypes and disease, and these resulted in a large number of genotype–phenotype association being available for humans and model organisms. Combined with recent advances in machine learning, it may now be possible to predict human phenotypes resulting from particular molecular aberrations.
• #### Ontology based mining of pathogen–disease associations from literature

(Journal of Biomedical Semantics, Springer Science and Business Media LLC, 2019-09-18) [Article]
Background Infectious diseases claim millions of lives especially in the developing countries each year. Identification of causative pathogens accurately and rapidly plays a key role in the success of treatment. To support infectious disease research and mechanisms of infection, there is a need for an open resource on pathogen–disease associations that can be utilized in computational studies. A large number of pathogen–disease associations is available from the literature in unstructured form and we need automated methods to extract the data. Results We developed a text mining system designed for extracting pathogen–disease relations from literature. Our approach utilizes background knowledge from an ontology and statistical methods for extracting associations between pathogens and diseases. In total, we extracted a total of 3420 pathogen–disease associations from literature. We integrated our literature-derived associations into a database which links pathogens to their phenotypes for supporting infectious disease research. Conclusions To the best of our knowledge, we present the first study focusing on extracting pathogen–disease associations from publications. We believe the text mined data can be utilized as a valuable resource for infectious disease research. All the data is publicly available from https://github.com/bio-ontology-research-group/padimi and through a public SPARQL endpoint from http://patho.phenomebrowser.net/.
• #### EL Embeddings: Geometric construction of models for the description logic EL++

(International Joint Conferences on Artificial Intelligence Organization, 2019-07-28) [Conference Paper]
An embedding is a function that maps entities from one algebraic structure into another while preserving certain characteristics. Embeddings are being used successfully for mapping relational data or text into vector spaces where they can be used for machine learning, similarity search, or similar tasks. We address the problem of finding vector space embeddings for theories in the Description Logic $\mathcal{EL}^{++}$ that are also models of the TBox. To find such embeddings, we define an optimization problem that characterizes the model-theoretic semantics of the operators in $\mathcal{EL}^{++}$ within $\Re^n$, thereby solving the problem of finding an interpretation function for an $\mathcal{EL}^{++}$ theory given a particular domain $\Delta$. Our approach is mainly relevant to large $\mathcal{EL}^{++}$ theories and knowledge bases such as the ontologies and knowledge graphs used in the life sciences. We demonstrate that our method can be used for improved prediction of protein--protein interactions when compared to semantic similarity measures or knowledge graph embeddings.
• #### DeepGOPlus: Improved protein function prediction from sequence.

(Bioinformatics (Oxford, England), Oxford University Press (OUP), 2019-07-27) [Article]
MOTIVATION:Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein-protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time. RESULTS:We developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins (if available). We evaluate the performance of DeepGOPlus using the CAFA3 evaluation measures and achieve an Fmax of 0:390, 0:557 and 0:614 for BPO, MFO and CCO evaluations, respectively. These results would have made DeepGOPlus one of the three best predictors in CCO and the second best performing method in the BPO and MFO evaluations. We also compare DeepGOPlus with state-of-the-art methods such as DeepText2GO and GOLabeler on another dataset. DeepGOPlus can annotate around 40 protein sequences per second on common hardware, thereby making fast and accurate function predictions available for a wide range of proteins. AVAILABILITY:http://deepgoplus.bio2vec.net/. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
• #### A Machine Learning Based Approach for Similarity Search on Biodiversity Knowledge Graphs

(Biodiversity Information Science and Standards, Pensoft Publishers, 2019-06-13) [Presentation]
Mass biodiversity data from scientific collections will be provided by world-wide digitization efforts like iDigBio in the U.S and DiSSCo in Europe. This opens up an increasing amount of data on wild type organisms, which enables the building of large biodiversity knowledge graphs comprising, inter alia, sequence, trait and occurrence data. Knowledge graphs model information in the form of entities and their relationships expressed in good practice as ontology-based annotations. Based on ontological descriptions, semantic similarity analysis makes linking of wild type data to genomic and proteonomic data of model organisms possible and thus supports knowledge discovery of crop wild relatives and underutilized species of interest for medicine, breeding and agriculture. Since classical similarity measurements focus on recording differences between character states (aiming to describe disease phenotypes), but not the character states in the sense of trait variations itself, new methods for similarity search are required. Machine learning algorithms operate on feature vectors, which are numeric representations of data (images, class labels etc) in n-dimensional vector space. We established a machine learning based workflow for similarity search on biodiversity entities using feature learning on ontologies and an associated RDF knowledge graph to project structured trait data into vector space. Vectors are then compared applying a similarity function (e.g. cosine similarity) to determine similarity between taxa based on trait semantics. We will present an application example of machine learning on biodiversity knowledge graphs using a pipeline built upon OPA2Vec, a method to generate feature vectors from the logical content of ontologies (Smaili et al. 2018), to successfully cluster plant species for life form and ecotype (e.g. tree vs. perennial plant) on the basis of their annotations with the Flora Phenotype Ontology (Hoehndorf et al. 2016).
• #### PathoPhenoDB, linking human pathogens to their phenotypes in support of infectious disease research

(Scientific Data, Springer Nature, 2019-06-03) [Article]
Understanding the relationship between the pathophysiology of infectious disease, the biology of the causative agent and the development of therapeutic and diagnostic approaches is dependent on the synthesis of a wide range of types of information. Provision of a comprehensive and integrated disease phenotype knowledgebase has the potential to provide novel and orthogonal sources of information for the understanding of infectious agent pathogenesis, and support for research on disease mechanisms. We have developed PathoPhenoDB, a database containing pathogen-to-phenotype associations. PathoPhenoDB relies on manual curation of pathogen-disease relations, on ontology-based text mining as well as manual curation to associate host disease phenotypes with infectious agents. Using Semantic Web technologies, PathoPhenoDB also links to knowledge about drug resistance mechanisms and drugs used in the treatment of infectious diseases. PathoPhenoDB is accessible at http://patho.phenomebrowser.net/, and the data are freely available through a public SPARQL endpoint.
• #### Semi-supervised entity alignment via knowledge graph embedding with awareness of degree difference

(Association for Computing Machinery, Inc, 2019-05-13) [Conference Paper]
Entity alignment associates entities in different knowledge graphs if they are semantically same, and has been successfully used in the knowledge graph construction and connection. Most of the recent solutions for entity alignment are based on knowledge graph embedding, which maps knowledge entities in a low-dimension space where entities are connected with the guidance of prior aligned entity pairs. The study in this paper focuses on two important issues that limit the accuracy of current entity alignment solutions: 1) labeled data of priorly aligned entity pairs are difficult and expensive to acquire, whereas abundant of unlabeled data are not used; and 2) knowledge graph embedding is affected by entity's degree difference, which brings challenges to align high frequent and low frequent entities. We propose a semi-supervised entity alignment method (SEA) to leverage both labeled entities and the abundant unlabeled entity information for the alignment. Furthermore, we improve the knowledge graph embedding with awareness of the degree difference by performing the adversarial training. To evaluate our proposed model, we conduct extensive experiments on real-world datasets. The experimental results show that our model consistently outperforms the state-of-the-art methods with significant improvement on alignment accuracy.
• #### Hyaline Arteriolosclerosis in 30 Strains of Aged Inbred Mice

(Veterinary Pathology, SAGE Publications, 2019-05-06) [Article]
During a screen for vascular phenotypes in aged laboratory mice, a unique discrete phenotype of hyaline arteriolosclerosis of the intertubular arteries and arterioles of the testes was identified in several inbred strains. Lesions were limited to the testes and did not occur as part of any renal, systemic, or pulmonary arteriopathy or vasculitis phenotype. There was no evidence of systemic or pulmonary hypertension, and lesions did not occur in ovaries of females. Frequency was highest in males of the SM/J (27/30, 90%) and WSB/EiJ (19/26, 73%) strains, aged 383 to 847 days. Lesions were sporadically present in males from several other inbred strains at a much lower (<20%) frequency. The risk of testicular hyaline arteriolosclerosis is at least partially underpinned by a genetic predisposition that is not associated with other vascular lesions (including vasculitis), separating out the etiology of this form and site of arteriolosclerosis from other related conditions that often co-occur in other strains of mice and in humans. Because of their genetic uniformity and controlled dietary and environmental conditions, mice are an excellent model to dissect the pathogenesis of human disease conditions. In this study, a discrete genetically driven phenotype of testicular hyaline arteriolosclerosis in aging mice was identified. These observations open the possibility of identifying the underlying genetic variant(s) associated with the predisposition and therefore allowing future interrogation of the pathogenesis of this condition.
• #### Uncovering the dark matter of the metagenome one read at a time

(Access Microbiology, Microbiology Society, 2019-04-24) [Poster]
Contemporary metagenomic annotation methods have proven insufficient in our attempts to better understand the complex environments around us. We call the yet to be annotated part of a metagenome it’s ‘dark matter’. The Gene Ontology (GO) is a hierarchical vocabulary used to describe gene product function and a large collection of curated genes with GO annotations already exists. DeepGO utilises deep learning to build models from these curated genes and gene products to predict GO categories for novel proteins. One of the major problems with metagenomic studies today is the process of assembling the environmental DNA sequences into their original genomes. This is difficult, with chimeric metagenomically assembled genomes being common. To avoid this and the computational and time expense, we have modified DeepGO to perform protein function prediction directly from sequence reads with limited protein coding sequence prediction. Three independent models were trained as the following; The first 50 amino acids of a protein were used for training, The last 50 amino acids were used for training, A phasing window of 50 amino acids was used to train across the entirety of a protein sequence. These models were chosen to learn from the different parts of a protein sequence we are likely to capture from only the short unassembled sequence reads. We compared the three models by producing a mock metagenomic community consisting of 6 model bacterial genomes. We evaluated the functions predicted from the unassembled sequence reads and the protein coding sequences predicted from the assembled metagenome.