Analysis of gene and protein name synonyms in Entrez Gene and UniProtKB resources
AdvisorsBajic, Vladimir B.
Embargo End Date2014-05-11
Permanent link to this recordhttp://hdl.handle.net/10754/293325
MetadataShow full item record
Access RestrictionsAt the time of archiving, the student author of this thesis opted to temporarily restrict access to it. The full text of this thesis became available to the public after the expiration of the embargo on 2014-05-11.
AbstractAmbiguity in texts is a well-known problem: words can carry several meanings, and hence, can be read and interpreted differently. This is also true in the biological literature; names of biological concepts, such as genes and proteins, might be ambiguous, referring in some cases to more than one gene or one protein, or in others, to both genes and proteins at the same time. Public biological databases give a very useful insight about genes and proteins information, including their names. In this study, we made a thorough analysis of the nomenclatures of genes and proteins in two data sources and for six different species. We developed an automated process that parses, extracts, processes and stores information available in two major biological databases: Entrez Gene and UniProtKB. We analysed gene and protein synonyms, their types, frequencies, and the ambiguities within a species, in between data sources and cross-species. We found that at least 40% of the cross-species ambiguities are caused by names that are already ambiguous within the species. Our study shows that from the six species we analysed (Homo Sapiens, Mus Musculus, Arabidopsis Thaliana, Oryza Sativa, Bacillus Subtilis and Pseudomonas Fluorescens), rice (Oriza Sativa) has the best naming model in Entrez Gene database, with low ambiguities between data sources and cross-species.