Analysis of gene and protein name synonyms in Entrez Gene and UniProtKB resources
Type
ThesisAuthors
Arkasosy, BasilAdvisors
Bajic, Vladimir B.
Committee Members
Moshkov, Mikhail
Zhang, Xiangliang

Program
Computer ScienceDate
2013-05-11Permanent link to this record
http://hdl.handle.net/10754/293325
Metadata
Show full item recordAbstract
Ambiguity in texts is a well-known problem: words can carry several meanings, and hence, can be read and interpreted differently. This is also true in the biological literature; names of biological concepts, such as genes and proteins, might be ambiguous, referring in some cases to more than one gene or one protein, or in others, to both genes and proteins at the same time. Public biological databases give a very useful insight about genes and proteins information, including their names. In this study, we made a thorough analysis of the nomenclatures of genes and proteins in two data sources and for six different species. We developed an automated process that parses, extracts, processes and stores information available in two major biological databases: Entrez Gene and UniProtKB. We analysed gene and protein synonyms, their types, frequencies, and the ambiguities within a species, in between data sources and cross-species. We found that at least 40% of the cross-species ambiguities are caused by names that are already ambiguous within the species. Our study shows that from the six species we analysed (Homo Sapiens, Mus Musculus, Arabidopsis Thaliana, Oryza Sativa, Bacillus Subtilis and Pseudomonas Fluorescens), rice (Oriza Sativa) has the best naming model in Entrez Gene database, with low ambiguities between data sources and cross-species.ae974a485f413a2113503eed53cd6c53
10.25781/KAUST-N0Q62