Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics

Handle URI:
http://hdl.handle.net/10754/623317
Title:
Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
Authors:
Magana-Mora, Arturo ( 0000-0001-8696-7068 )
Abstract:
Machine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often result in extremely intricate ML models. Frequently, these models may have a poor performance or may be computationally unfeasible. This study presents a set of novel computational methods and focuses on the application of genetic algorithms (GAs) for the simplification and optimization of ML models and their applications to biological problems. The dissertation addresses the following three challenges. The first is to develop a generalizable classification methodology able to systematically derive competitive models despite the complexity and nature of the data. Although several algorithms for the induction of classification models have been proposed, the algorithms are data dependent. Consequently, we developed OmniGA, a novel and generalizable framework that uses different classification models in a treeXlike decision structure, along with a parallel GA for the optimization of the OmniGA structure. Results show that OmniGA consistently outperformed existing commonly used classification models. The second challenge is the prediction of translation initiation sites (TIS) in plants genomic DNA. We performed a statistical analysis of the genomic DNA and proposed a new set of discriminant features for this problem. We developed a wrapper method based on GAs for selecting an optimal feature subset, which, in conjunction with a classification model, produced the most accurate framework for the recognition of TIS in plants. Finally, results demonstrate that despite the evolutionary distance between different plants, our approach successfully identified conserved genomic elements that may serve as the starting point for the development of a generic model for prediction of TIS in eukaryotic organisms. Finally, the third challenge is the accurate prediction of polyadenylation signals in human genomic DNA. To achieve this, we analyzed genomic DNA sequences for the 12 most frequent polyadenylation signal variants and proposed a new set of features that may contribute to the understanding of the polyadenylation process. We derived Omni-PolyA, a model, and tool based on OmniGA for the prediction of the polyadenylation signals. Results show that Omni-PolyA significantly reduced the average classification error rate compared to the state-of-the-art results.
Advisors:
Bajic, Vladimir B. ( 0000-0001-5435-4750 )
Committee Member:
Gojobori, Takashi ( 0000-0001-7850-1743 ) ; Moshkov, Mikhail ( 0000-0003-0085-9483 ) ; Wong, Limsoon
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Program:
Computer Science
Issue Date:
29-Apr-2017
Type:
Dissertation
Appears in Collections:
Dissertations

Full metadata record

DC FieldValue Language
dc.contributor.advisorBajic, Vladimir B.en
dc.contributor.authorMagana-Mora, Arturoen
dc.date.accessioned2017-05-04T06:23:53Z-
dc.date.available2017-05-04T06:23:53Z-
dc.date.issued2017-04-29-
dc.identifier.urihttp://hdl.handle.net/10754/623317-
dc.description.abstractMachine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often result in extremely intricate ML models. Frequently, these models may have a poor performance or may be computationally unfeasible. This study presents a set of novel computational methods and focuses on the application of genetic algorithms (GAs) for the simplification and optimization of ML models and their applications to biological problems. The dissertation addresses the following three challenges. The first is to develop a generalizable classification methodology able to systematically derive competitive models despite the complexity and nature of the data. Although several algorithms for the induction of classification models have been proposed, the algorithms are data dependent. Consequently, we developed OmniGA, a novel and generalizable framework that uses different classification models in a treeXlike decision structure, along with a parallel GA for the optimization of the OmniGA structure. Results show that OmniGA consistently outperformed existing commonly used classification models. The second challenge is the prediction of translation initiation sites (TIS) in plants genomic DNA. We performed a statistical analysis of the genomic DNA and proposed a new set of discriminant features for this problem. We developed a wrapper method based on GAs for selecting an optimal feature subset, which, in conjunction with a classification model, produced the most accurate framework for the recognition of TIS in plants. Finally, results demonstrate that despite the evolutionary distance between different plants, our approach successfully identified conserved genomic elements that may serve as the starting point for the development of a generic model for prediction of TIS in eukaryotic organisms. Finally, the third challenge is the accurate prediction of polyadenylation signals in human genomic DNA. To achieve this, we analyzed genomic DNA sequences for the 12 most frequent polyadenylation signal variants and proposed a new set of features that may contribute to the understanding of the polyadenylation process. We derived Omni-PolyA, a model, and tool based on OmniGA for the prediction of the polyadenylation signals. Results show that Omni-PolyA significantly reduced the average classification error rate compared to the state-of-the-art results.en
dc.language.isoenen
dc.subjectomlnivariate decision treesen
dc.subjectMachine Learningen
dc.subjectpolyadenylation signalsen
dc.subjectBioinformaticsen
dc.subjecttranslation initiation sitesen
dc.subjectData Miningen
dc.titleGenetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformaticsen
dc.typeDissertationen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
thesis.degree.grantorKing Abdullah University of Science and Technologyen_GB
dc.contributor.committeememberGojobori, Takashien
dc.contributor.committeememberMoshkov, Mikhailen
dc.contributor.committeememberWong, Limsoonen
thesis.degree.disciplineComputer Scienceen
thesis.degree.nameDoctor of Philosophyen
dc.person.id101711en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.