Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals

Handle URI:
http://hdl.handle.net/10754/336791
Title:
Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals
Authors:
Mulamba, Pierre Abraham ( 0000-0002-1133-3973 )
Abstract:
The challenge in finding genes in eukaryotic organisms using computational methods is an ongoing problem in the biology. Based on various genomic signals found in eukaryotic genomes, this problem can be divided into many different sub­‐problems such as identification of transcription start sites, translation initiation sites, splice sites, poly (A) signals, etc. Each sub-­problem deals with a particular type of genomic signals and various computational methods are used to solve each sub-­problem. Aggregating information from all these individual sub-­problems can lead to a complete annotation of a gene and its component signals. The fundamental principle of most of these computational methods is the mapping principle – building an input-­output model for the prediction of a particular genomic signal based on a set of known input signals and their corresponding output signal. The type of input signals used to build the model is an essential element in most of these computational methods. The common factor of most of these methods is that they are mainly based on the statistical analysis of the basic nucleotide sequence string composition. 4 Our study is based on a novel approach to predict genomic signals in which uniquely generated structural profiles that combine compressed physicochemical properties with topological and compositional properties of DNA sequences are used to develop machine learning predictive models. The compression of the physicochemical properties is made using principal component analysis transformation. Our ideas are evaluated through prediction models of canonical splice sites using support vector machine models. We demonstrate across several species that the proposed methodology has resulted in the most accurate splice site predictors that are publicly available or described. We believe that the approach in this study is quite general and has various applications in other biological modeling problems.
Advisors:
Bajic, Vladimir B. ( 0000-0001-5435-4750 )
Committee Member:
Moshkov, Mikhail ( 0000-0003-0085-9483 ) ; Arold, Stefan ( 0000-0001-5278-0668 ) ; Christoffels, Alan
KAUST Department:
Biological and Environmental Sciences and Engineering (BESE) Division
Program:
Bioscience
Issue Date:
Dec-2014
Type:
Dissertation
Appears in Collections:
Bioscience Program; Dissertations; Biological and Environmental Sciences and Engineering (BESE) Division

Full metadata record

DC FieldValue Language
dc.contributor.advisorBajic, Vladimir B.en
dc.contributor.authorMulamba, Pierre Abrahamen
dc.date.accessioned2014-12-07T13:52:27Z-
dc.date.available2014-12-07T13:52:27Z-
dc.date.issued2014-12en
dc.identifier.urihttp://hdl.handle.net/10754/336791en
dc.description.abstractThe challenge in finding genes in eukaryotic organisms using computational methods is an ongoing problem in the biology. Based on various genomic signals found in eukaryotic genomes, this problem can be divided into many different sub­‐problems such as identification of transcription start sites, translation initiation sites, splice sites, poly (A) signals, etc. Each sub-­problem deals with a particular type of genomic signals and various computational methods are used to solve each sub-­problem. Aggregating information from all these individual sub-­problems can lead to a complete annotation of a gene and its component signals. The fundamental principle of most of these computational methods is the mapping principle – building an input-­output model for the prediction of a particular genomic signal based on a set of known input signals and their corresponding output signal. The type of input signals used to build the model is an essential element in most of these computational methods. The common factor of most of these methods is that they are mainly based on the statistical analysis of the basic nucleotide sequence string composition. 4 Our study is based on a novel approach to predict genomic signals in which uniquely generated structural profiles that combine compressed physicochemical properties with topological and compositional properties of DNA sequences are used to develop machine learning predictive models. The compression of the physicochemical properties is made using principal component analysis transformation. Our ideas are evaluated through prediction models of canonical splice sites using support vector machine models. We demonstrate across several species that the proposed methodology has resulted in the most accurate splice site predictors that are publicly available or described. We believe that the approach in this study is quite general and has various applications in other biological modeling problems.en
dc.language.isoenen
dc.subjectPhysicochemicalen
dc.subjectCompositionalen
dc.subjectCharacteristicsen
dc.subjectPredictionen
dc.subjectGenomicen
dc.subjectSignalsen
dc.titleUsing physicochemical and compositional characteristics of DNA sequence for prediction of genomic signalsen
dc.typeDissertationen
dc.contributor.departmentBiological and Environmental Sciences and Engineering (BESE) Divisionen
thesis.degree.grantorKing Abdullah University of Science and Technologyen_GB
dc.contributor.committeememberMoshkov, Mikhailen
dc.contributor.committeememberArold, Stefanen
dc.contributor.committeememberChristoffels, Alanen
thesis.degree.disciplineBioscienceen
thesis.degree.nameDoctor of Philosophyen
dc.person.id102013en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.