EnsembleGASVR: A novel ensemble method for classifying missense single nucleotide polymorphisms

Handle URI:
http://hdl.handle.net/10754/563511
Title:
EnsembleGASVR: A novel ensemble method for classifying missense single nucleotide polymorphisms
Authors:
Rapakoulia, Trisevgeni; Theofilatos, Konstantinos A.; Kleftogiannis, Dimitrios A. ( 0000-0003-1086-821X ) ; Likothanasis, Spiridon D.; Tsakalidis, Athanasios K.; Mavroudi, Seferina P.
Abstract:
Motivation: Single nucleotide polymorphisms (SNPs) are considered the most frequently occurring DNA sequence variations. Several computational methods have been proposed for the classification of missense SNPs to neutral and disease associated. However, existing computational approaches fail to select relevant features by choosing them arbitrarily without sufficient documentation. Moreover, they are limited to the problem ofmissing values, imbalance between the learning datasets and most of them do not support their predictions with confidence scores. Results: To overcome these limitations, a novel ensemble computational methodology is proposed. EnsembleGASVR facilitates a twostep algorithm, which in its first step applies a novel evolutionary embedded algorithm to locate close to optimal Support Vector Regression models. In its second step, these models are combined to extract a universal predictor, which is less prone to overfitting issues, systematizes the rebalancing of the learning sets and uses an internal approach for solving the missing values problem without loss of information. Confidence scores support all the predictions and the model becomes tunable by modifying the classification thresholds. An extensive study was performed for collecting the most relevant features for the problem of classifying SNPs, and a superset of 88 features was constructed. Experimental results show that the proposed framework outperforms well-known algorithms in terms of classification performance in the examined datasets. Finally, the proposed algorithmic framework was able to uncover the significant role of certain features such as the solvent accessibility feature, and the top-scored predictions were further validated by linking them with disease phenotypes. © The Author 2014.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Computer Science Program
Publisher:
Oxford University Press (OUP)
Journal:
Bioinformatics
Issue Date:
26-Apr-2014
DOI:
10.1093/bioinformatics/btu297
Type:
Article
ISSN:
13674803
Sponsors:
Funding: Trisevgeni Rapakoulia and Dimitrios Kleftogiannis were supported by the King Abdullah University of Science and Technology (KAUST).
Appears in Collections:
Articles; Computer Science Program; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorRapakoulia, Trisevgenien
dc.contributor.authorTheofilatos, Konstantinos A.en
dc.contributor.authorKleftogiannis, Dimitrios A.en
dc.contributor.authorLikothanasis, Spiridon D.en
dc.contributor.authorTsakalidis, Athanasios K.en
dc.contributor.authorMavroudi, Seferina P.en
dc.date.accessioned2015-08-03T11:53:19Zen
dc.date.available2015-08-03T11:53:19Zen
dc.date.issued2014-04-26en
dc.identifier.issn13674803en
dc.identifier.doi10.1093/bioinformatics/btu297en
dc.identifier.urihttp://hdl.handle.net/10754/563511en
dc.description.abstractMotivation: Single nucleotide polymorphisms (SNPs) are considered the most frequently occurring DNA sequence variations. Several computational methods have been proposed for the classification of missense SNPs to neutral and disease associated. However, existing computational approaches fail to select relevant features by choosing them arbitrarily without sufficient documentation. Moreover, they are limited to the problem ofmissing values, imbalance between the learning datasets and most of them do not support their predictions with confidence scores. Results: To overcome these limitations, a novel ensemble computational methodology is proposed. EnsembleGASVR facilitates a twostep algorithm, which in its first step applies a novel evolutionary embedded algorithm to locate close to optimal Support Vector Regression models. In its second step, these models are combined to extract a universal predictor, which is less prone to overfitting issues, systematizes the rebalancing of the learning sets and uses an internal approach for solving the missing values problem without loss of information. Confidence scores support all the predictions and the model becomes tunable by modifying the classification thresholds. An extensive study was performed for collecting the most relevant features for the problem of classifying SNPs, and a superset of 88 features was constructed. Experimental results show that the proposed framework outperforms well-known algorithms in terms of classification performance in the examined datasets. Finally, the proposed algorithmic framework was able to uncover the significant role of certain features such as the solvent accessibility feature, and the top-scored predictions were further validated by linking them with disease phenotypes. © The Author 2014.en
dc.description.sponsorshipFunding: Trisevgeni Rapakoulia and Dimitrios Kleftogiannis were supported by the King Abdullah University of Science and Technology (KAUST).en
dc.publisherOxford University Press (OUP)en
dc.titleEnsembleGASVR: A novel ensemble method for classifying missense single nucleotide polymorphismsen
dc.typeArticleen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentComputer Science Programen
dc.identifier.journalBioinformaticsen
dc.contributor.institutionComputer Engineering and Informatics Department, University of Patras, Building B, Patras, 26504, Greeceen
dc.contributor.institutionDepartment of Social Work, School of Health Sciences, Technological Institute of Western Greece, Patras, Greeceen
kaust.authorKleftogiannis, Dimitrios A.en
kaust.authorRapakoulia, Trisevgenien
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.