Poly(A) motif prediction using spectral latent features from human DNA sequences

Handle URI:
http://hdl.handle.net/10754/325438
Title:
Poly(A) motif prediction using spectral latent features from human DNA sequences
Authors:
Xie, Bo; Jankovic, Boris R.; Bajic, Vladimir B. ( 0000-0001-5435-4750 ) ; Song, Le; Gao, Xin ( 0000-0002-7108-3574 )
Abstract:
Motivation: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA.Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge.Results: We propose a novel machine-learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we used hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine-tune the classification performance.We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14 740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of the previous state-of-the-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false-negative rate and false-positive rate by 26, 15 and 35%, respectively. Meanwhile, our method makes ?30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before. The Author 2013.
KAUST Department:
Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Citation:
Xie B, Jankovic BR, Bajic VB, Song L, Gao X (2013) Poly(A) motif prediction using spectral latent features from human DNA sequences. Bioinformatics 29: i316-i325. doi:10.1093/bioinformatics/btt218.
Publisher:
Oxford University Press (OUP)
Journal:
Bioinformatics
Issue Date:
21-Jun-2013
DOI:
10.1093/bioinformatics/btt218
PubMed ID:
23813000
PubMed Central ID:
PMC3694652
Type:
Article
ISSN:
13674803
Appears in Collections:
Articles; Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorXie, Boen
dc.contributor.authorJankovic, Boris R.en
dc.contributor.authorBajic, Vladimir B.en
dc.contributor.authorSong, Leen
dc.contributor.authorGao, Xinen
dc.date.accessioned2014-08-27T09:51:21Z-
dc.date.available2014-08-27T09:51:21Z-
dc.date.issued2013-6-21en
dc.identifier.citationXie B, Jankovic BR, Bajic VB, Song L, Gao X (2013) Poly(A) motif prediction using spectral latent features from human DNA sequences. Bioinformatics 29: i316-i325. doi:10.1093/bioinformatics/btt218.en
dc.identifier.issn13674803en
dc.identifier.pmid23813000en
dc.identifier.doi10.1093/bioinformatics/btt218en
dc.identifier.urihttp://hdl.handle.net/10754/325438en
dc.description.abstractMotivation: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA.Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge.Results: We propose a novel machine-learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we used hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine-tune the classification performance.We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14 740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of the previous state-of-the-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false-negative rate and false-positive rate by 26, 15 and 35%, respectively. Meanwhile, our method makes ?30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before. The Author 2013.en
dc.language.isoenen
dc.publisherOxford University Press (OUP)en
dc.rightsThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.comen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0en
dc.subjectDNAen
dc.subjectpolyadenylic aciden
dc.subject3' untranslated regionen
dc.subjectalgorithmen
dc.subjectarticleen
dc.subjectartificial intelligenceen
dc.subjectchemistryen
dc.subjectcomputer programen
dc.subjectDNA sequenceen
dc.subjectgeneticsen
dc.subjecthumanen
dc.subjectmethodologyen
dc.subjectnucleotide motifen
dc.subjectpolyadenylationen
dc.subjectprobabilityen
dc.subjectsupport vector machineen
dc.subject3' Untranslated Regionsen
dc.subjectAlgorithmsen
dc.subjectArtificial Intelligenceen
dc.subjectDNAen
dc.subjectHumansen
dc.subjectMarkov Chainsen
dc.subjectNucleotide Motifsen
dc.subjectPoly Aen
dc.subjectPolyadenylationen
dc.subjectSequence Analysis, DNAen
dc.subjectSoftwareen
dc.subjectSupport Vector Machinesen
dc.titlePoly(A) motif prediction using spectral latent features from human DNA sequencesen
dc.typeArticleen
dc.contributor.departmentComputational Bioscience Research Center (CBRC)en
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.identifier.journalBioinformaticsen
dc.identifier.pmcidPMC3694652en
dc.eprint.versionPublisher's Version/PDFen
dc.contributor.institutionCollege of Computing, Georgia Institute of Technology, Atlanta, GA 30332, United Statesen
dc.contributor.affiliationKing Abdullah University of Science and Technology (KAUST)en
kaust.authorBajic, Vladimir B.en
kaust.authorGao, Xinen
kaust.authorJankovic, Boris R.en

Related articles on PubMed

This item is licensed under a Creative Commons License
Creative Commons
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.