Sequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscape

Handle URI:
http://hdl.handle.net/10754/625301
Title:
Sequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscape
Authors:
Dai, Hanjun; Umarov, Ramzan; Kuwahara, Hiroyuki; Li, Yu; Song, Le; Gao, Xin ( 0000-0002-7108-3574 )
Abstract:
Motivation: An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. Results: Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model (HMM) which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these HMMs into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA data sets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.
KAUST Department:
Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Citation:
Dai H, Umarov R, Kuwahara H, Li Y, Song L, et al. (2017) Sequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscape. Bioinformatics. Available: http://dx.doi.org/10.1093/bioinformatics/btx480.
Publisher:
Oxford University Press (OUP)
Journal:
Bioinformatics
Issue Date:
26-Jul-2017
DOI:
10.1093/bioinformatics/btx480
Type:
Article
ISSN:
1367-4803; 1460-2059
Sponsors:
The research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/1976-04 and URF/1/3007-01. It was also supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, ONR N00014-15-1-2340, NVIDIA, Intel and Amazon AWS. This research made use of the resources of the computer clusters at KAUST.
Additional Links:
https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx480#supplementary-data
Appears in Collections:
Articles; Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorDai, Hanjunen
dc.contributor.authorUmarov, Ramzanen
dc.contributor.authorKuwahara, Hiroyukien
dc.contributor.authorLi, Yuen
dc.contributor.authorSong, Leen
dc.contributor.authorGao, Xinen
dc.date.accessioned2017-08-07T10:52:01Z-
dc.date.available2017-08-07T10:52:01Z-
dc.date.issued2017-07-26en
dc.identifier.citationDai H, Umarov R, Kuwahara H, Li Y, Song L, et al. (2017) Sequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscape. Bioinformatics. Available: http://dx.doi.org/10.1093/bioinformatics/btx480.en
dc.identifier.issn1367-4803en
dc.identifier.issn1460-2059en
dc.identifier.doi10.1093/bioinformatics/btx480en
dc.identifier.urihttp://hdl.handle.net/10754/625301-
dc.description.abstractMotivation: An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. Results: Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model (HMM) which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these HMMs into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA data sets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.en
dc.description.sponsorshipThe research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/1976-04 and URF/1/3007-01. It was also supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, ONR N00014-15-1-2340, NVIDIA, Intel and Amazon AWS. This research made use of the resources of the computer clusters at KAUST.en
dc.publisherOxford University Press (OUP)en
dc.relation.urlhttps://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx480#supplementary-dataen
dc.rightsThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.comen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/en
dc.titleSequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscapeen
dc.typeArticleen
dc.contributor.departmentComputational Bioscience Research Center (CBRC)en
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.identifier.journalBioinformaticsen
dc.eprint.versionPublisher's Version/PDFen
dc.contributor.institutionCollege of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA.en
kaust.authorUmarov, Ramzanen
kaust.authorKuwahara, Hiroyukien
kaust.authorLi, Yuen
kaust.authorGao, Xinen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.