Sequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscape
Type
ArticleKAUST Department
Computational Bioscience Research Center (CBRC)Computer Science Program
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Date
2017-07-27Online Publication Date
2017-07-27Print Publication Date
2017-11-15Permanent link to this record
http://hdl.handle.net/10754/625301
Metadata
Show full item recordAbstract
Motivation: An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. Results: Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model (HMM) which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these HMMs into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA data sets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.Citation
Dai H, Umarov R, Kuwahara H, Li Y, Song L, et al. (2017) Sequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscape. Bioinformatics. Available: http://dx.doi.org/10.1093/bioinformatics/btx480.Sponsors
The research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/1976-04 and URF/1/3007-01. It was also supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, ONR N00014-15-1-2340, NVIDIA, Intel and Amazon AWS. This research made use of the resources of the computer clusters at KAUST.Publisher
Oxford University Press (OUP)Journal
BioinformaticsAdditional Links
https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx480#supplementary-dataRelations
Is Supplemented By:- [Software]
Title: ramzan1990/sequence2vec:. Publication Date: 2017-03-20. github: ramzan1990/sequence2vec Handle: 10754/666987
ae974a485f413a2113503eed53cd6c53
10.1093/bioinformatics/btx480
Scopus Count
Except where otherwise noted, this item's license is described as This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com