Sequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscape
KAUST DepartmentComputational Bioscience Research Center (CBRC)
Computer Science Program
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Online Publication Date2017-07-27
Print Publication Date2017-11-15
Permanent link to this recordhttp://hdl.handle.net/10754/625301
MetadataShow full item record
AbstractMotivation: An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. Results: Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model (HMM) which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these HMMs into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA data sets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.
CitationDai H, Umarov R, Kuwahara H, Li Y, Song L, et al. (2017) Sequence2Vec: A novel embedding approach for modeling transcription factor binding affinity landscape. Bioinformatics. Available: http://dx.doi.org/10.1093/bioinformatics/btx480.
SponsorsThe research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/1976-04 and URF/1/3007-01. It was also supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, ONR N00014-15-1-2340, NVIDIA, Intel and Amazon AWS. This research made use of the resources of the computer clusters at KAUST.
PublisherOxford University Press (OUP)
Except where otherwise noted, this item's license is described as This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact firstname.lastname@example.org