Pipeline for Efficient Mapping of Transcription Factor Binding Sites and Comparison of Their Models

Handle URI:
http://hdl.handle.net/10754/136709
Title:
Pipeline for Efficient Mapping of Transcription Factor Binding Sites and Comparison of Their Models
Authors:
Ba alawi, Wail
Abstract:
The control of genes in every living organism is based on activities of transcription factor (TF) proteins. These TFs interact with DNA by binding to the TF binding sites (TFBSs) and in that way create conditions for the genes to activate. Of the approximately 1500 TFs in human, TFBSs are experimentally derived only for less than 300 TFs and only in generally limited portions of the genome. To be able to associate TF to genes they control we need to know if TFs will have a potential to interact with the control region of the gene. For this we need to have models of TFBS families. The existing models are not sufficiently accurate or they are too complex for use by ordinary biologists. To remove some of the deficiencies of these models, in this study we developed a pipeline through which we achieved the following: 1. Through a comparison analysis of the performance we identified the best models with optimized thresholds among the four different types of models of TFBS families. 2. Using the best models we mapped TFBSs to the human genome in an efficient way. The study shows that a new scoring function used with TFBS models based on the position weight matrix of dinucleotides with remote dependency results in better accuracy than the other three types of the TFBS models. The speed of mapping has been improved by developing a parallelized code and shows a significant speed up of 4x when going from 1 CPU to 8 CPUs. To verify if the predicted TFBSs are more accurate than what can be expected with the conventional models, we identified the most frequent pairs of TFBSs (for TFs E4F1 and ATF6) that appeared close to each other (within the distance of 200 nucleotides) over the human genome. We show unexpectedly that the genes that are most close to the multiple pairs of E4F1/ATF6 binding sites have a co-expression of over 90%. This indirectly supports our hypothesis that the TFBS models we use are more accurate and also suggests that the E4F1/ATF6 pair is exerting the control over these genes.
Advisors:
Bajic, Vladimir B. ( 0000-0001-5435-4750 )
Committee Member:
Moshkov, Mikhail ( 0000-0003-0085-9483 ) ; Zhang, Xiangliang ( 0000-0002-3574-5665 )
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Program:
Computer Science
Issue Date:
Jun-2011
Type:
Thesis
Appears in Collections:
Theses; Computer Science Program; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.advisorBajic, Vladimir B.en
dc.contributor.authorBa alawi, Wailen
dc.date.accessioned2011-07-24T07:53:39Z-
dc.date.available2011-07-24T07:53:39Z-
dc.date.issued2011-06en
dc.identifier.urihttp://hdl.handle.net/10754/136709en
dc.description.abstractThe control of genes in every living organism is based on activities of transcription factor (TF) proteins. These TFs interact with DNA by binding to the TF binding sites (TFBSs) and in that way create conditions for the genes to activate. Of the approximately 1500 TFs in human, TFBSs are experimentally derived only for less than 300 TFs and only in generally limited portions of the genome. To be able to associate TF to genes they control we need to know if TFs will have a potential to interact with the control region of the gene. For this we need to have models of TFBS families. The existing models are not sufficiently accurate or they are too complex for use by ordinary biologists. To remove some of the deficiencies of these models, in this study we developed a pipeline through which we achieved the following: 1. Through a comparison analysis of the performance we identified the best models with optimized thresholds among the four different types of models of TFBS families. 2. Using the best models we mapped TFBSs to the human genome in an efficient way. The study shows that a new scoring function used with TFBS models based on the position weight matrix of dinucleotides with remote dependency results in better accuracy than the other three types of the TFBS models. The speed of mapping has been improved by developing a parallelized code and shows a significant speed up of 4x when going from 1 CPU to 8 CPUs. To verify if the predicted TFBSs are more accurate than what can be expected with the conventional models, we identified the most frequent pairs of TFBSs (for TFs E4F1 and ATF6) that appeared close to each other (within the distance of 200 nucleotides) over the human genome. We show unexpectedly that the genes that are most close to the multiple pairs of E4F1/ATF6 binding sites have a co-expression of over 90%. This indirectly supports our hypothesis that the TFBS models we use are more accurate and also suggests that the E4F1/ATF6 pair is exerting the control over these genes.en
dc.language.isoenen
dc.titlePipeline for Efficient Mapping of Transcription Factor Binding Sites and Comparison of Their Modelsen
dc.typeThesisen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
thesis.degree.grantorKing Abdullah University of Science and Technologyen_GB
dc.contributor.committeememberMoshkov, Mikhailen
dc.contributor.committeememberZhang, Xiangliangen
thesis.degree.disciplineComputer Scienceen
thesis.degree.nameMaster of Scienceen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.