This thesis presents a computational methodology for ab-initio identification of
transcription factor binding sites based on ChIP-seq data. This method consists
of three main steps, namely ChIP-seq data processing, motif discovery and models
selection. A novel method for ranking the models of motifs identified in this process
This method combines multiple factors in order to rank the provided candidate
motifs. It combines the model coverage of the ChIP-seq fragments that contain motifs
from which that model is built, the suitable background data made up of shuffled
ChIP-seq fragments, and the p-value that resulted from evaluating the model on
actual and background data.
Two ChIP-seq datasets retrieved from ENCODE project are used to evaluate and
demonstrate the ability of the method to predict correct TFBSs with high precision.
The first dataset relates to neuron-restrictive silencer factor, NRSF, while the second
one corresponds to growth-associated binding protein, GABP. The pipeline system
shows high precision prediction for both datasets, as in both cases the top ranked
motif closely resembles the known motifs for the respective transcription factors.