Machine Learning Identification of Protein Properties Useful for Specific Applications
AuthorsKhamis, Abdullah M.
AdvisorsBajic, Vladimir B.
KAUST DepartmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Permanent link to this recordhttp://hdl.handle.net/10754/606030
MetadataShow full item record
AbstractProteins play critical roles in cellular processes of living organisms. It is therefore important to identify and characterize their key properties associated with their functions. Correlating protein’s structural, sequence and physicochemical properties of its amino acids (aa) with protein functions could identify some of the critical factors governing the specific functionality. We point out that not all functions of even well studied proteins are known. This, complemented by the huge increase in the number of newly discovered and predicted proteins, makes challenging the experimental characterization of the whole spectrum of possible protein functions for all proteins of interest. Consequently, the use of computational methods has become more attractive. Here we address two questions. The first one is how to use protein aa sequence and physicochemical properties to characterize a family of proteins. The second one focuses on how to use transcription factor (TF) protein’s domains to enhance accuracy of predicting TF DNA binding sites (TFBSs). To address the first question, we developed a novel method using computational representation of proteins based on characteristics of different protein regions (N-terminal, M-region and C-terminal) and combined these with the properties of protein aa sequences. We show that this description provides important biological insight about characterization of the protein functional groups. Using feature selection techniques, we identified key properties of proteins that allow for very accurate characterization of different protein families. We demonstrated efficiency of our method in application to a number of antimicrobial peptide families. To address the second question we developed another novel method that uses a combination of aa properties of DNA binding domains of TFs and their TFBS properties to develop machine learning models for predicting TFBSs. Feature selection is used to identify the most relevant characteristics of the aa for such modeling. In addition to reducing the number of required models to only 14 for several hundred TFs, the final prediction accuracy of our models appears dramatically better than with other methods. Overall, we show how to efficiently utilize properties of proteins in deriving more accurate solutions for two important problems of computational biology and bioinformatics.