SECOM: A novel hash seed and community detection based-approach for genome-scale protein domain identification
Name:
Article-PLoS_ONE-SECOM_A_no-2012.pdf
Size:
2.628Mb
Format:
PDF
Description:
Article - Full Text
Name:
Supplement_1_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s001.pdf
Size:
8.723Kb
Format:
PDF
Description:
Supplemental File 1
Name:
Supplement_2_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s002.pdf
Size:
8.708Kb
Format:
PDF
Description:
Supplemental File 2
Name:
Supplement_3_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s003.pdf
Size:
8.667Kb
Format:
PDF
Description:
Supplemental File 3
Name:
Supplement_4_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s004.pdf
Size:
8.610Kb
Format:
PDF
Description:
Supplemental File 4
Name:
Supplement_5_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s005.pdf
Size:
8.724Kb
Format:
PDF
Description:
Supplemental File 5
Name:
Supplement_6_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s006.pdf
Size:
8.704Kb
Format:
PDF
Description:
Supplemental File 6
Name:
Supplement_7_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s007.pdf
Size:
8.692Kb
Format:
PDF
Description:
Supplemental File 7
Name:
Supplement_8_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s008.pdf
Size:
8.657Kb
Format:
PDF
Description:
Supplemental File 8
Name:
Supplement_9_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s009.pdf
Size:
8.649Kb
Format:
PDF
Description:
Supplemental File 9
Name:
Supplement_10_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s010.pdf
Size:
8.667Kb
Format:
PDF
Description:
Supplemental File 10
Name:
Supplement_11_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s011.png
Size:
29.15Kb
Format:
PNG image
Description:
Supplemental File 11
Name:
Supplement_12_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s012.png
Size:
9.864Kb
Format:
PNG image
Description:
Supplemental File 12
Name:
Supplement_13_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s013.tex
Size:
1.725Kb
Format:
TeX
Description:
Supplemental File 13
Name:
Supplement_14_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s014.tex
Size:
4.661Kb
Format:
TeX
Description:
Supplemental File 14
Name:
Supplement_15_-_PLoS_ONE-SECOM_A_no-2012.pone.0039475.s015.pdf
Size:
153.0Kb
Format:
PDF
Description:
Supplemental File 15
Type
ArticleKAUST Department
Biological and Environmental Sciences and Engineering (BESE) DivisionBioscience Program
Computational Bioscience Research Center (CBRC)
Computer Science Program
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Integrative Systems Biology Lab
Structural and Functional Bioinformatics Group
Date
2012-06-28Permanent link to this record
http://hdl.handle.net/10754/325305
Metadata
Show full item recordAbstract
With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx. © 2012 Fan et al.Citation
Fan M, Wong K-C, Ryu T, Ravasi T, Gao X (2012) SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification. PLoS ONE 7: e39475. doi:10.1371/journal.pone.0039475.Publisher
Public Library of Science (PLoS)Journal
PLoS ONEPubMed ID
22761802PubMed Central ID
PMC3386278ae974a485f413a2113503eed53cd6c53
10.1371/journal.pone.0039475
Scopus Count
Collections
Articles; Biological and Environmental Science and Engineering (BESE) Division; Bioscience Program; Structural and Functional Bioinformatics Group; Integrative Systems Biology Lab; Computer Science Program; Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Science and Engineering (CEMSE) DivisionRelated items
Showing items related by title, author, creator and subject.
-
Analysis and Ranking of Protein-Protein Docking Models Using Inter-Residue Contacts and Inter-Molecular Contact MapsOliva, Romina; Chermak, Edrisse; Cavallo, Luigi (Molecules, MDPI AG, 2015-07-01) [Article]In view of the increasing interest both in inhibitors of protein-protein interactions and in protein drugs themselves, analysis of the three-dimensional structure of protein-protein complexes is assuming greater relevance in drug design. In the many cases where an experimental structure is not available, protein-protein docking becomes the method of choice for predicting the arrangement of the complex. However, reliably scoring protein-protein docking poses is still an unsolved problem. As a consequence, the screening of many docking models is usually required in the analysis step, to possibly single out the correct ones. Here, making use of exemplary cases, we review our recently introduced methods for the analysis of protein complex structures and for the scoring of protein docking poses, based on the use of inter-residue contacts and their visualization in inter-molecular contact maps. We also show that the ensemble of tools we developed can be used in the context of rational drug design targeting protein-protein interactions.
-
Prediction of protein-protein interaction sites through eXtreme gradient boosting with kernel principal component analysis.Wang, Xue; Zhang, Yaqun; Yu, Bin; Salhi, Adil; Chen, Ruixin; Wang, Lin; Liu, Zengfeng (Computers in biology and medicine, Elsevier BV, 2021-06-01) [Article]Predicting protein-protein interaction sites (PPI sites) can provide important clues for understanding biological activity. Using machine learning to predict PPI sites can mitigate the cost of running expensive and time-consuming biological experiments. Here we propose PPISP-XGBoost, a novel PPI sites prediction method based on eXtreme gradient boosting (XGBoost). First, the characteristic information of protein is extracted through the pseudo-position specific scoring matrix (PsePSSM), pseudo-amino acid composition (PseAAC), hydropathy index and solvent accessible surface area (ASA) under the sliding window. Next, these raw features are preprocessed to obtain more optimal representations in order to achieve better prediction. In particular, the synthetic minority oversampling technique (SMOTE) is used to circumvent class imbalance, and the kernel principal component analysis (KPCA) is applied to remove redundant characteristics. Finally, these optimal features are fed to the XGBoost classifier to identify PPI sites. Using PPISP-XGBoost, the prediction accuracy on the training dataset Dset186 reaches 85.4%, and the accuracy on the independent validation datasets Dtestset72, PDBtestset164, Dset_448 and Dset_355 reaches 85.3%, 83.9%, 85.8% and 85.4%, respectively, which all show an increase in accuracy against existing PPI sites prediction methods. These results demonstrate that the PPISP-XGBoost method can further enhance the prediction of PPI sites.
-
Statistical analysis of predicted vs. experimental interresidue contacts in protein-protein complexes from results of docking simulationsCavallo, Luigi; Oliva, Romina; Chermak, Edrisse (ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, American Chemical Society (ACS), 2015-08-16) [Presentation]