SECOM: A novel hash seed and community detection based-approach for genome-scale protein domain identification
Supplemental File 10
Supplemental File 11
Supplemental File 12
Supplemental File 13
Supplemental File 14
Supplemental File 15
KAUST DepartmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Biological and Environmental Sciences and Engineering (BESE) Division
Computational Bioscience Research Center (CBRC)
MetadataShow full item record
AbstractWith rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx. © 2012 Fan et al.
CitationFan M, Wong K-C, Ryu T, Ravasi T, Gao X (2012) SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification. PLoS ONE 7: e39475. doi:10.1371/journal.pone.0039475.
PublisherPublic Library of Science (PLoS)
PubMed Central IDPMC3386278
- HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.
- Authors: Bradshaw CR, Surendranath V, Henschel R, Mueller MS, Habermann BH
- Issue date: 2011 Mar 10
- Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.
- Authors: Bernardes J, Zaverucha G, Vaquero C, Carbone A
- Issue date: 2016 Jul
- CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction.
- Authors: Cui X, Lu Z, Wang S, Jing-Yan Wang J, Gao X
- Issue date: 2016 Jun 15
- Computational identification of novel chitinase-like proteins in the Drosophila melanogaster genome.
- Authors: Zhu Q, Deng Y, Vanka P, Brown SJ, Muthukrishnan S, Kramer KJ
- Issue date: 2004 Jan 22
- ProClust: improved clustering of protein sequences with an extended graph-based approach.
- Authors: Pipenbacher P, Schliep A, Schneckener S, Schönhuth A, Schomburg D, Schrader R
- Issue date: 2002
Showing items related by title, author, creator and subject.
Structural analysis and dimerization profile of the SCAN domain of the pluripotency factor Zfp206Liang, Yu; Huimei Hong, Felicia; Ganesan, Pugalenthi; Jiang, Sizun; Jauch, Ralf; Stanton, Lawrence W.; Kolatkar, Prasanna R. (Oxford University Press (OUP), 2012-06-26)Zfp206 (also named as Zscan10) belongs to the subfamily of C2H2 zinc finger transcription factors, which is characterized by the N-terminal SCAN domain. The SCAN domain mediates self-association and association between the members of SCAN family transcription factors, but the structural basis and selectivity determinants for complex formation is unknown. Zfp206 is important for maintaining the pluripotency of embryonic stem cells presumably by combinatorial assembly of itself or other SCAN family members on enhancer regions. To gain insights into the folding topology and selectivity determinants for SCAN dimerization, we solved the 1.85 crystal structure of the SCAN domain of Zfp206. In vitro binding studies using a panel of 20 SCAN proteins indicate that the SCAN domain Zfp206 can selectively associate with other members of SCAN family transcription factors. Deletion mutations showed that the N-terminal helix 1 is critical for heterodimerization. Double mutations and multiple mutations based on the Zfp206SCAN-Zfp110SCAN model suggested that domain swapped topology is a possible preference for Zfp206SCAN-Zfp110SCAN heterodimer. Together, we demonstrate that the Zfp206SCAN constitutes a protein module that enables C2H2 transcription factor dimerization in a highly selective manner using a domain-swapped interface architecture and identify novel partners for Zfp206 during embryonal development. 2012 The Author(s).
Characterization and gene expression analysis of the cir multi-gene family of plasmodium chabaudi chabaudi (AS)Lawton, Jennifer; Brugat, Thibaut; Yan, Yam Xue; Reid, Adam James; Böhme, Ulrike; Otto, Thomas Dan; Pain, Arnab; Jackson, Andrew; Berriman, Matthew; Cunningham, Deirdre; Preiser, Peter; Langhorne, Jean (Springer Nature, 2012-03-29)Background: The pir genes comprise the largest multi-gene family in Plasmodium, with members found in P. vivax, P. knowlesi and the rodent malaria species. Despite comprising up to 5% of the genome, little is known about the functions of the proteins encoded by pir genes. P. chabaudi causes chronic infection in mice, which may be due to antigenic variation. In this model, pir genes are called cirs and may be involved in this mechanism, allowing evasion of host immune responses. In order to fully understand the role(s) of CIR proteins during P. chabaudi infection, a detailed characterization of the cir gene family was required.Results: The cir repertoire was annotated and a detailed bioinformatic characterization of the encoded CIR proteins was performed. Two major sub-families were identified, which have been named A and B. Members of each sub-family displayed different amino acid motifs, and were thus predicted to have undergone functional divergence. In addition, the expression of the entire cir repertoire was analyzed via RNA sequencing and microarray. Up to 40% of the cir gene repertoire was expressed in the parasite population during infection, and dominant cir transcripts could be identified. In addition, some differences were observed in the pattern of expression between the cir subgroups at the peak of P. chabaudi infection. Finally, specific cir genes were expressed at different time points during asexual blood stages.Conclusions: In conclusion, the large number of cir genes and their expression throughout the intraerythrocytic cycle of development indicates that CIR proteins are likely to be important for parasite survival. In particular, the detection of dominant cir transcripts at the peak of P. chabaudi infection supports the idea that CIR proteins are expressed, and could perform important functions in the biology of this parasite. Further application of the methodologies described here may allow the elucidation of CIR sub-family A and B protein functions, including their contribution to antigenic variation and immune evasion. 2012 Lawton et al; licensee BioMed Central Ltd.
The systematic functional analysis of plasmodium protein kinases identifies essential regulators of mosquito transmissionTewari, Rita; Straschil, Ursula; Bateman, Alex; Böhme, Ulrike; Cherevach, Inna; Gong, Peng; Pain, Arnab; Billker, Oliver (Elsevier BV, 2010-10-21)Although eukaryotic protein kinases (ePKs) contribute to many cellular processes, only three Plasmodium falciparum ePKs have thus far been identified as essential for parasite asexual blood stage development. To identify pathways essential for parasite transmission between their mammalian host and mosquito vector, we undertook a systematic functional analysis of ePKs in the genetically tractable rodent parasite Plasmodium berghei. Modeling domain signatures of conventional ePKs identified 66 putative Plasmodium ePKs. Kinomes are highly conserved between Plasmodium species. Using reverse genetics, we show that 23 ePKs are redundant for asexual erythrocytic parasite development in mice. Phenotyping mutants at four life cycle stages in Anopheles stephensi mosquitoes revealed functional clusters of kinases required for sexual development and sporogony. Roles for a putative SR protein kinase (SRPK) in microgamete formation, a conserved regulator of clathrin uncoating (GAK) in ookinete formation, and a likely regulator of energy metabolism (SNF1/KIN) in sporozoite development were identified. 2010 Elsevier Inc.