Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks

Handle URI:
http://hdl.handle.net/10754/622926
Title:
Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks
Authors:
Umarov, Ramzan; Solovyev, Victor V. ( 0000-0001-8885-493X )
Abstract:
Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Computer Science Program
Citation:
Umarov RK, Solovyev VV (2017) Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLOS ONE 12: e0171410. Available: http://dx.doi.org/10.1371/journal.pone.0171410.
Publisher:
Public Library of Science (PLoS)
Journal:
PLOS ONE
Issue Date:
3-Feb-2017
DOI:
10.1371/journal.pone.0171410
Type:
Article
ISSN:
1932-6203
Sponsors:
This study was supported by the King Abdullah University of Science and Technology and Softberry Inc. The King Abdullah University of Science and Technology provided support in the form of salaries for authors RU and VS. VS is employed by Softberry Inc. and provided support for VS in the form of salaries. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Additional Links:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0171410
Appears in Collections:
Articles; Computer Science Program; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorUmarov, Ramzanen
dc.contributor.authorSolovyev, Victor V.en
dc.date.accessioned2017-02-26T06:34:21Z-
dc.date.available2017-02-26T06:34:21Z-
dc.date.issued2017-02-03en
dc.identifier.citationUmarov RK, Solovyev VV (2017) Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLOS ONE 12: e0171410. Available: http://dx.doi.org/10.1371/journal.pone.0171410.en
dc.identifier.issn1932-6203en
dc.identifier.doi10.1371/journal.pone.0171410en
dc.identifier.urihttp://hdl.handle.net/10754/622926-
dc.description.abstractAccurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.en
dc.description.sponsorshipThis study was supported by the King Abdullah University of Science and Technology and Softberry Inc. The King Abdullah University of Science and Technology provided support in the form of salaries for authors RU and VS. VS is employed by Softberry Inc. and provided support for VS in the form of salaries. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.en
dc.publisherPublic Library of Science (PLoS)en
dc.relation.urlhttp://journals.plos.org/plosone/article?id=10.1371/journal.pone.0171410en
dc.rightsThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.en
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/en
dc.titleRecognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networksen
dc.typeArticleen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentComputer Science Programen
dc.identifier.journalPLOS ONEen
dc.eprint.versionPublisher's Version/PDFen
dc.contributor.institutionSoftberry Inc., Mount Kisco, United Statesen
kaust.authorUmarov, Ramzanen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.