Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text

Handle URI:
http://hdl.handle.net/10754/325327
Title:
Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text
Authors:
Bin Raies, Arwa; Mansour, Hicham; Incitti, Roberto; Bajic, Vladimir B. ( 0000-0001-5435-4750 )
Abstract:
Background:In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually.Methodology:We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs) for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text.Conclusion:The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at http://www.cbrc.kaust.edu.sa/demgd/. The data is available for online browsing and download. © 2013 Bin Raies et al.
KAUST Department:
Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Biosciences Core Lab
Citation:
Bin Raies A, Mansour H, Incitti R, Bajic VB (2013) Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text. PLoS ONE 8: e77848. doi:10.1371/journal.pone.0077848.
Publisher:
Public Library of Science (PLoS)
Journal:
PLoS ONE
Issue Date:
16-Oct-2013
DOI:
10.1371/journal.pone.0077848
PubMed ID:
24147091
PubMed Central ID:
PMC3797705
Type:
Article
ISSN:
19326203
Appears in Collections:
Articles; Biosciences Core Lab; Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorBin Raies, Arwaen
dc.contributor.authorMansour, Hichamen
dc.contributor.authorIncitti, Robertoen
dc.contributor.authorBajic, Vladimir B.en
dc.date.accessioned2014-08-27T09:47:08Zen
dc.date.available2014-08-27T09:47:08Zen
dc.date.issued2013-10-16en
dc.identifier.citationBin Raies A, Mansour H, Incitti R, Bajic VB (2013) Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text. PLoS ONE 8: e77848. doi:10.1371/journal.pone.0077848.en
dc.identifier.issn19326203en
dc.identifier.pmid24147091en
dc.identifier.doi10.1371/journal.pone.0077848en
dc.identifier.urihttp://hdl.handle.net/10754/325327en
dc.description.abstractBackground:In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually.Methodology:We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs) for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text.Conclusion:The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at http://www.cbrc.kaust.edu.sa/demgd/. The data is available for online browsing and download. © 2013 Bin Raies et al.en
dc.language.isoenen
dc.publisherPublic Library of Science (PLoS)en
dc.rightsThis is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.en
dc.rightsArchived with thanks to PLoS ONEen
dc.subjectaccuracyen
dc.subjectautomationen
dc.subjectclassificationen
dc.subjectdata miningen
dc.subjectdocument term matrixen
dc.subjectgenetic associationen
dc.subjectgenetic proceduresen
dc.subjectinsulin dependent diabetes mellitusen
dc.subjectInterneten
dc.subjectlearning algorithmen
dc.subjectmachine learningen
dc.subjectmathematical computingen
dc.subjectmethodologyen
dc.subjectmethylationen
dc.subjectParkinson diseaseen
dc.subjectposition weight matrixen
dc.subjectscoring systemen
dc.subjectAlgorithmsen
dc.subjectComputational Biologyen
dc.subjectData Miningen
dc.subjectDatabases, Geneticen
dc.subjectDNA Methylationen
dc.subjectPosition-Specific Scoring Matricesen
dc.titleCombining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Texten
dc.typeArticleen
dc.contributor.departmentComputational Bioscience Research Center (CBRC)en
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentBiosciences Core Laben
dc.identifier.journalPLoS ONEen
dc.identifier.pmcidPMC3797705en
dc.eprint.versionPublisher's Version/PDFen
dc.contributor.institutionUnidad Académica de Sistemas Arrecifales (Puerto Morelos), Instituto de Ciencias Del Mar y Limnología, Universidad Nacional Autõnoma de México, Puerto Morelos, QR 77580, Mexicoen
dc.contributor.institutionSchool of Natural Sciences, University of California Merced, 5200 North Lake Road, Merced, CA 95343, United Statesen
dc.contributor.affiliationKing Abdullah University of Science and Technology (KAUST)en
kaust.authorMansour, Hichamen
kaust.authorIncitti, Robertoen
kaust.authorBajic, Vladimir B.en
kaust.authorBin Raies, Arwaen

Related articles on PubMed

All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.