Linear discriminant analysis of character sequences using occurrences of words

Handle URI:
http://hdl.handle.net/10754/552136
Title:
Linear discriminant analysis of character sequences using occurrences of words
Authors:
Dutta, Subhajit; Chaudhuri, Probal; Ghosh, Anil
Abstract:
Classification of character sequences, where the characters come from a finite set, arises in disciplines such as molecular biology and computer science. For discriminant analysis of such character sequences, the Bayes classifier based on Markov models turns out to have class boundaries defined by linear functions of occurrences of words in the sequences. It is shown that for such classifiers based on Markov models with unknown orders, if the orders are estimated from the data using cross-validation, the resulting classifier has Bayes risk consistency under suitable conditions. Even when Markov models are not valid for the data, we develop methods for constructing classifiers based on linear functions of occurrences of words, where the word length is chosen by cross-validation. Such linear classifiers are constructed using ideas of support vector machines, regression depth, and distance weighted discrimination. We show that classifiers with linear class boundaries have certain optimal properties in terms of their asymptotic misclassification probabilities. The performance of these classifiers is demonstrated in various simulated and benchmark data sets.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Citation:
Linear discriminant analysis of character sequences using occurrences of words 2014 Statistica Sinica
Publisher:
Institute of Statistical Science
Journal:
Statistica Sinica
Issue Date:
Feb-2014
DOI:
10.5705/ss.2012.220
Type:
Article
ISSN:
10170405
Additional Links:
http://www3.stat.sinica.edu.tw/statistica/J24N1/J24N125/J24N125.html
Appears in Collections:
Articles; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorDutta, Subhajiten
dc.contributor.authorChaudhuri, Probalen
dc.contributor.authorGhosh, Anilen
dc.date.accessioned2015-05-03T14:26:27Zen
dc.date.available2015-05-03T14:26:27Zen
dc.date.issued2014-02en
dc.identifier.citationLinear discriminant analysis of character sequences using occurrences of words 2014 Statistica Sinicaen
dc.identifier.issn10170405en
dc.identifier.doi10.5705/ss.2012.220en
dc.identifier.urihttp://hdl.handle.net/10754/552136en
dc.description.abstractClassification of character sequences, where the characters come from a finite set, arises in disciplines such as molecular biology and computer science. For discriminant analysis of such character sequences, the Bayes classifier based on Markov models turns out to have class boundaries defined by linear functions of occurrences of words in the sequences. It is shown that for such classifiers based on Markov models with unknown orders, if the orders are estimated from the data using cross-validation, the resulting classifier has Bayes risk consistency under suitable conditions. Even when Markov models are not valid for the data, we develop methods for constructing classifiers based on linear functions of occurrences of words, where the word length is chosen by cross-validation. Such linear classifiers are constructed using ideas of support vector machines, regression depth, and distance weighted discrimination. We show that classifiers with linear class boundaries have certain optimal properties in terms of their asymptotic misclassification probabilities. The performance of these classifiers is demonstrated in various simulated and benchmark data sets.en
dc.publisherInstitute of Statistical Scienceen
dc.relation.urlhttp://www3.stat.sinica.edu.tw/statistica/J24N1/J24N125/J24N125.htmlen
dc.rightsArchived with thanks to Statistica Sinicaen
dc.subjectBayes classifieren
dc.subjectMarkov and hidden Markov modelsen
dc.subjectmisclassification probabilityen
dc.subjectorder of a Markov modelen
dc.subjectV-fold cross-validationen
dc.subjectword frequencyen
dc.titleLinear discriminant analysis of character sequences using occurrences of wordsen
dc.typeArticleen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.identifier.journalStatistica Sinicaen
dc.eprint.versionPublisher's Version/PDFen
dc.contributor.institutionIndian Statistical Instituteen
kaust.authorDutta, Subhajiten
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.