Handle URI:
http://hdl.handle.net/10754/598323
Title:
Feature selection for high-dimensional integrated data
Authors:
Zheng, Charles; Schwartz, Scott; Chapkin, Robert S.; Carroll, Raymond J.; Ivanov, Ivan
Abstract:
Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of feature selection in which only a subset of the predictors Xt are dependent on the multidimensional variate Y, and the remainder of the predictors constitute a “noise set” Xu independent of Y. Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine “empirical bounds” on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.
Citation:
Zheng C, Schwartz S, Chapkin RS, Carroll RJ, Ivanov I (2012) Feature selection for high-dimensional integrated data. Proceedings of the 2012 SIAM International Conference on Data Mining: 1141–1150. Available: http://dx.doi.org/10.1137/1.9781611972825.98.
Publisher:
Society for Industrial & Applied Mathematics (SIAM)
Journal:
Proceedings of the 2012 SIAM International Conference on Data Mining
KAUST Grant Number:
KUS-C1-016-04
Issue Date:
26-Apr-2012
DOI:
10.1137/1.9781611972825.98
Type:
Book Chapter
Sponsors:
We are indebted to the Texas A& M Brazos Computing Cluster and Institute of Developmentaland Molecular Biology for access to computingresources, and to professors David B. Dahl,Mohsen Pourahmadi, and Joel Zinn for helpful discussions.The infant microarray-metagenomics data wasprovided courtesy of Sharon M. Donovan, of the Divisionof Nutritional Sciences, U. of Illinois, Urbana, IL.This publication is based in part on work supported byAward No. KUS-C1-016-04, made by King AbdullahUniversity of Science and Technology (KAUST).
Appears in Collections:
Publications Acknowledging KAUST Support

Full metadata record

DC FieldValue Language
dc.contributor.authorZheng, Charlesen
dc.contributor.authorSchwartz, Scotten
dc.contributor.authorChapkin, Robert S.en
dc.contributor.authorCarroll, Raymond J.en
dc.contributor.authorIvanov, Ivanen
dc.date.accessioned2016-02-25T13:18:42Zen
dc.date.available2016-02-25T13:18:42Zen
dc.date.issued2012-04-26en
dc.identifier.citationZheng C, Schwartz S, Chapkin RS, Carroll RJ, Ivanov I (2012) Feature selection for high-dimensional integrated data. Proceedings of the 2012 SIAM International Conference on Data Mining: 1141–1150. Available: http://dx.doi.org/10.1137/1.9781611972825.98.en
dc.identifier.doi10.1137/1.9781611972825.98en
dc.identifier.urihttp://hdl.handle.net/10754/598323en
dc.description.abstractMotivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of feature selection in which only a subset of the predictors Xt are dependent on the multidimensional variate Y, and the remainder of the predictors constitute a “noise set” Xu independent of Y. Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine “empirical bounds” on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.en
dc.description.sponsorshipWe are indebted to the Texas A& M Brazos Computing Cluster and Institute of Developmentaland Molecular Biology for access to computingresources, and to professors David B. Dahl,Mohsen Pourahmadi, and Joel Zinn for helpful discussions.The infant microarray-metagenomics data wasprovided courtesy of Sharon M. Donovan, of the Divisionof Nutritional Sciences, U. of Illinois, Urbana, IL.This publication is based in part on work supported byAward No. KUS-C1-016-04, made by King AbdullahUniversity of Science and Technology (KAUST).en
dc.publisherSociety for Industrial & Applied Mathematics (SIAM)en
dc.titleFeature selection for high-dimensional integrated dataen
dc.typeBook Chapteren
dc.identifier.journalProceedings of the 2012 SIAM International Conference on Data Miningen
dc.contributor.institutionTexas A & M Dept. Statisticsen
dc.contributor.institutionTexas A &M Program in Integrative Nutrition & Complex Diseases, Center for Environmental & Rural Healthen
dc.contributor.institutionTexas A & M Dept. Veterinary Physiology and Pharmacologyen
kaust.grant.numberKUS-C1-016-04en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.