RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

Handle URI:
http://hdl.handle.net/10754/627696
Title:
RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning
Authors:
Kim, Ji-Sung ( 0000-0002-8966-529X ) ; Gao, Xin ( 0000-0002-7108-3574 ) ; Rzhetsky, Andrey ( 0000-0001-6959-7405 )
Abstract:
Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Computer Science Program; Computational Bioscience Research Center (CBRC)
Citation:
Kim J-S, Gao X, Rzhetsky A (2018) RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning. PLOS Computational Biology 14: e1006106. Available: http://dx.doi.org/10.1371/journal.pcbi.1006106.
Publisher:
Public Library of Science (PLoS)
Journal:
PLOS Computational Biology
KAUST Grant Number:
FCC/1/1976-04; URF/1/3007-01; URF/1/3450-01; URF/1/3454-01
Issue Date:
26-Apr-2018
DOI:
10.1371/journal.pcbi.1006106
Type:
Article
ISSN:
1553-7358
Sponsors:
The study was supported by funds from the Defense Advanced Projects Agency, contract W911NF1410333 to AR, https://urldefense.proofpoint.com/v2/url?u=http-3A__www.darpa.mil_program_big-2Dmechanism&d=DwIGaQ&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=ULvNIgo15mH_8cCBmiM1KBF_qHRW8ZMYO-_ZDPm3uOp9kFqARW63OFcx12Y06DIX&m=t94GIr1nrxziPIHpUNauHQejNovIkVRHPMsNYXkgjNg&s=L72IDKoNZ_89dilSQYA3xsw98W2QbGXv0RdpZxi0oQk&e=, the National Heart Lung and Blood Institute, award R01HL122712 to A.R., https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nhlbi.nih.gov&d=DwIGaQ&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=ULvNIgo15mH_8cCBmiM1KBF_qHRW8ZMYO-_ZDPm3uOp9kFqARW63OFcx12Y06DIX&m=t94GIr1nrxziPIHpUNauHQejNovIkVRHPMsNYXkgjNg&s=PSG7vwNtEqDmJ-ch-N921YI8xACd-N-EyAJZbHII6Fw&e=, the National Institute of Mental Health, award P50 MH094267 to AR, https://urldefense.proofpoint.com/v2/url?u=https-3A__grants.nih.gov_grants_guide_pa-2Dfiles_PAR-2D14-2D120.html&d=DwIGaQ&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=ULvNIgo15mH_8cCBmiM1KBF_qHRW8ZMYO-_ZDPm3uOp9kFqARW63OFcx12Y06DIX&m=t94GIr1nrxziPIHpUNauHQejNovIkVRHPMsNYXkgjNg&s=Qu6HXGbIRPOE-yJSf3SFxJxutavu5K_Ic3FkHjau-s0&e=, by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR), awards FCC/1/1976-04, URF/1/3007-01, URF/1/3450-01 and URF/1/3454-01to XG, and a gift from Liz and Kent Dauten to AR. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Additional Links:
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006106
Appears in Collections:
Articles; Computer Science Program; Computational Bioscience Research Center (CBRC); Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorKim, Ji-Sungen
dc.contributor.authorGao, Xinen
dc.contributor.authorRzhetsky, Andreyen
dc.date.accessioned2018-04-30T06:58:23Z-
dc.date.available2018-04-30T06:58:23Z-
dc.date.issued2018-04-26en
dc.identifier.citationKim J-S, Gao X, Rzhetsky A (2018) RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning. PLOS Computational Biology 14: e1006106. Available: http://dx.doi.org/10.1371/journal.pcbi.1006106.en
dc.identifier.issn1553-7358en
dc.identifier.doi10.1371/journal.pcbi.1006106en
dc.identifier.urihttp://hdl.handle.net/10754/627696-
dc.description.abstractAnonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.en
dc.description.sponsorshipThe study was supported by funds from the Defense Advanced Projects Agency, contract W911NF1410333 to AR, https://urldefense.proofpoint.com/v2/url?u=http-3A__www.darpa.mil_program_big-2Dmechanism&d=DwIGaQ&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=ULvNIgo15mH_8cCBmiM1KBF_qHRW8ZMYO-_ZDPm3uOp9kFqARW63OFcx12Y06DIX&m=t94GIr1nrxziPIHpUNauHQejNovIkVRHPMsNYXkgjNg&s=L72IDKoNZ_89dilSQYA3xsw98W2QbGXv0RdpZxi0oQk&e=, the National Heart Lung and Blood Institute, award R01HL122712 to A.R., https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nhlbi.nih.gov&d=DwIGaQ&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=ULvNIgo15mH_8cCBmiM1KBF_qHRW8ZMYO-_ZDPm3uOp9kFqARW63OFcx12Y06DIX&m=t94GIr1nrxziPIHpUNauHQejNovIkVRHPMsNYXkgjNg&s=PSG7vwNtEqDmJ-ch-N921YI8xACd-N-EyAJZbHII6Fw&e=, the National Institute of Mental Health, award P50 MH094267 to AR, https://urldefense.proofpoint.com/v2/url?u=https-3A__grants.nih.gov_grants_guide_pa-2Dfiles_PAR-2D14-2D120.html&d=DwIGaQ&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=ULvNIgo15mH_8cCBmiM1KBF_qHRW8ZMYO-_ZDPm3uOp9kFqARW63OFcx12Y06DIX&m=t94GIr1nrxziPIHpUNauHQejNovIkVRHPMsNYXkgjNg&s=Qu6HXGbIRPOE-yJSf3SFxJxutavu5K_Ic3FkHjau-s0&e=, by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR), awards FCC/1/1976-04, URF/1/3007-01, URF/1/3450-01 and URF/1/3454-01to XG, and a gift from Liz and Kent Dauten to AR. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.en
dc.publisherPublic Library of Science (PLoS)en
dc.relation.urlhttp://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006106en
dc.rightsThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.en
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/en
dc.titleRIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarningen
dc.typeArticleen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentComputer Science Programen
dc.contributor.departmentComputational Bioscience Research Center (CBRC)en
dc.identifier.journalPLOS Computational Biologyen
dc.eprint.versionPost-printen
dc.contributor.institutionDepartment of Computer Science, Princeton University, Princeton, New Jersey, United States of America.en
dc.contributor.institutionInstitute for Genomics and Systems Biology, Computation Institute, Departments of Medicine and Human Genetics, University of Chicago, Chicago, Illinois, United States of America.en
kaust.authorGao, Xinen
kaust.grant.numberFCC/1/1976-04en
kaust.grant.numberURF/1/3007-01en
kaust.grant.numberURF/1/3450-01en
kaust.grant.numberURF/1/3454-01en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.