RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning
Type
ArticleAuthors
Kim, Ji-Sung
Gao, Xin

Rzhetsky, Andrey

KAUST Department
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) DivisionComputer Science Program
Computational Bioscience Research Center (CBRC)
KAUST Grant Number
FCC/1/1976-04URF/1/3007-01
URF/1/3450-01
URF/1/3454-01
Date
2018-04-26Permanent link to this record
http://hdl.handle.net/10754/627696
Metadata
Show full item recordAbstract
Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.Citation
Kim J-S, Gao X, Rzhetsky A (2018) RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning. PLOS Computational Biology 14: e1006106. Available: http://dx.doi.org/10.1371/journal.pcbi.1006106.Sponsors
The study was supported by funds from the Defense Advanced Projects Agency, contract W911NF1410333 to AR, https://urldefense.proofpoint.com/v2/url?u=http-3A__www.darpa.mil_program_big-2Dmechanism&d=DwIGaQ&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=ULvNIgo15mH_8cCBmiM1KBF_qHRW8ZMYO-_ZDPm3uOp9kFqARW63OFcx12Y06DIX&m=t94GIr1nrxziPIHpUNauHQejNovIkVRHPMsNYXkgjNg&s=L72IDKoNZ_89dilSQYA3xsw98W2QbGXv0RdpZxi0oQk&e=, the National Heart Lung and Blood Institute, award R01HL122712 to A.R., https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nhlbi.nih.gov&d=DwIGaQ&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=ULvNIgo15mH_8cCBmiM1KBF_qHRW8ZMYO-_ZDPm3uOp9kFqARW63OFcx12Y06DIX&m=t94GIr1nrxziPIHpUNauHQejNovIkVRHPMsNYXkgjNg&s=PSG7vwNtEqDmJ-ch-N921YI8xACd-N-EyAJZbHII6Fw&e=, the National Institute of Mental Health, award P50 MH094267 to AR, https://urldefense.proofpoint.com/v2/url?u=https-3A__grants.nih.gov_grants_guide_pa-2Dfiles_PAR-2D14-2D120.html&d=DwIGaQ&c=Nd1gv_ZWYNIRyZYZmXb18oVfc3lTqv2smA_esABG70U&r=ULvNIgo15mH_8cCBmiM1KBF_qHRW8ZMYO-_ZDPm3uOp9kFqARW63OFcx12Y06DIX&m=t94GIr1nrxziPIHpUNauHQejNovIkVRHPMsNYXkgjNg&s=Qu6HXGbIRPOE-yJSf3SFxJxutavu5K_Ic3FkHjau-s0&e=, by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR), awards FCC/1/1976-04, URF/1/3007-01, URF/1/3450-01 and URF/1/3454-01to XG, and a gift from Liz and Kent Dauten to AR. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Publisher
Public Library of Science (PLoS)Journal
PLOS Computational BiologyPubMed ID
29698408arXiv
1707.01623ae974a485f413a2113503eed53cd6c53
10.1371/journal.pcbi.1006106
Scopus Count
Except where otherwise noted, this item's license is described as This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Related articles
- Concordance between self-reported race/ethnicity and that recorded in a Veteran Affairs electronic medical record.
- Authors: Hamilton NS, Edelman D, Weinberger M, Jackson GL
- Issue date: 2009 Jul-Aug
- Missing race/ethnicity data in Veterans Health Administration based disparities research: a systematic review.
- Authors: Long JA, Bamba MI, Ling B, Shea JA
- Issue date: 2006 Feb
- Imputing race and ethnic information in administrative health data.
- Authors: Xue Y, Harel O, Aseltine RH Jr
- Issue date: 2019 Aug
- The justification of race in biological explanation.
- Authors: Lorusso L
- Issue date: 2011 Sep
- Harmonizing Genetic Ancestry and Self-identified Race/Ethnicity in Genome-wide Association Studies.
- Authors: Fang H, Hui Q, Lynch J, Honerlaw J, Assimes TL, Huang J, Vujkovic M, Damrauer SM, Pyarajan S, Gaziano JM, DuVall SL, O'Donnell CJ, Cho K, Chang KM, Wilson PWF, Tsao PS, VA Million Veteran Program, Sun YV, Tang H
- Issue date: 2019 Oct 3