Missing Data Imputation for Healthcare Data
Clinical diagnosis often relies at least in part on the results of multiple laboratory tests. While many individual tests have well-characterized diagnostic properties (e.g. sensitivity), in many cases, the relationships within and diagnostic properties across sets of laboratory tests remain unknown.
My colleagues and I are exploring the relationship between several target tests and other tests commonly ordered alongside them (predictor tests). By imputing results for missing predictor tests, we showed that patient demographics and results of other laboratory tests measured alongside target tests can discriminate normal from abnormal target test results (ESR and ferritin in our work) with over 0.95 AUCs on held out test data.
Parallelizing the investigation of shared information among common laboratory tests, I have been exploring the utility of such shared information in diagnosis related tasks such as mortality risk prediction in ICUs. The entire panel of hundreds of measurements and tests can be overwhelming for clinicians to grasp, let alone their temporal progressions. On the other hand, we lack an automated approach to group temporal progression patterns and explore their information redundancies in such a meaningful way as having diagnostic utility. I observed that with proper discretization, physiologic time series can be converted to graphs, and temporal progression patterns can be captured with subgraphs. Moreover, simultaneous abnormality among subgraphs can be used as atomic features to correlate temporal progression patterns, thus fitting in the SANTF model. This approach clustered patients more accurately with respect to mortality risks, and identified groups of temporal progression patterns as mortality risk factors. This success suggests the potential utility of information shared among common laboratory tests in diagnostic tasks.
Select Publications
- Using Machine Learning to Predict Laboratory Test Results. American journal of clinical pathology 2016
- 3D-MICE: Integration of Cross-Sectional and Longitudinal Imputation for Multi-Analyte Longitudinal Clinical Data. Journal of the American Medical Informatics Association, 2018
- Evaluating the state-of-the-art in missing data imputation for clinical data, Briefings in Bioinformatics 2022
- IRTCI: Item Response Theory for Categorical Imputation