Our earlier research works focused on augmenting medical natural language processing with statistical machine learning models, targeting applications such as de-identifying private health information in EMRs, detecting patient smoking status from clinical narrative text, and automating semantic relation discovery from medical literature.
Lab methodologies: Our recent works focus on on subgraph mining and factorization models applied to clinical narrative text, ICU physiologic time series and computational genomics. The common theme of these works aims at building clinical models that improve both prediction accuracy and interpretability, by exploring relational information in each data modality.
We have built and extended these models to implicating neurodevelopmentally coregulated exon clusters in phenotypes of Autism Spectrum Disorder (ASD), predicting mortality risk of ICU patients based on their physiologic measurement time series, and identifying subtypes of lymphoma patients based on pathology report text. We demonstrated how to automatically extract relational information into a graph representation and how to collect important subgraphs that are of interest. Depending on the degree of structure in the data format, heavier machinery of factorization models becomes necessary to reliably group important subgraphs. These methods lead to not only improved performance but also better interpretability in each application.