Skip to main content

Automated Computational Phenotyping

Currently, the construction of cancer classification guidelines (e.g., breast cancer, lung cancer lymphoma, or leukemia) is largely an expert-driven manual approach. As a result, cases reviewed in guideline construction or revision are limited and multiple ethnic groups are often underrepresented (e.g., the Asian or Latin American population). On a broader horizon, as pathology advances, what previously constituted one cancer category is now often regarded as multiple diseases or even a spectrum of diseases. This shift is likely to generate phenomenal impact on society if one can automatically identify sub-cohorts of cancer patients that share Omic and phenotypic signatures and that can benefit from targeted medications.

Our research has shown that machine learning can enable automated analysis and knowledge extraction from large amount of cancer patient EHRs archived at institutions like Northwestern Medicine. We tackled the problem by modeling pathology report text using both atomic features that are statistically robust (e.g., the words in pathology report) and higher-order features that are naturally interpretable and discriminative (e.g., the relations among the medical concepts, captured as subgraphs that are parts of graphs corresponding to entire sentences). Further, one relation such as "[large atypical cells] express [CD30]" leads to a certain belief that the patient might have Hodgkin lymphoma. Adding "[large atypical cells] have [Reed-Sternberg appearance]" increases the belief in Hodgkin lymphoma. Intuitively, a group of relations (subgraphs) correspond to a panel of morphologic and immunophenotypic features that are used as diagnostic criteria in the WHO guideline. To this end, we use atomic features to help group subgraph features (e.g., the above relations all share the words “large”, “atypical” and “cells”). The data-driven grouping is achieved via non-negative tensor factorization hence the name subgraph augmented non-negative tensor factorization. In this method, correlations among patient subsets, word and phrase usage, and linguistic relations mutually constrain the discovery of structure in the data.  Being able to correlate patient phenotype with a group of relations renders the model both more accurate and more interpretable.

Select Publications

Media Coverage