Northwestern University Feinberg School of Medicine


Automatically constructing the lymphoma classification guideline

The current revision of the lymphoma classification guideline by the World Health Organization (WHO) took more than one year, involving an eight-member steering committee and over 130 pathologists and hematologists worldwide. Moreover, only around 1400 cases from Europe and North America were reviewed, subjecting the guideline to manual overfitting and selection bias. There is no reason we should not tap into the large collection of archived pathology reports to automate the construction and revision of lymphoma classification guidelines, provided that the computational model is accurate and interpretable.

We tackled the problem by modeling pathology report text using both atomic features that are statistically robust (e.g., the words in pathology report) and higher-order features that are naturally interpretable and discriminative (e.g., the relations among the medical concepts, captured as subgraphs that are parts of graphs corresponding to entire sentences). Further, one relation such as "[large atypical cells] express [CD30]" leads to a certain belief that the patient might have Hodgkin lymphoma. Adding "[large atypical cells] have [Reed-Sternberg appearance]" increases the belief in Hodgkin lymphoma. Intuitively, a group of relations (subgraphs) correspond to a panel of morphologic and immunophenotypic features that are used as diagnostic criteria in the WHO guideline. To this end, we use atomic features to help group subgraph features (e.g., the above relations all share the words “large”, “atypical” and “cells”). The data-driven grouping is achieved via non-negative tensor factorization hence the name subgraph augmented non-negative tensor factorization. In this method, correlations among patient subsets, word and phrase usage, and linguistic relations mutually constrain the discovery of structure in the data.  Being able to correlate patient phenotype with a group of relations renders the model both more accurate and more interpretable.

Data: Massachusetts General Hospital (MGH)

Main Collaborators:

Media Coverage: