Multimodal Machine Learning Analysis of over 220,000 Tumor Profiles to Accurately Diagnose Molecular and Morphological Subtypes of Cancer


Jim Abraham, Sergey Klimov, Eghbal Amidi, Amy Heimberger, John Marshall, Elisabeth Heath, Joseph Drabick, Brian Rubin, Rouba Ali- Fehmi, David Braxton, George Sledge, David Bryant, Curtis Johnston, Hassan Ghani, Matthew Oberley, David Spetzler

Background: The diagnosis of a malignancy is typically informed by clinical presentation and tumor tissue features including cell morphology, immunohistochemistry, and molecular markers. Additionally, multi-omic approaches1 and deep learning models using digital pathology2 have augmented expert pathologists and led to improved diagnoses but are often not employed together on the same patient. The opportunity exists for a truly multimodal, multi-omic machine learning classifier that comprehensively assesses all aspects of a tumor from the molecular underpinnings to the morphological and histological phenotypic presentation to provide the most accurate diagnosis while at the same time providing predictive biomarker data from the same specimen.

Methods: Whole transcriptome data from 220,246 tumor profiles, large panel and whole exome data from over 170,000 tumor profiles, and digital pathology features from over 50,000 tumors were used to construct a multi lineage classifier. The classifier was trained on 256 OncoTree3 classifications corresponding to established WHO diagnoses where a tumor of each class has been observed at least 30 times in our dataset. The dataset was split 50% for training and the other 50% for testing, UMAP was employed for dimensionality reduction, and ensemble models were used for making the final calls. Truth was established by traditional pathologist-directed diagnostic work up.

Results: Tumor lineage classifiers predicted the correct classifications where the primary site was known with accuracies ranging between 97% and 100% when using the 32 highest level OncoTree categories corresponding to human tissues. Accuracy on the most granular OncoTree categories varied with many between 90 and 95%. When applied to CUP cases (n = 3589), an unequivocal OncoTree classification could be obtained over 90% of the time.


Combining multi-omic and digital pathology information into a comprehensive multimodal artificial intelligence platform can provide comprehensive information to pathologists to aid in diagnosis. This tool can be used to meet an unmet clinical need to define the lineage of CUP cases, which when coupled with biomarker data, will provide an opportunity to examine whether this information can be used to improve the outcomes of patient with CUP.

Download Publication