Machine Learning Algorithm Analysis using 55,780 Cases from a Commercial 592-gene NGS Panel to Accurately Predict Tumor Type for Carcinoma of Unknown Primary (CUP)


Jim Abraham, Amy B. Heimberger, John Marshall, Joanne Xiu, Anthony Helmstetter, Daniel Magee, Adam Morgan, Curtis Johnston, Zoran Gatalica, Wolfgang Michael Korn, David Spetzler


The diagnosis of a malignancy is typically informed by clinical presentation and tumor tissue features including cell morphology, immunohistochemistry, cytogenetics, and molecular markers. However, in approximately 5-10% of cancers1,2, ambiguity is high enough that no tissue of origin can be determined and the specimen is labeled as a Cancer of Occult/Unknown Primary (CUP). Lack of reliable classification of a tumor poses a significant treatment dilemma for the oncologist leading to inappropriate and/or delayed treatment. Gene expression profiling has been used to try to identify the tumor type for CUP patients, but suffers from a number of inherent limitations. Specifically, tumor percentage, variation in expression, and the dynamic nature of RNA all contribute to suboptimal performance. For example, one commercial RNA-based assay has sensitivity of 83% in a test set of 187 tumors and confirmed results on only 78% of a separate 300 sample validation set3.


55,780 tumor patients with NGS data were used to construct a multiple parameter tumor type specific classification system using an advanced machine learning approach.


  • Final performance of DNA-based tumor type identification on an independent test of 15,000+ patient samples is superior to current standards using gene expression based methods
  • Unbiased training machine learning techniques applied to more than 45,000 enabled detection of tumor types independent of sampling location or tumor percentage
  • Tumor type predictors can render a histologic diagnosis to CUP cases that can inform treatment and potentially improve outcomes
  • Cancer of unknown primary remains a substantial problem for both clinicians and patients, diagnosis can be aided with the algorithms presented here.
  • Returning both diagnostic and therapeutic information that optimize patients treatment strategy from a single test is a substantial improvement over the current standard of multiple tests that require more tissue

Download Publication