An Exploration of Machine Learning Methods & Potential Biases vis-à-vis Pancreatic Cancer

Over the course of the past 3 months, I have gotten the opportunity to not only further my computer science skillset vis-à-vis machine learning, but also explore a new realm of possibility in research. My project existed at the intersection of computer science and biomedical research, a combination which I was eager to explore and learn more about as I cultivated my skills as a researcher.

Our research dealt with 3 unique datasets obtained from the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial of the late 1990s/early 2000s. These 3 datasets consisted of cases and controls of patients with and without pancreatic cancer (the disease that was studied), and each had varying degrees of bias; for instance, Datasets 1 and 2 had a clear selection bias, which occurs when the sample dataset being used is not reflective of a representative sample of the population (e.g. Datasets 1 and 2 were 85% male). Dataset 2 also had a high percentage of missing data. Dataset 3 was the least biased dataset, but it was also the smallest in # of both cases and controls.

The primary objectives of my project were to (1) establish a reproducible analysis pipeline that allows for statistical and visual comparisons across the 8 different machine learning methods we tested, as well as across datasets, (2) compare model accuracies and feature importance scores to assess method performance and the influence of the aforementioned biases on performance, and (3) evaluate whether or not the addition of dietary variables improves predictive accuracy.

In the end, we found that the machine learning algorithm developed by our lab performed the best in terms of model accuracy when compared to traditional models, which was very exciting to see. We also concluded that the aforementioned biases yield falsely inflated model accuracies, and that adding dietary variables to our analysis did seem to improve the predictive accuracy of most models. Lastly, we were able to verify that the variables our machine learners identified as most “important”/deterministic in whether or not a patient has pancreatic cancer were consistent with the historically known causal agents of the disease.

All in all, this summer offered me a glimpse into the world of research and academia, which I haven’t previously gotten to explore as much as I’ve wanted. I learned everything from how to set up a rigorous machine learning pipeline to how to make and present an academic research poster, and I thank PURM and my PI Dr. Ryan Urbanowicz for all of it. In the lab, I’ve gotten to connect with someone who I consider a great mentor and friend, and I have learned more than I thought possible in one summer.