Distance Metrics in Clustering Single Cell RNASequencing Data





Project Summary

This summer I had the opportunity to work in the Department of Biology under the direction of Dr. Junhyong Kim, studying how to cluster single cell RNA sequencing data. The goal of the project is to assign each cell to a cluster to maximize the similarity of the cells within clusters, given a data matrix of cells from the roundworm C. elegans, and their mRNA expressions for every gene in the organism. Because there are thousands of genes in the organism, the data set is extremely high-dimensional. In high dimensions, the standard Euclidean notion of distance is not effective at determining which points are closer or farther apart, so I looked into how other definitions of distance could improve the results of the clustering. In addition, because the data set contained cells that were of different stages in development and cells of differing cell fates, the data points exhibit a branching pattern. I also looked into ways of improving clustering to consider these branching clustering shapes.

I learned a great deal from my summer research experience as its reinforced concepts in probability and also led me to learning some concepts in machine learning. I also learned that, as opposed to class, where problems have a clear answer and a succinct solution, research often involves an ambiguous goal, and the path to that goal is non-linear, requiring patience and reaching a lot of dead ends. Research has taught me to think more about my approach to a problem rather than simply getting to a solution.