Reviewing and developing distributed algorithms for multi-center learning

Geshi Yeung




Associate Professor of Biostatistics

Project Summary

As research assistant under the mentorship of Dr. Yong Chen, I’ve focused on working under the privacy-preserving distributed algorithm (PDA) framework, and my work includes three major components: 1) reviewing and simulating algorithms previously developed and published by Dr. Chen, including one-shot algorithms ODAL and ODAC that performs distributed logistic and cox regression respectively; 2) performing a literature search for other existing distributed algorithms that could be improved on by making them one-shot; 3) developing from the ground up a website for real-world researching sites to implement the lab’s algorithms.

With the increasing proliferation of electronic health records, there is a large amount of patient-level data that could be leveraged to perform statistical/machine learning tasks. The privacy-preserving distributed algorithm (PDA) framework works under the setting where each researching site – often a university or hospital – has a large amount of patient-level data and hopes to collaborate together to solve a regression problem. The algorithm requires sites to share with one another only aggregate data such as the first- and second-order gradients with respect to a loss function. This allows statistical inference to be performed without needing to share patient-level data.

My literature review has found that one-shot distributed algorithms, especially in the machine learning field, are quite hard to come by. Many existing methods try to lower communication cost by reducing noise in different sites’ data by adding regularizers. Other methods often skip certain iterations to lessen the total number of rounds for the results to converge. However, not many can guarantee one-shot or few-shot convergence, and there is still a lot to be developed under the PDA framework.

This semester I’ll continue to work with PennCIL Lab to hopefully begin generalizing the PDA framework to different regression/ML learning problems that relaxes more assumptions are are fewer-shot.