Protecting patient privacy

Image credit: Bajibabu Bollepalli

Data access is an important component of advancing any scientific field. The Artificial Intelligence community has often enjoyed significant performance improvements in Machine Learning tasks due to the availability of datasets such as MNIST, ImageNet, and CIFAR-10.

Improvements are due to two reasons: a common reference for comparing algorithms and reuse on tasks different from those originally proposed by the author and data provider. Healthcare is no different, but suffers from the concern of institutions to protect the confidentiality and privacy of patient information.

Recently, Generative Models have seen significant advances in the task of synthesizing information with privacy guarantees. From a dataset with sensitive information (e.g. patient test result) it is possible to generate a second set with the same statistical characteristics as the original, but without the records representing any real patient data. In addition, experiments show that a synthetic dataset-trained algorithm has performance comparable to the original with a precisely controlled margin of error.

This anonymization can be viewed as the insertion of noise into the original data. The more noise inserted, the greater the degradation of the statistical signal (i.e. utility) available in the dataset. With Differential Privacy theory, it is possible to precisely calculate the amount of noise needed to ensure the privacy of the people whose information is present in the dataset while controlling the degradation of the statistical signal so that a learning algorithm can still be trained.

This enables us to generate synthetic hospital data and share it without worrying about violating patient privacy. Entities such as the U.S. Census Bureau are already adapting to adopt Differential Privacy technologies (https://digitalcommons.ilr.cornell.edu/ldi/49/).

We have recently released a pre-print of our work entitled Ward2ICU: A Vital Signs Dataset of Inpatients from the General Ward where we train a GAN to synthesize a time-series dataset.

Avatar
3778 Care
Research Group