My PCCI Internship – Synthetic Data Project

As my internship at Parkland Center for Clinical Innovation (PCCI) comes to an end, it feels great to look back and ponder over what I had the opportunity to work on, achieve and experience over the past three months. I arrived at PCCI with high expectations and am happy to say that I wasn’t disappointed. The project I worked on is called “Synthetic Data.” As the name suggests, the goal of the project is to create synthetic data from real medical datasets.

Why do we need synthetic medical data sets?

Real medical data is expensive and seldom released for research due to various privacy issues connected to it. Regulations exist because by looking at the medical data, a hacker could identify the name of a patient, thereby gaining access to sensitive information. The synthetic data project at PCCI aims to alleviate these problems by creating synthetic datasets, which are as close to the real medical datasets as possible without compromising a patient’s privacy.

Generative Machine Learning Algorithms and Challenges

Generative machine learning algorithms, specifically, Generative Adversarial Networks (GANs), proposed in 2014 [1] were used in this project. GANs have gained huge popularity within the machine learning community with a wide variety of GAN models being proposed.

There were quite a few challenges along the way in realizing the goal of synthetic data generation. First, the proposed GAN model has not been applied to real medical datasets before us, as it was mainly designed for image generation tasks. It also tends to not perform well with different modalities of data, which are naturally present in a real medical dataset. Modifying the network to work with real medical data or modifying the data (mostly getting it into a single modality) for it to work with GANs was a major challenge.

Second, there is an explosion in the number of GAN-based architectures being proposed and thus coming up with a novel architecture is a huge challenge in itself. After a lot of deliberation, we came up with an approach that allows us to incorporate domain knowledge into the GAN architecture. Below were our three possible approaches:

  1. Advice on just the discriminator.
  2. Advice on just the generator.
  3. Advice on improving the zero-sum game between the discriminator and the generator.

Improving the Zero-sum Game

Taking the third approach, we decided to incorporate reconstruction error [2] into the discriminator and the generator loss functions. The simple intuition is this: train the network with a mini-batch of data and generate the synthetic counterpart for each mini-batch.  Since reconstruction errors, as the name suggests, measure the errors between the real data and the “reconstructed” data, adding this to the loss functions can penalize either the discriminator or the generator depending on which performs the worse (i.e. has a higher loss difference) with respect to the mini-batch in question.

Initial Experiments Using the MIMIC III Dataset

For our initial experiments, we made use of the MIMIC III dataset [3], which is a dataset incorporating clinically relevant data for all admissions to an ICU at the Beth Israel Deaconess Medical Center between 2001 and 2012. Figure 2 shows the selected features.

Synthetic Data
Figure 2: Selected MIMIC III features

Experiments Using GAN-based Methods

We then ran a couple of experiments. First, we used three different GAN-based methods, without the reconstruction error, to see how the network performed with the defined loss functions.  Figure 3 shows the original and the generated data as obtained from these networks.

Synthetic Data
Figure 3: Snippets of the original and generated data

An Imbalanced Dataset

The original and synthetic datasets were then used to train a machine learning model. For the synthetic data to be useful, it needs to be as close to the real data distribution as possible, which should be captured by a machine learning classification model. Our dataset was highly imbalanced. We were predicting mortality rate and since a majority of patients come out of the ICU alive, we had nearly a 90%-10% split between negative examples i.e. patients who are alive after the ICU treatment and positive examples i.e. patients who die in the ICU. We used a cost-sensitive support vector machine classifier, constructed for such imbalanced data, to report the F1 score and the area under the curve (AUC-ROC) in Figure 4.

Synthetic Data
Figure 4: Comparison between the machine learning model performance for real and synthetic data. (W-GAN, GAN and MA-GAN are GAN models used to generate synthetic data)

As it can be seen, the results proved our hypothesis that real healthcare data was going to be a challenge for techniques, like GANs, which rely on many samples of very predictable data types since healthcare data tends to be more diverse and are difficult to compose into higher-order features. (Credit: David Watkins, my supervisor). We then used the reconstruction error to “indirectly” capture the relationships between the data points and used “real” hospital data to test on and create a synthetic dataset from it. The work was currently still in progress at the time this blog was written.

Working at PCCI

The work culture at PCCI, in my opinion, towers above other places. The work hours are flexible, the team’s ethics and bonding are strong and people are always willing to help you regardless of their schedule. The company truly values its employees and creates a work environment where every employee gives his/her best. You will never feel out of place (not even on the first day) as all your tasks are defined and everyone is so welcoming. A great thing about PCCI is the absence of an implicit hierarchy. Everyone from the CEO to your respective manager(s) (thank you Albert Karam) is always accessible. I never felt any different than a PCCI employee and this says a lot about the values of the company and how these values are being nurtured by PCCI’s CEO Steve Miff and all employees of this amazing organization.

Another important quality about PCCI is that it values and encourages all feedback that any employee may have and any grievances are then actually addressed. I am proud to say that Steve himself makes sure that any such issues are addressed.

Nothing is perfect and PCCI has a few areas where it can improve. One area of improvement is getting access to the real data. This is currently a very is a slow process, which makes sense as it is sensitive medical information of real patients, but it can be sped up. Another area of improvement, that I have actually raised to Steve during a meeting is that PCCI should focus on publishing research papers. It is a company that is capable of doing amazing research and has access to real medical datasets that are difficult to find. I hope PCCI becomes more active in this regard.

What PCCI does is super important to the community and it makes sure that all the employees realize this fact. Creating an impact in the real world, on real lives is a great morale booster for anyone and since PCCI values are centered around this motto, working here has been a great experience. It’s been a pleasure interning at PCCI and I am happy to be taking fond memories with me back to school.

Learn more about PCCI’s careers, or stay-up-to-date with our recent news by following us on FacebookTwitter and LinkedIn!

References:

[1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

[2] Borji, Ali. “Pros and Cons of GAN Evaluation Measures.” arXiv preprint 2018.

[3] Johnson, Alistair EW, et al. “MIMIC-III, a freely accessible critical care database.” Nature 2016.

Recommended Posts