A video from Parkland Center for Clinical Innovation (PCCI) highlights its Women in Data Science and Technology Summer Internship program, with members of the program sharing their valuable experiences.
PCCI’s 2019 summer intern program is made up of area students from Dallas Independent School District high schools, SMU’s Statistics Department as well as students from the University of Texas at Dallas and Creighton University.
This internship program has become one of the most prestigious internship programs in North Texas with a mission to expand opportunities for women in an industry that significantly lacks gender diversity.
DALLAS – Parkland Center for Clinical Innovation (PCCI), improving healthcare in our communities with advanced analytics and artificial intelligence, recognizes the importance of a STEM education. Offering opportunities to women interested in data science is particularly crucial, which is the mission of PCCI’s summer internship program.
PCCI’s Women in Data Science and Technology Summer Internship, in collaboration with Southern Methodist University’s (SMU) Statistics Department, is one of the most prestigious internship programs in North Texas with a mission to expand opportunities for women in an industry that significantly lacks gender diversity.
The seven women participating in PCCI’s Women in Data Science and Technology Summer Internship program include high school, college and graduate students from Dallas Independent School District high schools, SMU’s Statistics Department as well as students from the University of Texas at Dallas and Creighton University.
The program’s interns will be immersed in PCCI’s daily work where they will directly experience the organization’s innovative healthcare and social determinants of health programs. The students will also have hands-on exposure to the practical applications of analytics, computing and data science.
“The Women in Data Science and Technology Summer Internship program is a rigorous and meaningful path that demonstrates to women what to expect and how to enter the technology market,” Steve Miff, PhD, President and CEO of PCCI. “Because of the important and valuable contributions from organizations such as SMU’s Statistics Department, we are able to place women side-by-side with clinical and data science experts where they can hone their programming and analytics skills within an atmosphere of mentorship and advancement.”
PCCI celebrates diversity and inclusion with a workforce that includes 54 percent women with 30 percent of its employees representing various ethnicities and communities from around the world. As an example of PCCI’s successful commitment to diversity, the Dallas Business Journal recently named Priyanka Kharat, PCCI’s Vice President, Data Engineering and Machine Learning, as a 2019 Women in Technology honoree.
PCCI’s Women in Data Science and Technology Summer Internship program is currently underway and will conclude in mid-August with a presentation program for their PCCI mentors showcasing the impact their projects are having on the Dallas community and Parkland Health & Hospital System.
About Parkland Center for Clinical Innovation
Parkland Center for Clinical Innovation (PCCI) is an independent, not-for-profit, healthcare intelligence organization affiliated with Parkland Health & Hospital System. PCCI focuses on creating connected communities through data science and cutting-edge technologies like machine learning. PCCI combines extensive clinical expertise with advanced analytics and artificial intelligence to enable the delivery of patient-centric precision medicine at the point of care.
###
Parkland Health & Hospital System, Department of Corporate Communications
5200 Harry Hines Blvd., Dallas TX 75235, 469-419-4400
As my internship at Parkland Center for Clinical Innovation (PCCI) comes to an end, it feels great to look back and ponder over what I had the opportunity to work on, achieve and experience over the past three months. I arrived at PCCI with high expectations and am happy to say that I wasn’t disappointed. The project I worked on is called “Synthetic Data.” As the name suggests, the goal of the project is to create synthetic data from real medical datasets.
Why do we need synthetic medical data sets?
Real medical data is expensive and seldom released for research due to various privacy issues connected to it. Regulations exist because by looking at the medical data, a hacker could identify the name of a patient, thereby gaining access to sensitive information. The synthetic data project at PCCI aims to alleviate these problems by creating synthetic datasets, which are as close to the real medical datasets as possible without compromising a patient’s privacy.
Generative Machine Learning Algorithms and Challenges
Generative machine learning algorithms, specifically, Generative Adversarial Networks (GANs), proposed in 2014 [1] were used in this project. GANs have gained huge popularity within the machine learning community with a wide variety of GAN models being proposed.
There were quite a few challenges along the way in realizing the goal of synthetic data generation. First, the proposed GAN model has not been applied to real medical datasets before us, as it was mainly designed for image generation tasks. It also tends to not perform well with different modalities of data, which are naturally present in a real medical dataset. Modifying the network to work with real medical data or modifying the data (mostly getting it into a single modality) for it to work with GANs was a major challenge.
Second, there is an explosion in the number of GAN-based architectures being proposed and thus coming up with a novel architecture is a huge challenge in itself. After a lot of deliberation, we came up with an approach that allows us to incorporate domain knowledge into the GAN architecture. Below were our three possible approaches:
Advice on just the discriminator.
Advice on just the generator.
Advice on improving the zero-sum game between the discriminator and the generator.
Improving the Zero-sum Game
Taking the third approach, we decided to incorporate reconstruction error [2] into the discriminator and the generator loss functions. The simple intuition is this: train the network with a mini-batch of data and generate the synthetic counterpart for each mini-batch. Since reconstruction errors, as the name suggests, measure the errors between the real data and the “reconstructed” data, adding this to the loss functions can penalize either the discriminator or the generator depending on which performs the worse (i.e. has a higher loss difference) with respect to the mini-batch in question.
Initial Experiments Using the MIMIC III Dataset
For our initial experiments, we made use of the MIMIC III dataset [3], which is a dataset incorporating clinically relevant data for all admissions to an ICU at the Beth Israel Deaconess Medical Center between 2001 and 2012. Figure 2 shows the selected features.
Experiments Using GAN-based Methods
We then ran a couple of experiments. First, we used three different GAN-based methods, without the reconstruction error, to see how the network performed with the defined loss functions. Figure 3 shows the original and the generated data as obtained from these networks.
An Imbalanced Dataset
The original and synthetic datasets were then used to train a machine learning model. For the synthetic data to be useful, it needs to be as close to the real data distribution as possible, which should be captured by a machine learning classification model. Our dataset was highly imbalanced. We were predicting mortality rate and since a majority of patients come out of the ICU alive, we had nearly a 90%-10% split between negative examples i.e. patients who are alive after the ICU treatment and positive examples i.e. patients who die in the ICU. We used a cost-sensitive support vector machine classifier, constructed for such imbalanced data, to report the F1 score and the area under the curve (AUC-ROC) in Figure 4.
As it can be seen, the results proved our hypothesis that real healthcare data was going to be a challenge for techniques, like GANs, which rely on many samples of very predictable data types since healthcare data tends to be more diverse and are difficult to compose into higher-order features. (Credit: David Watkins, my supervisor). We then used the reconstruction error to “indirectly” capture the relationships between the data points and used “real” hospital data to test on and create a synthetic dataset from it. The work was currently still in progress at the time this blog was written.
Working at PCCI
The work culture at PCCI, in my opinion, towers above other places. The work hours are flexible, the team’s ethics and bonding are strong and people are always willing to help you regardless of their schedule. The company truly values its employees and creates a work environment where every employee gives his/her best. You will never feel out of place (not even on the first day) as all your tasks are defined and everyone is so welcoming. A great thing about PCCI is the absence of an implicit hierarchy. Everyone from the CEO to your respective manager(s) (thank you Albert Karam) is always accessible. I never felt any different than a PCCI employee and this says a lot about the values of the company and how these values are being nurtured by PCCI’s CEO Steve Miff and all employees of this amazing organization.
Another important quality about PCCI is that it values and encourages all feedback that any employee may have and any grievances are then actually addressed. I am proud to say that Steve himself makes sure that any such issues are addressed.
Nothing is perfect and PCCI has a few areas where it can improve. One area of improvement is getting access to the real data. This is currently a very is a slow process, which makes sense as it is sensitive medical information of real patients, but it can be sped up. Another area of improvement, that I have actually raised to Steve during a meeting is that PCCI should focus on publishing research papers. It is a company that is capable of doing amazing research and has access to real medical datasets that are difficult to find. I hope PCCI becomes more active in this regard.
What PCCI does is super important to the community and it makes sure that all the employees realize this fact. Creating an impact in the real world, on real lives is a great morale booster for anyone and since PCCI values are centered around this motto, working here has been a great experience. It’s been a pleasure interning at PCCI and I am happy to be taking fond memories with me back to school.
For the short duration of returning to my hometown Dallas for the summer, I’ve been interning at Parkland Center for Clinical Innovation (PCCI) as a Data Science Intern. During my interview with Albert and Vikas, we discussed some issues with the representation of data in the current healthcare system. Hospitals use different coding systems in their electronic medical records (EMRs), making communication between hospitals and care providers difficult. A while ago, a new health data standard called FHIR (Fast Healthcare Interoperability Resource, pronounced “fire”) was proposed. My project this summer aimed at identifying whether data could be easily transformed into the new FHIR format, carrying out the transformation, and creating predictive models using the new FHIR data.
Situated on the 11th floor of the building, PCCI is a very chill place to work. Quiet spaces are easily found at desks and conference rooms scattered around the office. As an intern, I sit on the “Intern Island” with (usually) 6 other interns. I like this space because we get two monitors and a Lenovo Thinkpad.
As for work, each PCCI project usually consists of one project manager, a clinical expert, and a data scientist. The intern projects are no different; Aaron was the FHIR Project Manager Intern, and Mila was the FHIR Clinical Intern. Both had important but separate duties that helped our project succeed.
As the Data Science Intern on the FHIR project, I was responsible for first converting the data into FHIR resources. This involved bringing back Java knowledge from several years ago! There were definitely some issues figuring out how to add the right dependencies because Java can get complicated very quickly. A few days were spent just trying to get oriented with Java and Eclipse, and making sure all the necessary packages for FHIR were installed.
We were working with two years of data. This roughly translates into 27 million (!) vitals and 17 million labs, and each vital and lab was converted into its own separate file. I quickly realized that there would be no space on my laptop to hold all of these files, so we decided to enlist the help of Microsoft Azure. With Azure, the task became less difficult, but still, the hardest part of my summer was working with such huge numbers of files.
Caught up in the huge task of transforming vast amounts of data to FHIR resources, I left very little time in my internship to work on actual data science. Out of the approximately 13 weeks total, about six weeks were spent converting the table format EMR data into FHIR resources, five weeks were spent on parsing the FHIR resources into a format for machine learning, and the remaining two weeks were dedicated to model building. Reflecting back, I would definitely work harder to cut short the resource conversion in favor of more time for data science.
As a Data Science Intern at PCCI, you have the freedom to work in any language you want; the full-time Data Science team is very evenly divided between R and Python. There’s also a lot of freedom in dictating which path your project will go. Your supervisor will point you in a very general direction of where to go and state goals and expectations, but is otherwise very lenient!
Don’t be shy about asking around people for advice and help, even if they’re not on your project team! Even though most people are busy with various meetings, they will gladly schedule a 30-minute or even hour-long block to discuss your project privately with you.
When presenting your project, whether it’s a progress update or final presentation, expect multiple questions from the audience. It’s not that they want to quiz you on your knowledge and preparation on your project, but because they’re genuinely curious and care about understanding what you’re doing over the summer.
A mandatory 30-minute lunch is required every day. I recommended bringing lunches that can stay in the fridge for several days (like salad) or not bringing anything because there are often team lunches and random outings during the day. Occasionally there’s leftover pizza or sandwiches from lunch meetings in the big conference room or leftover burritos from breakfast.
I enjoy the diverse atmosphere at PCCI the most. The three teams: Data Science, Project Management, and Clinical teams collaborate and work together so well. It’s a very fluid system. A data scientist with a question about the best intervention methods for patients with diabetes can easily walk over to a clinical team member and get an answer within minutes. Despite being employed as a data scientist, you have access to an entire host of medical knowledge from the clinical team and connections from the project management team.
My biggest takeaway from this internship is learning about long-term time management and collaboration. Manage your time well and you’ll be able to at least touch on everything you wanted to learn during your internship. Collaborate with as many people as you can, so not only can you learn so much more but also gain friends and connections while doing so.