Hacking Global Health London 2016

Giovanni M. Dall’Olio

Hacking Global Health

1

lessons learned from an Open Data Science Hackaton

https://github.com/dalloliogm/HBGDki-London/tree/master/Ultrasound/notebooks

https://github.com/dalloliogm/HBGDki-London/tree/master/Ultrasound/notebooks

Background – the HBGDki initiativeBill and Melinda Gates Foundation

2Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016

http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016



The HBGDki data

Objective of HBGDki:

• Understand which factors affect child development

Variables in full dataset (curated from 122 studies):

• Motor, Cognitive, Language Development• Environment, Socioeconomic status• Parents’ Reasoning skills and Depressive Symptoms• Infant temperament, Breastfeeding, Micronutrients, Growth velocity, HAZ, enteric infections

Observations on HBGDki data?

•90% data from US studies•US data may be collected in a more systematic way or with better tools

Bias towards US studies

•Inconsistent data (different procedures used) although manually curated•Incomplete data

Data collected from several

sources

•HBGDki plans to use insights from current dataset to launch a global data collection study•Scope of the Hackaton is to see which type of analysis can be done and where efforts should be concentrated

Future plans ahead

The Hackaton Challenge

• Being able to predict the weight at birth during the pregnancy allows to detect underweight babies and act in advance

• This can be predicted from ultrasound measurements• The current method are relatively good, but the objective of

the hackaton is to improve them.

Predicting weight at birth, given ultrasound measurements

Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016




The Hackaton data

Data size

• 17,370 ultrasound scans from 2,525 samples collected from two studies

Variables

• GAGEDAYS: age of the foetus in days at the time of the ultrasound• SUBJID, STUDYID, SEX: subject and study id, sex of the baby• WTKG: predicted weight at birth, using best method in x• BWT_40: predicted weight at birth, using best method in literature• PARITY, GRAVIDA: number of times the mother has been pregnant before• ABCIRCM, BPDCM, FEMURCM, HCIRCM: ultrasound measurements

Presentation title 6

6

Biparietal DiameterBPDCM

Head CircumferenceHCIRCM

Abdominal CircumferenceABCIRM

Femur LengthFEMURCM

Slides credit: http://www.slideshare.net/JessicaWillis13/odsc-hackathon-for-health-october-2016




Exploratory 1: how much data, and how it is distributed

Number of ultrasounds per subjectDistribution of ultrasound measurements

Centering, scaling, and imputing data with caret

library(caret)

preProcess(., method=c("center", "scale", "knnImpute" , "YeoJohnson" ))

After transformBefore transform

– The caret library in R can be used to center and scale the data, apply an YeoJohson transform to normalize it, and impute missing values

Exploratory 2: Correlation between variables• The ggpairs

function from GGally allows to quickly create pair plots

Correlation between variables,Grouped by Study

Exploratory 3: Differences between Studies

• One group plotted the PARITY (number of pregnancies) by Study• From the different distributions they hypothesized that Study 1 was from

an high-income country, while Study 2 from a medium-low income country

Study 1 Study 2

A PCA of the four ultrasound measurements confirms they are highly correlated

• We can merge these 4 variables into one single Principal Component, losing <1% of the variance

My plan: trajectory clustering

Use trajectory clustering to classify growth trajectories into different

groups.

For example a group of individuals may grow slower or faster than the others, or

with different trajectories

Use non-ultrasound variables to characterize the different trajectory

groups – e.g. does male sex increases odds of being in a fast-growing group?

https://github.com/dalloliogm/HBGDki-London/blob/master/Ultrasound/notebooks/prehackaton_mousephenotype_trajectoryclustering.ipynb



Trajectory Clustering on PC1 of Ultrasound measurements

cluster n1 12 123 54 578

– Unfortunately trajectory clustering of the data doesn’t show much

– Almost all samples (578) follow the same trajectory

– A cluster of 12 samples (cluster 2) follows a slightly faster growth trajectory than the others

Characterizing Cluster 2

• Cluster 2 contains 12 babies that grow slightly faster than the other groups

• We can use a binomial regression on other variables (Sex, study id, parity) to determine if they increase the odds of belonging to cluster 12

• Results are not exciting but at least indicate a new possible direction of analysis when new data is available

Logistic Regression – odds of belonging to cluster 2 given Sex, Study ID and Parity

Coefficients Estimate Std. Error z-value Pr(>|z|)

(Intercept) 9.2496 729.0359 0.013 0.989877

SEXMale 0.6564 0.2685 2.444 0.014517 *

STUDYID -14.2373 729.0359 -0.02 0.984419

PARITY 0.508 0.133 3.82 0.000134 ***

Modeling with caret

• The caret library is an interface to several R packages for modelling / clustering / regressions

• The train function can be used to:• Preprocess the data (center, scale, normalization)• Fit a model/ regression/etc• Do resampling and cross-validation• Select best fit based on a metric

ctrl <- trainControl( method="boot", number=10, repeats=3)

gbm.fit = train(BWT_40~., data=ultrasound.data, method="gbm", trainControl=ctrl, preProcess=c("center", "scale"), verbose=F)

Generalized boosting regression on ultrasound data

var rel.infABCIRCM ABCIRCM42.3102187GAGEDAYS GAGEDAYS 34.7568922FEMURCM FEMURCM7.0196893SEXMale SEXMale 6.5910654

BPDCM BPDCM 4.7765837

HCIRCM HCIRCM 2.6879421

PARITY PARITY 1.5100042

STUDYID STUDYID 0.3476046

• 25 resamplings• Data centered,

scaled, knnImputed with caret

• RMSE 0.294

Focusing model on weeks 15-25 slightly improves performances

• 25 resamplings• Data centered, scaled,

knnImputed with caret• RMSE .327

gbm variable importance

OverallGAGEDAYS 100.00ABCIRCM 93.12HCIRCM 68.62FEMURCM 46.02BPDCM 29.54SEXMale 21.96PARITY 11.61STUDYID 0.00

Caret is an interface to several R modelling packages

ModelsModels tried:

• Linear regression• Regularised regression (LASSO/Ridge)• Decision trees + AdaBoost• Random forests

Using:• Last scan only• Last two scans• Last three scans• All 6 scans (if available)

‘Best’ model• Last three scans• Elastic Net• MAPE ≈ 7.4% (MAE ≈ 0.24 kg)

This can be improved by:• Adding scans closer to delivery back in (MAPE

≈ 6.4%)

What did teams do

What did the winning team do better?• Feature engineering

• Smart transform of features to predict brain volume, density, etc

• Unfortunately their slides are not available anymore ..

Lessons learned

• About 50% time was spent on cleaning and understanding data

• HBGDki’s investment in data curation is well justified

Cleaning data takes time

• An approach to classify longitudinal data, even if incomplete• More samples and more variables would allow to

characterize different classes of growth speed

Trajectory clustering

• Common interface for several R modelling packages• Also useful for data cleaning and exploringCaret

• Models can be improved by understanding the variables and transforming them in a proper way

Feature Engineering

Date post:	16-Apr-2017
Category:	Science
Upload:	giovanni-dallolio
View:	1,025 times
Download:	3 times