Dario Marrocchelli Insight Demo

Post on 24-Jan-2017

232 views 1 download

transcript

RISKPREDICTORIdentify health risk before it is too late

Dario Marrocchelli

Insight Health Data Fellow

The problem

Wellness companies do not know which programs will be most effective for a specific population

My solution

RiskPredictor A tool that identifies those people at risk1 and their underlying conditions

1 Risk is defined as an individual’s predicted healthcare cost

The data: unique, rich and messy

Unique de-identified data set from Zakipoint that combines three types of information:

Claims: ICD-9 codes, medical costs, gender, age

Biometric: BMI, BP, cholesterol, A1C, etc. Behavioral: HRA and wellness program

participation

There are about 250,000 rows and 2,000 people in these datasets

Cannot discuss feature engineering

Model performs very well

1 http://us.milliman.com/mara/

Model performance in line with proprietary programs which cost $100,000+ There is room for improvement (more data, more features, etc.)

R2Model

RiskPredictor - Random Forest ACG1 (Commercial) RiskPredictor – Linear (Ridge) MARA1 (Commercial)

20.5%

29.7%

34.4%

57.9%

PhD in Chemistry2006-2010

Postdoctoral Associate2010-2011

Postdoctoral Fellow

2011-2013

Research Scientist & Instructor

2013-Present

Short Bio Fun FactI designed and taught at MIT a course on the Science of Cooking (rated 9/10)

Computational Materials Scientist working on renewable energy

30+ papers, 900+ citations

Extra slides

Model performs very well

1 http://us.milliman.com/mara/

There is room for improvement (more data, more features, etc.)…

… but model performance is in line with proprietary programs which cost $100,000+

R2Model

RiskPredictor - Random Forest ACG1 (Commercial) RiskPredictor – Linear (Ridge) MARA1 (Commercial)

20.5%

29.7%

34.4%

57.9%

No diabetes

Diabetes (59)

(972)

No hypertension

Hypertension (210)

(821)

Data

Claims Biometric Behavioral

Communication with Zakipoint

Ramesh Kumar, CEO Heather Richie,VP Product Management

Several emails Google Hangout

DataUnique data set from zph that combines:

1) Claim information (ICD-9 codes, medical costs, gender, age)2) Biometric information (BMI, BP, cholesterol, A1C, etc.)3) Behavioral (HRA and wellness program participation)

The dataset contains 2k lives and is in csv format (masked)

Obese

Overweight

Normal

Underweight (5)

(245)

(430)

(411)

Algorithm anatomy

Raw data(icd-9 codes, BMI)

1000s diagnostic features(e.g. diabetes, obesity, etc.)

Regression (linear & random forest)

Predicted cost (risk scores)