Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Predictive...

Post on 27-Mar-2015

219 views 2 download

Tags:

transcript

Anthony GoldbloomCEO, Kagglee-mail anthony.goldbloom@kaggle.comtwitter @antgoldbloom

Predictive modeling competitions

Photo by mikebaird, www.flickr.com/photos/mikebaird

making data science a sport

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

Global competitions

1½ weeks 70.8%

Competition closes 77%

State of the art 70%

Predicting HIV viral load

Mismatch between those with data andthose with the skills to analyse it

Crowdsourcing

Countless approaches. Hard to know which will work

Additional slidesNot MIT, not SAS … UoL?

Forecast Error(MASE)

Existing model

Tourism Forecasting Competition

Aug 9 2 weeks later

1 month later

Competition End

Existing model (ELO)

Chess Ratings Competition

Aug 4 1 monthlater

2 monthslater

Today

Error Rate(RMSE)

Our User Base

• neural networks• logistic regression• support vector machine• decision trees• ensemble methods• adaBoost• Bayesian networks

• genetic algorithms• random forest• Monte Carlo methods• principal component analysis• Kalman filter• evolutionary fuzzy modeling

Users apply different techniques

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

Clean, Real world data Professional Reputation & Experience

Interactions with experts in related fields Prizes

1

4

2

3

Why Participants Compete

More fun than Sudoku

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

Competitions are judged based on predictive accuracy

Competition Mechanics

Competitions are judged on objective criteria

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

R on Kaggle

R on Kaggle among academics

R on Kaggle among Americans

Number Name Winner Packages

4HIV Progression Prediction Chris Raimondi

Caret (RFE and RandomForest)

5 Informs 2010 Cole Harris GLM, NNET6 Chess Rating Yannis Sismanis

7

Tourism Forecasting Part 2 Phil Brierley Forecast

10R Package Recommendation Max Lin

Stats, ROCR, GGPlot, GGPlot2

13 Ford Stay Alert Edward Stats

Who Uses R and How

1. Motivation2. Why compete?3. How it works4. R on Kaggle 5. The Heritage Health Prize

MembId AgeAtFirstClaim Sex

25872 19- Oct F

MembId DaysInHospital

25872 0

MembId ProviderId Vendor PCP YearSvc Specialty Place PayDelay LengthOfStayDSFS PrimaryConditionGroupCharIndexClaimID25872 171278567 7891165 294037 Y1 Internal Office 22 0- 1 month RESPR4 1- 2 125872 376108719 5024957 294037 Y1 Laboratory Independent Lab 23 0- 1 month MSC2a3 0 225872 171278567 7891165 294037 Y1 Internal Office 16 1- 2 months RESPR4 1- 2 325872 171278567 7891165 294037 Y1 Internal Office 19 2- 3 months RESPR4 1- 2 425872 171278567 7891165 294037 Y1 Internal Office 21 3- 4 months RESPR4 1- 2 525872 171278567 7891165 294037 Y1 Internal Office 21 4- 5 months RESPR4 1- 2 625872 376108719 5024957 294037 Y1 Laboratory Independent Lab 11 7- 8 months METAB3 1- 2 7

Mmm… how do I put this into R?

Some SQL Magic

Gives us a flat record

MembId DaysInHospital AgeAtFirstClaim Sex maxlos numclaims inhosp urgent25872 0 19- Oct F 7 0 0

Voila, an entry!

Photo by gidzy, www.flickr.com/photos/gidzy

What could the world’s bestanalysts find in your data?e-mail anthony.goldbloom@kaggle.comphone +61438400053