Date post: | 11-Aug-2014 |
Category: |
Data & Analytics |
Upload: | kai-xin-thia |
View: | 373 times |
Download: | 2 times |
Musing of a Kaggler By Kai Xin
I am not a good student. Skipped school, played games all day, almost got kicked out of school.
I play a different game now. But at the core it is the same: understand the game, devise a strategy, keep playing.
My Overall Strategy
Every piece of data is unique but some data is more important than others
It is not about the tools or the model or the stats. It is about the steps to put everything together.
The Kaggle Competition
https://github.com/thiakx/RUGS-Meetup
Remember to download data from Kaggle Competition and put it here
First look at the data
223,129 rows
First look at the data
Plot on map?
Not really free text? Some repeats
Need to predict these
Related to summary /
description?
Graph by Ryn Locar
Understand the data via visualization
Oakland
http://www.thiakx.com/misc/playground/scfMap/scfMap.html
Oakland Chicargo
New Haven Richmond
LeafletR Demo
Visualize the data - Interactive maps
Step1: Draw Boundary Polygon
Step 2: Create Base (each hex 1km wide)
Step 3: Point in Polygon Analysis
Step 4: Local Moran’s I
Obtain Boundary Polygon Lat Long
App can be found at: leafletMaps/latlong.html
leafletMaps/regionPoints.csv
Generating Hex
Code can be found at: baseFunctions_map.R
Point in Polygon Analysis Code can be found at: 1. dataExplore_map.R
Local Moran’s I
Code can be found at: 1. dataExplore_map.R
LeafletR
Code can be found at: 1. dataExplore_map.R
Kx’s layered demo map:
leafletMaps/scfMap_kxDemoVer
In Search of the 20% data
Ignore
Ignore
Ignore
Model
Ignore
Model
Ignore
MAD
Training Data
In Search of the 20% Data
Detection of “Anomalies”
Can we justify this using statistics?
ksTest<-ks.test(trainData$num_views[trainData$month==4&trainData$year==2013],
trainData$num_views[trainData$month==9&trainData$year==2012])
#d is like the distance of difference, smaller d =
the two data sets probably from same distribution
d
Jan’12 to Oct’12 and Mar’13 training data ignored
2 sample Kolmogorov–Smirnov test
What happenedhere?
No need to model? Just assume all Chicargo
data to be 0?
Chicargo data generated by remote_API mostly 0s, no need to model
Separate Outliers using Median Absolute Deviation (MAD)
MAD is robust and can handle skewed data. It helps to identify outliers. We separated data more which are more than 3 Median Absolute Deviation.
Code can be found at: baseFunctions_cleanData.R
Ignore
Ignore
Ignore
Model
Ignore
Model
Ignore
MAD
Ignore
Ignore
Ignore
Model
Ignore
Model
Ignore
MAD
10% of training data is
used for modeling
59% of data are Chicargo
data generate
d by remote_API, mostly
0s, no need
model, just
estimate using
medianKey Advantage: Rapid prototyping!
4% of data is identified as outliers by MAD
KS test: 27% of training data are of different distribution
When you can focus on a small but representative subset of data, you can run many, many experiments very quickly (I did
several hundreds)
Now we have the raw ingredients prepared, it is time to make the dishes
Experiment with Different Models
❖ Random Forest ❖ Generalized Boosted Regression Models (GBM)❖ Support Vector Machines (SVM)❖ Bootstrap aggregated (bagged) linear models
How to use? Ask Google & RTFM
Or just do download my code
I don’t spend time on optimizing/tuning model settings (learning rate etc) with cross validation. I find it really boring
and really slow
Obsessing with tuning model variables is like being obsessed with tuning the oven
Instead, the magic happens when we combine data and when we create new data - aka feature creation
Creating Simples Features : City
trainData$city[trainData$longitude=="-77"]<-"richmond"trainData$city[trainData$longitude=="-72"]<-"new_haven"trainData$city[trainData$longitude=="-87"]<-"chicargo"trainData$city[trainData$longitude=="-122"]<-"oakland"
Code can be found at: 1. dataExplore_map.R
Creating Complex Features: Local Moran’s I
Code can be found at: 1. dataExplore_map.R
Creating Complex Features: Predicted View
The task is to predict view, votes, comments but logically,
won’t number of votes and comments be correlated with
number of views?
Code can be found at: baseFunctions_model.
R
Creating Complex Features: Predicted View
Storing the predicted value of view as new column and using it as a new feature to predict votes & comments…
very risky business but powerful if you know what you are doing
Creating Complex Features: SplitTag, wordMine
Creating Complex Features: SplitTag, wordMine
Code can be found at:
baseFunctions_cleanData.
R
Adjusting Features: Simplify Tags
Code can be found at: baseFunctions_cleanData.
R
Adjusting Features: Recode Unknown Tags
Code can be found at: baseFunctions_cleanData.
R
Adjusting Features: Combine Low Count Tags
Code can be found at: baseFunctions_cleanData.
R
Full List of Features Used
Code can be found at: baseFunctions_model.
R
+Num View as Y variable
+Num Comments as Y variable
+Num Votes as Y variable
Fed into models to predict view, votes, comments respectively
Only used 1 original feature, I created the other 13 features
Code can be found at: baseFunctions_model.
R
Fed into models to predict view, votes, comments respectively
Original Feature (1) Created Feature (13)
An ensemble of good enough models can be surprisingly strong
An ensemble of good enough models can be surprisingly strong
An ensemble of the 4 base model has less error
Each model is good for different scenario
GBM is rock solid, good
for all scenarios
SVM is counter weight,
don’t trust anything it
says
GLM is amazing for predicting comments,
not so much for others
RandomForest is
moderate, provides a balanced
view
Ensemble (Stacking using regression)
testDataAns rfAns gbmAns svmAns glmBagAns
2.3 2 2.5 2.4 1.8
2 1.8 2.2 1.7 1.6
1.3 1.3 1.7 1.2 1.0
1.5 1.4 1.9 1.6 1.2
… … … … …
glm(testDataAns~rfAns+gbmAns+svmAns+glmBagAns)We are interested in the coefficient
Ensemble (Stacking using regression)
Sort and column bind the predictions from the 4 models
Run regression (logistic or linear) and obtain coefficients
Scale ensemble ratio back to 1 (100%)
Obtaining the ensemble ratio for each model
Inside 3. testMod_generateEnsembleRatio folder
- getEnsembleRatio.r
Ensemble is not perfect…❖ Simple to implement? Kind of. But very tedious
to update. Will need to rerun every single model every time you make any changes to the data (as the ensemble ratio may change).
❖ Easy to overfit test data (will require another set of validation data or cross validation).
❖ Very hard to explain to business users what is going on.
All this should get you to top rank 49/532
Ignore
Ignore
Ignore
Model
Ignore
Model
Ignore
MAD
10% of training data is
used for modeling
4% of data is identified as outliers by MAD
KS test: Too different from rest of data
59% of data are Chicargo
data generate
d by remote_API, mostly
0s, no need
model, just
estimate using
medianKey Advantage: Rapid prototyping!