Date post: | 14-Jul-2015 |
Category: |
Data & Analytics |
Upload: | krishna-sankar |
View: | 1,045 times |
Download: | 2 times |
Who will win XLIX? R, Data Wrangling & Data Science
January 18, 2015
@ksankar // doubleclix.wordpress.com
“I want to die on Mars but not on
impact”
— Elon Musk, interview with Chris Ande
rson
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner�"There are no facts, only interpretations." - Friedrich Nietzsche �
etude
http://en.wikipedia.org/wiki/%C3%89tude, http://www.etudesdemarche.net/articles/etudes-sectorielles.htm, http://upload.wikimedia.org/wikipedia/commons/2/26/La_Cour_du_Palais_des_%C3%A9tudes_de_l%E2%80%99%C3%89cole_des_beaux-arts.jpg
We will focus on “short”, “acquiring skill” & “having fun” !
Goals & non-goals
Goals
¤ Get familiar with the R language & dplyr
¤ Work on a couple of interesting data science problems
¤ Give you a focused time to work § Work with me. I will wait
if you want to catch-up ¤ Less theory, more usage - let
us see if this works ¤ As straightforward as possible § The programs can be
optimized
Non-goals
¡ Go deep into the algorithms • We don’t have
sufficient time. The topic can be easily a 5 day tutorial !
¡ Dive into R internals • That is for another day
¡ A passive talk • Nope. Interactive &
hands-on
Activities & Results
o Activities: • Get familiar with R, R Studio • Work on a couple of data sets • Get familiar with the mechanics of Data Science Competitions • Explore the intersection of Algorithms, Data, Intelligence, Inference &
Results • Discuss Data Science Horse Sense ;o)
o Results : • Hands-on R • Familiar with some of the interesting algorithms • Submitted entries for 1 competition • Knowledge of Model Evaluation • Cross Validation, ROC Curves
About Me
o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata et al o Reviewing Packt Book “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o Have done lots of things: • Big Data (Retail, Bioinformatics, Financial, AdTech),
• Written Books (Web 2.0, Wireless, Java,…) • Standards, some work in AI, • Guest Lecturer at Naval PG School,… • Planning MS-CFinance or Statistics
• Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
The Nuthead band !
Setup & Data
R & IDE
o Install R o Install R Studio
Tutorial Materials
o Github : https://github.com/xsankar/hairy-octo-hipster
o Clone or download zip
Setup an account in Kaggle (www.kaggle.com) We will be using the data from 2 Kaggle competitions ① Titanic: Machine Learning from Disaster
Download data from http://www.kaggle.com/c/titanic-gettingStarted Directory ~/hairy-octo-hipster/titanic-r
② Predicting Bike Sharing @ Washington DC Download data from http://www.kaggle.com/c/bike-sharing-demand/data Directory ~/hairy-octo-hipster/bike
③ 2014 NFL Boxscore http://www.pro-football-reference.com/years/2014/games.htm
Directory ~/hairy-octo-hipster/nfl
Data
Agenda
o Jan 18 : 9:00-12:30 3 hrs o Intro, Goals, Logistics, Setup [10] [9:00-9:10) o Introduction to R & dplyR [30] [9:10-9:40) o Who will win Superbowl XLIX ? The Art of ELO Ranking [30] [9:40-10:10) • The Algorithm • The Data • The Results (Compare with FiveThirtyEight
o Anatomy of a Kaggle Competition [40] [10:10-10:50) • Competition Mechanics • Register, download data, create sub
directories • Trial Run : Submit Titanic
o Break [20] [10:50-11:10)
o Algorithms for the Amateur Data Scientist [20] [11:10-11:30) • Algorithms, Tools & frameworks in perspective • “Folk Wisdom”
o Model Evaluation & Interpretation [30] [11:30 - 12:00) • Confusion Matrix, ROC Graph
o Homework : The Art of a Competition – Bike Sharing o Homework : The Art of a Competition – Walmart
Overload Warning … There is enough material for a week’s training … which is good & bad !Read thru at your pace, refer, ponder & internalize
Close Encounters
� 1st ◦ This Tutorial
� 2nd ◦ Do More Hands-on Walkthrough
� 3nd ◦ Listen To Lectures ◦ More competitions …
Introduction to R
9:10
R Syntax – A quick overview
o aString <- "A String" o aNumber <- 12 o class(aString) o class(aNumber) o aVector <- c(1,2,3,4) o class(aVector) o aVector * 2 o sqrt(aVector) o Packages : dplyR & tidyR
Data wrangling with dplyR
o dplyR – versatile package for various data operations o We will see dplyR is use o Resources: • “Data Manipulation with dplyR” - Hadley Wickham’s UseR! 2014
Tutorial Slides • http://datascience.la/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/
• Slides https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a
• Slides of Tutorial by Rstudio’s Garrett Grolemund • https://github.com/rstudio/webinars
• And the cheatsheet is available at http://www.rstudio.com/resources/cheatsheets/
dplyR verbs
o Select
o Filter
o Summarise o Group_by o Mutate o Arrange
dplyR joins
Hiroaki Yutani @yutannihilatio inspired by http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
dplyR joins
Hiroaki Yutani @yutannihilatio inspired by http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Who will win Super Bowl XLIX
9:40
The Art of ELO Ranking & Super Bowl XLIX
o Let us look at this from 3 angles: • The Algorithm • The R program
• The Data
• The Results • Comparing with the
FiveThirtyEight Results
http://www.imdb.com/title/tt1285016/trivia?item=qt1318850
I need the Algorithm, I need the Algorithm – Mark Z to Eduardo S
The ELO Algorithm (1 of 3)
1. Basic Chess Algorithm proposed by Elo • Arpad Emrick Elo proposed the system for Chess ranking
• Rnew = Rold + K(S-μ); μij = 1 / 1 + 10(Riold-Rjold)/400
• K – varies depending on the match
• Sij = 1, ½ or 0
2. Soccer Ranking • http://www.eloratings.net/system.html
3. NFL Ranking with adjusted factor for scores, 538 Ranking
Ref : Who is #1, Princeton University Press
The ELO Algorithm (2 of 3) NFL Ranking
http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/
The ELO Algorithm (3 of 3) NFL Ranking
The Data
http://www.pro-football-reference.com/years/2014/games.htm
The R Code https://github.com/xsankar/hairy-octo-hipster
The Analysis - Ranks
The Analysis – Week 1, Week 18
Analysis – Week 20 Results
Wisdom from Nate Silver & the 538 Gang … o [Homework #1] Improve our core algorithm
to add the Margin of victory from the 538 gang ! • Remember, kFactor = 20
o [Homework #2] Weigh recent games more heavily w/ Exponential Decay
The Art of ELO Ranking & Super Bowl XLIX
o The real formula is
o Not what is written on the glass !
o But then that is Hollywood !
I need the Algorithm, I need the Algorithm – Mark Z to Eduardo S
Ref : Who is #1, Princeton University Press
References:
o ELO ranking – NFL,Soccer • http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/ • http://fivethirtyeight.com/datalab/nfl-week-20-elo-ratings-and-playoff-
odds-conference-championships/ • http://www.eloratings.net/system.html
o dplyR • http://www.rstudio.com/resources/webinars/ <- github for the slides • http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part1/ • http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part2/ • http://www.rstudio.com/resources/cheatsheets/ • http://www.r-bloggers.com/data-analysis-example-with-ggplot-and-dplyr-
analyzing-supercar-data-part-2/
Anatomy Of a Kaggle Competition 10:10
Kaggle Data Science Competitions
o Hosts Data Science Competitions o Competition Attributes: • Dataset • Train • Test (Submission) • Final Evaluation Data Set (We don’t
see)
• Rules • Time boxed • Leaderboard • Evaluation function • Discussion Forum • Private or Public
Titanic Passenger Metadata • Small • 3 Predictors
• Class • Sex • Age • Survived?
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html
City Bike Sharing Prediction (Washington DC)
Walmart Store Forecasting
Train.csv Taken from Titanic Passenger Manifest
Variable Descrip-on
Survived 0-‐No, 1=yes
Pclass Passenger Class ( 1st,2nd,3rd )
Sibsp Number of Siblings/Spouses Aboard
Parch Number of Parents/Children Aboard
Embarked Port of EmbarkaMon o C = Cherbourg o Q = Queenstown o S = Southampton
Titanic Passenger Metadata • Small • 3 Predictors
• Class • Sex • Age • Survived?
Test.csv
Submission
o 418 lines; 1st column should have 0 or 1 in each line o Evaluation: • % correctly predicted
Approach
o This is a classification problem - 0 or 1 o Comb the forums ! o Opportunity for us to try different algorithms & compare them • Simple Model • CART[Classification & Regression Tree] • Greedy, top-down binary, recursive partitioning that divides feature space into sets
of disjoint rectangular regions
• RandomForest • Different parameters
• SVM • Multiple kernels • Table the results
o Use cross validation to predict our model performance & correlate with what Kaggle says http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
Simple Model – Our First Submission
o #1 : Simple Model (M=survived)
o #2 : Simple Model (F=survived)
https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python
Refer to 1-‐Intro_to_Kaggle.R at hTps://github.com/xsankar/hairy-‐octo-‐hipster/
#3 : Simple CART Model
o CART (Classification & Regression Tree)
hTp://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ClassificaMon/Decision_Trees
May be better, because we have improved on the survival of men !
Refer to 1-‐Intro_to_Kaggle.R at hTps://github.com/xsankar/hairy-‐octo-‐hipster/
#4 : Random Forest Model
o https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience • Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/
o https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
o https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py
Refer to 1-‐Intro_to_Kaggle.R at hTps://github.com/xsankar/hairy-‐octo-‐hipster/
#5 : SVM
o Multiple Kernels o kernel = ‘radial’ #Radial Basis Function
o Kernel = ‘sigmoid’
o agconti's blog - Ultimate Titanic !
o http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713
Refer to 1-‐Intro_to_Kaggle.R at hTps://github.com/xsankar/hairy-‐octo-‐hipster/
Feature Engineering - Homework
o Add attribute : Age • In train 714/891 have age; in test 332/418 have age • Missing values can be just Mean Age of all passengers • We could be more precise and calculate Mean Age based on Title (Ms,
Mrs, Master et al) • Box plot age
o Add attribute : Mother, Family size et al o Feature engineering ideas • http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/
sharing-experiences-about-data-munging-and-classification-steps-with-python
o More ideas at http://statsguys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/ o And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md
What does it mean ? Let us ponder ….
o We have a training data set representing a domain • We reason over the dataset & develop a model to predict outcomes
o How good is our prediction when it comes to real life scenarios ? o The assumption is that the dataset is taken at random • Or Is it ? Is there a Sampling Bias ? • i.i.d ? Independent ? Identically Distributed ? • What about homoscedasticity ? Do they have the same finite variance ?
o Can we assure that another dataset (from the same domain) will give us the same result ?
o Will our model & it’s parameters remain the same if we get another data set ? o How can we evaluate our model ? o How can we select the right parameters for a selected model ?
Break
11:10
10:50
Algorithms for the Amateur Data Scientist
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …
11:10
Ref: Anthony’s Kaggle Presentation
Data Scientists apply different techniques
• Support Vector Machine • adaBoost • Bayesian Networks • Decision Trees • Ensemble Methods • Random Forest • Logistic Regression
• Genetic Algorithms • Monte Carlo Methods • Principal Component Analysis • Kalman Filter • Evolutionary Fuzzy Modelling • Neural Networks
Quora • http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms
Algorithm spectrum
o Regression o Logit o CART o Ensemble :
Random Forest
o Clustering o KNN o Genetic Alg o Simulated
Annealing
o Collab Filtering
o SVM o Kernels
o SVD
o NNet o Boltzman
Machine o Feature
Learning
Machine Learning Cute Math Ar0ficial Intelligence
Classifying Classifiers
Statistical Structural
Regression Naïve Bayes
Bayesian Networks
Rule-‐based Distance-‐based
Neural Networks
Production Rules Decision Trees
Multi-‐layer Perception
Functional Nearest Neighbor
Linear Spectral Wavelet
kNN Learning vector Quantization
Ensemble
Random Forests
Logistic Regression1
SVM Boosting
1Max Entropy Classifier
Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
Classifiers
Regression Continuous Variables
Categorical Variables
Decision Trees
k-‐NN(Nearest Neighbors)
Bias Variance
Model Complexity Over-fitting
BoosMng Bagging
CART
Data Science “folk knowledge”
Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions
o Learning = Representation + Evaluation + Optimization o It’s Generalization that counts • The fundamental goal of machine learning is to generalize beyond the
examples in the training set o Data alone is not enough • Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
o Machine Learning is not magic – one cannot get something from nothing • In order to infer, one needs the knobs & the dials • One also needs a rich expressive dataset
A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
Data Science “folk knowledge” (2 of A)
o Over fitting has many faces • Bias – Model not strong enough. So the learner has the tendency to learn the
same wrong things • Variance – Learning too much from one dataset; model will fall apart (ie much
less accurate) on a different dataset • Sampling Bias
o Intuition Fails in high Dimensions –Bellman • Blessing of non-conformity & lower effective dimension; many applications
have examples not uniformly spread but concentrated near a lower dimensional manifold eg. Space of digits is much smaller then the space of images
o Theoretical Guarantees are not What they seem • One of the major developments o f recent decades has been the realization that
we can have guarantees on the results of induction, particularly if we are willing to settle for probabilistic guarantees.
o Feature engineering is the Key A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
Data Science “folk knowledge” (3 of A)
o More Data Beats a Cleverer Algorithm • Or conversely select algorithms that improve with data • Don’t optimize prematurely without getting more data
o Learn many models, not Just One • Ensembles ! – Change the hypothesis space • Netflix prize • E.g. Bagging, Boosting, Stacking
o Simplicity Does not necessarily imply Accuracy o Representable Does not imply Learnable • Just because a function can be represented does not mean
it can be learned o Correlation Does not imply Causation
o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o A few useful things to know about machine learning - by Pedro Domingos
§ http://dl.acm.org/citation.cfm?id=2347755
Data Science “folk knowledge” (4 of A)
o The simplest hypothesis that fits the data is also the most plausible • Occam’s Razor • Don’t go for a 4 layer Neural Network unless
you have that complex data • But that doesn’t also mean that one should
choose the simplest hypothesis • Match the impedance of the domain, data & the
algorithms o Think of over fitting as memorizing as opposed to learning. o Data leakage has many forms o Sometimes the Absence of Something is Everything o [Corollary] Absence of Evidence is not the Evidence of
Absence New to Machine Learning? Avoid these three mistakes, James Faghmous https://medium.com/about-data/73258b3848a4
§ Simple Model § High Error line that cannot be
compensated with more data § Gets to a lower error rate with less data
points § Complex Model
§ Lower Error Line § But needs more data points to reach
decent error
Ref: Andrew Ng/Stanford, Yaser S./CalTech
Importance of feature selection & weak models
o “Good features allow a simple model to beat a complex model”-Ben Lorica1
o “… using many weak predictors will always be more accurate than using a few strong ones …” –Vladimir Vapnik2
o “A good decision rule is not a simple one, it cannot be described by a very few parameters” 2
o “Machine learning science is not only about computers, but about humans, and the unity of logic, emotion, and culture.” 2
o “Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you” – Hadley Wickham3
hTp://radar.oreilly.com/2014/06/streamlining-‐feature-‐engineering.html hTp://nauMl.us/issue/6/secret-‐codes/teaching-‐me-‐so^ly hTp://www.johndcook.com/blog/2013/02/07/visualizaMon-‐modeling-‐and-‐surprises/
Updated Slide
Check your assumptions
o The decisions a model makes, is directly related to the it’s assumptions about the statistical distribution of the underlying data o For example, for regression one should check that:
① Variables are normally distributed • Test for normality via visual inspection, skew & kurtosis, outlier inspections via
plots, z-scores et al
② There is a linear relationship between the dependent & independent variables
• Inspect residual plots, try quadratic relationships, try log plots et al
③ Variables are measured without error ④ Assumption of Homoscedasticity § Homoscedasticity assumes constant or near constant error variance § Check the standard residual plots and look for heteroscedasticity
§ For example in the figure, left box has the errors scattered randomly around zero; while the right two diagrams have the errors unevenly distributed
Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test, http://pareonline.net/getvn.asp?v=8&n=2
Data Science “folk knowledge” (5 of A)
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The World Knowns Unknowns
You UnKnown Known
o Others know, you don’t o What we do
o Facts, outcomes or scenarios we have not encountered, nor considered
o “Black swans”, outliers, long tails of probability distribuMons
o Lack of experience, imaginaMon
o PotenMal facts, outcomes we are aware, but not with certainty
o StochasMc processes, ProbabiliMes
o Known Knowns o There are things we know that we know
o Known Unknowns o That is to say, there are things that we
now know we don't know o But there are also Unknown Unknowns
o There are things we do not know we don't know
Data Science “folk knowledge” (6 of A) - Pipeline
o Scalable Model Deployment
o Big Data automation & purpose built appliances (soft/hard)
o Manage SLAs & response times
o Volume o Velocity o Streaming Data
o Canonical form o Data catalog o Data Fabric across the
organization o Access to multiple
sources of data o Think Hybrid – Big Data
Apps, Appliances & Infrastructure
Collect Store Transform
o Metadata o Monitor counters &
Metrics o Structured vs. Multi-‐
structured
o Flexible & Selectable § Data Subsets § Attribute sets
o Refine model with § Extended Data
subsets § Engineered
Attribute sets o Validation run across a
larger data set
Reason Model Deploy
Data Management
Data Science
o Dynamic Data Sets o 2 way key-‐value tagging of
datasets o Extended attribute sets o Advanced Analytics
Explore Visualize Recommend Predict
o Performance o Scalability o Refresh Latency o In-‐memory Analytics
o Advanced Visualization o Interactive Dashboards o Map Overlay o Infographics
¤ Bytes to Business a.k.a. Build the full stack
¤ Find Relevant Data For Business
¤ Connect the Dots
Volume
Velocity
Variety
Data Science “folk knowledge” (7 of A)
Context
Connectedness
Intelligence
Interface
Inference
“Data of unusual size” that can't be brute forced
o Three Amigos o Interface = Cognition o Intelligence = Compute(CPU) & Computational(GPU) o Infer Significance & Causality
Data Science “folk knowledge” (8 of A) Jeremy’s Axioms
o Iteratively explore data o Tools • Excel Format, Perl, Perl Book
o Get your head around data • Pivot Table
o Don’t over-complicate o If people give you data, don’t assume that you
need to use all of it o Look at pictures ! o History of your submissions – keep a tab o Don’t be afraid to submit simple solutions • We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/
Data Science “folk knowledge” (9 of A)
① Common Sense (some features make more sense then others) ② Carefully read these forums to get a peak at other peoples’ mindset ③ Visualizations ④ Train a classifier (e.g. logistic regression) and look at the feature weights ⑤ Train a decision tree and visualize it ⑥ Cluster the data and look at what clusters you get out ⑦ Just look at the raw data ⑧ Train a simple classifier, see what mistakes it makes ⑨ Write a classifier using handwritten rules ⑩ Pick a fancy method that you want to apply (Deep Learning/Nnet)
-- Maarten Bosma -- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data
Data Science “folk knowledge” (A of A) Lessons from Kaggle Winners
① Don’t over-fit ② All predictors are not needed • All data rows are not needed, either ③ Tuning the algorithms will give different results ④ Reduce the dataset (Average, select transition data,…) ⑤ Test set & training set can differ ⑥ Iteratively explore & get your head around data ⑦ Don’t be afraid to submit simple solutions ⑧ Keep a tab & history your submissions
The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual o Data Scientist should be building Data Products o Data Scientist should tell a story
Data Scientist (noun): Person who is better at
statistics than any software engineer & better
at software engineering than any statistician
– Josh Wills (Cloudera)
Data Scientist (noun): Person who is worse at
statistics than any statistician & worse at
software engineering than any software
engineer – Will Cukierski (Kaggle)
http://doubleclix.wordpress.com/2014/01/25/the-‐curious-‐case-‐of-‐the-‐data-‐scientist-‐profession/
Large is hard; Infinite is much easier ! – Titus Brown
Essential Reading List
o A few useful things to know about machine learning - by Pedro Domingos • http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert • http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/
lack_of_a_priori_distinctions_wolpert.pdf o http://www.no-free-lunch.org/ o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C • http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y
%20FDR.pdf o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe • http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo • https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance • http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
For your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning • http://www-bcf.usc.edu/~‾gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning • http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo • https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng • https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data • https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube • https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning • http://statweb.stanford.edu/~‾tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-learning/
Of Models, Performance, Evaluation & Interpretation
11:30
Bias/Variance (1 of 2)
o Model Complexity • Complex Model increases the
training data fit • But then it overfits & doesn't
perform as well with real data o Bias vs. Variance
o Classical diagram o From ELSII, By Hastie, Tibshirani & Friedman
o Bias – Model learns wrong things; not complex enough; error gap small; more data by itself won’t help
o Variance – Different dataset will give different error rate; over fitted model; larger error gap; more data could help
Prediction Error
Training Error
Ref: Andrew Ng/Stanford, Yaser S./CalTech
Learning Curve
Bias/Variance (2 of 2)
o High Bias • Due to Underfitting • Add more features • More sophisticated model • Quadratic Terms, complex equations,…
• Decrease regularization o High Variance • Due to Overfitting • Use fewer features • Use more training sample • Increase Regularization
Prediction Error
Training Error
Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need more features or more complex model to improve
Need more data to improve
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
Partition Data ! • Training (60%)
• Validation(20%) &
• “Vault” Test (20%) Data sets k-fold Cross-Validation • Split data into k equal parts
• Fit model to k-1 parts & calculate prediction error on kth part
• Non-overlapping dataset
Data Partition & Cross-Validation
� Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
Train Validate Test
#2 #3 #4 #5
#1
#2 #3 #5 #4 #1
#2 #4 #5 #3 #1
#3 #4 #5 #2 #1
#3 #4 #5 #1 #2
K-‐fold CV (k=5)
Train Validate
Bootstrap • Draw datasets (with replacement) and fit model for each dataset • Remember : Data Partitioning (#1) & Cross Validation (#2) are without
replacement
Bootstrap & Bagging � Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
Bagging (Bootstrap aggregation) ◦ Average prediction over a collection of
bootstrap-ed samples, thus reducing variance
◦ “Output of weak classifiers into a powerful commiTee” ◦ Final PredicMon = weighted majority vote ◦ Later classifiers get misclassified points � With higher weight, � So they are forced � To concentrate on them ◦ AdaBoost (AdapMveBoosting) ◦ BoosMng vs Bagging � Bagging – independent trees � BoosMng – successively weighted
Boosting � Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
◦ Builds large collecMon of de-‐correlated trees & averages them
◦ Improves Bagging by selecMng i.i.d* random variables for spliong
◦ Simpler to train & tune ◦ “Do remarkably well, with very li@le tuning required” – ESLII ◦ Less suscepMble to over fiong (than boosMng) ◦ Many RF implementaMons � Original version -‐ Fortran-‐77 ! By Breiman/Cutler � Python, R, Mahout, Weka, Milk (ML toolkit for py), matlab
* i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+
� Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
◦ Two Step � Develop a set of learners � Combine the results to develop a composite predictor ◦ Ensemble methods can take the form of: � Using different algorithms, � Using the same algorithm with different seongs � Assigning different parts of the dataset to different classifiers
◦ Bagging & Random Forests are examples of ensemble method
Ref: Machine Learning In Action
Ensemble Methods � Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
Random Forests
o While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables
o Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller)
o Error prediction • For each iteration, predict for dataset that is not in the sample (OOB data) • Aggregate OOB predictions • Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate • Can use this to search for optimal # of predictors
• We will see how close this is to the actual error in the Heritage Health Prize o Assumes equal cost for mis-prediction. Can add a cost function o Proximity matrix & applications like adding missing data, dropping outliers
Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
Classifiers
Regression Continuous Variables
Categorical Variables
Decision Trees
k-‐NN(Nearest Neighbors)
Bias Variance
Model Complexity Over-fitting
BoosMng Bagging
CART
Model Evaluation & Interpretation Relevant Digression
Cross Validation
o Reference: • https://www.kaggle.com/wiki/
GettingStartedWithPythonForDataScience
• Chris Clark ‘s blog :http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/
• Predicive Modelling in py with scikit-learning, Olivier Grisel Strata 2013 • titanic from pycon2014/parallelmaster/An introduction to Predictive
Modeling in Python
Model Evaluation - Accuracy
o Accuracy =
o For cases where tn is large compared tp, a degenerate return(false) will be very accurate !
o Hence the F-measure is a better reflection of the model strength
Predicted=1 Predicted=0
Actual =1 True+ (tp) False-‐ (fn) – Type II
Actual=0 False+ (fp) -‐ Type I True-‐ (tn)
tp + tn tp+fp+fn+tn
Model Evaluation – Precision & Recall
o Precision = How many items we identified are relevant o Recall = How many relevant items did we identify o Inverse relationship – Tradeoff depends on situations • Legal – Coverage is important than correctness • Search – Accuracy is more important • Fraud • Support cost (high fp) vs. wrath of credit card co.(high fn)
tp tp+fp
• Precision • Accuracy • Relevancy
tp tp+fn
• Recall • True +ve Rate • Coverage • Sensitivity • Hit Rate
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
fp fp+tn
• Type 1 Error Rate
• False +ve Rate • False Alarm Rate
• Specificity = 1 – fp rate
• Type 1 Error = fp • Type 2 Error = fn
Predicted=1 Predicted=0
Actual =1 True+ (tp) False-‐ (fn) -‐ Type II
Actual=0 False+ (fp) -‐ Type I True-‐ (tn)
Confusion Matrix
Actual
Predicted
C1 C2 C3 C4
C1 10 5 9 3
C2 4 20 3 7
C3 6 4 13 3
C4 2 1 4 15
Correct Ones (cii)
Precision =
Columns i
cii cij
Recall =
Rows j
cii cij
Σ Σ
Model Evaluation : F-Measure
Precision = tp / (tp+fp) : Recall = tp / (tp+fn) F-Measure
Balanced, Combined, Weighted Harmonic Mean, measures effectiveness
= β2 P + R
Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R
+ (1 – α) α 1 P 1 R
1 (β2 + 1)PR
Predicted=1 Predicted=0
Actual =1 True+ (tp) False-‐ (fn) -‐ Type II
Actual=0 False+ (fp) -‐ Type I True-‐ (tn)
Hands-on Walkthru - Model Evaluation
Train Test
712 (80%) 179
891 hTp://cran.r-‐project.org/web/packages/e1071/vigneTes/svmdoc.pdf -‐ model eval Kappa measure is interesMng
Refer to 2-‐Model_EvaluaMon.R at hTps://github.com/xsankar/hairy-‐octo-‐hipster/
ROC Analysis
o “How good is my model?” o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf o “A receiver operating characteristics (ROC) graph is a technique for visualizing,
organizing and selecting classifiers based on their performance”
o Much better than evaluating a model based on simple classification accuracy o Plots tp rate vs. fp rate
ROC Graph - Discussion o E = Conservative, Everything NO o H = Liberal, Everything YES o Am not making any
political statement ! o F = Ideal o G = Worst o The diagonal is the chance o North West Corner is good o South-East is bad • For example E • Believe it or Not - I have
actually seen a graph with the curve in this region !
E
F
G
H
Predicted=1 Predicted=0
Actual =1 True+ (tp) False-‐ (fn)
Actual=0 False+ (fp) True-‐ (tn)
ROC Graph – Clinical Example
Ifcc : Measures of diagnostic accuracy: basic definitions
ROC Graph Walk thru
Refer to 2-‐Model_EvaluaMon.R at hTps://github.com/xsankar/hairy-‐octo-‐hipster/
The Beginning As The End Who will win Super BOWL XLIX ?
12:15
References:
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas • http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-
learning
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel • http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn
o Just The Basics, Strata 2013, William Cukierski & Ben Hamner • http://strataconf.com/strata2013/public/schedule/detail/27291
o The Problem of Multiple Testing • http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/
PIIS1934148209014609.pdf
Homework: Bike Sharing at Washington DC
12:30
Few interesting Links - Comb the forums o Quick First prediction : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10510/a-simple-model-for-kaggle-bike-sharing • Solution by Brandon Harris
o Random forest http://www.kaggle.com/c/bike-sharing-demand/forums/t/10093/solution-based-on-random-forests-in-r-language o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9368/what-are-the-machine-learning-algorithms-applied-for-this-
prediction o GBM : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9349/gbm o Research paper : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9457/research-paper-weather-and-dc-bikeshare o Ggplot http://www.kaggle.com/c/bike-sharing-demand/forums/t/9352/visualization-using-ggplot-in-r o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9474/feature-importances o Converting datetime to hour : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10064/tip-converting-date-time-to-hour o Casual & Registered Users :
http://www.kaggle.com/c/bike-sharing-demand/forums/t/10432/predict-casual-registered-separately-or-just-count o RMSLE : https://www.kaggle.com/c/bike-sharing-demand/forums/t/9941/my-approach-a-better-way-to-benchmark-please o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9938/r-how-predict-new-counts-in-r o Weather data : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10285/weather-data o Date Error : https://www.kaggle.com/c/bike-sharing-demand/forums/t/8343/i-am-getting-an-error/47402#post47402 o Using dates in R : http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html
Data Organization – train, test & submission
• datetime - hourly date + timestamp • Season
• 1 = spring, 2 = summer, 3 = fall, 4 = winter • holiday - whether the day is considered a holiday • workingday - whether the day is neither a weekend nor holiday • Weather
• 1: Clear, Few clouds, Partly cloudy, Partly cloudy • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds,
Light Rain + Scattered clouds • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
• temp - temperature in Celsius • atemp - "feels like" temperature in Celsius • humidity - relative humidity • windspeed - wind speed • casual - number of non-registered user rentals initiated • registered - number of registered user rentals initiated • count - number of total rentals
Approach
o Convert to factors o Engineer new features from date
o Explore other synthetic features
#1 : ctree
Refer to 3-‐Session-‐I-‐Bikes.R at hTps://github.com/xsankar/hairy-‐octo-‐hipster/
#2 : Add Month + year
#3 : RandomForest