trees forests and tensorflow
fort lauderdale machine learning meetup18-May-2016
Andy Catlin• An enthusiastic student,
enjoying Thomas Quintana’s ongoing lecture series on Google’s TensorFlow.
• A data science teacher, mentor, and coach. My focus areas are recommender systems and collective intelligence.
• An entrepreneur – my teams helped build out the analytics infrastructure for the National Football League and several NFL teams.
Above: Curriculum for City University of New York’s Online Masters Degree in Data Analytics program, where I am the lead faculty member.
classificationand
regression treesSource: G
areth James, R
obert Tibshirani, and Trevor H
astie, Introduction to Statistical Learning, Springer.
2013.
sources / resources / learning roadmap
Resource Math? Videos? Code?Leo Breiman, ‘Statistical Modeling: The Two Cultures, Institute of Mathematical Statistics,” 2001. See also “What’s the difference between machine learning, statistics, and data mining?,” http://www.sharpsightlabs.com/difference-machine-learning-statistics-data-mining/, Sharp Sight Labs, 2016.
Little No No
Scott Foreman, “Understanding the Bias/Variance Tradeoff,” http://scott.fortmann-roe.com/docs/BiasVariance.html, June 2012.
Little No No
Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie, Elements of Statistical Learning, freely downloadable here: https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
Lots No No
Josh Gordon (Google), Machine Learning for Developers, YouTube video series: https://www.youtube.com/watch?v=cKxRvEZd3Mw&list=PLOU2XLYxmsIIuiBfYad6rFYQU_jL2ryal&index=4
Little Yes Python
Joel Grus, Data Science from Scratch, O’Reilly. 2015. Some No PythonGareth James, Robert Tibshirani, and Trevor Hastie, Introduction to Statistical Learning, Springer. 2013. Freely downloadable at http://www-bcf.usc.edu/~gareth/ISL/. Excellent videos in edx course, and archived here: http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/.
Some Yes R
Victor Lavrenko, Decision Trees, YouTube video series, http://bit.ly/D-Tree . Part of Introductory Applied Machine Learning course at University of Edinburgh.
Some Yes No
Kevin Markham, “ROC curves and Area Under the Curve explained,” http://www.dataschool.io/roc-curves-and-auc-explained/. See also “Understanding ROC Curves,” http://www.navan.name/roc/. and “Comparing supervised learning algorithms,” http://www.dataschool.io/comparing-supervised-learning-algorithms/
Little Yes No
Foster Provost and Tom Fawcett, Data Science for Business, O’Reilly. 2013. Little No NoSebastian Raschka, Python Machine Learning, Packt. 2015. See also his blog post “When Does Deep Learning Work Better Than SVMs or Random Forests?,” http://www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html.
Lots No Yes
Wesleyan University, Machine Learning for Data Analysis. https://www.coursera.org/learn/machine-learning-data-analysis. Freely available Python-based Coursera course. Part of a five course specialization.
Little Yes Python
goal of this talk
• To help you understand when to use, how to build, and how to tune decision trees and random forests
outline of this talk
• problem statement: “hiring analytics”• decision trees and random forests• trees into tensorflow?
hiring analytics
features and labels
• What does it mean for an NFL draft pick (“hire”) to have been successful?
• What are the features that matter most in selecting a player?
y = f(X)candidate featuresplayerid 39408, 39412, …position QB, WR, DL, OL, …history of knee injuries Yes, No40 yard dash time 4.24, 4.31, 4.67, …wonderlic score 0..50“good citizen" Yes, No
candidate labelsprobowl first five years? Yes, Noyears in league? 3.1, 5.2, …
http://wonderlictestsam
ple.com/nfl-w
onderlic-scores/https://en.w
ikipedia.org/wiki/W
onderlic_testG
reen Bay’s Mike Eayrs’ w
as probably NFL’s first data
scientist: http://ww
w.baselinem
ag.com/c/a/Projects-
Managem
ent/Green-Bay-Packers-Reel-Tim
e
Will this player be selected for the Pro Bowl?featuresplayerid 39408, 39412, …position QB, WR, DL, OL, …history of knee injuries No = 0, Yes = 140 yard dash time Slow = 0, Medium = 1, Fast = 2wonderlic score Normal= 0, Smart = 1“good citizen" Yes, No
labelsprobowl first five years? No = 0, Yes = 1years in league? 3.1, 5.2, …
Which attribute m
atters most
(1/2)?
Player 40yard Wonderlic KneeInjury ProBowl1 Medium Smart No No2 Medium Smart Yes No3 Fast Smart No Yes4 Fast Smart No Yes5 Fast Normal No Yes6 Slow Normal Yes No7 Fast Normal Yes Yes8 Medium Smart No No9 Fast Normal No Yes
10 Fast Normal No Yes11 Fast Normal Yes Yes12 Fast Smart Yes Yes13 Fast Normal No Yes14 Low Smart Yes No
New Fast Smart Yes ?
Which attribute m
atters most
(2/2)?
Player 40yard Wonderlic KneeInjury ProBowl1 Fast Smart No No2 Fast Smart Yes No3 Medium Smart No Yes4 Slow Smart No Yes5 Slow Normal No Yes6 Slow Normal Yes No7 Medium Normal Yes Yes8 Fast Smart No No9 Fast Normal No Yes
10 Slow Normal No Yes11 Fast Normal Yes Yes12 Medium Smart Yes Yes13 Medium Normal No Yes14 Low Smart Yes No
New Fast Smart Yes ?
40yard X[0]
WonderlicX[1]
KneeInjuryX[2] ProBowl
2 1 0 02 1 1 01 1 0 10 1 0 10 0 0 10 0 1 01 0 1 12 1 0 02 0 0 10 0 0 12 0 1 11 1 1 11 0 0 10 1 1 0
Player40yard
X[0]Wonderlic
X[1]KneeInjury
X[2] ProBowl
New Medium High No ?
40yard X[0]
WonderlicX[1]
KneeInjuryX[2] ProBowl
2 1 0 02 1 1 01 1 0 10 1 0 10 0 0 10 0 1 01 0 1 12 1 0 02 0 0 10 0 0 12 0 1 11 1 1 11 0 0 10 1 1 0
Player 40yard Wonderlic KneeInjury ProBowl
New 2 1 0 ?
id3 algorithm
Best name ever?...Iterative Dichotomiser 3. Source: Joel Grus, Data Science from Scratch, O’Reilly. 2015. See also https://en.wikipedia.org/wiki/ID3_algorithm.
entropySource: Foster Provost and Tom
Fawcett, D
ata Science
for Business, O
’Reilly. 2013. N
ote that scikitlearn uses cart algorithm
instead of id3; cart uses giniimpurity
instead of entropy—see also
https://en.wikipedia.org/w
iki/Decision_tree_learning
information gain
Source: Foster Provost and Tom Fawcett, Data Science for Business, O’Reilly. 2013. See also https://en.wikipedia.org/wiki/Claude_Shannon
from collections import Counter, defaultdictfrom functools import partialimport math, random
def entropy(class_probabilities):"""given a list of class probabilities, compute the entropy"""return sum(-p * math.log(p, 2) for p in class_probabilities if p)
def class_probabilities(labels):total_count = len(labels)return [count / total_count
for count in Counter(labels).values()]
Source: Joel Grus, D
ata Science from
Scratch, O
’Reilly. 2015
def data_entropy(labeled_data):labels = [label for _, label in labeled_data]probabilities = class_probabilities(labels)return entropy(probabilities)
def partition_entropy(subsets):"""find the entropy from this partition of data into subsets"""total_count = sum(len(subset) for subset in subsets)
return sum( data_entropy(subset) * len(subset) / total_countfor subset in subsets )
Source: Joel Grus, D
ata Science from
Scratch, O
’Reilly. 2015
accuracy
Suppose you took the decision tree model that we built from 14 NFL players, then used the model to predict whether members of the next year’s college draft group would go on to play in the Pro Bowl.
• How accurate would your model be? • How should you best measure your model’s accuracy?
Source: Gareth Jam
es, Robert Tibshirani, and Trevor
Hastie, Introduction to S
tatistical Learning, Springer. 2013.
Source: Gareth Jam
es, Robert Tibshirani, and Trevor
Hastie, Introduction to S
tatistical Learning, Springer. 2013.
As the flexibility of f-hat [the estimate for the labelled response variable] increases, its variance increases and its bias decreases.
•Variance refers to the amount by which f-hat (our estimate for y) would change if we estimated it using a different training data set.•Bias refers to the error that is introduced by approximating a real life problem, which may be extremely complicated, by a much simpler model.
bias variance tradeoff
Source: Scott Foreman, “U
nderstanding the Bias/Variance Tradeoff,” http://scott.fortm
ann-roe.com
/docs/BiasVariance.html, June 2012.
bias variance tradeoff
Overfitting
Wesleyan U
niversity, Machine Learning for D
ata A
nalysis. https://ww
w.coursera.org/learn/machine-
learning-data-analysis
Accuracy: confusion m
atrix
import numpy as npfrom sklearn import tree
#Load the dataset
X = [[2,1,0],[2,1,1],[1,1,0],[0,1,0],[0,0,0],[0,0,1],[1,0,1],[2,1,0],[2,0,0],[0,0,0],[2,0,1],[1,1,1],[1,0,0],[0,1,1]]
y = [0,0,1,1,1,0,1,0,1,1,1,1,1,0]
nfl_feature_names = ['40 yard','wonderlic','knee injury']nfl_target_names = ['No Pro Bowl', 'Pro Bowl']
from sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
clf = tree.DecisionTreeClassifier()clf = clf.fit(X_train, y_train)predictions = clf.predict(X_test)
from sklearn.metrics import accuracy_scorefrom sklearn.metrics import classification_reportimport sklearn.metrics
print(sklearn.metrics.confusion_matrix(y_test,predictions))print(accuracy_score(y_test, predictions))
# visualization codefrom sklearn.externals.six import StringIOimport pydotplusdot_data = StringIO()
tree.export_graphviz(clf, out_file = dot_data,feature_names = nfl_feature_names,class_names = nfl_target_names,filled=True, rounded=True,impurity=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("c:\\Data\ProBowl.pdf")
Random forestsSource: G
areth James, R
obert Tibshirani, and Trevor H
astie, Introduction to Statistical Learning, Springer.
2013.
#Build model on training datafrom sklearn.ensemble import RandomForestClassifier
# build 25 treesclassifier=RandomForestClassifier(n_estimators=25)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
print(sklearn.metrics.confusion_matrix(tar_test,predictions))print(sklearn.metrics.accuracy_score(tar_test, predictions))
[[1424 80] [ 217 109]]
0.837704918033
# fit an Extra Trees model to the datamodel = ExtraTreesClassifier()model.fit(pred_train,tar_train)# display the relative importance of each attributeprint(model.feature_importances_)
[ 0.02572953 0.01454145 0.02808065 0.01565101 0.00723755 0.00482434 0.06410482 0.03400461 0.0571412 0.12897684 0.01891439 0.01500713 0.02514497 0.06112466 0.05639455 0.05085095 0.01686558 0.06461658 0.063369640.07272654 0.01245386 0.05971838 0.05615322 0.04636755]
trees=range(25)accuracy=np.zeros(25)
for idx in range(len(trees)):classifier=RandomForestClassifier(n_estimators=idx + 1)classifier=classifier.fit(pred_train,tar_train)predictions=classifier.predict(pred_test)accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla()plt.plot(trees, accuracy)
Decision trees: pros and cons
• Decision trees are less accurate than more modern methods.• Great for “explainability” – important for change management and
easy operationalization• Handle interactions between variables better than regression
methods.
• Random forests—by controlling for variance—approach “state of the art” for accuracy… but also suffer from explainability issues. Especially strong for ranking variables.
Decision trees in TensorFlow
• The hard way—implement algorithms from group up (e.g., id3, cart)• Higher level approaches:
• skflow: scikit-learn TensorFlow• keras