+ All Categories
Home > Documents > trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine...

trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine...

Date post: 07-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
38
trees forests and tensorflow fort lauderdale machine learning meetup 18-May-2016
Transcript
Page 1: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

trees forests and tensorflow

fort lauderdale machine learning meetup18-May-2016

Page 2: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Andy Catlin• An enthusiastic student,

enjoying Thomas Quintana’s ongoing lecture series on Google’s TensorFlow.

• A data science teacher, mentor, and coach. My focus areas are recommender systems and collective intelligence.

• An entrepreneur – my teams helped build out the analytics infrastructure for the National Football League and several NFL teams.

Above: Curriculum for City University of New York’s Online Masters Degree in Data Analytics program, where I am the lead faculty member.

[email protected]

Page 3: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

classificationand

regression treesSource: G

areth James, R

obert Tibshirani, and Trevor H

astie, Introduction to Statistical Learning, Springer.

2013.

Page 4: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

sources / resources / learning roadmap

Resource Math? Videos? Code?Leo Breiman, ‘Statistical Modeling: The Two Cultures, Institute of Mathematical Statistics,” 2001. See also “What’s the difference between machine learning, statistics, and data mining?,” http://www.sharpsightlabs.com/difference-machine-learning-statistics-data-mining/, Sharp Sight Labs, 2016.

Little No No

Scott Foreman, “Understanding the Bias/Variance Tradeoff,” http://scott.fortmann-roe.com/docs/BiasVariance.html, June 2012.

Little No No

Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie, Elements of Statistical Learning, freely downloadable here: https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

Lots No No

Josh Gordon (Google), Machine Learning for Developers, YouTube video series: https://www.youtube.com/watch?v=cKxRvEZd3Mw&list=PLOU2XLYxmsIIuiBfYad6rFYQU_jL2ryal&index=4

Little Yes Python

Joel Grus, Data Science from Scratch, O’Reilly. 2015. Some No PythonGareth James, Robert Tibshirani, and Trevor Hastie, Introduction to Statistical Learning, Springer. 2013. Freely downloadable at http://www-bcf.usc.edu/~gareth/ISL/. Excellent videos in edx course, and archived here: http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/.

Some Yes R

Victor Lavrenko, Decision Trees, YouTube video series, http://bit.ly/D-Tree . Part of Introductory Applied Machine Learning course at University of Edinburgh.

Some Yes No

Kevin Markham, “ROC curves and Area Under the Curve explained,” http://www.dataschool.io/roc-curves-and-auc-explained/. See also “Understanding ROC Curves,” http://www.navan.name/roc/. and “Comparing supervised learning algorithms,” http://www.dataschool.io/comparing-supervised-learning-algorithms/

Little Yes No

Foster Provost and Tom Fawcett, Data Science for Business, O’Reilly. 2013. Little No NoSebastian Raschka, Python Machine Learning, Packt. 2015. See also his blog post “When Does Deep Learning Work Better Than SVMs or Random Forests?,” http://www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html.

Lots No Yes

Wesleyan University, Machine Learning for Data Analysis. https://www.coursera.org/learn/machine-learning-data-analysis. Freely available Python-based Coursera course. Part of a five course specialization.

Little Yes Python

Page 5: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

goal of this talk

• To help you understand when to use, how to build, and how to tune decision trees and random forests

Page 6: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

outline of this talk

• problem statement: “hiring analytics”• decision trees and random forests• trees into tensorflow?

Page 7: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

hiring analytics

Page 8: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

features and labels

• What does it mean for an NFL draft pick (“hire”) to have been successful?

• What are the features that matter most in selecting a player?

Page 9: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

http://ww

w.nfl.com

/combine/top-perform

ers

Page 10: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

y = f(X)candidate featuresplayerid 39408, 39412, …position QB, WR, DL, OL, …history of knee injuries Yes, No40 yard dash time 4.24, 4.31, 4.67, …wonderlic score 0..50“good citizen" Yes, No

candidate labelsprobowl first five years? Yes, Noyears in league? 3.1, 5.2, …

http://wonderlictestsam

ple.com/nfl-w

onderlic-scores/https://en.w

ikipedia.org/wiki/W

onderlic_testG

reen Bay’s Mike Eayrs’ w

as probably NFL’s first data

scientist: http://ww

w.baselinem

ag.com/c/a/Projects-

Managem

ent/Green-Bay-Packers-Reel-Tim

e

Page 11: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Will this player be selected for the Pro Bowl?featuresplayerid 39408, 39412, …position QB, WR, DL, OL, …history of knee injuries No = 0, Yes = 140 yard dash time Slow = 0, Medium = 1, Fast = 2wonderlic score Normal= 0, Smart = 1“good citizen" Yes, No

labelsprobowl first five years? No = 0, Yes = 1years in league? 3.1, 5.2, …

Page 12: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Which attribute m

atters most

(1/2)?

Player 40yard Wonderlic KneeInjury ProBowl1 Medium Smart No No2 Medium Smart Yes No3 Fast Smart No Yes4 Fast Smart No Yes5 Fast Normal No Yes6 Slow Normal Yes No7 Fast Normal Yes Yes8 Medium Smart No No9 Fast Normal No Yes

10 Fast Normal No Yes11 Fast Normal Yes Yes12 Fast Smart Yes Yes13 Fast Normal No Yes14 Low Smart Yes No

New Fast Smart Yes ?

Page 13: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Which attribute m

atters most

(2/2)?

Player 40yard Wonderlic KneeInjury ProBowl1 Fast Smart No No2 Fast Smart Yes No3 Medium Smart No Yes4 Slow Smart No Yes5 Slow Normal No Yes6 Slow Normal Yes No7 Medium Normal Yes Yes8 Fast Smart No No9 Fast Normal No Yes

10 Slow Normal No Yes11 Fast Normal Yes Yes12 Medium Smart Yes Yes13 Medium Normal No Yes14 Low Smart Yes No

New Fast Smart Yes ?

Page 14: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

40yard X[0]

WonderlicX[1]

KneeInjuryX[2] ProBowl

2 1 0 02 1 1 01 1 0 10 1 0 10 0 0 10 0 1 01 0 1 12 1 0 02 0 0 10 0 0 12 0 1 11 1 1 11 0 0 10 1 1 0

Page 15: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Player40yard

X[0]Wonderlic

X[1]KneeInjury

X[2] ProBowl

New Medium High No ?

Page 16: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

40yard X[0]

WonderlicX[1]

KneeInjuryX[2] ProBowl

2 1 0 02 1 1 01 1 0 10 1 0 10 0 0 10 0 1 01 0 1 12 1 0 02 0 0 10 0 0 12 0 1 11 1 1 11 0 0 10 1 1 0

Player 40yard Wonderlic KneeInjury ProBowl

New 2 1 0 ?

Page 17: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

id3 algorithm

Best name ever?...Iterative Dichotomiser 3. Source: Joel Grus, Data Science from Scratch, O’Reilly. 2015. See also https://en.wikipedia.org/wiki/ID3_algorithm.

Page 18: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

entropySource: Foster Provost and Tom

Fawcett, D

ata Science

for Business, O

’Reilly. 2013. N

ote that scikitlearn uses cart algorithm

instead of id3; cart uses giniimpurity

instead of entropy—see also

https://en.wikipedia.org/w

iki/Decision_tree_learning

Page 19: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

information gain

Source: Foster Provost and Tom Fawcett, Data Science for Business, O’Reilly. 2013. See also https://en.wikipedia.org/wiki/Claude_Shannon

Page 20: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas
Page 21: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

from collections import Counter, defaultdictfrom functools import partialimport math, random

def entropy(class_probabilities):"""given a list of class probabilities, compute the entropy"""return sum(-p * math.log(p, 2) for p in class_probabilities if p)

def class_probabilities(labels):total_count = len(labels)return [count / total_count

for count in Counter(labels).values()]

Source: Joel Grus, D

ata Science from

Scratch, O

’Reilly. 2015

Page 22: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

def data_entropy(labeled_data):labels = [label for _, label in labeled_data]probabilities = class_probabilities(labels)return entropy(probabilities)

def partition_entropy(subsets):"""find the entropy from this partition of data into subsets"""total_count = sum(len(subset) for subset in subsets)

return sum( data_entropy(subset) * len(subset) / total_countfor subset in subsets )

Source: Joel Grus, D

ata Science from

Scratch, O

’Reilly. 2015

Page 23: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

accuracy

Suppose you took the decision tree model that we built from 14 NFL players, then used the model to predict whether members of the next year’s college draft group would go on to play in the Pro Bowl.

• How accurate would your model be? • How should you best measure your model’s accuracy?

Page 24: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Source: Gareth Jam

es, Robert Tibshirani, and Trevor

Hastie, Introduction to S

tatistical Learning, Springer. 2013.

Page 25: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Source: Gareth Jam

es, Robert Tibshirani, and Trevor

Hastie, Introduction to S

tatistical Learning, Springer. 2013.

As the flexibility of f-hat [the estimate for the labelled response variable] increases, its variance increases and its bias decreases.

•Variance refers to the amount by which f-hat (our estimate for y) would change if we estimated it using a different training data set.•Bias refers to the error that is introduced by approximating a real life problem, which may be extremely complicated, by a much simpler model.

bias variance tradeoff

Page 26: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Source: Scott Foreman, “U

nderstanding the Bias/Variance Tradeoff,” http://scott.fortm

ann-roe.com

/docs/BiasVariance.html, June 2012.

bias variance tradeoff

Page 27: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Overfitting

Wesleyan U

niversity, Machine Learning for D

ata A

nalysis. https://ww

w.coursera.org/learn/machine-

learning-data-analysis

Page 28: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Accuracy: confusion m

atrix

Page 29: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

import numpy as npfrom sklearn import tree

#Load the dataset

X = [[2,1,0],[2,1,1],[1,1,0],[0,1,0],[0,0,0],[0,0,1],[1,0,1],[2,1,0],[2,0,0],[0,0,0],[2,0,1],[1,1,1],[1,0,0],[0,1,1]]

y = [0,0,1,1,1,0,1,0,1,1,1,1,1,0]

nfl_feature_names = ['40 yard','wonderlic','knee injury']nfl_target_names = ['No Pro Bowl', 'Pro Bowl']

from sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

Page 30: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

clf = tree.DecisionTreeClassifier()clf = clf.fit(X_train, y_train)predictions = clf.predict(X_test)

Page 31: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

from sklearn.metrics import accuracy_scorefrom sklearn.metrics import classification_reportimport sklearn.metrics

print(sklearn.metrics.confusion_matrix(y_test,predictions))print(accuracy_score(y_test, predictions))

Page 32: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

# visualization codefrom sklearn.externals.six import StringIOimport pydotplusdot_data = StringIO()

tree.export_graphviz(clf, out_file = dot_data,feature_names = nfl_feature_names,class_names = nfl_target_names,filled=True, rounded=True,impurity=False)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

graph.write_pdf("c:\\Data\ProBowl.pdf")

Page 33: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Random forestsSource: G

areth James, R

obert Tibshirani, and Trevor H

astie, Introduction to Statistical Learning, Springer.

2013.

Page 34: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

#Build model on training datafrom sklearn.ensemble import RandomForestClassifier

# build 25 treesclassifier=RandomForestClassifier(n_estimators=25)

classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

print(sklearn.metrics.confusion_matrix(tar_test,predictions))print(sklearn.metrics.accuracy_score(tar_test, predictions))

[[1424 80] [ 217 109]]

0.837704918033

Page 35: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

# fit an Extra Trees model to the datamodel = ExtraTreesClassifier()model.fit(pred_train,tar_train)# display the relative importance of each attributeprint(model.feature_importances_)

[ 0.02572953 0.01454145 0.02808065 0.01565101 0.00723755 0.00482434 0.06410482 0.03400461 0.0571412 0.12897684 0.01891439 0.01500713 0.02514497 0.06112466 0.05639455 0.05085095 0.01686558 0.06461658 0.063369640.07272654 0.01245386 0.05971838 0.05615322 0.04636755]

Page 36: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

trees=range(25)accuracy=np.zeros(25)

for idx in range(len(trees)):classifier=RandomForestClassifier(n_estimators=idx + 1)classifier=classifier.fit(pred_train,tar_train)predictions=classifier.predict(pred_test)accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)

plt.cla()plt.plot(trees, accuracy)

Page 37: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Decision trees: pros and cons

• Decision trees are less accurate than more modern methods.• Great for “explainability” – important for change management and

easy operationalization• Handle interactions between variables better than regression

methods.

• Random forests—by controlling for variance—approach “state of the art” for accuracy… but also suffer from explainability issues. Especially strong for ranking variables.

Page 38: trees forests and tensorflow - Meetupfiles.meetup.com/18648807/trees.pdffort lauderdale machine learning meetup 18-May-2016 Andy Catlin • An enthusiastic student, enjoying Thomas

Decision trees in TensorFlow

• The hard way—implement algorithms from group up (e.g., id3, cart)• Higher level approaches:

• skflow: scikit-learn TensorFlow• keras


Recommended