+ All Categories
Home > Documents > Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the...

Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the...

Date post: 07-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
51
Machine Learning and Data Science Seth Mottaghinejad Data Scientist [email protected] www.linkedin.com/in/sethmott/ 1
Transcript
Page 1: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Machine Learning and Data Science

Seth Mottaghinejad

Data Scientist

[email protected]/in/sethmott/

1

Page 2: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

I believe over the next decadecomputing will become evenmore ubiquitous and intelligencewill become ambient... This will bemade possible by an ever‐growingnetwork of connected devices,incredible computing capacityfrom the cloud, insights from bigdata, and intelligence frommachine learning.

2

Page 3: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

What do data scientists do?

3

Page 4: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

the short versiontranslate a business problem to a data problemuse machine learning to find a close‐enough solutionmake the solution useful by finding the right way to surface it

4

Page 5: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Three essential skills of data scientists

Drew Conwaywww.dataists.com/2010/09/the‐data‐science‐venn‐diagram/

5

Page 6: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

and now the long version

6

Page 7: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

business understanding

ask: identify key business variablesmeasure: define success metrics"If you can't measure it, you can't improve it." Peter Drucker

data acquisition and understanding

data ingestion: ingest the data into analytic environmentdata exploration: explore the data to check quality and adequacy

7

Page 8: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

modeling

pre‐processing: set up data pipeline to prepare datafeature engineering: create features from raw datamodel evaluation: how well does the model fit?model selection: explore and find the "right" model

8

Page 9: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

deployment

operationalization: deploy model to productionmodel consumption: use model to make predictions ﴾scoring﴿business validation: are business requirements met?

9

Page 10: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

last but not least

iterate, iterate, iterate...

10

Page 11: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Recap: What do data scientists do?

11

Page 12: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

What is machine learning?An algorithm is a self‐contained set of rules used to solve problems through data processing,math, or automated reasoning.

Machine learning is the field of study that gives computers the ability to learn without beingexplicitly programmed, using data ﴾experience﴿.

The problems ML algorithms try to solve are usually ﴾1﴿ prediction and ﴾2﴿ finding structure indata, so the algorithms that do them are called supervised learning and unsupervised learningalgorithms respectively.

12

Page 13: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Two main types of ML problemssupervised learning ﴾prediction problem﴿

regression algorithms predict a number ﴾numeric target﴿classification algorithms predict a category ﴾categorical target﴿

unsupervised learning ﴾data‐mining to find structure﴿

13

Page 14: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

supervised learning

look at some examples ﴾labeled data﴿ and find a way to predict ﴾a number in regression, acategory in classfication﴿ future examplesthe target variable ﴾"labels"﴿ are whatever we want to predict

by comparing predictions with the actual labels, we know how well we're doing

unsupervised learning

look at unlabeled data and find general patternsmore subjective and difficult to interpret

14

Page 15: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Azure Machine LearningOur goal is to make machine learning accessible to every enterprise, data scientist, developer,information worker, consumer, and device anywhere in the world.

15

Page 16: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

An Azure ML experiment is a blank canvas connecting datasets to modules, each moduleperforming some action on the data.

You build a model in a training experiment and convert it to a predictive experiment to publishit as a web service so that your model can be consumed. An ML project is a collections ofexperiments, datasets, notebooks, and other resources.

16

Page 17: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

17

Page 18: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

18

Page 19: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

ResourcesA listing of all the modules: https://msdn.microsoft.com/library/azure/dn906033.aspxGo to Cortana Intelligence Gallery and examine some pre‐existing experiments:http://gallery.cortanaintelligence.com/Learn this basic infographicAzure Machine Learning Documentation Center:https://azure.microsoft.com/services/machine‐learning/

19

Page 20: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Supervised learning: Regression demo

Azure Machine Learning Studio

20

Page 21: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

21

Page 22: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

predicted price = ‐ 364.378                  + 156.012 * horsepower                  ‐ 52.7053 * city‐mpg                  ‐ 3510.22 (if body‐type is "hatchback")                  + 2793.26 (if body‐type is "convertible")                  + 1519.70 (if body‐type is "hardtop")                  ‐ 1116.29 (if body‐type is "station wagon")                  ‐ 50.8291 (if body‐type is "sedan")

22

Page 23: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Supervised learning: Classification demo

Azure Machine Learning Studio

23

Page 24: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

24

Page 25: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Supervised learningWe are trying to predict a variable ﴾called labels, target variable or response variable﴿ usingother variables ﴾called features, explanatory variables, covariates, attributes or independentvariables﴿.

Sometimes regression refers to a family of ML algorithms. For example, linear regression is aregression algorithm but logistic regression is a classification algorithm!

25

Page 26: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Supervised learning algorithmsCommon examples of algorithms used include

tree‐based algorithms such decision tree, random forest, boosted treeslinear regression models such as linear regression, logistic regression, lasso regressionand elastic net

support vector machinesneural networks including deep learning

Most of these algorithms can be used for both classification and regression, although theimplementation is different in each case, and some algorithms are more appropriate thanothers.

26

Page 27: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Supervised learning: trainingAn ML algorithm is sometimes called a model. When we build a model on data we say wetrain or fit a model.

For example, we say we trained a decision tree on the data, or fitted a decision tree model tothe data. The result is called a trained model. A trained model is also often referred to as amodel, which can be confusing.

Sometimes, people use the word model for a trained model to distinguish it from thealgorithm itself ﴾which does not depend on data﴿.

27

Page 28: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Supervised learning: scoring and evaluationOnce you have a trained model, you can use it to get predictions on any data ﴾as long as it hasthe features needed by the model to run the predictions﴿. This is called scoring.

If the data that you scored also has the target ﴾or labels﴿, we can compare scores ﴾thepredictions﴿ to the target ﴾the true values﴿ to see how well the model predicts. This is calledevaluating a model.

28

Page 29: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Regression evaluation metricsevaluation

metricdefinition interpretation

Root MeanSquared Error

average prediction error

R‐squaredR where R is the correlation betweenobserved and predicted

percentage of variationexplained by the model

√∑n

(observed−predicted)2

2

29

Page 30: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Binary classificationBinary ﴾2 categories﴿ classification is the most common kind of classification, because it can beused to answer ﴾predict﴿ yes/no or true/false questions.

A model will make predictions ﴾positive or negative﴿ which we compare to the actual answers﴾true or false﴿. When the answers disagree we get a misclassification. This can happen in twoways.

30

Page 31: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Confusion matrixwhat we observe what we observe

true false

what we predict positive TP FP

what we predict negative FN TN

31

Page 32: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Binary classification evaluation metricsevaluation

metricdefinition interpretation

accuracypercentage of correctlyclassified cases

ROC curvea visual representation of the model'sperformance

refer to this article

AUC area under the ROC curveclose to 0.5 is bad close to 1 isgood

TP+FP+FN+TNTP+TN

32

Page 33: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Supervised learning: recapterm what is needed results in

training ﴾a model﴿ appropriate ML algorithm + labeled data trained model

scoring ﴾data﴿ trained model + data ﴾labeled or unlabeled﴿ scores ﴾predictions﴿

evaluating a model scoring labeled data evaluation metrics

More about scoring and model evaluation in the next chapter.

33

Page 34: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

what the machine learning community callsit

what the community of statisticians callsit

learning algorithm ﴾or model﴿ model

trained model ﴾or just model﴿ fitted model

supervised learning prediction problem

unsupervised learning data‐mining or pattern recognition

features or attributes explanatory or independent variables

target or labels response or dependent variables

training fitting

scoring predicting

34

Page 35: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Unsupervised learningWe are trying to find structure ﴾natural groupings﴿ in the data. There is no target, only features.

The k‐means algorithm attempts to find clusters in the observations ﴾rows﴿: twoobservations in the same cluster will have similar features.

Principal component analysis attempts to find clusters in the variables ﴾columns﴿: twovariables within the same cluster will contain similar ﴾redundant﴿ information.

35

Page 36: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Unsupervised learning: Clustering demo

36

Page 37: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

37

Page 38: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

k‐means clusteringThe k‐means clustering is an algorithm that attempts to find grouping in the rows of thedata.It finds similar data points ﴾observations﴿ when we compare their features.

So k‐means clustering reduces redundancy in the data across rows.We choose k ﴾the number of clusters﴿ and k‐means gives us a new column showing thecluster assignments for each row.

38

Page 39: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Unsupervised learning: PCA demo

39

Page 40: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

40

Page 41: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Principal component analysisPrincipal pomponent analysis attempts to find groupings of the features.It finds features that are similar ﴾relay similar information because they are highlycorrelated﴿ and combines them into one feature called a principal component.

PCA reduces redundancy in the data across columns.PCA is an example of a dimensionality reduction: Usually, the first few principalcomponents captures most of the variation in the data. So we replace our original pfeatures with the first m principal components, where m should be much smaller than p.

41

Page 42: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Some final notes

42

Page 43: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

There is a lot more to ML than choosing an algorithm. For example, in supervised learning,when we evaluate a model, we must score and evaluate predictions on an out‐of‐sample data﴾also called test data, it is data not used to train the model﴿. Otherwise our evaluation is notfair because it may be biased by small and insignificant "trends" in the data used to model﴾called training data﴿ that do not generalize to the larger data. This is referred to asoverfitting.

43

Page 44: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

One big distinction between PCA and k‐means clustering is that k‐means gives us distinctclusters and every row of the data falls into one and only one cluster. PCA on the other handgives us principal components, where each principal component is a linear combination ﴾like aweighted average﴿ of the original features in the data. Any one principal component may havehigher weights associated with certain features, but the distinction is not clear‐cut. This makesprincipal components abstract and hard to interpret.

44

Page 45: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Because unsupervised learning deals with unlabeled data, we can still "train" a "model" ondata and "score" data using a trained model ﴾quotes are used to emphasize that the terms areused more loosely﴿, but evaluating a model is very difficult and subjective because no truelabels are present.

45

Page 46: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

For example, we can train a k‐means algorithm on unlabeled data, then score it so that wehave a new column indicating which cluster each row belongs to. But we must decide ahead oftime how many clusters we should have, and it's very hard for us to know if our clusters are"correct".

So it is best to avoid the verbs training and scoring, and the term model when referring tounsupervised learning algorithms to lessen the confusion.

46

Page 47: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Although clustering and PCA are unsupervised learning algorithms, machine learningworkflows often involve combining supervised and unsupervised learning algorithm. Examples:

run k‐means to build clusters and use clusters as one of the features in a regressionmodel

run PCA and use the top few principal components as features in a regression instead ofthe original features

47

Page 48: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

the end

48

Page 49: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

Thank you

Questions?

49

Page 50: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

50

Page 51: Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed,

51


Recommended