Machine Learning and Data Science › events › 2017 › redmon… · Machine learning is the...

Post on 07-Jun-2020

0 views 0 download

transcript

Machine Learning and Data Science

Seth Mottaghinejad

Data Scientist

sethmott@microsoft.comwww.linkedin.com/in/sethmott/

1

I believe over the next decadecomputing will become evenmore ubiquitous and intelligencewill become ambient... This will bemade possible by an ever‐growingnetwork of connected devices,incredible computing capacityfrom the cloud, insights from bigdata, and intelligence frommachine learning.

2

What do data scientists do?

3

the short versiontranslate a business problem to a data problemuse machine learning to find a close‐enough solutionmake the solution useful by finding the right way to surface it

4

Three essential skills of data scientists

Drew Conwaywww.dataists.com/2010/09/the‐data‐science‐venn‐diagram/

5

and now the long version

6

business understanding

ask: identify key business variablesmeasure: define success metrics"If you can't measure it, you can't improve it." Peter Drucker

data acquisition and understanding

data ingestion: ingest the data into analytic environmentdata exploration: explore the data to check quality and adequacy

7

modeling

pre‐processing: set up data pipeline to prepare datafeature engineering: create features from raw datamodel evaluation: how well does the model fit?model selection: explore and find the "right" model

8

deployment

operationalization: deploy model to productionmodel consumption: use model to make predictions ﴾scoring﴿business validation: are business requirements met?

9

last but not least

iterate, iterate, iterate...

10

Recap: What do data scientists do?

11

What is machine learning?An algorithm is a self‐contained set of rules used to solve problems through data processing,math, or automated reasoning.

Machine learning is the field of study that gives computers the ability to learn without beingexplicitly programmed, using data ﴾experience﴿.

The problems ML algorithms try to solve are usually ﴾1﴿ prediction and ﴾2﴿ finding structure indata, so the algorithms that do them are called supervised learning and unsupervised learningalgorithms respectively.

12

Two main types of ML problemssupervised learning ﴾prediction problem﴿

regression algorithms predict a number ﴾numeric target﴿classification algorithms predict a category ﴾categorical target﴿

unsupervised learning ﴾data‐mining to find structure﴿

13

supervised learning

look at some examples ﴾labeled data﴿ and find a way to predict ﴾a number in regression, acategory in classfication﴿ future examplesthe target variable ﴾"labels"﴿ are whatever we want to predict

by comparing predictions with the actual labels, we know how well we're doing

unsupervised learning

look at unlabeled data and find general patternsmore subjective and difficult to interpret

14

Azure Machine LearningOur goal is to make machine learning accessible to every enterprise, data scientist, developer,information worker, consumer, and device anywhere in the world.

15

An Azure ML experiment is a blank canvas connecting datasets to modules, each moduleperforming some action on the data.

You build a model in a training experiment and convert it to a predictive experiment to publishit as a web service so that your model can be consumed. An ML project is a collections ofexperiments, datasets, notebooks, and other resources.

16

17

18

ResourcesA listing of all the modules: https://msdn.microsoft.com/library/azure/dn906033.aspxGo to Cortana Intelligence Gallery and examine some pre‐existing experiments:http://gallery.cortanaintelligence.com/Learn this basic infographicAzure Machine Learning Documentation Center:https://azure.microsoft.com/services/machine‐learning/

19

Supervised learning: Regression demo

Azure Machine Learning Studio

20

21

predicted price = ‐ 364.378                  + 156.012 * horsepower                  ‐ 52.7053 * city‐mpg                  ‐ 3510.22 (if body‐type is "hatchback")                  + 2793.26 (if body‐type is "convertible")                  + 1519.70 (if body‐type is "hardtop")                  ‐ 1116.29 (if body‐type is "station wagon")                  ‐ 50.8291 (if body‐type is "sedan")

22

Supervised learning: Classification demo

Azure Machine Learning Studio

23

24

Supervised learningWe are trying to predict a variable ﴾called labels, target variable or response variable﴿ usingother variables ﴾called features, explanatory variables, covariates, attributes or independentvariables﴿.

Sometimes regression refers to a family of ML algorithms. For example, linear regression is aregression algorithm but logistic regression is a classification algorithm!

25

Supervised learning algorithmsCommon examples of algorithms used include

tree‐based algorithms such decision tree, random forest, boosted treeslinear regression models such as linear regression, logistic regression, lasso regressionand elastic net

support vector machinesneural networks including deep learning

Most of these algorithms can be used for both classification and regression, although theimplementation is different in each case, and some algorithms are more appropriate thanothers.

26

Supervised learning: trainingAn ML algorithm is sometimes called a model. When we build a model on data we say wetrain or fit a model.

For example, we say we trained a decision tree on the data, or fitted a decision tree model tothe data. The result is called a trained model. A trained model is also often referred to as amodel, which can be confusing.

Sometimes, people use the word model for a trained model to distinguish it from thealgorithm itself ﴾which does not depend on data﴿.

27

Supervised learning: scoring and evaluationOnce you have a trained model, you can use it to get predictions on any data ﴾as long as it hasthe features needed by the model to run the predictions﴿. This is called scoring.

If the data that you scored also has the target ﴾or labels﴿, we can compare scores ﴾thepredictions﴿ to the target ﴾the true values﴿ to see how well the model predicts. This is calledevaluating a model.

28

Regression evaluation metricsevaluation

metricdefinition interpretation

Root MeanSquared Error

average prediction error

R‐squaredR where R is the correlation betweenobserved and predicted

percentage of variationexplained by the model

√∑n

(observed−predicted)2

2

29

Binary classificationBinary ﴾2 categories﴿ classification is the most common kind of classification, because it can beused to answer ﴾predict﴿ yes/no or true/false questions.

A model will make predictions ﴾positive or negative﴿ which we compare to the actual answers﴾true or false﴿. When the answers disagree we get a misclassification. This can happen in twoways.

30

Confusion matrixwhat we observe what we observe

true false

what we predict positive TP FP

what we predict negative FN TN

31

Binary classification evaluation metricsevaluation

metricdefinition interpretation

accuracypercentage of correctlyclassified cases

ROC curvea visual representation of the model'sperformance

refer to this article

AUC area under the ROC curveclose to 0.5 is bad close to 1 isgood

TP+FP+FN+TNTP+TN

32

Supervised learning: recapterm what is needed results in

training ﴾a model﴿ appropriate ML algorithm + labeled data trained model

scoring ﴾data﴿ trained model + data ﴾labeled or unlabeled﴿ scores ﴾predictions﴿

evaluating a model scoring labeled data evaluation metrics

More about scoring and model evaluation in the next chapter.

33

what the machine learning community callsit

what the community of statisticians callsit

learning algorithm ﴾or model﴿ model

trained model ﴾or just model﴿ fitted model

supervised learning prediction problem

unsupervised learning data‐mining or pattern recognition

features or attributes explanatory or independent variables

target or labels response or dependent variables

training fitting

scoring predicting

34

Unsupervised learningWe are trying to find structure ﴾natural groupings﴿ in the data. There is no target, only features.

The k‐means algorithm attempts to find clusters in the observations ﴾rows﴿: twoobservations in the same cluster will have similar features.

Principal component analysis attempts to find clusters in the variables ﴾columns﴿: twovariables within the same cluster will contain similar ﴾redundant﴿ information.

35

Unsupervised learning: Clustering demo

36

37

k‐means clusteringThe k‐means clustering is an algorithm that attempts to find grouping in the rows of thedata.It finds similar data points ﴾observations﴿ when we compare their features.

So k‐means clustering reduces redundancy in the data across rows.We choose k ﴾the number of clusters﴿ and k‐means gives us a new column showing thecluster assignments for each row.

38

Unsupervised learning: PCA demo

39

40

Principal component analysisPrincipal pomponent analysis attempts to find groupings of the features.It finds features that are similar ﴾relay similar information because they are highlycorrelated﴿ and combines them into one feature called a principal component.

PCA reduces redundancy in the data across columns.PCA is an example of a dimensionality reduction: Usually, the first few principalcomponents captures most of the variation in the data. So we replace our original pfeatures with the first m principal components, where m should be much smaller than p.

41

Some final notes

42

There is a lot more to ML than choosing an algorithm. For example, in supervised learning,when we evaluate a model, we must score and evaluate predictions on an out‐of‐sample data﴾also called test data, it is data not used to train the model﴿. Otherwise our evaluation is notfair because it may be biased by small and insignificant "trends" in the data used to model﴾called training data﴿ that do not generalize to the larger data. This is referred to asoverfitting.

43

One big distinction between PCA and k‐means clustering is that k‐means gives us distinctclusters and every row of the data falls into one and only one cluster. PCA on the other handgives us principal components, where each principal component is a linear combination ﴾like aweighted average﴿ of the original features in the data. Any one principal component may havehigher weights associated with certain features, but the distinction is not clear‐cut. This makesprincipal components abstract and hard to interpret.

44

Because unsupervised learning deals with unlabeled data, we can still "train" a "model" ondata and "score" data using a trained model ﴾quotes are used to emphasize that the terms areused more loosely﴿, but evaluating a model is very difficult and subjective because no truelabels are present.

45

For example, we can train a k‐means algorithm on unlabeled data, then score it so that wehave a new column indicating which cluster each row belongs to. But we must decide ahead oftime how many clusters we should have, and it's very hard for us to know if our clusters are"correct".

So it is best to avoid the verbs training and scoring, and the term model when referring tounsupervised learning algorithms to lessen the confusion.

46

Although clustering and PCA are unsupervised learning algorithms, machine learningworkflows often involve combining supervised and unsupervised learning algorithm. Examples:

run k‐means to build clusters and use clusters as one of the features in a regressionmodel

run PCA and use the top few principal components as features in a regression instead ofthe original features

47

the end

48

Thank you

Questions?

49

50

51