+ All Categories
Home > Documents > ECLT5810/SEEM5750 E-Commerce Data Mining Technique

ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Date post: 25-Oct-2021
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
50
ECLT5810/SEEM5750 E-Commerce Data Mining Technique Tutorial 2: Decision tree; Regression; Assignment 1 Wenxuan ZHANG [email protected]
Transcript
Page 1: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Tutorial 2: Decision tree; Regression; Assignment 1Wenxuan [email protected]

Page 2: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

What is Decision Tree?Decision tree is a decision-making tool using a tree-like graph or model of decisions and their possible consequences such as event outcomes, resource costs, and utility.

All the conditional control statements used in the decision tree can be displayed for easily understand the logic behind it.

Page 3: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

What is Decision Tree?A decision tree is a flowchart-like structure contains three components:

◦ each internal node represents a “test” on an attribute (e.g., whether a coin flip comes up heads or tails)

◦ each branch represents the outcome of the test

◦ each leaf node represents a class label (decision taken after computing all attributes).

The paths from the root to leaf represent classification rules.

Page 4: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

What is Decision Tree?This is an example of a decision tree for the target variable response. This variable has two labels: 1 for response and 0 for no response.

Each node determines which attribute should be used for splitting the dataset based on the information gain. In this example, Node 1 uses Income as splitting attribute, <$25k go to Node 2 and >= $25k go to Node 3.

There are 4 leaf nodes (Node 4-7) for determining the predicted label.

Page 5: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Decision Tree in WekaIn Weka, it provides several classification algorithms including decision tree for users to easily construct a predictive model using their training data.

Weka provides many tree-based algorithms. In this tutorial, we will use the J48 algorithm which is an implementation of the C4.5 algorithm.

Page 6: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Preparation for building Decision TreeBefore constructing our decision tree, we first need to prepare our training data.

Open Weka, choose Explorer in the Weka GUI Chooser

Page 7: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Preparation for building Decision TreeClick Open file, then open the bank.csv used in the last tutorial

Again, please remember to change to CSV data files(*.csv) in file type.

Page 8: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Preparation for building Decision TreeNow, data is loaded into Explorer.

We can perform feature engineering before building the decision tree, but this time we simply use the original dataset to do it.

Page 9: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Building Decision TreeClick Classify

Page 10: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Building Decision TreeClick Choose

Page 11: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Building Decision TreeUnder

classifiers->trees

select J48

Page 12: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Building Decision TreeClick on the text near Choose to access to the configuration

Page 13: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Building Decision TreeHere is the configuration of J48

Change the minNumObj from 2 to 30 such that each leaf needs to at least cover 30 instances so that the size of the tree can be reduced

Then, click OK

Page 14: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Building Decision TreeIn the Test options here,

Use training set means use whole training set as testing.

Supplied test set means we provide an external testing set for testing

Percentage split means split part of the training set as testing set

Page 15: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Cross-ValidationThe Cross-validation means to split the training set into n folds. Use the first n-1 fold to train the model and the remaining 1 fold as testing. The step is repeated n times.

Here is a simple illustration. Say we have 5 folds.

Step 1: 1 2 3 4 5

Fold 5 as testingFold 1 - 4 as training

Page 16: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Cross-ValidationStep 2:

And finally Step 5:

1 2 3 4 5

Fold 4 as testingFold 1 – 3, 5 as training

1 2 3 4 5

Fold 1 as testing Fold 2 - 5 as training

Page 17: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Cross-ValidationCross-validation is widely used for testing the predictive model performance as it can provide a better understanding of our model and have an investigation on overfitting.

Page 18: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Building Decision TreeThis time, we simply use percentage split 66% as our testing option.

Page 19: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Building Decision TreeClick Start to start our decision tree construction

Page 20: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Visualizing the Decision TreeWe can visualize the trained decision tree.

In the result list, right click the model currently trained

Click Visualize tree

Page 21: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Visualizing the Decision TreeThe trained decision tree will be shown in a new window

Page 22: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Viewing the Classifier outputThe result is shown on the right panel.

The accuracy of our model is 89.525%

Page 23: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Viewing the Classifier outputThe model accuracy sometimes could not reflect all the model performance. As a result, Weka provides several statistics for us to better investigate our model performance.

Including the Confusion Matrix, TP rate, FP rate, Precision, Recall and F-measure for each class.

Page 24: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Viewing the Classifier outputHere is the Confusion Matrix.

It can be viewed as follow:

Predicted Label

a b

Actual Labela True Positive (TP) False Negative(FN)

b False Positive (FP) True Negative (TN)

Page 25: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Viewing the Classifier outputTrue Positive is an outcome where the model correctly predicts the positive class

True Negative is an outcome where the model correctly predicts the negative class.

False Positive is an outcome where the model incorrectly predicts the positive class.

False Negative is an outcome where the model incorrectly predicts the negative class

Page 26: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Viewing the Classifier outputHere is the Detailed Accuracy By Class

TP rate is calculated by TP / (TP + FN)

FP rate is calculated by FP / (FP + TN)

Precision is calculated by TP / (TP + FP)

Recall is calculated by TP/ (TP+ FN)

F-measure is calculated by 2*Precision*Recall / (Precision + Recall)

Page 27: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Linear Regression

Page 28: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Logistic RegressionLogistic regression is similar to linear regression where the aims of them are both finding a straight line. However, the purpose of linear regression is to use that straight line to fit the data while logistic regression is to use the line for separating the data.

Linear Regression Logistic Regression

Page 29: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Logistic RegressionLogistic regression is similar to linear regression where the aims of them are both finding a straight line. However, the purpose of linear regression is to use that straight line to fit the data while logistic regression is to use the line for separating the data.

Just like a Linear Regression model, a Logistic Regression model computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result.

Page 30: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

What is Logistic Regression?Ideally, we can use a unit-step function to determine the class label after obtaining the straight line f(x).

However, if the value of f(x) is very close to 0, we might mis-classify the data using the unit-step function. To have more flexibility, logistic regression uses a function called Sigmoid function instead of the uni-step function.

Page 31: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

What is Logistic Regression?The Sigmoid function is a "S"-shaped curve with maximum value of 1 and minimum value of 0. It is defined by

Page 32: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

What is Logistic Regression?The formula of logistic regression is therefore

is the target parameter needed to regress.

Page 33: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Logistic Regression in WekaWe can easily perform a logistic regression in Weka.

Weka will do the calculation for us. We only need to prepare our dataset.

Page 34: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Logistic Regression in WekaIn Classify tag, Click Choose

Page 35: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Logistic Regression in WekaUnder classifier->function

Select Logistic

Page 36: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Logistic Regression in WekaUse percentage split 66% as our testing option.

Page 37: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Logistic Regression in WekaClick Start to start our logistic regression

Page 38: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Logistic Regression in WekaThe result is shown on the right panel.

The accuracy of our model is 89.7202%. It is slightly better than our previous decision tree model.

Page 39: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Save Model and Make Predictions on New DataAfter we have found a well-performing machine learning model, we can finalize our model and save it.

If we have some new data later, we can load our previous trained model and make predictions on the new data.

Page 40: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Save Machine Learning ModelSuppose we want to save the logistic regression model trained in last section.

In the result list, right click the model

Click Save model

Page 41: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Save Machine Learning ModelSelect a location and enter a filename such as logistic, click Save

Our model is now saved to the file "logistic.model".

Page 42: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Load Our Machine Learning ModelSuppose we want to use our trained model to make prediction.

Right click on the Result list and click Load model, select the model saved in the previous slide "logistic.model".

Page 43: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Load Our Machine Learning ModelNow, the model is loaded, and we can see some information on the right panel.

Page 44: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Make Predictions on New DataSuppose we have some new data and we want to use our trained model to make predictions on it.

We will use the file bank-new.csv as our new data. It contains first 100 instances of bank.csv but the class label (i.e., the attribute y) is changed to “?” from yes/no.

Page 45: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Make Predictions on New DataGo to Classify tab.

Select the Supplied test set option in the Test options pane.

Page 46: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Make Predictions on New DataClick Set, click the Open file on the options window and select the new dataset we just created with the name "bank-new.csv".

For the Class, select y

Then, Click Close

Page 47: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Make Predictions on New DataClick the “More options…” to bring up options for evaluating the classifier.

Page 48: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Make Predictions on New DataUncheck the the following information:

◦ Output model◦ Output per-class stats◦ Output confusion matrix◦ Store predictions for visualization◦ Collect predictions for evaluation based on

AUROC, etc.

For Output predictions, choose PlainText

Click OK

Page 49: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Make Predictions on New DataRight click on the list item for our loaded model in the Results list.

Choose Re-evaluate model on current test set

Page 50: ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Make Predictions on New DataThe predictions for each test instance are then listed in the Classifier Output.

Specifically, the middle column of the results is the predicted label which is "yes" or "no".


Recommended