ECLT5810/SEEM5750 E-Commerce Data Mining Technique

ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Tutorial 2: Decision tree; Regression; Assignment 1Wenxuan [email protected]

What is Decision Tree?Decision tree is a decision-making tool using a tree-like graph or model of decisions and their possible consequences such as event outcomes, resource costs, and utility.

All the conditional control statements used in the decision tree can be displayed for easily understand the logic behind it.

What is Decision Tree?A decision tree is a flowchart-like structure contains three components:

◦ each internal node represents a “test” on an attribute (e.g., whether a coin flip comes up heads or tails)

◦ each branch represents the outcome of the test

◦ each leaf node represents a class label (decision taken after computing all attributes).

The paths from the root to leaf represent classification rules.

What is Decision Tree?This is an example of a decision tree for the target variable response. This variable has two labels: 1 for response and 0 for no response.

Each node determines which attribute should be used for splitting the dataset based on the information gain. In this example, Node 1 uses Income as splitting attribute, <$25k go to Node 2 and >= $25k go to Node 3.

There are 4 leaf nodes (Node 4-7) for determining the predicted label.

Decision Tree in WekaIn Weka, it provides several classification algorithms including decision tree for users to easily construct a predictive model using their training data.

Weka provides many tree-based algorithms. In this tutorial, we will use the J48 algorithm which is an implementation of the C4.5 algorithm.

Preparation for building Decision TreeBefore constructing our decision tree, we first need to prepare our training data.

Open Weka, choose Explorer in the Weka GUI Chooser

Preparation for building Decision TreeClick Open file, then open the bank.csv used in the last tutorial

Again, please remember to change to CSV data files(*.csv) in file type.

Preparation for building Decision TreeNow, data is loaded into Explorer.

We can perform feature engineering before building the decision tree, but this time we simply use the original dataset to do it.

Building Decision TreeClick Classify

Building Decision TreeClick Choose

Building Decision TreeUnder

classifiers->trees

select J48

Building Decision TreeClick on the text near Choose to access to the configuration

Building Decision TreeHere is the configuration of J48

Change the minNumObj from 2 to 30 such that each leaf needs to at least cover 30 instances so that the size of the tree can be reduced

Then, click OK

Building Decision TreeIn the Test options here,

Use training set means use whole training set as testing.

Supplied test set means we provide an external testing set for testing

Percentage split means split part of the training set as testing set

Cross-ValidationThe Cross-validation means to split the training set into n folds. Use the first n-1 fold to train the model and the remaining 1 fold as testing. The step is repeated n times.

Here is a simple illustration. Say we have 5 folds.

Step 1: 1 2 3 4 5

Fold 5 as testingFold 1 - 4 as training

Cross-ValidationStep 2:

And finally Step 5:

1 2 3 4 5

Fold 4 as testingFold 1 – 3, 5 as training

1 2 3 4 5

Fold 1 as testing Fold 2 - 5 as training

Cross-ValidationCross-validation is widely used for testing the predictive model performance as it can provide a better understanding of our model and have an investigation on overfitting.

Building Decision TreeThis time, we simply use percentage split 66% as our testing option.

Building Decision TreeClick Start to start our decision tree construction

Visualizing the Decision TreeWe can visualize the trained decision tree.

In the result list, right click the model currently trained

Click Visualize tree

Visualizing the Decision TreeThe trained decision tree will be shown in a new window

Viewing the Classifier outputThe result is shown on the right panel.

The accuracy of our model is 89.525%

Viewing the Classifier outputThe model accuracy sometimes could not reflect all the model performance. As a result, Weka provides several statistics for us to better investigate our model performance.

Including the Confusion Matrix, TP rate, FP rate, Precision, Recall and F-measure for each class.

Viewing the Classifier outputHere is the Confusion Matrix.

It can be viewed as follow:

Predicted Label

a b

Actual Labela True Positive (TP) False Negative(FN)

b False Positive (FP) True Negative (TN)

Viewing the Classifier outputTrue Positive is an outcome where the model correctly predicts the positive class

True Negative is an outcome where the model correctly predicts the negative class.

False Positive is an outcome where the model incorrectly predicts the positive class.

False Negative is an outcome where the model incorrectly predicts the negative class

Viewing the Classifier outputHere is the Detailed Accuracy By Class

TP rate is calculated by TP / (TP + FN)

FP rate is calculated by FP / (FP + TN)

Precision is calculated by TP / (TP + FP)

Recall is calculated by TP/ (TP+ FN)

F-measure is calculated by 2*Precision*Recall / (Precision + Recall)

Linear Regression

Logistic RegressionLogistic regression is similar to linear regression where the aims of them are both finding a straight line. However, the purpose of linear regression is to use that straight line to fit the data while logistic regression is to use the line for separating the data.

Linear Regression Logistic Regression

Logistic RegressionLogistic regression is similar to linear regression where the aims of them are both finding a straight line. However, the purpose of linear regression is to use that straight line to fit the data while logistic regression is to use the line for separating the data.

Just like a Linear Regression model, a Logistic Regression model computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result.

What is Logistic Regression?Ideally, we can use a unit-step function to determine the class label after obtaining the straight line f(x).

However, if the value of f(x) is very close to 0, we might mis-classify the data using the unit-step function. To have more flexibility, logistic regression uses a function called Sigmoid function instead of the uni-step function.

What is Logistic Regression?The Sigmoid function is a "S"-shaped curve with maximum value of 1 and minimum value of 0. It is defined by

What is Logistic Regression?The formula of logistic regression is therefore

is the target parameter needed to regress.

Logistic Regression in WekaWe can easily perform a logistic regression in Weka.

Weka will do the calculation for us. We only need to prepare our dataset.

Logistic Regression in WekaIn Classify tag, Click Choose

Logistic Regression in WekaUnder classifier->function

Select Logistic

Logistic Regression in WekaUse percentage split 66% as our testing option.

Logistic Regression in WekaClick Start to start our logistic regression

Logistic Regression in WekaThe result is shown on the right panel.

The accuracy of our model is 89.7202%. It is slightly better than our previous decision tree model.

Save Model and Make Predictions on New DataAfter we have found a well-performing machine learning model, we can finalize our model and save it.

If we have some new data later, we can load our previous trained model and make predictions on the new data.

Save Machine Learning ModelSuppose we want to save the logistic regression model trained in last section.

In the result list, right click the model

Click Save model

Save Machine Learning ModelSelect a location and enter a filename such as logistic, click Save

Our model is now saved to the file "logistic.model".

Load Our Machine Learning ModelSuppose we want to use our trained model to make prediction.

Right click on the Result list and click Load model, select the model saved in the previous slide "logistic.model".

Load Our Machine Learning ModelNow, the model is loaded, and we can see some information on the right panel.

Make Predictions on New DataSuppose we have some new data and we want to use our trained model to make predictions on it.

We will use the file bank-new.csv as our new data. It contains first 100 instances of bank.csv but the class label (i.e., the attribute y) is changed to “?” from yes/no.

Make Predictions on New DataGo to Classify tab.

Select the Supplied test set option in the Test options pane.

Make Predictions on New DataClick Set, click the Open file on the options window and select the new dataset we just created with the name "bank-new.csv".

For the Class, select y

Then, Click Close

Make Predictions on New DataClick the “More options…” to bring up options for evaluating the classifier.

Make Predictions on New DataUncheck the the following information:

◦ Output model◦ Output per-class stats◦ Output confusion matrix◦ Store predictions for visualization◦ Collect predictions for evaluation based on

AUROC, etc.

For Output predictions, choose PlainText

Click OK

Make Predictions on New DataRight click on the list item for our loaded model in the Results list.

Choose Re-evaluate model on current test set

Make Predictions on New DataThe predictions for each test instance are then listed in the Classifier Output.

Specifically, the middle column of the results is the predicted label which is "yes" or "no".

Date post:	25-Oct-2021
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

ECLT5810/SEEM5750 E-Commerce Data Mining Technique

Documents