word

SEG 4630 Tutorial 4 13/4/2023

SEG4630 E-Commerce Data Mining (2007-08)Tutorial 4 – Clementine Tutorial (2)

Clementine

Model Generation1. Build C5.0 -- Using C5.0 algorithm to build a decision tree or a ruleset. It builds the model by splitting the sample based on the filed that provides the maximum information gain, with splits that do not contribute significantly to the model are pruned.

In Clementine, C5.0 uses the Gain Ratio criterion for selecting the fields for splitting.

Notes: For training a C5.0 model, there should be at least one IN fields and exactly one symbolic OUT field. BOTH or NONE fields will be ignored.

1

SEG 4630 Tutorial 4 13/4/2023

Setting Options:1. Output name: the name of the model2. Cross-validate: If selected, C5.0 will use a set of models built on subsets of

training data to estimate the accuracy of model built on full data set. Modify the Number of folds for controlling the number of models.(Note: model building and cross-validation are done at the same time)

3. Output type: to generate either a Decision Tree model or Ruleset model.4. Group symbolic: combining symbolic values having similar patterns.5. Use Boosting: improving accuracy rate by boosting method.6. Method:

a. Simple – parameters are set automatically.b. Expert – more control.

7. Favor: a. Accuracy – Trying to produce the most accurate tree.b. Generality – Try to use settings that are less susceptible to

overfitting.8. Expected noise: the expected proportion of noisy data in the training set

Expert Options:1. Pruning severity:

a. Increase the value to obtain a smaller, more concise tree.b. Decrease the value to obtain a more accurate tree.

2. Minimum records per child branch – limit the number of splits in any branch of the tree. Increase the value to help prevent overtraining with noisy data.

3. Use global pruning – Pruning by considering the tree as a whole, and weak subtrees may be collapsed. Global pruning is performed by default as the two stages pruning of trees.

4. Winnow attributes – Usefulness of the predictors will be examined before starting to build the model.

2

1

4

5

6

8

7

2

3

SEG 4630 Tutorial 4 13/4/2023

Misclassification Costs Options for specifying relative importance of prediction errors.Misclassification costs are basically weights applied to specific outcomes.

To view the models generated, right-click on the model node and choose “Browse”.

3

SEG 4630 Tutorial 4 13/4/2023

Example of a decision tree model generated.

4

SEG 4630 Tutorial 4 13/4/2023

Example of a ruleset model generated.

2. C&R Tree – a tree-based classification and prediction modelIt also uses recursive partitioning to split training records. C&R Tree finds the best split by measuring the reduction in an impurity index.

Notes: For training a C5.0 model, there should be at least one IN fields and exactly one numeric or symbolic OUT field. BOTH or NONE fields will be ignored.

5

SEG 4630 Tutorial 4 13/4/2023

Setting Options:1. Model name: the name of the model2. Use partitioned data -- Splits the data into separate subsets or samples for

training, testing, and validation. If no partition field is specified in the stream, this option is ignored.

3. Build method -- Specifies the method used to build the model. a. Model: Direct generates a model automatically. b. Interactive: launches the Tree Builder, which allows you to build

your tree one level at a time, edit splits, and prune as desired before saving the generated model.

i. Use tree directives. Select this option to specify directives to apply when generating an interactive tree from the node.

4. Maximum tree depth: the maximum number of times the sample will be split recursively.

Expert Options:

1. Modea. Simple – parameters are set automatically.b. Expert – more control.

2. Maximum number of surrogates – dealing with missing values. Increasing this value allows more flexibility in handling missing values.

3. Minimum change in impurity – if the impurity reduced by the split of a branch is less than the specified amount, the split will not be made.

4. Impurity measure for categorical targets: a. Gini – based on probabilities of categories, measuring the probabilities

of misclassificationb. Twoing – emphasizes on finding the best binary splits.c. Ordered twoing -- adds the additional constraint that only contiguous

target classes can be grouped together, as is applicable only with ordinal targets.

6

SEG 4630 Tutorial 4 13/4/2023

5. Stopping -- controlling how the tree is constructed. It determines when to stop splitting specific branches of the tree.

a. Use percentage. Allows you to specify sizes in terms of percentage of overall training data.

b. Use absolute value. Allows you to specify sizes as the absolute numbers of records.

c. The minimum branch sizes – to prevent splits that would create very small subgroups.

Minimum records in parent branch will prevent a split if the number of records in the node to be split (the parent) is less than the specified value.

Minimum records in child branch will prevent a split if the number of records in any branch created by the split (the child) would be less than the specified value.

6. Prune tree – removing bottom-level splits that do not contribute significantly to the accuracy of the tree after it fully grown.

a. Use standard error rule – Allows you to specify a more liberal pruning rule. It selects the simplest tree whose risk estimate is close to (but possibly greater than) that of the subtree with the smallest risk.

b. The multiplier – it indicates the size of the allowable difference in the risk estimate between the pruned tree and the tree with the smallest risk in terms of the risk estimate. For example, if you specify 2, a tree whose risk estimate is (2 × standard error) larger than that of the full tree could be selected.

7. Priors – specify prior probabilities for symbolic output fields.

7

SEG 4630 Tutorial 4 13/4/2023

Costs – controlling options for misclassification costs

8

SEG 4630 Tutorial 4 13/4/2023

Appendix A:Some other Field Operations:

Filler node – to replace a filed. E.g. Perform transformation on a field.

9

Filter node

Derive node

Filler node

SEG 4630 Tutorial 4 13/4/2023

Derive node – to derive a new field.

10

SEG 4630 Tutorial 4 13/4/2023

Filter node – to Remove some fields. Suppose if we have derived a new field Na_Over_K in the previous step, we need then to remove the field Na and K:

11

Date post:	16-Jun-2015
Category:	Documents
Upload:	tommy96
View:	414 times
Download:	1 times

word

Documents