Notes on DM

NOTES ON DATA MINING

I DictionaryThe thing to be learned is the concept and the output produced by a learning scheme is the concept description.

The input to a machine learning scheme is a set of instances. In the standard scenario, each instance is an individual, independent example of the concept to be learned. No relationships between objects. If there is relationship you have to denormalize the data. A finite set of finite relations can always be recast into a single table (denormalized), although often at enormous cost in space. Moreover, denormalization can generate spurious regularities in the data (example: "supplier" attribute predict "supplier adress")

Instances are characterized by the values of a set of predetermined attributes (features). Either numeric (continous) or nominal (categorical,enumerated,discrete). You can also split nominal into nominal (without order), ordinal (hot,mild,cold), interval.

In classification learning, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples (The problem is to learn how to classify new days as play or don’t play.) Outcome is called the class of the example.

In association learning, any association among features is sought, not just ones that predict a particular class value. So far more association rules than classification rules. So constraints as minimum coverage (say 80% of the dataset contains those attributes) and minimum accuracy(say 95% accurate predictions) are necessary.

In clustering (analiza skupień), groups of examples that belong together are sought (when there is no specified class). Unsupervised.

In numeric prediction, the outcome is a numeric value rather than a class. Supervised.

II ARFF file@relation weather@attribute outlook { sunny, overcast, rainy } - opis atrybutu (nominalny, przyjmuje 3 wartości)@attribute temperature numeric - opis atrybutu (numeryczny)@attribute play? { yes, no } - ostatni służy za class

@data - początek instancji% % 2 instances – komentarz%sunny, 85, 85, false, no - instancja z atrybutamisunny, 80, ?, true, no - znak zapytania oznacza brakujące dane

When we deal with relational attribute

@relation Feather@attribute bag_ID { 1, 2, 3, 4, 5, 6, 7 }@attribute bag relational - dodajemy relational

@attribute outlook { sunny, overcast, rainy }

@attribute temperature numeric

@attribute humidity numeric

@attribute windy { true, false }

@end bag - konczymy opis atrubutu@attribute play? { yes, no }

@data%% seven “multiple instance” instances%1, “sunny, 85, 85, false\nsunny, 80, 90, true”, no2, “overcast, 83, 86, false\nrainy, 70, 96, false”, yes3, “rainy, 68, 80, false\nrainy, 65, 70, true”, yes4, “overcast, 64, 65, true\nsunny, 72, 95, false”, yes5, “sunny, 69, 70, false\nrainy, 75, 80, false”, yes6, “sunny, 75, 70, true\novercast, 72, 90, true”, yes7, “overcast, 81, 75, false\nrainy, 71, 91, true”, yes

When most values are 0

0, X, 0, 0, 0, 0, Y, 0, 0, 0, “class A”0, 0, 0, W, 0, 0, 0, 0, 0, 0, “class B”

Those instances can be rewritten as:

{1 X, 6 Y, 10 “class A”} – where numbers indicato attribute number

{3 W, 10 “class B”} – ommited values are 0 not ? (so not regarded as missing)

III Problems with input data1. Missing values (can be ignored, but maybe the fact that it’s missing tells us sth2. Inaccurate values :

Typos – needs to be identified and corrected Measurement error – identify outliers and get rid of them Deliberate errors

3. Merging different databases with different data types, relations, procedures4. Data preparation, cleaning etc. takes 60-85% of time

IV Knowledge representation

1. Tables

Simplest form. Exactly the same process can be used for numeric prediction too—in this case, the structure is sometimes referred to as a regression table.

Outlook Temperature

Humidity Windy Play

Sunny hot high false noSunny hot high true noOvercast hot high false yesRainy mild high false yes… … … … …

2. Linear model

PRP=37.06+2.47CACH 2.0+0.5PLength+0.8PWidth=0

Linear models can also be applied to binary classification problems. In this case, the line produced by the model separates the two classes: It defines where the decision changes from one class value to the other. Such a line is often referred to as the decision boundary. The weights for Figure 1 were found by a method called least squares linear regression, those for Figure 2 were found by the perceptron training rule.

3. Decision trees

If you predict numeric quantity decision trees with averaged numeric values at the leaves are called regression trees.Predicted value is average value o training instances that reach the leaf.It is possible to combine regression equations with regression trees. You get a tree whose leaves contain linear expressions—that is, regression equations— rather than single predicted values. This is called a model tree.

Decision boundary

4. Classification rules

The antecedent, or precondition is test like in decision trees, while the consequent, or conclusion gives the class or classed that pass the test.

Trees Rules

Easy to transform. One rule is generated for each leaf.

+ Produces rules that are unambiguous- Rules are unnecessarily complex (so need to remove some tests)

Rules Trees

More difficult. Replicated subtree problem – repeat the second test twice (so replicates the c/d subtree).

If a and b then xIf c and d then x

Advantages of rulesEach rule seems to represent an independent “nugget” of knowledge. New rules can be added to an existing set without disturbing it, in tree you often have to reshape it.

Disadvantages of rules Remark above ignores in what order rules are executed – either decision list or randomly. Rules set gives multiple classifications for an example

- give no conclusion at all OR- count how often each rule applies and choose most popular- These strategies can lead to radically different results!

Rules fail to classify at all (cannot occur with decision trees).- decide not to classify OR- choose the most frequently occurring class as a default- These strategies can lead to radically different results!

5. Association Rules

Association rules same as classification rules but can predict any attribute, not just the class as well as combinations of attributes. But association rules are not intended to be used together as a set, as classification rules are.

The coverage (support) is # the number of instances that they predict correctly (say 5 or 80% of the dataset). Its accuracy (confidence) is the same # expressed as a proportion of the # of instances to which the rule applies (90% of cool days). Example:

If temperature = cool then humidity = normalDatabase contains 8 observations. 4 days that are both cool and normal humidity (coverage 50% or 4), there is 1 observation that is cool and not normal humidity (accuracy 80%), 3 other days are hot.

It is usual to specify minimum coverage and accuracy values, and to seek only those rules for which coverage and accuracy are both at least these specified minima.

6. Rules with exceptions

If you have a rule, and new instance comes that defies it, you can change the bounds (problems with missclafisication) or add an exception:

If petal-length >=2.45 and petal-length < 4.45 then Iris-versicolor EXCEPT if petal-width < 1.0 then Iris-setosa

More complicated: default with exceptions of exceptions of…

Default: Iris-setosaexcept if petal-length >= 2.45 and petal-length < 5.355 and petal-width < 1.75then Iris-versicolor

except if petal-length >= 4.95 and petal-width < 1.55then Iris-virginicaelse if sepal-length < 4.95 and sepal-width >= 2.45then Iris-virginica

else if petal-length >= 3.35then Iris-virginica

except if petal-length < 4.85 and sepal-length < 5.95then Iris-versicolor

Advantages of exceptions Easy to incorporate new data Easy to incorporate domain knowledge People often think in terms of exceptions (easier to understand large rule sets)

7. Relation rules

So far testing was comparing attribute to a constant. But sometimes it’s better to compare to attributes. Shapes problem – instead of comparing height and width to numbers compare it to each other! BUT learning relational rules is costly, so most algorithms don’t do it. Solution: create another attribute height-width and you can see whether it’s positive or negative.

8. Instance-based representation

Simplest form of learning: rote learning. Training instances are searched for instance that most closely resembles new instance. Instead of rules just store the instances and operate by relating new ones to find out their class (find nearest-neighbor or k-nearest-neighbors )

In other methods we learning is achieved operating on a database and then use it on new instances, here learning starts when new instances occur – lazy learning.

To find similar instances use some distance function. When numeric just Euclidean distance (but it assumes all attributes are made equal, we may want to weight them – key problem in this method). Nominal attributes: distance is set to 1 if values are different, 0 if they are equal.

We don’t want to store all instances (slow algorithm, storage capacity). Save just a few critical examples of each class (which ones – second key problem).

You can visualize this using rectangles as rules. Nested rectangles are rules with exceptions.

9. Clusters

The output takes the form of a diagram that shows how the instances fall into clusters (can overlap). Some algorithms associate instances with clusters probabilistically rather than categorically.Other algorithms produce a hierarchical structure of clusters and subclusters. Clustering is often followed by a stage in which a decision tree or rule set is inferred that allocates each instance to the cluster in which it belongs.

V Basic algorithms

1. OneR

One level decision tree. Simple algorithm that performs surprisingly well. We encounter overfitting, especially if an attribute has a lot of values. Assumes nominal attributes, when dealing with numeric discretize.

Pseudocode

For each attribute in our data setFor each value of the attribute

Find out which class appears most frequently for this valueMake a rule: if this value then most frequent classCalculate how many times you would be wrong if you used it

Sum up number of errors for all the values in this attributeChoose the attribute that has least errors in predicting values based on those rules

2. Naïve Bayes

Based on Bayes’ rule and “naïvely” assumes independence (simplistic assumption). But works very effectively when tested on actual datasets, particularly when combined with attribute selection procedures to eliminate redundant, non-independent, attributes. Why? Because the exact probabilities don’t matter as long as you find the class with maximum probability. Use of Bayes theorem:

P [C=c∨H i=h ]=P [H i=h|C=c ]∗P [C=c ]

P[H i=h]

Problem if a particular attribute value does not occur in the training set in conjunction with every class value (probability equals zero). Solution: add some constant to make it nonzero. Example:

3+μp19+μ

,6+μp29+μ

,0+μp39+μ ∑

i=1

n=3

pi=1

If μ=n and pi=1/n, in this case 3 and 1/3 it is called Laplace estimator (we add 1 to numerator and 3 to denominator).

Numeric values are usually handled by assuming that they have a normal distribution.

Example

Attribute Rules Errors Total Decision

outlook sunny -> noovercast -> yesrainy -> yes

2/50/42/5

4/14 USE

temperature hot->nomild->yescool->yes

2/42/61/4

5/14

humidity high -> nonormal -> yes

3/73/7

6/14

windy false -> yestrue -> no

2/83/6

5/14

We calculate the mean and the standard deviation for each class and each numeric attribute. To find out probabilities of numeric value just plug in this value, mean and deviation to probability density function.

Advantages Disadvantages missing values are no problem at all. often outperforms more sophisticated

techniques very fast (efficient in bid datasets)

problem when probability of some attribute value is 0

assumes independence, problem if some values are highly correlated (example if 3 other attributes have same values as temperature, effectively algorithm decides basing on temperature alone)

when numeric value there is a need to assume some distribution (or use kernel density estimation)

Naïve Bayes is often used for document classification. You determine class based on the occurrence of certain words. But you have to take into account the number of occurrences of each word. To determine class not just by the words that occur in it but also by the number of times they occur you use multinomial Naïve Bayes. It performs better than the ordinary Naïve Bayes model for document classification, particularly for large dictionary sizes. See example below.

P [{yellow }∨H ]=0,75 then P [{yellow , yellow ,blue }∨H ]=N

!∗0,752

2 !∗0,25

1!

P [{blue }∨H ]=0,25P [{yellow , yellow ,blue }] is computed similarly, P [H ] you get as in Naïve Bayes, so you have everything to compute P [H∨{yellow , yellow ,blue}].

3. Decision trees (C4.5 algorithm)

Strategy for decision trees is split examples into subsets, based on some condition levied on attributes. Stop when every instance in a subset has the same class. But which attributes to make condition on do you choose?

You want to get the smallest tree so define some purity function and

maximize it (or minimize impurity function). Functions you use are entropy and information and information gain.

Entropy represents the expected amount of information that would be needed in next step if you choose particular splitting. You want to get as little bits as possible!

entropy ( p1 , p2 ,…, pn )=−∑i=1

n

p i∗logr p i0≥ p i≥1 ,∑ p i=1

information (a1 , a2 ,…,an )=entropy ( a1a ,a2a,…,

ana )a=∑ ai

Lets calculate information of the leafs in the first tree. Arguments is number of “yes” and “no” in the trees.

information ( [2,3 ] , [4,0 ] , [3,2 ])= 514

∗entropy ( 25 , 35 )+ 414∗entropy ( 44 , 04 )+ 514∗entropy ( 35 , 25 )information ( [2,3 ] , [4,0 ] , [3,2 ])=0.693bits

Now let’s look at information gain. It’s the difference between information before split and after split. It measures how much info we got rid of by splitting data by certain attribute. Maximize!

inf . gain (outlook )=information ( [9,5 ] )−information ( [2,3 ] , [4,0 ] , [3,2 ] )=0.940−0.693=0.247inf . gain (temperature ) = 0.029

So we split data by outlook. For every subtree repeat the algorithm until information gain is 0 or when you can’t split any further.

Problem with information gain – it is biased towards choosing attributes with a large number of values. To compensate for this use gain ratio. It takes number and size of branches into account when choosing an attribute (split ratio represent information needed to get to particular instance if tree is split into subtrees).

gainratio (outlook )= information gain (outlook )split (outlook )

Example:information([9,5]) 0.940information([2,3],[4,0],[3,2]) 0.693information gain(outlook) 0.247split(outlook)=information([5,4,5]) 1.577gain ratio(outlook) 0.247/1.577 = 0.157

Problem with gain ratio – it may overcompensate and lead to preferring an attribute just because its split ratio is small. Solution: consider attributes with information gain greater than average and maximize gain ratio for them.

4. Classification rules (PRISM algorithm)

In decision trees you considered all classes, when choosing splitting attribute. Here take one class at a time and find a rule that covers all instances in that class and excludes those not in that class. Rule-generating method focuses on one class at a time, disregarding what happens to the other classes.

Example – consider class b and find set of rules for it.

Whereas trees must use one attribute to split, rules can use more (no subtree division problem).Tree seeks to maximize class separation, covering algorithms try to find a rule with maximum accuracy (find attribute-value pair that maximizes probability of desired classification).Example: PRISM Algorithm

PseudocodeFor each class C

Let E be set of all instances #reinitialize after changing classWhile E contains instances in class C #if not you move to next class

Create a rule R: “if ? then recommendation C”Until R is perfect (or there are no more attributes to use) do

For each attribute A not mentioned in R, and each value v,Consider adding the condition A = v Select A and v to maximize the accuracy p/t #if tied use biggest tAdd A = v to R

Remove the instances covered by R from E

5. Association rules (a priori algorithm)

Could use the same approach, but too many values and possible combinations – infeasible. Instead restrict item set (item – combination attribute-value) to only make rules with minimum coverage.

From those items you generate rules and choose those with particular accuracy

If x ≤ 1.2 then class = bIf x > 1.2 and y ≤ 2.6 then class = b

PropertiesRules generated by PRISM with in class should be interpreted in order (each rule covers instances not cover by previous rules). “Between classes order does not matter, because we reinitialize E each time.Other algorithms may produce order-independent rules but two problems: conflicting rules and instances not covered.

Three-Item sethumidity = normal, windy = false, play = yes

Generated 2N−1 rules (N=3)If humidity = normal and windy = false then play = yes 4/4If humidity = normal and play = yes then windy = false 4/6If windy = false and play = yes then humidity = normal 4/6If humidity = normal then windy = false and play = yes 4/7If windy = false then humidity = normal and play = yes 4/8If play = yes then humidity = normal and windy = false 4/9If – then humidity = normal and windy = false and play = yes 4/14

(here 58 rules with accuracy 100%). But how exactly do you generate the rules? Example for a particular set:Of course you can generate all rules from the set but it’s very costly (brute force approach). Instead generate one-consequent rules (generally k-1-consequent) and from them build two-consequent (generally k-consequent). If either A or B, which are k-1 consequent don’t hold (accuracy below border), then no point checking A+B.

6. Linear models

http://www.speech.sri.com/people/anand/771/html/node32.html

7. Instance based learning (nearest-neighbors algorithm)

Compute the distance using some metric (e.g. Euclidean). But different attributes have different order of magnitude. So normalize values.

a i=vi−minv i

max v i−minv i

, so that instead of value v i∈ (−∞,∞ ) we have vai∈ (0,1 )

If missing values:For nominal assume that a missing feature is maximally different. So, if at least one missing or values different take 1, if not missing and the same take 0;For numeric the difference between missing values is also 1. If just one is missing, the difference is often taken as the (normalized) size of the other value or 1 minus that size, whichever larger.

To find nearest-neighbors you can search through entire database, but very slow. Instead divide the database. How? kD-Tree.

kD-Tree algorithmDivides the input k-dimensional space (k is #of attributes with a hyperplane and then splits each partition again, recursively, thus creating the tree. Splits are parallel to one of the axes (based on computed volatility in axes). To locate the nearest neighbor of a given point, follow the tree down from its root to locate the region containing the target. Compute distances in this region. Useful to examine sibling nodes.

Pseudocode (for creating a tree)

Compute volatility on every attribute Select splitting axis, based on highest volatility (perpendicular to it)Sort instances and split on medianMedian is the nodeRepeat for subtree(points before median)Repeat for subtree (points after median) Problems with kD-trees:

- Skewed data results in unbalanced trees

- Need to rebuild (or somehow change) the tree if new instances arrive- Rectangles are not efficient structures

Ball TreesUse hyperspheres, not hyperrectangles. Choose the point in the ball that is farthest from its center, and then a second point that is farthest from the first one. Assign all data to either first or second point, compute the center and then radius of newly created to balls. Neighboring spheres may overlap, but not a problem. Points that fall into the overlap area are assigned to only one of the overlapping balls.

Problems with nearest-neighbours:- noisy exemplars (use k-nearest-neighbor strategy)- kD-trees inefficient when dimension increases (works up to 10 attributes)

Voting Feature Intervals Construct intervals for each attribute

● Discretize numeric attributes● Treat each value of a nominal attribute as an “interval”

Count number of times class occurs in interval.Prediction is generated by letting intervals vote (those that contain the test instance)

8. Clustering (k-means algorithm)

Clustering techniques apply when there is no class to be predicted. Aim: divide instances into “natural” groups.

Clusters can be: disjoint vs. overlapping deterministic vs. probabilistic flat vs. Hierarchical

K-meansFlat, deterministic, disjoint. Find clusters and assign values until the result stabilizes.

Pseudocodespecify how many clusters are being sought = kk points are chosen at random as cluster centers (1-st centroids)Until n-th centroids are the same as n-1-th centroids

Assign instances to closest cluster center (e.g Euclidean distance)calculate i-th centroids

Problems Can get trapped in local minimum. Result can vary significantly, based on 1-st centroids

repeat x times OR let initial choice of centroids not be entirely random: k-means+

Each iteration compute distances from all instances to k-centroids (costly!) Use kD-trees or ball trees

How to use kD-trees for clustering? (element of X-means clustering) First, build a tree, which remains static throughout the procedure

At each node, store number of instances and sum of all instances

In each iteration, descend tree and find out which cluster each node belongs to

o Can stop descending as soon as we find out that a node belongs entirely to a particular cluster

o Use statistics stored at the nodes to compute new cluster centers

VI Evaluating the algorithms

1. Introduction

How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data. To evaluate split data into training and test set. To be sure that apparent differences are not caused by chance effects, statistical tests are needed. Natural performance measure for classification problems: error rate - proportion of errors made over the whole set of instances. Resubstitution error: error rate obtained from training data (overly optimistic) Test set: independent instances that have played no part in formation of classifier. test data is not

used in any way to create the classifier! Proper procedure uses three sets: training data, validation

data (used to optimize parameters), and test dat. Once evaluation is complete, all the data can be used to build the final classifier (the more data the better) The larger the test data the more accurate the error estimate, but we sacrifice the accuracy of the classifier. Holdout procedure: method of splitting original data into training and test set.

2. Cross-Validation

Usually hold out one-third of the data for testing and use the remaining two-thirds for training. Problem: The sample used for training (or testing) might not be representative

Solution is stratification. It ensures that each class is represented with approximately equal proportions in both subsets. But stratified holdout is a primitive safeguard.

You can also use repeated holdout, repeating the whole process several times and later average error terms. But problem with overlapping sets.

More sophisticated technique is k-fold cross-validation. First split data into k subsets of equal size. Then use k-1 subsets for training, and the other one for testing. Repeat k-times and average the results. Often the subsets are stratified. Usually stratified tenfold cross-validation is used. To get a small variance you can repeat it 10 times, repeated stratified tenfold cross-validation.

Special case is leave-one-out cross-validation. Simply n-fold cross-validation, where n is the # of instances. You learn on all-but-one and test on one instance and average error rate. Great because no randomness in choosing sets and greatest possible dataset. But computational problem, so use for small datasets. Also by nature can’t be stratified.

3. Bootstrap

Cross-validation uses sampling without replacement The bootstrap uses sampling with replacement to form the training set. Draw n-times to create training set. Particular instance has a probability of

1 –1n

of not being picked, so probability of not being picked in n rounds (not being in training set ) is

(1 – 1n )n

≈e−1≈0.368 This means the training data will contain approximately 63.2% of the

instances. It’s pretty low comparing to 90% in tenfold cross-validation, so the error will be very pessimistic. Solution is to combine error from test and training set.

err=0.632×errtest instances+0.368×err training instances

4. Comparing models

Simple: Estimate the error using cross-validation (or any other suitable estimation procedure), and choose the smaller estimate. But what if difference is only due to chance? Use t-test.

But when cross-validated you reuse the data, and samples become dependent. Insignificant differences became significant. Solution: use corrected resampled t-test.

t= d

√( 1k +n2n1 )σd

2 , d is difference between models, and σ d2 variance of that difference, n1# of

instances in training (n2testing) set, k is # of cross-validations.

5. Evaluating Costs

Different types of classification errors often incur different costs. When scanning for terrorist false negative (we think that a person is terrorist, he’s not) is not nearly as costly as false positive.

Successful assessments are measured by sum of elements on the diagonal. Is this a fair measure of overall success? To assess how many predictions you would make by chance divide randomly the columns of confusion matrix, proportionally to total number of cases (here 6:3:1).

CONFUSIONMATRIX

Actual classp n

Predictedclass

p’

True Positive False Positive

n’

False Negative True Negative

Actual predictor Random predictora b c a b c

Actual classa 88 14 18 a 60 36 24b 10 40 10 b 30 18 12c 2 6 12 c 10 6 4D 88+40+12=140 D 60+18+4=82

κappa=diagactual−diagrandom

total−diagrandom

Kappa statistic measures relative improvement over random predictor. It is used to measure the agreement between predicted and observed categorizations of a dataset, while correcting for an agreement that occurs by chance.

Date post:	20-Jan-2016
Category:	Documents
Upload:	dartheamon
View:	14 times
Download:	0 times

Notes on DM

Documents