+ All Categories
Home > Education > Data Mining. Classification

Data Mining. Classification

Date post: 21-May-2015
Category:
Upload: ssa-kpi
View: 401 times
Download: 1 times
Share this document with a friend
Description:
AACIMP 2011 Summer School. Operational Research Stream. Lecture by Erik Kropat.
Popular Tags:
45
Classification Summer School “Achievements and Applications of Contemporary Informatics, Mathematics and Physics” (AACIMP 2011) August 8-20, 2011, Kiev, Ukraine Erik Kropat University of the Bundeswehr Munich Institute for Theoretical Computer Science, Mathematics and Operations Research Neubiberg, Germany
Transcript
Page 1: Data Mining. Classification

Classification

Summer School

“Achievements and Applications of Contemporary Informatics,

Mathematics and Physics” (AACIMP 2011)

August 8-20, 2011, Kiev, Ukraine

Erik Kropat

University of the Bundeswehr Munich Institute for Theoretical Computer Science,

Mathematics and Operations Research

Neubiberg, Germany

Page 2: Data Mining. Classification

Examples

Clinical trials

In a clinical trial 20 laboratory values of 10.000 patients are collected together with the diagnosis ( ill / not ill ).

We measure the values of a new patient.

Is he / she ill or not?

Credit ratings

An online shop collects data from its customers together with some information about the credit rating ( good customer / bad customer ).

We get the data of a new customer.

Is he / she a good customer or not?

Page 3: Data Mining. Classification

Machine-Learning / Classification

Labeled training

examples

Machine learning

algorithm

Classification rule

New Example

Predicted classification

Page 4: Data Mining. Classification

k Nearest Neighbor Classification

kNN

Page 5: Data Mining. Classification

k Nearest Neighbor Classification

Compare the new object with the k “nearest” objects (“nearest neighbors”)

Idea: Classify a new object with regard to a set of training examples.

+

+ +

+

+ + +

+ + +

+

+ Objects in class 1 Objects in class 2

New object

4-nearest neighbor

Page 6: Data Mining. Classification

k Nearest Neighbor Classification

• Required

− Training set, i.e. objects and their class labels

− Distance measure

− The number k of nearest neighbors

+ +

+ +

+ + +

+ + +

+

5-nearest neighbor

• Classification of a new object

− Calculate the distance between the objects of the training set.

− Identify the k nearest neighbors.

− Use the class label of the k nearest neighbors to determine the class of the new object (e.g. by majority vote).

New object

Page 7: Data Mining. Classification

k Nearest Neighbor Classification

+

+

+ +

1-nearest neighbor

+

+

+ +

+

+

+ +

2-nearest neighbor 3-nearest neighbor

Classification ? + Class label:

Decision by distance

Page 8: Data Mining. Classification

1-nearest neighbor ⇒ Voronoi diagram

Page 9: Data Mining. Classification

kNN k Nearest Neighbor Classification

• The distance between the new object and the objects in the set of training samples is usually measured by the Euclidean metric or the squared Euclidean metric.

• In text mining the Hamming-distance is often used.

Distance

Page 10: Data Mining. Classification

kNN k Nearest Neighbor Classification

• The class label of the new object is determined by the list of the k nearest neighbors. This could be achieved by

− Majority vote with regard to the class labels of the k nearest neighbors.

− Distance of the k nearest neighbors.

Class label of the new object

Page 11: Data Mining. Classification

kNN k Nearest Neighbor Classification • The value of k has a strong influence on the classification result.

− k too small: Noise can have a strong influence.

− k too large: Neighborhood can contain objects from different classes (ambiguity / false classification)

+

+ +

+ + + +

+ +

+

Page 12: Data Mining. Classification

Support Vector Machines

Page 13: Data Mining. Classification

Support Vector Machines

A set of training samples with objects in Rn is divided in two categories:

positive objects and negative objects

Page 14: Data Mining. Classification

Goal: “Learn” a decision rule from the training samples.

Assign a new example into the “positive” or the “negative” category. Idea: Determine a separating hyperplane.

New objects are classified as

positive, if they are in the half space of positive examples

negative, if they are in the half space of negative examples.

Support Vector Machines

Page 15: Data Mining. Classification

INPUT: Sample of training data T = { (x1, y1),...,(xk, yk) | xi ∈ Rn , yi ∈ { -1, +1 } }, with xi ∈ Rn data

and yi ∈ {-1, +1} class label

Data from patients with confirmed diagnosis

Laboratory values

Disease: Yes / No

Decision rule: f : Rn → {-1, +1}

INPUT: Laboratory values of a new patient Decision: Disease: Yes / No

Support Vector Machines

Page 16: Data Mining. Classification

Separating Hyperplane

A separating hyperplane is determined by

− a normal vector w and

− a parameter b

Idea: Choose w and b, such that the hyperplane separates the set of training samples in an optimal way.

H = { x ∈ Rn | ⟨ w, x ⟩ b = 0 }

H

w scalar product

‖ ‖ w b ____

Offset of the hyperplane from the origin along w:

Page 17: Data Mining. Classification

What is a good separating hyperplane?

There exist many separating hyperplanes

Will this new object be in the “red” class?

Page 18: Data Mining. Classification

Question: What is the best separating hyperplane?

Answer: Choose the separating hyperplane so that the distance from it

to the nearest data point on each side is maximized.

support vector

support vector

H

margin

maximum-margin hyperplane

Page 19: Data Mining. Classification

Scaling of Hyperplanes

• A hyperplane can be defined in many ways:

For c ≠ 0: { x ∈ Rn | ⟨ w, x ⟩ + b = 0 } = { x ∈ Rn | ⟨ cw, x ⟩ + cb = 0 } • Use trainings samples to choose (w, b), such that Min | ⟨ w, xi ⟩ + b | = 1

xi canonical hyperplane

Page 20: Data Mining. Classification

Definition

A training sample T = {(x1, y1),...,(xk, yk) | xi ∈ Rn , yi ∈ {-1, +1} } is separable by the hyperplane H = { x ∈ Rn | ⟨ w, x ⟩ + b = 0 }, if there exists a vector w ∈ Rn and a parameter b ∈ R, such that ⟨ w, xi ⟩ + b ≥ +1 , falls yi = +1

⟨ w, xi ⟩ + b ≤ 1 , falls yi = 1 for all i ∈ {1,...,k}.

H

w

⟨ w, x ⟩ + b = -1

⟨ w, x ⟩ + b = 1

Page 21: Data Mining. Classification

Maximal Margin

• The above conditions can be rewritten:

yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 for all i ∈ {1,...,k}

• Distance between the two margin hyperplanes:

⇒ In order to maximize the margin we must minimize

‖ ‖ w 2 ____

‖ ‖ w

H

w

⟨ w, x ⟩ + b = -1

⟨ w, x ⟩ + b = 1

Page 22: Data Mining. Classification

s.t. yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 for all i ∈ {1,...,k}

Optimization problem

Find a normal vector w and a parameter b, such that the distance between

the training samples and the hyperplane defined by w and b is maximized.

Minimize 2 ‖ ‖ w 1 2 __

⇒ quadratic programming problem

H

w

Page 23: Data Mining. Classification

Dual Form

Find parameters α1,...,αk, such that

with αi ≥ 0 for all i = 1,...,k

i = 1 Σ k

αi 1 2 i, j = 1

Σ k

αi αj yi yj ⟨ xi, xj ⟩ Max

αi yi = 0 i = 1 Σ k

The maximal margin hyperplane (= the classification problem)

is only a function of the support vectors. ⇒

Kernel function

k ( xi, xj ) := ⟨ xi, xj ⟩

Page 24: Data Mining. Classification

Dual Form

• When the optimal Parameters α1,...,αk are known, the normal vector w*

of the separating hyperplane is given by

* *

αi yi * xi w* = • The parameter b* is given by

b* = 1 2 _ _ max { ⟨ w*, xi ⟩ | yi = 1 } min { ⟨ w*, xi ⟩ | yi = +1 } +

i = 1 Σ k

training data

Page 25: Data Mining. Classification

Classifier

• A decision function f maps a new object x ∈ Rn to a category f(x) ∈ {-1, +1} :

, if ⟨ w*, x ⟩ + b* ≥ +1

, if ⟨ w*, x ⟩ + b* ≤ 1 f (x) =

+1

1

H

w

-1

+1

Page 26: Data Mining. Classification

Support Vector Machines

Soft Margins

Page 27: Data Mining. Classification

Soft Margin Support Vector Machines

• Until now: Hard margin SVMs

The set of training samples can be separated by a hyperplane. • Problem: Some elements of the trainings samples can have a false label

The set of training samples can not be separated by a hyperplane

and SVM is not applicable.

Page 28: Data Mining. Classification

• Idea: Soft margin SVMs

Modified maximum margin method for mislabeled examples.

• Choose a hyperplane that splits the training set as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples.

• Introduce slack variables ξ1,…, ξ n which measure the degree of misclassification.

Soft Margin Support Vector Machines

Page 29: Data Mining. Classification

• Interpretation

The slack variables measure the degree of misclassification of the training examples with regard to a given hyperplane H.

H

ξ i

ξ j

Soft Margin Support Vector Machines

Page 30: Data Mining. Classification

• Replace the constraints

by

yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 for all i ∈ {1,...,n}

yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 ξ i for all i ∈ {1,...,n}

Soft Margin Support Vector Machines

H

ξ i

Page 31: Data Mining. Classification

• Idea

If the slack variables ξ i are small, then:

ξ i = 0 ⇔ xi is correctly classified

0 < ξ i < 1 ⇔ xi is between the margins.

ξ i ≥ 1 ⇔ xi is misclassified [ yi · ( ⟨ w, xi ⟩ + b ) < 0 ]

Constraint: yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 ξ i for all i ∈ {1,...,n}

H

ξ i

Soft Margin Support Vector Machines

Page 32: Data Mining. Classification

• The sum of all slack variables is an upper bound for the total training error:

H

ξ i

ξ j

i = 1 Σ n

ξ i

Soft Margin Support Vector Machines

Page 33: Data Mining. Classification

Find a hyperplane with maximal margin and minimal training error.

2

regularisation

Soft Margin Support Vector Machines

Minimize 2 ‖ ‖ w 1 2 __

i = 1 Σ n

ξ i

+

C

*

s.t. yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 ξ i for all i ∈ {1,...,n }

* *

≥ 0 for all i ∈ {1,...,kn ξ i

Page 34: Data Mining. Classification

Support Vector Machines

Nonlinear Classifiers

Page 35: Data Mining. Classification

Support Vector Machines Nonlinear Separation

Question: Is it possible to create nonlinear classifiers?

Page 36: Data Mining. Classification

Idea: Map data points into a higher dimensional feature space where a linear separation is possible.

Ф

Rn Rm

Support Vector Machines Nonlinear Separation

Page 37: Data Mining. Classification

Nonlinear Transformation

Ф

Rn Rm original feature space high dimensional feature space

Page 38: Data Mining. Classification

Kernel Functions

Assume: For a given set X of training examples we know a function Ф,

such that a linear separation in the high-dimensional space is possible.

Decision: When we have solved the corresponding optimization problem,

we only need to evaluate a scalar product

to decide about the class label of a new data object.

i = 1 Σ n

αi yi * f(xnew) ⟨ Ф (xi), Ф(xneu) ⟩ + b* ) ( = sign ∈ {-1, +1}

Page 39: Data Mining. Classification

Kernel functions

Introduce a kernel function The kernel function defines a similarity measure between the objects xi and xj. It is not necessary to know the function Ф or the dimension of H !!!

K(xi, xj) = ⟨ Ф (xi), Ф(xj) ⟩

Page 40: Data Mining. Classification

Kernel Trick

Example: Transformation into a higher dimensional feature space

Ф (x1,x2) = ( x1 , 2 x1 x2, x2 ) 2 ___

2 Ф : R → R , 2 3

Input: An element of the training sample x,

a new object x

⟨ Ф ( x ), Ф( x ) ⟩ ^

^

The scalar product in the higher dimensional space (here: R )

can be evaluated in the low dimensional original space (here: R ).

3

2

= x1 x1 + 2 x1 x1 x2 x2 + x2 x2 ^ ^ 2 2 ^ 2 2 ^

= ⟨ ( x1 , 2 x1 x2, x2 ), ( x1 , 2 x1 x2, x2 ) ⟩ ^ ^ ^ ^ 2 2 2 2

= ( x1 x1 + x2 x2) ^ ^ 2

= ⟨ x , x ⟩ ^ = K ( x , x ) ^ 2

___

√ ___

Page 41: Data Mining. Classification

It is not necessary to apply the nonlinear function Ф to transform

the set of training examples into a higher dimensional feature space. Use a kernel function instead of the scalar product in the original optimization problem and the decision problem.

K(xi, xj) = ⟨ Ф (xi), Ф(xj) ⟩

Kernel Trick

Page 42: Data Mining. Classification

Kernel Functions

Linear kernel K(xi, xj) = ⟨ xi, xj ⟩ Radial basis function kernel K(xi, xj) = exp

Polynomial kernel K(xi, xj) = (s ⟨ xi, xj ⟩ + c) d

Sigmoid kernel K(xi, xj) = tanh (s ⟨ xi, xj ⟩ + c) Convex combinations of kernels K(xi, xj) = c1K1(xi, xj) + c2K2(xi, xj) Normalization kernel K(xi, xj) =

xi xj ‖ ‖ 2

___________ 2 σ 2

0

___________________ K (xi, xj) ,

√ K (xi, xi) K (xj, xj)

σ 2 0 = ‖ ‖ xi xj 2 mean ;

, ,

Page 43: Data Mining. Classification

Summary

• Support vector machines can be used for binary classification.

• We can handle misclassified data if we introduce slack variables.

• If the sets to discriminate are not linearly separable we can use kernel functions.

• Applications → binary decisions − Spam filter (spam / no spam)

− Face recognition ( access / no access)

− Credit rating ( good customer / bad costumer)

Page 44: Data Mining. Classification

Literature

• N. Christianini, J.Shawe-Taylor

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.

Cambridge University Press, Cambridge, 2004. • T. Hastie, R. Tibshirani, J. Friedman

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2011.

Page 45: Data Mining. Classification

Thank you very much!


Recommended