Data Mining. Classification

Classification

Summer School

“Achievements and Applications of Contemporary Informatics,

Mathematics and Physics” (AACIMP 2011)

August 8-20, 2011, Kiev, Ukraine

Erik Kropat

University of the Bundeswehr Munich Institute for Theoretical Computer Science,

Mathematics and Operations Research

Neubiberg, Germany

Examples

Clinical trials

In a clinical trial 20 laboratory values of 10.000 patients are collected together with the diagnosis ( ill / not ill ).

We measure the values of a new patient.

Is he / she ill or not?

Credit ratings

An online shop collects data from its customers together with some information about the credit rating ( good customer / bad customer ).

We get the data of a new customer.

Is he / she a good customer or not?

Machine-Learning / Classification

Labeled training

examples

Machine learning

algorithm

Classification rule

New Example

Predicted classification

k Nearest Neighbor Classification

kNN


Compare the new object with the k “nearest” objects (“nearest neighbors”)

Idea: Classify a new object with regard to a set of training examples.

+

+ +

+

+ + +

+ + +

+

+ Objects in class 1 Objects in class 2

New object

4-nearest neighbor


• Required

− Training set, i.e. objects and their class labels

− Distance measure

− The number k of nearest neighbors

+ +

+ +

+ + +

+ + +

+

5-nearest neighbor

• Classification of a new object

− Calculate the distance between the objects of the training set.

− Identify the k nearest neighbors.

− Use the class label of the k nearest neighbors to determine the class of the new object (e.g. by majority vote).

New object


+

+

+ +

1-nearest neighbor

+

+

+ +

+

+

+ +

2-nearest neighbor 3-nearest neighbor

Classification ? + Class label:

Decision by distance

1-nearest neighbor ⇒ Voronoi diagram

kNN k Nearest Neighbor Classification

• The distance between the new object and the objects in the set of training samples is usually measured by the Euclidean metric or the squared Euclidean metric.

• In text mining the Hamming-distance is often used.

Distance

kNN k Nearest Neighbor Classification

• The class label of the new object is determined by the list of the k nearest neighbors. This could be achieved by

− Majority vote with regard to the class labels of the k nearest neighbors.

− Distance of the k nearest neighbors.

Class label of the new object

kNN k Nearest Neighbor Classification • The value of k has a strong influence on the classification result.

− k too small: Noise can have a strong influence.

− k too large: Neighborhood can contain objects from different classes (ambiguity / false classification)

+

+ +

+ + + +

+ +

+

Support Vector Machines


A set of training samples with objects in Rn is divided in two categories:

positive objects and negative objects

Goal: “Learn” a decision rule from the training samples.

Assign a new example into the “positive” or the “negative” category. Idea: Determine a separating hyperplane.

New objects are classified as

positive, if they are in the half space of positive examples

negative, if they are in the half space of negative examples.


INPUT: Sample of training data T = { (x1, y1),...,(xk, yk) | xi ∈ Rn , yi ∈ { -1, +1 } }, with xi ∈ Rn data

and yi ∈ {-1, +1} class label

Data from patients with confirmed diagnosis

Laboratory values

Disease: Yes / No

Decision rule: f : Rn → {-1, +1}

INPUT: Laboratory values of a new patient Decision: Disease: Yes / No


Separating Hyperplane

A separating hyperplane is determined by

− a normal vector w and

− a parameter b

Idea: Choose w and b, such that the hyperplane separates the set of training samples in an optimal way.

H = { x ∈ Rn | ⟨ w, x ⟩ b = 0 }

H

w scalar product

‖ ‖ w b ____

Offset of the hyperplane from the origin along w:

What is a good separating hyperplane?

There exist many separating hyperplanes

Will this new object be in the “red” class?

Question: What is the best separating hyperplane?

Answer: Choose the separating hyperplane so that the distance from it

to the nearest data point on each side is maximized.

support vector

support vector

H

margin

maximum-margin hyperplane

Scaling of Hyperplanes

• A hyperplane can be defined in many ways:

For c ≠ 0: { x ∈ Rn | ⟨ w, x ⟩ + b = 0 } = { x ∈ Rn | ⟨ cw, x ⟩ + cb = 0 } • Use trainings samples to choose (w, b), such that Min | ⟨ w, xi ⟩ + b | = 1

xi canonical hyperplane

Definition

A training sample T = {(x1, y1),...,(xk, yk) | xi ∈ Rn , yi ∈ {-1, +1} } is separable by the hyperplane H = { x ∈ Rn | ⟨ w, x ⟩ + b = 0 }, if there exists a vector w ∈ Rn and a parameter b ∈ R, such that ⟨ w, xi ⟩ + b ≥ +1 , falls yi = +1

⟨ w, xi ⟩ + b ≤ 1 , falls yi = 1 for all i ∈ {1,...,k}.

H

w

⟨ w, x ⟩ + b = -1

⟨ w, x ⟩ + b = 1

Maximal Margin

• The above conditions can be rewritten:

yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 for all i ∈ {1,...,k}

• Distance between the two margin hyperplanes:

⇒ In order to maximize the margin we must minimize

‖ ‖ w 2 ____

‖ ‖ w

H

w

⟨ w, x ⟩ + b = -1

⟨ w, x ⟩ + b = 1

s.t. yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 for all i ∈ {1,...,k}

Optimization problem

Find a normal vector w and a parameter b, such that the distance between

the training samples and the hyperplane defined by w and b is maximized.

Minimize 2 ‖ ‖ w 1 2 __

⇒ quadratic programming problem

H

w

Dual Form

Find parameters α1,...,αk, such that

with αi ≥ 0 for all i = 1,...,k

i = 1 Σ k

αi 1 2 i, j = 1

Σ k

αi αj yi yj ⟨ xi, xj ⟩ Max

αi yi = 0 i = 1 Σ k

The maximal margin hyperplane (= the classification problem)

is only a function of the support vectors. ⇒

Kernel function

k ( xi, xj ) := ⟨ xi, xj ⟩

Dual Form

• When the optimal Parameters α1,...,αk are known, the normal vector w*

of the separating hyperplane is given by

* *

αi yi * xi w* = • The parameter b* is given by

b* = 1 2 _ _ max { ⟨ w*, xi ⟩ | yi = 1 } min { ⟨ w*, xi ⟩ | yi = +1 } +

i = 1 Σ k

training data

Classifier

• A decision function f maps a new object x ∈ Rn to a category f(x) ∈ {-1, +1} :

, if ⟨ w*, x ⟩ + b* ≥ +1

, if ⟨ w*, x ⟩ + b* ≤ 1 f (x) =

+1

1

H

w

-1

+1


Soft Margins

Soft Margin Support Vector Machines

• Until now: Hard margin SVMs

The set of training samples can be separated by a hyperplane. • Problem: Some elements of the trainings samples can have a false label

The set of training samples can not be separated by a hyperplane

and SVM is not applicable.

• Idea: Soft margin SVMs

Modified maximum margin method for mislabeled examples.

• Choose a hyperplane that splits the training set as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples.

• Introduce slack variables ξ1,…, ξ n which measure the degree of misclassification.


• Interpretation

The slack variables measure the degree of misclassification of the training examples with regard to a given hyperplane H.

H

ξ i

ξ j


• Replace the constraints

by

yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 for all i ∈ {1,...,n}

yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 ξ i for all i ∈ {1,...,n}


H

ξ i

• Idea

If the slack variables ξ i are small, then:

ξ i = 0 ⇔ xi is correctly classified

0 < ξ i < 1 ⇔ xi is between the margins.

ξ i ≥ 1 ⇔ xi is misclassified [ yi · ( ⟨ w, xi ⟩ + b ) < 0 ]

Constraint: yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 ξ i for all i ∈ {1,...,n}

H

ξ i


• The sum of all slack variables is an upper bound for the total training error:

H

ξ i

ξ j

i = 1 Σ n

ξ i


Find a hyperplane with maximal margin and minimal training error.

2

regularisation


Minimize 2 ‖ ‖ w 1 2 __

i = 1 Σ n

ξ i

+

C

*

s.t. yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 ξ i for all i ∈ {1,...,n }

* *

≥ 0 for all i ∈ {1,...,kn ξ i


Nonlinear Classifiers

Support Vector Machines Nonlinear Separation

Question: Is it possible to create nonlinear classifiers?

Idea: Map data points into a higher dimensional feature space where a linear separation is possible.

Ф

Rn Rm

Support Vector Machines Nonlinear Separation

Nonlinear Transformation

Ф

Rn Rm original feature space high dimensional feature space

Kernel Functions

Assume: For a given set X of training examples we know a function Ф,

such that a linear separation in the high-dimensional space is possible.

Decision: When we have solved the corresponding optimization problem,

we only need to evaluate a scalar product

to decide about the class label of a new data object.

i = 1 Σ n

αi yi * f(xnew) ⟨ Ф (xi), Ф(xneu) ⟩ + b* ) ( = sign ∈ {-1, +1}

Kernel functions

Introduce a kernel function The kernel function defines a similarity measure between the objects xi and xj. It is not necessary to know the function Ф or the dimension of H !!!

K(xi, xj) = ⟨ Ф (xi), Ф(xj) ⟩

Kernel Trick

Example: Transformation into a higher dimensional feature space

Ф (x1,x2) = ( x1 , 2 x1 x2, x2 ) 2 ___

2 Ф : R → R , 2 3

Input: An element of the training sample x,

a new object x

⟨ Ф ( x ), Ф( x ) ⟩ ^

^

The scalar product in the higher dimensional space (here: R )

can be evaluated in the low dimensional original space (here: R ).

3

2

= x1 x1 + 2 x1 x1 x2 x2 + x2 x2 ^ ^ 2 2 ^ 2 2 ^

= ⟨ ( x1 , 2 x1 x2, x2 ), ( x1 , 2 x1 x2, x2 ) ⟩ ^ ^ ^ ^ 2 2 2 2

= ( x1 x1 + x2 x2) ^ ^ 2

= ⟨ x , x ⟩ ^ = K ( x , x ) ^ 2

√

___

√ ___

√

It is not necessary to apply the nonlinear function Ф to transform

the set of training examples into a higher dimensional feature space. Use a kernel function instead of the scalar product in the original optimization problem and the decision problem.

K(xi, xj) = ⟨ Ф (xi), Ф(xj) ⟩

Kernel Trick

Kernel Functions

Linear kernel K(xi, xj) = ⟨ xi, xj ⟩ Radial basis function kernel K(xi, xj) = exp

Polynomial kernel K(xi, xj) = (s ⟨ xi, xj ⟩ + c) d

Sigmoid kernel K(xi, xj) = tanh (s ⟨ xi, xj ⟩ + c) Convex combinations of kernels K(xi, xj) = c1K1(xi, xj) + c2K2(xi, xj) Normalization kernel K(xi, xj) =

xi xj ‖ ‖ 2

___________ 2 σ 2

0

___________________ K (xi, xj) ,

√ K (xi, xi) K (xj, xj)

σ 2 0 = ‖ ‖ xi xj 2 mean ;

, ,

Summary

• Support vector machines can be used for binary classification.

• We can handle misclassified data if we introduce slack variables.

• If the sets to discriminate are not linearly separable we can use kernel functions.

• Applications → binary decisions − Spam filter (spam / no spam)

− Face recognition ( access / no access)

− Credit rating ( good customer / bad costumer)

Literature

• N. Christianini, J.Shawe-Taylor

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.

Cambridge University Press, Cambridge, 2004. • T. Hastie, R. Tibshirani, J. Friedman

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2011.

Thank you very much!

Date post:	21-May-2015
Category:	Education
Upload:	ssa-kpi
View:	401 times
Download:	1 times

Data Mining. Classification

Education