A Cluster Validity Measure With Outlier Detection for Support Vector Clustering

Post on 23-Feb-2016

34 views 0 download

description

A Cluster Validity Measure With Outlier Detection for Support Vector Clustering. Presenter : Lin, Shu -Han Authors : Jeen-Shing Wang, Jen- Chieh Chiang. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS(2008). Outline. Introduction of SVC Motivation Objective Methodology - PowerPoint PPT Presentation

transcript

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

A Cluster Validity Measure With Outlier Detection for Support Vector Clustering

Presenter : Lin, Shu-HanAuthors : Jeen-Shing Wang, Jen-Chieh Chiang

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS(2008)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

2

Outline

Introduction of SVC Motivation Objective Methodology Experiments Conclusion Comments

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.SVC

SVC is from SVMs SVMs is supervised clustering technique

Fast convergence Good generalization performance Robustness for noise

SVC is unsupervised approach1. Data points map to HD feature space using a Gaussian kernel.

2. Look for smallest sphere enclose data.

3. Map sphere back to data space to form set of contours.

4. Contours are treated as the cluster boundaries.

3

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.SVC - Sphere Analysis

To find the minimal enclose sphere with soft margin:

To solve this problem, the Lagrangian function:

4

a

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.SVC - Sphere Analysis

5

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.SVC - Sphere Analysis

Karush-Kuhn-Tucker complementarity:

6

Bound SV; Outlier

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.SVC -Sphere Analysis

To find the minimal enclose sphere with soft margin:

C : existence of outliers allowed

7

Wolfe dual optimization

problem a

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.SVC -Sphere Analysis

The distance between x and a:

q : |clusters| & the smoothness/tightness of the cluster boundaries.

8

Mercer kernelKernel: Gaussian

a

Gaussian function:

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Motivation

9

The traditional cluster validity measure such as Partition coefficient (PC) Separation measures

Base on fuzzy membership grades and cancroids of clusters.

SVC algorithm generates boundaries to cluster are arbitrary no fuzzy membership grade.

Which clustering is better?

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Objectives

Optimal cluster number Cluster validity measure Outlier-detection algorithm Cluster merging mechanism

10

Outlier-detection

Cluster merging

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Methodology- Overview

11

Cluster Validity Measure for the SVC Algorithm

Outlier detection

Cluster-Merging Mechanism

C=1, no outliers are allowed

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Methodology – Cluster Validity Measure for the SVC Algorithm

12

Compactness (intra-cluster)

Separation (inter-cluster)

Cluster Validity measure (ratio) for SVC

min

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Outlier Detection

13

In SVC, outliers (BSV) are the data in boundary regions.

q = 1

q = 4

q = 2

q = 1.8C=0.02

singleton

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Outlier Detection

C If C=1, result clusters are smooth, but not desirable

BSV (outlier) All outlier are SVs Some outlier is far away from other data in clusters

SVs More SVs make too tight to fit the data

q Increase q makes clusters compact

Singleton Important criterion

14

q = 1

q = 4

q = 2

q = 1.8C=0.02

singleton

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Outlier Detection

Outlier Existence Criterion

Desirable Cluster Criterion Singleton clusters can’t exceed threshold Datapoint’s % of SVs can’t greater than threshold, suggested 50% Recursively adjust C to satisfy this two criterion

15

Suggested γ = 2

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Cluster-Merging Mechanism

Similarity: overlapping degree

16

Gaussian function:

PC= 0

PA > 0

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Cluster-Merging Mechanism

1) Agglomerative outliers/noises: identificationFor all ci < ε, i = 1, . . . , K, where ε is density, chosen as

3%~5%{Set x ← mi. For each j, j = i, perform pj(x), where pj [0, 1] ∈

is the normalized overlapping index of the j cluster. If pj(x)>0, merge cluster i and cluster j. Otherwise, discard cluster i. Set K ← K − 1.}

2) Compatible clusters: Combination (similarity)Sort the size of the remaining K clusters in ascending order

such that cK = max(ci), i K. For each i, i = 1, . . . , K, perform {Set ∀ ∈x ← mi. For each j, j = i + 1, . . . , K, perform pj(x)

Find l = arg maxi+1≤j≤K pj(x), where arg maxa denotes the value of a at which the expression that follows is maximized.

If pl > 0, merge cluster i with cluster l. Set K ← K − 1 and repeat 2) until no further combination.}

17

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Summary

1) Initialize a small value of q, and set C = 1 and γ = 2

2) Perform SVC algorithm, get |clusters|. 3) If |clusters| < 2, increase q, go to 2).4) If the outlier-detection criterion holds,

decrease C, fix q, and go to 2). Otherwise, go to 5).

5) If |SVs|< 50% of the datapoints, go to 6). Otherwise, decrease C, and go to 2).

6) Compute validity measure index (V (m)).7) If |clusters| > √N, increase q, and go to 2).

Otherwise, stop the SVC.8) Use cluster-merging mechanism to identify

an ideal |clusters|. Output |clusters|. 18

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Experiments - Benchmark and Artificial Examples Bensaid Data Set

19

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Experiments - Benchmark and Artificial Examples Five-Cluster Data Set & Five-Cluster Data Set With Noise

20

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Experiments - Benchmark and Artificial Examples

21

Five-Cluster Data Set With Noise, after cluster-mergeMerge

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Experiments - Benchmark and Artificial Examples

22

Crescent Data Set

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments - IRIS Data Set

23

Misclassificatoin

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

24

Conclusions

This paper integrated for SVC: cluster validity measure Outlier detection Merging mechanism

Automatically determine suitable values for Kernel parameter Soft-margin constant

Clustering with Compact and smooth arbitrary-shaped cluster contours Increasing robustness to outliers and noises

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

25

Comments

Advantage Provide a cluster validity index for a cluster method

Drawback …

Application SVC