Wine Clustering Ling Lin. Contents ❏ Motivation ❏ Data ❏ Dimensionality Reduction-MDS, Isomap...

Post on 06-Jan-2018

235 views 0 download

description

Motivation Clustering is a main task of exploratory data mining Make market Segementation, marketing strategies Document Clustering Target appropriate treatment to patients with similar response patterns Image segementation Apply clustering methods to a real data

transcript

WineClustering

Ling Lin

Contents❏ Motivation❏ Data❏ Dimensionality Reduction-MDS, Isomap❏ Clustering-Kmeans, Ncut, Ratio Cut, SCC❏ Conclustion❏ Reference

Motivation• Clustering is a main task of exploratory data mining

Make market Segementation, marketing strategies Document Clustering Target appropriate treatment to patients with similar response

patterns Image segementation

• Apply clustering methods to a real data

Data➢ Wine data

Source of the data set : “Machine Learning Repository” , University of California, Irvine.

Data sample size : 14 variables and 178 observations in 3 classes : different cultivar

Variables :

1) Alcohol 2) Malic acid 3) Ash 4) Alcalinity of ash 5) Magnesium 6) Total phenols

7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10)Color intensity 11)Hue

12)OD280/OD315 of diluted wines 13)Proline

MDS

Can I seperate objects better? ---> change the ways to find the distances

Cityblock(L1)Distance

Chebychev Distance

Cosine Distance Mahalanobis Distance

Distances• Euclidean Distance-Straight line distance between two points.

• City-block Distance- (L1 Distance)

Sum of the distances of two points in any coordinate dimension.

Distances• Chebychev Distance-(Chessboard Distance)

The greatest distance of two points’ difference in any coordinate dimension.

• Cosine Distance-

The cosine of the angle between two vectors

Distances• Mahalanobis Distance-The dissimilarity of two vectors. S is the

covariance matrix.

Euclidean Distance = c

City-block Distance = a+b

Chebychev Distance = max(a,b) = a

Cosine Distance = cos(θ)

a

b

MDS in 3D

MDS in 2D

Isomap

Cosine

Mahalanobis

Isomap

Cosine

Mahalanobis

Kmeans Clustering

Error rate = 0.03

True Labeled Kmeans Clustering

Normalized Cut Ratio Cut SCC

ClusteringComparison

Conclusion• Dimensionality Reduction-

Different methods for calculating distances and reducing dimension

--->Wine dataV X

3D MDS Cosine Distance Mahalanobis

2D MDS Cosine Distance Mahalanobis

Isomap make Mahalanobis distance a better display

Conclusion• Clustering:

Kmeans= Rcut→ SCC→ Ncut

Ncut and Rcut : consider both inter and intra cluster connections.

However, in this dataset, the intra cluster connections are weak.