CLUSTERING PROXIMITY MEASURES By Ça ğ rı Sarıgöz Submitted to Assoc. Prof. Turgay İ brikçi EE...

CLUSTERINGPROXIMITY MEASURES

By Çağrı SarıgözSubmitted to Assoc. Prof. Turgay İbrikçi

EE 639

Classification

Classifying has been one of the crucial though activities of human kind. It makes it easy to perceive the outside

world and act accordingly.

Aristotle’s Classification of Living Things is one

of the most famous classification works

dating back to ancient times

Cluster Analysis

Cluster analysis brings mathematical methodology to the solution of classification problems

It deals with classification or grouping of data into a set of categories or clusters. Data objects that are in the same cluster

should be similar and the ones that are in different clusters should be dissimilar in some context. It’s generally a subjective matter to determine

this context.

Approaching the Data Objects

Feature Types

Continuous

Discrete

Binary

Measurement Levels Qualitative

Nominal Ordinal

Quantitative Interval Ratio

Feature Types

A continuous feature can take a value from an uncountably infinite range. Exact weight of a person.

Whereas a discrete feature has a range of value that is finite or countably infinite. Number of heartbeats of a person, in bpm.

Binary feature is a special case of discrete features where there is only 2 values that the feature can take. Presence or absence of tattoos on a person’s

skin.

Measurement Levels:Qualitative

Features at nominal level have no mathematical meaning; they generally are levels, states or names. Color of a car, condition of weather, etc..

Features at ordinal level are still just names, but with a certain order. But, the difference between the values are still meaningless in mathematical sense. Degrees of headache: none, slight,

moderate, severe, unbearable, etc..

Measurement Levels:Quantitative

At interval level, difference between feature values has a meaning, but there is no true zero in the range of level, i.e. the ratio between two values has no meaning. IQ score. A person with 140 IQ score isn’t

necessarily two times intelligent than a person with 70 IQ score.

Features at ratio level have all the properties of the other, plus a true zero, so that the ratio between two values has a mathematical meaning. Number of cars in a parking lot.

Definition of Proximity Measures: Dissimilarity (Distance)

A dissimilarity or distance function D on a data set X is defined to satisfy these conditions: Symmetry: D(xi , xj) = D(xj , xi )

Positivity: D(xi, xj) ≥ 0 for all xi and xj.

It’s called a dissimilarity metric if these conditions also hold, Triangle inequality: D(xi, xj) ≤ D(xi, xk) + D(xk, xj) for all xi, xj and xk

Reflexivity: D(xi, xj) = 0 iff xi = xj

It’s called a semimetric if triangle inequality does not hold

If the following condition also holds, it’s called a ultrametric: D(xi, xj) ≤ max(D(xi, xk),D(xj, xk)) for all xi, xj and xk.

Definition of Proximity Measures: Similarity

A similarity function S is defined to satisfy the following conditions:

Symmetry: S(xi , xj) = S(xj , xi);

Positivity: 0 ≤ S(xi, xj) ≤ 1, for all xi and xj.

It’s called a similarity metric if the following additional conditions also hold:

For all xi , xj, and xk, S (xi , xj)S (xj , xk) ≤ [S (xi , xj) + S (xj , xk)]S (xi , xk)

S(xi , xj) = 1 iff xi = xj

Proximity Measures for Continuous Variables

Euclidean distance (also known as L2 norm) :

xi and xj are d-dimensional data objects

Euclidean distance is a metric, tending to form hyperspherical clusters. Also, clusters formed with Euclidean distance are invariant to translations and rotations in the feature space.

Without normalizing the data, features with large values and variances will tend to dominate over other features. A commonly used method is data standardization, in which each feature has zero mean and unit variance,

where xil* represents the raw data

and sample mean ml and sample standard sl are defined as

and respectively.


Another normalization approach: The Euclidean distance can be generalized

as a special case of a family of metrics, called Minkowski distance or Lp norm, defined as:

When p = 2, the distance becomes the Euclidean distance.

p = 1: the city-block (Manhattan distance) or L1 norm,

p → ∞ : the sup distance or L∞ norm,


The squared Mahalanobis distance is also a metric:

Where S is the within-class covariance matrix defined as S = E[(x − μ)(x − μ)T] where μ is the mean vector and E[·] calculates the expected value of a random variable.

Mahalanobis distance tends to form hyperellipsodial clusters, which are invariant to any nonsingular linear transformation.

The calculation of the inverse of S may cause some computational burden for large-scale data.

When features are not correlated, S equals to an identity matrix, making Mahalanobis distance equal to Euclidean distance.


The point symmetry distance is based on the assumption that the cluster’s structure is symmetric:

Where xr is a reference point (e.g. the centroid of the cluster) and ||·|| represents the Euclidean norm.

It calculates the distance between an object xi and xr, the reference point, given other N – 1 objects and minimized when a symmetric pattern exists.


The distance measure can also be derived from a correlation coefficient, such as the Pearson correlation coefficient, defined as,

The correlation coefficient is in the range of [-1,1], with -1 and 1 indicating the strongest negative and positive correlation respectively. So we can define the distance measure as which is in the range of [0,1].

Features should be measured on the same scales, otherwise the calculation of the mean or variance in calculating the Pearson correlation coefficient would have no meaning.


Cosine similarity is an example of similarity measures, which can be used to compare a pair of data objects with continuous variables, given as,

which can be constructed as a distance measure by simply using D(xi, xj) = 1 − S(xi, xj).

Like Pearson correlation coefficient, the cosine similarity is unable to provide information on the magnitude of differences.

Examples and Applications of the Proximity Measures for Continuous

Variables

Proximity Measures for Discrete Variables: Binary Variables

Invariant similarity measures for symmetric binary variables:

1-1 match and 0-0 match of the variables are regarded as equally important. Unmatched pairs are weighted based on their contribution to the similarity.

For the simple matching coefficient, the corresponding dissimilarity measure from D(xi, xj) = 1 − S(xi, xj) is known as the Hamming distance.

Proximity Measures for Discrete Variables: Binary Variables

Non-invariant similarity measures for asymmetric binary variables:

These measures focus on 1-1 match features while ignoring the effect of 0-0 match, which is considered uninformative.

Again, the unmatched pairs are weighted depending on their importance.

Proximity Measures for Discrete Variables with More than Two Values

One simple and direct approach is to map the variables into new binary features. It is simple, but it may cause introducing

too many binary variables. A more effective and commonly used

method is based on matching criterion. For a pair of d-dimensional objects xi and xj, the similarity using the simple matching criterion is given as:

where

Proximity Measures for Discrete Variables with More than Two Values

The categorical features may display certain orders, known as the ordinal features. In this case, the codes from 1 to Ml, where Ml is the highest

level, are no meaningless in similarity measures. In fact, the closer the two levels are, the more similar the two objects in that feature.

Objects with this type of feature can be compared using the continuous dissimilarity measures. Since the number of possible levels varies with the different features, the original ranks ril* for the ith object in the lth feature are usually converted into the new ranks ril in the range of [0,1], using the following method:

Then city-block or Euclidean distance can be used.

Proximity Measures for Mixed Variables

The similarity measure for a pair of d-dimensional mixed data objects xi and xj can be defined as:

where Sijl indicates the similarity for the lth feature between the two objects, and

δijl is a 0-1 coefficient based on whether the measure of the two objects is missing.

Correspondingly, the dissimilarity measure can be obtained by simply using D(xi, xj) = 1 − S(xi, xj).

The component similarity for discrete variables: For continuous variables:

where Rl is the range of the lth

variable over all objects, written as

Questions?

Date post:	16-Dec-2015
Category:	Documents
Upload:	liliana-hockley
View:	223 times
Download:	2 times

CLUSTERING PROXIMITY MEASURES By Ça ğ rı Sarıgöz Submitted to Assoc. Prof. Turgay İ brikçi EE...

Documents