8clst.doc

UNIT-V

CLUSTER ANALYSIS?

1. Cluster: a collection of data objectsa. Similar to one another within the same clusterb. Dissimilar to the objects in other clusters2. Cluster analysisa. Grouping a set of data objects into clusters3. Clustering is unsupervised classification: no predefined classes4. Typical applicationsa. As a stand-alone tool to get insight into data distributionb. As a preprocessing step for other algorithmsAPPLICATIONS OF CLUSTERING

1. Pattern Recognition2. Spatial Data Analysisa. create thematic maps in GIS by clustering feature spacesb. detect spatial clusters and explain them in spatial data mining3. Image Processing4. Economic Science (especially market research)5. WWWa. Document classificationb. Cluster Weblog data to discover groups of similar access patterns6. Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs7. Land use: Identification of areas of similar land use in an earth observation database8. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost9. City-planning: Identifying groups of houses according to their house type, value, and geographical location10. Earth-quake studies: Observed earth quake epicenters should be clustered along continent faultsREQUIREMENTS OF CLUSTERING1.Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results2.Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types.3.Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density.4.Minimal requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters.5.Able to deal with noise and outliers: Most real-world databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality6.Insensitive to order of input records: Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and, instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data.7.High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions8.Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic banking machines (ATMs) in a city.9. Interpretability and usability: Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied to specific semantic Interpretations

andapplications.

TYPES OF DATA IN CLUSTER ANALYSIS:

1. Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j)2. There is a separate quality function that measures the goodness of a cluster.3. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.4. Weights should be associated with different variables based on applications and data semantics.5. It is hard to define similar enough or good enougha. the answer is typically highly subjective.TYPES OF DATA IN CLUSTERING ANALYSIS

1. Interval-scaled variables:2. Binary variables:3. Nominal, ordinal, and ratio variables:4. Variables of mixed types:INTERVAL-SCALED VARIABLES1) Standardize dataa) Calculate the mean absolute deviation:

where

2) Calculate the standardized measurement (z-score)

Using mean absolute deviation is more robust than using standard deviationSimilarity and Dissimilarity between Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects

1) Some popular ones include: Minkowski distance:

where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two p-dimensional data objects, and q is a positive integer

2) If q = 1, d is Manhattan distance

If q = 2, d is Euclidean distance:

Properties

i) d(i,j) ( 0 Distance is a nonnegative number.ii) d(i,i) = 0 The distance of an object to itself is 0.iii) d(i,j) = d(j,i) The distance of an object to itself is 0.iv) d(i,j) ( d(i,k) + d(k,j)BINARY VARIABLES A contingency table for binary data

Object j

10sum

1aba+b

Object i 0cdc +d

suma+cb+dp

Simple matching coefficient (invariant, if the binary variable is symmetric):

Jaccard coefficient (noninvariant if the binary variable is asymmetric):

DISSIMILARITY BETWEEN BINARY VARIABLES

NameGenderFeverCoughTest-1Test-2Test-3Test-4

JackMYNPNNN

MaryFYNPNPN

JimMYPNNNN

1. gender is a symmetric attribute2. the remaining attributes are asymmetric binary3. let the values Y and P be set to 1, and the value N be set to 0 d ( jack , mary ) =0 + 1=0.33

2+ 0 +1

d ( jack , jim ) =1 + 1=0.67

1 +1 + 1

d ( jim , mary ) =1 + 2=0.75

1+ 1 + 2

CATEGORICAL VARIABLES1. A generalization of the binary variable in that it can take more than 2 states,a. e.g., red, yellow, blue, green2. Method 1: Simple matchinga. m: # of matches, p: total # of variables

3. Method 2: use a large number of binary variablesa. creating a new binary variable for each of the M nominal statesORDINAL VARIABLES

An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled replace xif by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

compute the dissimilarity using methods for interval-scaled variablesRATIO-SCALED VARIABLE Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt

Methods: treat them like interval-scaled variablesnot a good choice! (why?the scale can be distorted) apply logarithmic transformation yif = log(xif)

treat them as continuous ordinal data treat their rank as interval-scaledVARIABLES OF MIXED TYPESA database may contain all the six types of variables

symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio

One may use a weighted formula to combine their

effects

f

is binary or nominal:

dij(f) = 0if xif = xjf , or dij(f)= 1 other wise.

fis interval-based: use the normalized distance

fis ordinal or ratio-scaled

compute ranks rif and

and treat zif as interval-scaled

A CATEGORIZATION OF MAJOR CLUSTERING METHODS1. Partitioning algorithms: Construct various partitions and then evaluate them by some criterionmost applications adopt one of a few popular heuristic methods, such as(1) the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster, and

(2) the k-medoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster. These heuristic clustering methods work well for finding spherical-shaped clusters in small to medium-sized databases.2. Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion.A hierarchical method can be classified as being either agglomerative or divisive, based on howthe hierarchical decomposition is formed. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group.The divisive approach, also called the top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds.

There are two approaches to improving the quality of hierarchical clustering: (1) perform careful analysis of object linkages at each hierarchical partitioning, such as in Chameleon, or (2) integrate hierarchical agglomeration and other approaches by first using a hierarchical agglomerative algorithm to group objects into microclusters, and then performing macroclustering on the microclusters using another clustering method such as iterative relocation, as in BIRCH.

3. Density-based: based on connectivity and density functionsOther clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the neighborhood exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points.4. Grid-based: based on a multiple-level granularity structureGrid-based methods quantize the object space into a finite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure (i.e., on the quantized space). The main advantage of this approach is its fast processing time, which is typically independent of the number

of data objects and dependent only on the number of cells in each dimension in the quantized space.

STING is a typical example of a grid-based method.5. Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each otherA model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points. It also leads to a way of automatically determining the number of

clusters based on standard statistics, taking noise or outliers into account and thus yielding robust clustering methods.

PARTITIONING METHODS Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw87): Each cluster is represented by one of the objects in the cluster.The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignmentComments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t

Date post:	12-Sep-2015
Category:	Documents
Upload:	reddy-avula
View:	4 times
Download:	0 times

8clst.doc

Documents