Mr. Idrissa Y. H. Assistant Lecturer, Geography Environment Department of Social Sciences School of...

Mr. Idrissa Y. H. Assistant Lecturer,Geography & EnvironmentDepartment of Social SciencesSchool of Natural & Social SciencesState University of Zanzibar

LECTURE 8a_ SPATIAL STATISCAL ANALYSIS

Introduction to spatial analysis Judging spatial association visually The concept of Clustering and

Cluster analysis Spatial Cross-Correlation Pearson, Spearman Multivariate spatial association

measures

That Spatial Statistics, extends traditional statistics on two fronts. First, it seeks to map the variation in a data set and Secondly, it can uncover “numerical spatial relationships” within and among mapped data layers.

Tobler’s Law “Everything is related to everything else, but near things are more related than distant things”

3 major benefits of spatial analysis Pattern Analysis Feature count Analysis

May 6, 2023 Data Mining: Concepts and Techniques 4

1. What is Cluster Analysis?2. Types of Data in Cluster Analysis3. A Categorization of Major Clustering Methods4. Partitioning Methods5. Hierarchical Methods6. Density-Based Methods7. Grid-Based Methods8. Model-Based Methods9. Clustering High-Dimensional Data 10.Constraint-Based Clustering 11.Outlier Analysis12.Summary


Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters

Cluster analysis Finding similarities between data according to the

characteristics found in the data and grouping similar data objects into clusters

Unsupervised learning: no predefined classes Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms


Pattern Recognition Spatial Data Analysis

Create thematic maps in GIS by clustering feature spaces

Detect spatial clusters or for other spatial mining tasks

Image Processing Economic Science (especially market research) WWW

Document classification Cluster Weblog data to discover groups of similar

access patterns


Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use in an earth observation database

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

City-planning: Identifying groups of houses according to their house type, value, and geographical location

Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults


A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity

The quality of a clustering result depends on both the similarity measure used by the method and its implementation

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns


Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)

There is a separate “quality” function that measures the “goodness” of a cluster.

The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.

Weights should be associated with different variables based on applications and data semantics.

It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.


Partitioning approach: Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS

Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects)

using some criterion Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

Density-based approach: Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue


Grid-based approach: based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE

Model-based: A model is hypothesized for each of the clusters and tries to find the

best fit of that model to each other Typical methods: EM, SOM, COBWEB

Frequent pattern-based: Based on the analysis of frequent patterns Typical methods: pCluster

User-guided or constraint-based: Clustering by considering user-specified or application-specific

constraints Typical methods: COD (obstacles), constrained clustering


Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the

clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)

Assign each object to the cluster with the nearest seed point

Go back to Step 2, stop when no more new assignment


Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign


Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.▪ Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

Weakness Applicable only when mean is defined, then what about

categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes


Cluster analysis groups objects based on their similarity and has wide applications

Measure of similarity can be computed for various types of data

Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches

There are still lots of research issues on cluster analysis

Date post:	18-Jan-2018
Category:	Documents
Upload:	ginger-thomas
View:	216 times
Download:	0 times

Mr. Idrissa Y. H. Assistant Lecturer, Geography Environment Department of Social Sciences School of...

Documents