Presentation1.1

transcript

Akash Dwivedi (09IT6001)

Under the guidance of

Prof. S.K. Ghosh

School of Information Technology Indian Institute of Technology, Kharagpur

Data Mining Engine for Enterprise GIS

OUTLINE Motivating examples

Objective of work

Spatial data and its properties

Architecture of Enterprise GIS

Proposed model for data mining engine for Enterprise GIS.

Spatial Outliers Detection

Spatial Clustering Analysis

Semantic Enrichment using Spatial Clustering

Future work

4/2/2011

OBJECTIVES

To develop a data mining engine, which can be integrated with Enterprise GIS as a WPS service.

Engine must be able to pre-process the given input and save the data in desired form.

Knowledge discovery using standard spatial-data mining (spatial data mining) techniques, e.g. spatial outlier detection, spatial cluster analysis, and output the results in service mode.

To develop a framework for the automatic updation of data ontology in service oriented architecture of GIS.

4/2/2011

What is Spatial Data?oThe data related to objects that occupy space

• traffic, bird habitats, global climate, logistics, ...

oObject types:• Points, Lines, Polygons ,etc.

Used in/for:•GIS - Geographic Information Systems•Meteorology•Astronomy•Environmental studies, etc.

4/2/2011

What is Special about Spatial Data

Highly correlated

Data stored in Heterogeneous databases.

Huge amount of

data in term of per record size.

4/2/2011

Why Data Mining in Spatial Data

Spatial data mining follows along the same functions in data mining, with the end objective to find patterns in geography, meteorology, etc.

• The neighbors of a spatial object may have an influence on it and therefore have to be considered as well.

The main difference (Spatial

autocorrelation)

• Topological : Adjacency or Inclusion information

• Geometric : Position (longitude/latitude), Area, Perimeter, Boundary polygon

Spatial attributes

4/2/2011

Spatial Data + Web Services= OGC (Open Geospatial Consortium)

• The Web Processing Service specification defines a mechanism by which a client may submit a processing task to a server to be completed.

• The web map service (WMS) produces maps of spatially referenced data dynamically from geographic information

• The specified Web Processing Service (WPS) provides client access to pre-programmed calculations and/or computation models that operate on spatially referenced data

4/2/2011

Proposed Architecture of Enterprise GIS

4/2/2011

Semantic Resolution of query

Broker(service

composition)

Spatial Data

mining Engine

Client Map

Overlay

Fig.1: Architecture of Enterprise GIS

Data Mining Engine Framework

Fig.2: Data mining engine framework

4/2/2011

Spatial Outlier Detection

Spatial Outlier

A data point that is extreme relative to its neighbors.

• Item: Palm Beach County.• Neighbors(item) = counties in

Florida.

An example

• Fig.3 : Palm Beach county as spatial outlier (source : http://madison.hss.cmu.edu/buchanan-bush.gif)4/2/2011

Spatial Outlier Detection Problem

Given,•A spatial framework S consisting of locations•A neighborhood relationship•An attribute function•A neighborhood aggregate function , where N is the maximum neighbor number for a location.

•A comparison function •Statistical test function

21, ,.... ns s s

:f Ris

:n Naggrf RR

, Naggrdiff

: ,ST R True False

Design : An efficient algorithm to detect spatial outliers4/2/2011

Back To Our Motivating Example:-

Crime hot spots to plan police patrols in St Louis Region Homicides sample data:-

• Given problem can be solved with Spatial Clustering and Outlier detection methods.

Motivation

• Spatial outlier detection methods are well suited compared to classical data mining algorithms.

Preprocessing

• Getting the GML data from data repository in service mode.• Parsing the obtained GML data. Can be done using SAX or DOM.• Saving intermediate result in database.

4/2/2011

Results (Classical Data Mining Algorithms)

Input • St Louis Region Homicides sample data

Attributes considered• HR7984(homicide rate per 100,000 (1979-84))• PE82(police expenditures per capita, 1982 )• RDAC80(resource deprivation/affluence composite

variable, 1980)Methods Used• Normal Quantile map (without considering contiguity

matrix).• Box map (Hinge=1.5, also without considering

contiguity matrix).4/2/2011

Results for the above methods

Box Map Quantile Map

Fig.4 :Outliers in red color Fig. 5:Outliers in Brown color4/2/2011

Results(Spatial data mining algorithms)

The Spatial dependence (distance based) is introduced in our model using Contiguity Matrix.

4/2/2011

Methods Used

• LAG based approach• Moran scatter plot• LISA cluster map

LAG based approach

Create the Box map (Hinge=3), the marked data is spatial cluster of outlier data where more police

force is needed for crime patrolling.

Add new feature named LAG in original database equated as: LAG= W*HR7984.

Suppose Contiguity matrix is W.

4/2/2011

LAG based approach contd..

Fig. 6: LAG Based Box Map4/2/2011

The marked data is spatial cluster of outlier data where more police force is needed for crime patrolling.

Using Moran Scatter Plot

Fig.7 Moran scatter plot, yellow points are spatial outliers4/2/2011

VerificationVerification using LISA Cluster map.

Used same contiguity matrix.

Same result as LAG based approach.

Fig. 8: LISA cluster map, Outliers in Red color

4/2/2011

Verification Contd…Scatter plot between HR7984 and PE84(Police expenditure) gives positive slope(0.4602), and for our predicted values the police expenditures are quite high (outliers).

Fig. 9 : Relation between HR7984 and PE82

4/2/2011

Any Reasons

There may be several factors affecting, one of them is :-

• Resource deprivation/affluence shows positive co-relation with the homicide rate (Slope=0.5250, quite high) and outlier in scatter plot of HR7984 and RDAC80 are our predicted points.

Fig.12 :Scatterplot bw RDAC80 and HR7984 outliers in yellow color. .

4/2/2011

Spatial Cluster Analysis

Given n data points in a D-dimensional metric space.

Partition of data points into K-clusters.

Data points within a cluster are more similar to each other than data points in different clusters.

4/2/2011

Cluster analysis is an unsupervised learning which groups similar spatial objects into classes, and the problem of clustering is:

• While choosing a clustering algorithm many factors have to be considered like:

Application goal

Quality and Speed

Characteristics of the data

Dimensionality of the data

Amount of noise in data

4/2/2011

Spatial Clustering Problem Definition

1. A spatial framework of n sites, 1

S si with a

neighbor relation N ⊆S × S. Sites si and sj are neighbors iff (si, sj ) ∈N, i ≠ j. Let N(si)≡{sj: (si , sj∈N)} denote the neighborhood of si . We assume N is given by a contiguity matrix W whose W(i , j)=1 iff (si,sj) ∈N and W( i , j)=0 otherwise.

2. Associated with each si , there is a d-D feature vector of normal attributes xi≡x (si) ∈ d .

4/2/2011

Given,

Problem Definition Contd…

• A many-to-one mapping 1

: 1,....n

i if kx

Objective,

• The ultimate goal is to maximize similarity between clustering and classification based on true class labels.

Constraint,• Spatial autocorrelation exists, i.e., (xi, yi ) of site si may

not be independent of the corresponding values of nearby spatial sites.

4/2/2011

Back To Our Motivating Example:-

House price prediction in Baltimore Housing price sample data:- • Given problem can be solved with Spatial Clustering

Motivation

• Spatial Clustering methods are well suited compared to classical data mining algorithms.

Preprocessing

• Getting the GML data from data repository in service mode.• Parsing the obtained GML data. Can be done using SAX or

DOM.• Saving intermediate result in database.

4/2/2011

Experimental Setup

Data Used Baltimore House Price data

Number of Rows 211

Number Of attributes 17

Number of attributes used 13

Methods used NEM Spatial clustering algorithm analysis

Tools used GeoDA 0.9.5-i5,MATLAB R2010A

4/2/2011

Table 1. Experimental Setup details

Analysis• Histogram

Figure 13: Histogram of House price Data

We can roughly model with a mixture of components.

4/2/2011

Results for K=2, Using NEM

Figure 14: Clustering Results for K=2, High priced Houses in in Brown color

4/2/2011

Results for k=3, Using NEM

Figure 15:k=3, High Prices building shown in red color

4/2/2011

Semantic Enrichment using spatial clustering

Problem Definition

The very common problem in enterprise GIS is updation of data ontology when a new data source is added.The naïve way of doing it requires lot of manual work hence it is not suitable for enterprise GIS.

4/2/2011

Proposed Solution

Knowledge discovery from clustered data will be positively aided by providing semantic meaning to the clusters.

This can be done through formally specifying the data using ontology.

Using description logic for ontology will provide implicit knowledge about the semantic relationship between clusters and the relationships of individuals to the clusters.

4/2/2011

Framework

Figure 16: Semantic enrichment of clusters

4/2/2011

Framework Contd…

Formal specification of clusters using data ontology

The identification of clusters using clustering algorithm.

The retrieval of data from distributed heterogeneous databases for clustering

4/2/2011

Reasoning of ontology for implicit knowledge

Both TBox and ABox reasoning can be used to extract information from the ontology.SPARQL can be used for querying ontology, which is a type of ABox reasoning.

4/2/2011

Results: Ontology

Figure 17:Data ontology for Baltimore House price data

4/2/2011

Results Contd…

Figure 18:SPARQL Query page

Reasoning ,ABox reasoning done to this ontology using SPARQL.Sample Query:

4/2/2011

Results Contd…• Result for the given query

Figure 19: Result for the given query4/2/2011

Future WorkIntegration of Spatial co-location mining, Spatial regression and Spatial classification in Data mining engine.

Generally spatial data contains large number of attributed, which requires a novel dimensionality reduction algorithm for spatial data.

A SQL type query language can be developed to mine data using spatial data mining engine.

Implementation of fully automatic updation of data ontology to semantically enrich the clustered data.

4/2/2011

References

4/2/2011

[1] P. Bolstad, "GIS fundamentals," A first text on Geographic Information Systems, 2002.[2] S. and Chawla, S. Shekhar, "Spatial databases: a tour," Upper Saddle River, New Jersey, vol. 7458.[3] K. and Adhikary, J. and Han, J. Koperski, "Spatial data mining: progress and challenges survey paper," in Proc. ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada., 1996.[4] R. and Srikant, R. Agrawal, "Fast algorithms for mining association rules," in Proc. 20th Int. Conf. Very Large Data Bases, VLDB., 1994, vol. 1215, pp. 487--499.[5] J.R. Quinlan, C4. 5: programs for machine learning.: Morgan Kaufmann, 1993.[6] V. and Lewis, T. Barnett, Outliers in statistical data. New York: Wiley , 1994.[7] A.K. and Dubes, R.C. Jain, Algorithms for clustering data., 1988.[8] L. and Procopiuc, O. and Ramaswamy, S. and Suel, T. and Vitter, J.S. Arge, "Scalable sweeping-based spatial join," in PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES., 1998, pp. 570--581.[9] Y. Chou,.: Onward Press, 1997.[10]H.P. Kriegel, R.T. Ng, and J. Sander M.M. Breunig, "Optics-of: Id ntifying local outliers," Proc. of PKDD, pp. 262-270, 1999.

References Contd…

4/2/2011

[11] V. Barnett and T. Lewis, Outliers in Statistical Data. New York: John Wiley, 1994.[12] M.M Breunig, H.P. Kriegel, and J. Sander M. ankerst, "Ordering points to identify the clustering," International conference on Management of Data, pp. 49-60, 1999.[13] R. Johnson, Applied Multivariate Statistical Analysis.: Prentice Halt, 1992.[14] R. Rastogi, and K. Shim. S. Ramaswamy, "Efficient algorithms for mining outliers from large data sets," Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, vol. 29, pp. 427-438, 2000.[15] Shashi and Lu, Chang-Tien and Zhang, Pusheng Shekhar, "A Unified Approach to Detecting Spatial Outliers," Geoinformatica, vol. 7, no. 2, pp. 139--166, June 2003.[16] Anselin Luc, "Exploratory spatial data analysis and geographic information systems," in New Tools for Spatial Analysis., 1994, pp. 45-54.[17] D. and Hebeler, J. and Dean, M. Kolas, "Geospatial semantic web: Architecture of ontologies," GeoSpatial Semantics, pp. 183--194, 2005.[18] T. and Vt, "Creating and using geospatial ontology time series in a semantic cultural heritage portal," in Proceedings of the 5th European semantic web conference on The semantic web: research and applications.: Springer-Verlag, 2008, pp. 110—123.

References Contd…

4/2/2011

[19]P. and Di, L. and Yang, W. and Yu, G. and Zhao, P. and Gong, J. Yue, "Semantic Web Services-based process planning for earth science applications," International Journal of Geographical Information Science, vol. 29, no. 9, pp. 1139--1163, 2009.[20]M. and Ghosh, SK Paul, "oward Assessing Semantic Similarity of Geospatial Services," in TENCON 2006. 2006 IEEE Region 10 Conference., pp. 1--4.[21]E. and Lutz, M. and Kuhn, W. Klien, "Ontology-based discovery of geographic information services--An application in disaster management," Computers, environment and urban systems, vol. 30, no. 1, 2006.[22]Anselin Luc, "Local indicators of spatial association: LISA," Geographical Analysis, vol. 27, no. 2.[23]L. Anselin, D. Hawkins, G. Deane, S. Tolnay, R. Baller S. Messner. (2000) [Online]. http://www.ncovr.heinz.cmu.edu/[24]Shashi Shekhar,Weili Wu, and Uygar Ozesmi Sanjay Chawla, "Predicting Locations Using Map Similarity(PLUMS): A Framework for Spatial Data Mining," in MDM/KDD, Simeon J. Simoff and Osmar R. Za, Ed. Boston, MA, USA: University of Alberta, 2000, pp. 14-24.[25]Robin A. Dubin. (1992) geodacenter.asu.edu. [Online]. http://geodacenter.org/downloads/data-files/baltimore.zip

References Contd…

4/2/2011

[26]P. Zhang, Y. Huang, R. Vatsavai S. Shekhar, "Trend in Spatial Data Mining," in Data Mining: Next Generation Challenges and Future Directions.: AAAI/MIT Press, 2003.[27]C. and Govaert, G. Ambroise, "onvergence of an EM-type algorithm for spatial clustering," pattern recognition letters, vol. 19, no. 10, pp. 919--927, 1998.[28]N. Alameh, "Chaining geographic information web services," IEEE Internet Computing, vol. 7, no. 5, pp. 22--29, 2003.[29]A. and Lucchi, R. and Lutz, M. and Ostl Friis-Christensen, "Service chaining architectures for applications implementing distributed geographic information processing," International Journal of Geographical Information Science, vol. 23, no. 5, pp. 561--580, 2009.[30]P. and Gong, J. and Di, L. and He, L. and Wei, Y. Yue, "Integrating semantic web technologies and geospatial catalog services for geospatial information discovery and processing in cyberinfrastructure," GeoInformatica, 2009.

Questions??

4/2/2011

Thank You

Box Map

• Since box maps are based on the same methodology as box plots, they can be used to detect outliers in a stricter sense than is possible with percentile maps. Box maps group values such as counts or rates into six fixed categories: Four quartiles (1-25%, 25-50%, 50-75%, and 75-100%) plus two outlier categories at the low and high end of the distribution.

• Values are classified as outliers if they are 1. 5 times higher than the interquartile range (IQR). IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) or Q3-Q1. It describes the range of the middle of the distribution since 25% of values are above the interquartile range and 25% below it.

4/2/2011

Box Plot

• Box plots are particularly useful to identify outliers and gain an overview of the spread of a distribution.

• The box plot (sometimes referred to as box and whisker plot) is a non-parametric method. For normally distributed data, the median corresponds to the mean and the interquartile range to the standard deviation. The box plot shows the median, first and third quartile of a distribution (the 50%, 25% and 75% points in the cumulative distribution) as well as outliers. An observation is classified as an outlier when it lies more than a given multiple of the interquartile range (the difference in value between the 75% and 25% observation) above or below respectively the value for the 75th percentile and 25th percentile. The standard multiples used are 1.5 and 3 times the interquartile range.

• The red bar in the middle corresponds to the median, the dark part shows the interquartile range. The individual observations in the first and fourth quartile are shown as blue dots. The thin line is the hinge, corresponding to the default criterion of 1.5.

4/2/2011