Post on 30-Oct-2014
description
transcript
Akash Dwivedi (09IT6001)
Under the guidance of
Prof. S.K. Ghosh
School of Information Technology Indian Institute of Technology, Kharagpur
Data Mining Engine for Enterprise GIS
2
OUTLINE Motivating examples
Objective of work
Spatial data and its properties
Architecture of Enterprise GIS
Proposed model for data mining engine for Enterprise GIS.
Spatial Outliers Detection
Spatial Clustering Analysis
Semantic Enrichment using Spatial Clustering
Future work
4/2/2011
3
OBJECTIVES
To develop a data mining engine, which can be integrated with Enterprise GIS as a WPS service.
Engine must be able to pre-process the given input and save the data in desired form.
Knowledge discovery using standard spatial-data mining (spatial data mining) techniques, e.g. spatial outlier detection, spatial cluster analysis, and output the results in service mode.
To develop a framework for the automatic updation of data ontology in service oriented architecture of GIS.
4/2/2011
4
What is Spatial Data?oThe data related to objects that occupy space
• traffic, bird habitats, global climate, logistics, ...
oObject types:• Points, Lines, Polygons ,etc.
Used in/for:•GIS - Geographic Information Systems•Meteorology•Astronomy•Environmental studies, etc.
4/2/2011
5
What is Special about Spatial Data
Highly correlated
data.
Data stored in Heterogeneous databases.
Huge amount of
data in term of per record size.
4/2/2011
6
Why Data Mining in Spatial Data
Spatial data mining follows along the same functions in data mining, with the end objective to find patterns in geography, meteorology, etc.
• The neighbors of a spatial object may have an influence on it and therefore have to be considered as well.
The main difference (Spatial
autocorrelation)
• Topological : Adjacency or Inclusion information
• Geometric : Position (longitude/latitude), Area, Perimeter, Boundary polygon
Spatial attributes
4/2/2011
7
Spatial Data + Web Services= OGC (Open Geospatial Consortium)
• The Web Processing Service specification defines a mechanism by which a client may submit a processing task to a server to be completed.
WFS
• The web map service (WMS) produces maps of spatially referenced data dynamically from geographic information
WMS
• The specified Web Processing Service (WPS) provides client access to pre-programmed calculations and/or computation models that operate on spatially referenced data
WPS
4/2/2011
8
Proposed Architecture of Enterprise GIS
4/2/2011
Semantic Resolution of query
Broker(service
composition)
WFS
WPS
Spatial Data
mining Engine
WFS
Client Map
Overlay
WMS
DB1
DB 2
DB n
Query
Fig.1: Architecture of Enterprise GIS
9
Data Mining Engine Framework
Fig.2: Data mining engine framework
4/2/2011
Spatial Outlier Detection
11
Spatial Outlier
A data point that is extreme relative to its neighbors.
• Item: Palm Beach County.• Neighbors(item) = counties in
Florida.
An example
• Fig.3 : Palm Beach county as spatial outlier (source : http://madison.hss.cmu.edu/buchanan-bush.gif)4/2/2011
12
Spatial Outlier Detection Problem
Given,•A spatial framework S consisting of locations•A neighborhood relationship•An attribute function•A neighborhood aggregate function , where N is the maximum neighbor number for a location.
•A comparison function •Statistical test function
21, ,.... ns s s
N S S
:f Ris
:n Naggrf RR
, Naggrdiff
f ff
: ,ST R True False
Design : An efficient algorithm to detect spatial outliers4/2/2011
13
Back To Our Motivating Example:-
Crime hot spots to plan police patrols in St Louis Region Homicides sample data:-
• Given problem can be solved with Spatial Clustering and Outlier detection methods.
Motivation
• Spatial outlier detection methods are well suited compared to classical data mining algorithms.
Preprocessing
• Getting the GML data from data repository in service mode.• Parsing the obtained GML data. Can be done using SAX or DOM.• Saving intermediate result in database.
4/2/2011
14
Results (Classical Data Mining Algorithms)
Input • St Louis Region Homicides sample data
Attributes considered• HR7984(homicide rate per 100,000 (1979-84))• PE82(police expenditures per capita, 1982 )• RDAC80(resource deprivation/affluence composite
variable, 1980)Methods Used• Normal Quantile map (without considering contiguity
matrix).• Box map (Hinge=1.5, also without considering
contiguity matrix).4/2/2011
15
Results for the above methods
Box Map Quantile Map
Fig.4 :Outliers in red color Fig. 5:Outliers in Brown color4/2/2011
16
Results(Spatial data mining algorithms)
The Spatial dependence (distance based) is introduced in our model using Contiguity Matrix.
4/2/2011
Methods Used
• LAG based approach• Moran scatter plot• LISA cluster map
17
LAG based approach
Create the Box map (Hinge=3), the marked data is spatial cluster of outlier data where more police
force is needed for crime patrolling.
Add new feature named LAG in original database equated as: LAG= W*HR7984.
Suppose Contiguity matrix is W.
4/2/2011
18
LAG based approach contd..
Fig. 6: LAG Based Box Map4/2/2011
The marked data is spatial cluster of outlier data where more police force is needed for crime patrolling.
19
Using Moran Scatter Plot
Fig.7 Moran scatter plot, yellow points are spatial outliers4/2/2011
20
VerificationVerification using LISA Cluster map.
Used same contiguity matrix.
Same result as LAG based approach.
Fig. 8: LISA cluster map, Outliers in Red color
4/2/2011
21
Verification Contd…Scatter plot between HR7984 and PE84(Police expenditure) gives positive slope(0.4602), and for our predicted values the police expenditures are quite high (outliers).
Fig. 9 : Relation between HR7984 and PE82
4/2/2011
22
Any Reasons
There may be several factors affecting, one of them is :-
• Resource deprivation/affluence shows positive co-relation with the homicide rate (Slope=0.5250, quite high) and outlier in scatter plot of HR7984 and RDAC80 are our predicted points.
Fig.12 :Scatterplot bw RDAC80 and HR7984 outliers in yellow color. .
4/2/2011
Spatial Cluster Analysis
24
Given n data points in a D-dimensional metric space.
Partition of data points into K-clusters.
Data points within a cluster are more similar to each other than data points in different clusters.
4/2/2011
Cluster analysis is an unsupervised learning which groups similar spatial objects into classes, and the problem of clustering is:
25
• While choosing a clustering algorithm many factors have to be considered like:
Application goal
Quality and Speed
Characteristics of the data
Dimensionality of the data
Amount of noise in data
4/2/2011
26
Spatial Clustering Problem Definition
1. A spatial framework of n sites, 1
{ }ni
S si with a
neighbor relation N ⊆S × S. Sites si and sj are neighbors iff (si, sj ) ∈N, i ≠ j. Let N(si)≡{sj: (si , sj∈N)} denote the neighborhood of si . We assume N is given by a contiguity matrix W whose W(i , j)=1 iff (si,sj) ∈N and W( i , j)=0 otherwise.
2. Associated with each si , there is a d-D feature vector of normal attributes xi≡x (si) ∈ d .
4/2/2011
Given,
27
Problem Definition Contd…
Find,
• A many-to-one mapping 1
: 1,....n
i if kx
Objective,
• The ultimate goal is to maximize similarity between clustering and classification based on true class labels.
Constraint,• Spatial autocorrelation exists, i.e., (xi, yi ) of site si may
not be independent of the corresponding values of nearby spatial sites.
4/2/2011
28
Back To Our Motivating Example:-
House price prediction in Baltimore Housing price sample data:- • Given problem can be solved with Spatial Clustering
Motivation
• Spatial Clustering methods are well suited compared to classical data mining algorithms.
Preprocessing
• Getting the GML data from data repository in service mode.• Parsing the obtained GML data. Can be done using SAX or
DOM.• Saving intermediate result in database.
4/2/2011
29
Experimental Setup
Data Used Baltimore House Price data
Number of Rows 211
Number Of attributes 17
Number of attributes used 13
Methods used NEM Spatial clustering algorithm analysis
Tools used GeoDA 0.9.5-i5,MATLAB R2010A
4/2/2011
Table 1. Experimental Setup details
30
Analysis• Histogram
Figure 13: Histogram of House price Data
We can roughly model with a mixture of components.
4/2/2011
31
Results for K=2, Using NEM
Figure 14: Clustering Results for K=2, High priced Houses in in Brown color
4/2/2011
32
Results for k=3, Using NEM
Figure 15:k=3, High Prices building shown in red color
4/2/2011
Semantic Enrichment using spatial clustering
34
Problem Definition
The very common problem in enterprise GIS is updation of data ontology when a new data source is added.The naïve way of doing it requires lot of manual work hence it is not suitable for enterprise GIS.
4/2/2011
35
Proposed Solution
Knowledge discovery from clustered data will be positively aided by providing semantic meaning to the clusters.
This can be done through formally specifying the data using ontology.
Using description logic for ontology will provide implicit knowledge about the semantic relationship between clusters and the relationships of individuals to the clusters.
4/2/2011
36
Framework
Figure 16: Semantic enrichment of clusters
4/2/2011
37
Framework Contd…
Formal specification of clusters using data ontology
The identification of clusters using clustering algorithm.
The retrieval of data from distributed heterogeneous databases for clustering
4/2/2011
38
Reasoning of ontology for implicit knowledge
Both TBox and ABox reasoning can be used to extract information from the ontology.SPARQL can be used for querying ontology, which is a type of ABox reasoning.
4/2/2011
39
Results: Ontology
Figure 17:Data ontology for Baltimore House price data
4/2/2011
40
Results Contd…
Figure 18:SPARQL Query page
Reasoning ,ABox reasoning done to this ontology using SPARQL.Sample Query:
4/2/2011
41
Results Contd…• Result for the given query
Figure 19: Result for the given query4/2/2011
42
Future WorkIntegration of Spatial co-location mining, Spatial regression and Spatial classification in Data mining engine.
Generally spatial data contains large number of attributed, which requires a novel dimensionality reduction algorithm for spatial data.
A SQL type query language can be developed to mine data using spatial data mining engine.
Implementation of fully automatic updation of data ontology to semantically enrich the clustered data.
4/2/2011
43
References
4/2/2011
[1] P. Bolstad, "GIS fundamentals," A first text on Geographic Information Systems, 2002.[2] S. and Chawla, S. Shekhar, "Spatial databases: a tour," Upper Saddle River, New Jersey, vol. 7458.[3] K. and Adhikary, J. and Han, J. Koperski, "Spatial data mining: progress and challenges survey paper," in Proc. ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada., 1996.[4] R. and Srikant, R. Agrawal, "Fast algorithms for mining association rules," in Proc. 20th Int. Conf. Very Large Data Bases, VLDB., 1994, vol. 1215, pp. 487--499.[5] J.R. Quinlan, C4. 5: programs for machine learning.: Morgan Kaufmann, 1993.[6] V. and Lewis, T. Barnett, Outliers in statistical data. New York: Wiley , 1994.[7] A.K. and Dubes, R.C. Jain, Algorithms for clustering data., 1988.[8] L. and Procopiuc, O. and Ramaswamy, S. and Suel, T. and Vitter, J.S. Arge, "Scalable sweeping-based spatial join," in PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES., 1998, pp. 570--581.[9] Y. Chou,.: Onward Press, 1997.[10]H.P. Kriegel, R.T. Ng, and J. Sander M.M. Breunig, "Optics-of: Id ntifying local outliers," Proc. of PKDD, pp. 262-270, 1999.
44
References Contd…
4/2/2011
[11] V. Barnett and T. Lewis, Outliers in Statistical Data. New York: John Wiley, 1994.[12] M.M Breunig, H.P. Kriegel, and J. Sander M. ankerst, "Ordering points to identify the clustering," International conference on Management of Data, pp. 49-60, 1999.[13] R. Johnson, Applied Multivariate Statistical Analysis.: Prentice Halt, 1992.[14] R. Rastogi, and K. Shim. S. Ramaswamy, "Efficient algorithms for mining outliers from large data sets," Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, vol. 29, pp. 427-438, 2000.[15] Shashi and Lu, Chang-Tien and Zhang, Pusheng Shekhar, "A Unified Approach to Detecting Spatial Outliers," Geoinformatica, vol. 7, no. 2, pp. 139--166, June 2003.[16] Anselin Luc, "Exploratory spatial data analysis and geographic information systems," in New Tools for Spatial Analysis., 1994, pp. 45-54.[17] D. and Hebeler, J. and Dean, M. Kolas, "Geospatial semantic web: Architecture of ontologies," GeoSpatial Semantics, pp. 183--194, 2005.[18] T. and Vt, "Creating and using geospatial ontology time series in a semantic cultural heritage portal," in Proceedings of the 5th European semantic web conference on The semantic web: research and applications.: Springer-Verlag, 2008, pp. 110—123.
45
References Contd…
4/2/2011
[19]P. and Di, L. and Yang, W. and Yu, G. and Zhao, P. and Gong, J. Yue, "Semantic Web Services-based process planning for earth science applications," International Journal of Geographical Information Science, vol. 29, no. 9, pp. 1139--1163, 2009.[20]M. and Ghosh, SK Paul, "oward Assessing Semantic Similarity of Geospatial Services," in TENCON 2006. 2006 IEEE Region 10 Conference., pp. 1--4.[21]E. and Lutz, M. and Kuhn, W. Klien, "Ontology-based discovery of geographic information services--An application in disaster management," Computers, environment and urban systems, vol. 30, no. 1, 2006.[22]Anselin Luc, "Local indicators of spatial association: LISA," Geographical Analysis, vol. 27, no. 2.[23]L. Anselin, D. Hawkins, G. Deane, S. Tolnay, R. Baller S. Messner. (2000) [Online]. http://www.ncovr.heinz.cmu.edu/[24]Shashi Shekhar,Weili Wu, and Uygar Ozesmi Sanjay Chawla, "Predicting Locations Using Map Similarity(PLUMS): A Framework for Spatial Data Mining," in MDM/KDD, Simeon J. Simoff and Osmar R. Za, Ed. Boston, MA, USA: University of Alberta, 2000, pp. 14-24.[25]Robin A. Dubin. (1992) geodacenter.asu.edu. [Online]. http://geodacenter.org/downloads/data-files/baltimore.zip
46
References Contd…
4/2/2011
[26]P. Zhang, Y. Huang, R. Vatsavai S. Shekhar, "Trend in Spatial Data Mining," in Data Mining: Next Generation Challenges and Future Directions.: AAAI/MIT Press, 2003.[27]C. and Govaert, G. Ambroise, "onvergence of an EM-type algorithm for spatial clustering," pattern recognition letters, vol. 19, no. 10, pp. 919--927, 1998.[28]N. Alameh, "Chaining geographic information web services," IEEE Internet Computing, vol. 7, no. 5, pp. 22--29, 2003.[29]A. and Lucchi, R. and Lutz, M. and Ostl Friis-Christensen, "Service chaining architectures for applications implementing distributed geographic information processing," International Journal of Geographical Information Science, vol. 23, no. 5, pp. 561--580, 2009.[30]P. and Gong, J. and Di, L. and He, L. and Wei, Y. Yue, "Integrating semantic web technologies and geospatial catalog services for geospatial information discovery and processing in cyberinfrastructure," GeoInformatica, 2009.
47
Questions??
4/2/2011
Thank You
48
Box Map
• Since box maps are based on the same methodology as box plots, they can be used to detect outliers in a stricter sense than is possible with percentile maps. Box maps group values such as counts or rates into six fixed categories: Four quartiles (1-25%, 25-50%, 50-75%, and 75-100%) plus two outlier categories at the low and high end of the distribution.
• Values are classified as outliers if they are 1. 5 times higher than the interquartile range (IQR). IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) or Q3-Q1. It describes the range of the middle of the distribution since 25% of values are above the interquartile range and 25% below it.
4/2/2011
49
Box Plot
• Box plots are particularly useful to identify outliers and gain an overview of the spread of a distribution.
• The box plot (sometimes referred to as box and whisker plot) is a non-parametric method. For normally distributed data, the median corresponds to the mean and the interquartile range to the standard deviation. The box plot shows the median, first and third quartile of a distribution (the 50%, 25% and 75% points in the cumulative distribution) as well as outliers. An observation is classified as an outlier when it lies more than a given multiple of the interquartile range (the difference in value between the 75% and 25% observation) above or below respectively the value for the 75th percentile and 25th percentile. The standard multiples used are 1.5 and 3 times the interquartile range.
• The red bar in the middle corresponds to the median, the dark part shows the interquartile range. The individual observations in the first and fourth quartile are shown as blue dots. The thin line is the hinge, corresponding to the default criterion of 1.5.
4/2/2011