+ All Categories
Home > Documents > Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining –...

Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining –...

Date post: 28-Apr-2018
Category:
Upload: vukien
View: 221 times
Download: 2 times
Share this document with a friend
88
Geographic Dimension in Geographic Dimension in Data Mining Data Mining Konrad Dramowicz Konrad Dramowicz Centre of Geographic Sciences Centre of Geographic Sciences Lawrencetown, Nova Scotia, Canada Lawrencetown, Nova Scotia, Canada ESRI Business ESRI Business GeoInfo GeoInfo Summit Summit Chicago, April 18 Chicago, April 18 - - 19, 2005 19, 2005
Transcript
Page 1: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Geographic Dimension in Geographic Dimension in Data MiningData Mining

Konrad DramowiczKonrad DramowiczCentre of Geographic SciencesCentre of Geographic Sciences

Lawrencetown, Nova Scotia, CanadaLawrencetown, Nova Scotia, Canada

ESRI Business ESRI Business GeoInfoGeoInfo SummitSummitChicago, April 18Chicago, April 18--19, 200519, 2005

Page 2: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

What is data mining?What is data mining?• Data mining (also known as a Knowledge

Discovery) is a technology, science, and art. It can help in extracting significant, previously unknown information from databases.

• Data mining automatically detects relevant patterns in a database. However, for many years statisticians have manually mined databases looking for statistically significant patterns.

• Data mining is also a tool for predicting future trends and behaviors, allowing business to make proactive knowledge driven decisions. Usually experts miss this predictive information because it lies outside their expectations.

Page 3: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

GIS as a synergetic technologyGIS as a synergetic technology

• GIS alone is a very powerful technology dealing with spatial aspects of the real world.

• However, GIS with other technologies such as data mining, CRM, or ERP can be seen as synergetic technology.

Page 4: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

What is special about spatial data?What is special about spatial data?

• Heavy use of computational geometry algorithms such as– Polygon intersection– Topological operations

• Large and complex objects such as– High fractal dimension polygons– Polygons with attached topological information– Networks and their attributes (for example,

addresses)• Large index tables

– Each spatial feature can be indexed by many z-values

Page 5: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Spatial data structuresSpatial data structures

• Spatial data structures are designed for indexing or storing spatial data.

• Usually they are raster-based structures using the Binary Search Tree (B-tree).

Page 6: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Reasons for emergingReasons for emergingGIS and data miningGIS and data mining

• Abundance of data• Inefficient traditional technology processing

information• Progress in computer technology, including data

structure, database management, computer graphic, artificial intelligence, etc.

• Growing user awareness and demand• Interdisciplinary approach such as:

– GIS: geography, computer science, forestry, land surveying, military applications

– Data mining: statistics, computer science, marketing, quality control, medicine

Page 7: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Major steps in developing Major steps in developing new technology (1970s)new technology (1970s)

• GIS– Data collection– Question example: “What

is the forest stand type in a given polygon?”

– Data delivery:• Retrospective and static

– Enabling technologies:• Mainframe computers• Digitizing tables

– Major users:• Forestry• Military• Land registry

• Data mining– Data collection– Question example: “What

was the total revenue in the last three years?”

– Data delivery:• Retrospective and static

– Enabling technologies:• Mainframe computers• Tapes, disks

Page 8: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Major steps in developing Major steps in developing new technology (1980s)new technology (1980s)

• GIS– Data access– Question example: “Where is

the most suitable animal habitat?”

– Data delivery:• Retrospective and dynamic at

feature level– Enabling issues:

• Vector topology• Raster data structure• DBMS

– Major users:• Geology• Environment • Government

• Data mining– Data access– Question example: “What

were unit sales in Maritimes last April?”

– Data delivery:• Retrospective and dynamic at

record level– Enabling technologies:

• RDBMS• SQL• ODBC

Page 9: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Major steps in developing Major steps in developing new technology (1990s)new technology (1990s)

• GIS– Data modeling and analysis– Question example: “What are

changes in the forest cover in a given area?”

– Data delivery:• Retrospective and dynamic at

multiple levels– Enabling issues:

• Vector/raster integration• GPS• SQL• Interoperability• Portable computers

– Major users:• Corporations• Municipalities• Education

• Data mining– Data warehousing and decision

support– Question example: “What were

unit sales in Maritimes last April? Drill down to Halifax”

– Data delivery:• Retrospective and dynamic at

multiple levels– Enabling technologies:

• OLAP• Data warehouses• Portable computers

Page 10: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Major steps in developing Major steps in developing new technology (emerging today)new technology (emerging today)

• GIS– Deployment of geographical

information– Question example: “How to

get to the closest restaurant?”– Data delivery:

• Prospective and proactive– Enabling issues:

• LBS• Internet mapping• Geodatabases

– Major users:• Communication• Business• General public

• Data mining– Data mining– Question example: “What

likely to happen to Halifax unit sales next month and why?”

– Data delivery:• Prospective and proactive

– Enabling technologies:• Distributive algorithms and

databases• Multiprocessor computers• Massive databases

Page 11: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Ten hottest jobs Ten hottest jobs and jobs that will disappear and jobs that will disappear

((TimeTime magazine, May 22, 2005)magazine, May 22, 2005)1. Tissue engineers2. Gene programmers3. Frankenfood monitors4. Pharmers5. Data miners

“ Research gurus will be on hand to extract useful tidbits from mountain of data, pinpointing behavior patterns for marketers and epidemiologists like”

1. Stockbrokers, auto dealers, mail carriers, insurance and real estate agents

2. Teachers3. Printers4. Stenographers5. CEOs6. Orthodontists7. Prison guards8. Truckers9. Housekeepers10.Fathers (?)

Page 12: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Broad and narrow definition Broad and narrow definition of data miningof data mining

• Broad definition of data mining refers to traditional statistical methods (“we are all data miners”)

• Narrow definition of data mining refers to automated methods, artificial intelligence, computer learning techniques

Page 13: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Data mining and GIS Data mining and GIS as a technology, science, and artas a technology, science, and art

• GIS and data mining as technologies:– Originated and stimulated by computer technology– Dealing with massive databases– Employing graphics

• GIS and data mining as sciences:– Having specific methods– Using specialized tools– Trying to develop own methodology– Having interdisciplinary character

• GIS and data mining can be seen as arts:– Requiring technical experience– Requiring experience in content domain area

Page 14: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Domains and scales of Domains and scales of GIS and data mining applicationsGIS and data mining applications

• The list of GIS and data mining applications is very extensive. Both technologies can be applied practically to any domain. Since it is not appropriate to define GIS and data mining by listing their most typical applications, therefore both technologies can be considered as domain-free.

• Also, GIS and data mining are scale-free since they can be applied to many different scales. There are examples of using GIS for mapping a human eye and for analyzing changes in global or even cosmic scale. Data mining is used for diagnosing a single patient and for international analyses.

Page 15: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Too much information?Too much information?• Can GIS and data mining help to handle the

problem of information overload?– 61% of managers believe that information overload is

present in their own workplace– 80% believe the situation get worse– Over 50% of managers ignore data in current

decision-making process because of the information overload

– 84% of managers store this information for the future; it is not used for current analysis

– 60% believe that the cost of gathering information outweighs its value

(Kantardzic, 2003)

Page 16: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

CRISPCRISP• CIRSP (Cross-Industry Standard Process for

Data Mining) is a general data mining protocol developed in late 1990s.

• CRISP is similar to a product life cycle methodology developed in software engineering and implemented in managing GIS projects

• CRISP consists of six phases:1. Business understanding2. Data understanding3. Data preparation4. Modeling5. Evaluation6. Deployment

Page 17: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Mining geographical informationMining geographical information

• Components of Geographic Information Systems:– Data input – Data manipulation – Analysis and modeling– Data output

• Steps in data mining:– Problem

understanding– Data pre-processing– Modeling– Evaluation– Deployment

Page 18: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Comparing data mining and GISComparing data mining and GIS• Data mining

– Operates in a multidimensional abstract space

– Hypotheses are generated by machine learning

– Results go beyond the content of database

• GIS– Operates in

geographical space– Hypotheses are

generated by users– Difficulties in mapping

multivariate dependencies

• Data mining– Operates in a

multidimensional abstract space

– Hypotheses are generated by machine learning

– Results go beyond the content of database

• GIS– Operates in

geographical space– Hypotheses are

generated by users– Difficulties in mapping

multivariate dependencies

Page 19: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Business understandingBusiness understanding• Determine business objectives

– Background– Business objectives– Business success criteria

• Access situation– Inventory of resources– Requirements, assumptions and constraints– Risk and contingencies– Terminology– Costs and benefits

• Determine data mining goals– Data mining goals– Data mining success criteria

• Produce project plan– Project plan– Initial assessment of tools and techniques

Page 20: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Data understandingData understanding

• Collect initial data– Initial data collection report

• Describe data– Data description report

• Explore data– Data exploration report

• Verify data quality– Data quality report

Page 21: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Data preparationData preparation– Data set– Data set description

• Select data– Rationale for inclusion / exclusion

• Clean data– Data cleaning report

• Construct data– Derived attributes– Generated records

• Integrate data– Merged data

• Format data– Reformatted data

Page 22: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

ModelingModeling• Select modeling techniques

– Modeling technique– Modeling assumptions

• Generate test design– Test design

• Build model– Parameter setting – Models– Model description

• Access model– Model assessment– Revised parameter settings

Page 23: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

EvaluationEvaluation

• Evaluate results– Assessment of data mining results

• Business success criteria– Approved models

• Review process– Review of process

• Determine next steps– List of possible actions – Decision

Page 24: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

DeploymentDeployment

• Plan deployment– Deployment plan

• Plan monitoring and maintenance– Monitoring and maintenance plan

• Produce final report– Final report– Final presentation

• Review project– Experience documentation

Page 25: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Integrating GIS and data miningIntegrating GIS and data mining

• There are numerous areas where GIS and data mining already overlap.

• Both, GIS and data mining represent synergetic, powerful, dynamic, and rapidly developing technologies.

• The process of integration of GIS and data mining has been already initiated.

• Further integration can benefit significantly both technologies.

Page 26: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Integration benefits to GISIntegration benefits to GIS

• GIS can benefit from being integrated with data mining by using:– More efficient data manipulation tools – Specialized Exploratory Data Analysis tools– Powerful new modeling tools– Better visualization tools

Page 27: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

More efficient More efficient data manipulation toolsdata manipulation tools

• Data manipulation tools represent the primary area of data mining. These tools are important but not critical in GIS, since the manipulation of non-spatial attributes can be always performed outside GIS.

• Data cleansing is one of the most time and cost consuming operation within GIS projects.

• Below are some examples of typical data mining operations that can be done with specialized data manipulation tools:– Detecting and replacing missing data– Improving attribute accuracy– Handling inconsistency in databases– Data reclassification– Merging attributes and appending records– Filtering data

Page 28: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Exploratory Data Analysis toolsExploratory Data Analysis tools

• Exploratory Data Analysis (EDA) tools has been very commonly used in GIS analysis in the last ten years.

• Exploratory Data Analysis is usually the very first step in any spatial analysis.

• Data mining provides very specialized EDA tools for such operations as:– Outlier analysis– Testing normality– Analyzing distribution with boxplots and Q-Q plots

Page 29: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Exploratory Data Analysis Exploratory Data Analysis in data miningin data mining

Page 30: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Powerful new modeling toolsPowerful new modeling tools• Data mining offers numerous powerful modeling

tools that are not implemented yet in GIS, such as:– Decision trees and decision rules– Association rules– Artificial neural networks– Genetic algorithms

• Some data mining tools are partially implemented in some GIS, including:– Fuzzy logic– Cluster analysis

Page 31: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

New visualization toolsNew visualization tools

• Visualization tools play critical role in GIS and they are also very important in data mining

• GIS focuses primarily on mapping tools using spatial attributes and employing the art of cartography

• Data mining focuses on charting and graphing non-spatial attributes using statistical methods

Page 32: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Visualization tool in data mining: cluster viewerVisualization tool in data mining: cluster viewer

Page 33: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Integration benefits to data miningIntegration benefits to data mining

• Data mining can specially benefit from being integrated with GIS at the following phases of CRISP methodology:– Data preparation– Analysis– Evaluation– Deployment

Page 34: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Data preparationData preparation• Data preparation represents a critical component in both

technologies. • Spatial (geographically-referenced) attributes are very

common within databases being analyzed with data mining.

• However, using the data mining alone, many typical operations on spatial attributes cannot be performed at all.

• GIS can provide tools for such operations as, for example:– Spatial referencing– Geocoding– Building topological relationships among objects

Page 35: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Deriving new attributesDeriving new attributes• GIS can be very useful in expanding the

number of attributes for further analysis by deriving new attributes.

• These attributes can be derived based on– Geographical (metric) information– Topological information

Page 36: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Deriving geographical (metric) Deriving geographical (metric) attributesattributes

• Length of lines• Areas of polygons• Distance to a closest object• Directions• Density of features

Page 37: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Deriving topological attributesDeriving topological attributes

• Connectivity of nodes• Adjacency of polygons• Information resulting from such topological

operations as, for example: – Inside– Within– Intersects– Contains– Covers

Page 38: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

AnalysisAnalysis• Modeling and analysis exemplify the most

powerful component in data mining and GIS.• However, both technologies are

complementary in their approach to modeling. GIS provides more specialized spatial analysis tools, whereas data mining provides rather statistical analysis tools.

• Data mining lacks numerous tools from the domain of GIS.

Page 39: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Missing geographical analytical Missing geographical analytical tools in data miningtools in data mining

• Spatial statistics:– For example: spatial multiple linear regression

• Spatial analysis – For example: spatial autocorrelation

• Geostatistics– For example: kriging or trend surface analysis

• Network analysis– For example: optimal path or minimal tour

• Surface analysis – For example: visibility analysis

• Location-allocation modeling – For example: allocating demand to a given center

• Regionalization (spatial clustering)

Page 40: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

EvaluationEvaluation

• Evaluation is a required step in data mining, whereas in GIS evaluation is rather more recommended than strictly forced.

• GIS offers invaluable tools for evaluating residuals (the difference between actual and predicted values), especially for– Mapping residuals– Analyzing the spatial autocorrelation of residuals

Page 41: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

DeploymentDeployment

• GIS provides mapping tools that are non-existing in traditional data mining. These tools used for mapping results can enhance the deployment phase in data mining.

Page 42: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Spatial data mining resourcesSpatial data mining resources

• The leading academia centers developing spatial data mining are:– University of Utah (USA)– Southern Illinois University (USA)– Boston University (USA)– Simon Fraser University (Canada)– University of Leeds (England)– University of Munich (Germany)– University of Bari (Italy)– Russian Academy of Sciences

Page 43: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Spatial data mining softwareSpatial data mining software

• GeoMiner is a prototype of a spatial data mining system, including a spatial database server.

• Spin! (Spatial Mining for Data of Public Interest) represents the Web-based integration of data mining and GIS for such applications as public health, environmental protection, seismology, or marketing. This European product includes live Oracle-based queries and data visualization.

Page 44: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Visual programming and streamsVisual programming and streams

• Visual streams constructed from single operations and linked with visual programming represent a common contemporary user interface.

• Some data mining software packages implemented visual programming at the end of 1990s.

• Geoprocessing model building streams were implemented in ArcGIS, version. 8.0.

• Will these streams be ever linked together?

Page 45: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Basic data mining operations (1)Basic data mining operations (1)• Source operations:

• Record operations:

• Fields operations:

Page 46: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Basic data mining operations (2)Basic data mining operations (2)• Graphs operations:

• Modeling operations:

• Output operations:

Page 47: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Clustering stream: Clustering stream: KohonenKohonen, , KK--Means, and Two Step algorithmsMeans, and Two Step algorithms

Page 48: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Data mining modeling toolsData mining modeling tools• Predictive tools

– Neural networks– Multiple linear regression– Logistic regression– Prediction using C5.0 rule-based algorithm

• Rule-based tools– C5.0– CR&T (Classification and Regression Trees)– Association rules

• Apriori• GRI (Generalized )

• Classification tools– K-Means clustering– Kohonen network– TwoStep clustering

Page 49: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Neural networks modelingNeural networks modeling• Purpose: to predict a numeric or categorical

target variable• Output:

– Predicted value – Residuals (actual minus predicted values)– Rules

• Can be mapped:– Actual target variable– Predicted target variable– Residuals– Rules

Page 50: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Rule induction modelingRule induction modeling• Purpose: to predict a categorical target variable• Algorithm: C5.0• Output:

– Importance of predictors– Predicted value – Residuals

• Can be mapped:– Actual target variable– Predicted target variable– Residuals

Page 51: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Multiple linear regressionMultiple linear regression

• Purpose: to predict a numerical target variable using numerical predictors

• Output: – Set of predictors– Predicted target variable– Residuals

• Can be mapped:– Actual target variable– Predicted target variable– Residuals

Page 52: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Prediction stream:Prediction stream:numeric target variablenumeric target variable

Page 53: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Predicting with neural networks:Predicting with neural networks:numeric target variablenumeric target variable

Page 54: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Predicting with neural networks: Predicting with neural networks: residualsresiduals

Page 55: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Predicting with Predicting with multiple linear regressionmultiple linear regression

Page 56: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Predicting with multiple linear Predicting with multiple linear regression: residualsregression: residuals

Page 57: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Generating rulesGenerating rules• Purpose: to perform rule induction or to discover

associations rules• Algorithms:

– C5.0 (categorical target variables, categorical or numerical predictors)

– Apriori (categorical target variables and predictors)– GRI (Generalized Rule Induction, categorical target variables,

categorical or numerical predictors)• Output:

– Rules for groups of records, including their frequency and accuracy• Can be mapped:

– Geographical distribution of rules

Page 58: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Rule induction modeling:Rule induction modeling:numeric target variablenumeric target variable

Page 59: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Examples of rules the target Examples of rules the target variable: high GDPvariable: high GDP

Page 60: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Logistic regressionLogistic regression

• Purpose: to predict a categorical target variable using categorical and numerical predictors

• Output: – Set of predictors– Predicted target variable– Residuals

• Can be mapped:– Actual target variable– Predicted target variable– Residuals

Page 61: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Stream with rules:Stream with rules:C5, C5, AprioriApriori, and GRI algorithms, and GRI algorithms

Supernodes

Page 62: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Stream with rules Stream with rules inside inside supernodesupernode

Page 63: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Prediction stream:Prediction stream:categorical target variablecategorical target variable

Page 64: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Predicting with neural networks: Predicting with neural networks: categorical target variablecategorical target variable

Page 65: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Predicting withPredicting withlogistic regressionlogistic regression

Page 66: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

ClusteringClustering• Purpose: to group records into clusters• Algorithms:

– Kohonen network– K-Means– TwoStep

• Output: – Cluster memberships– Cluster description– Distance to cluster centroids

• Can be mapped:– Cluster memberships– Most typical features for each cluster

Page 67: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Clustering: Clustering: KK--Means algorithmMeans algorithm

Page 68: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Clustering:Clustering:similarities to cluster similarities to cluster centroidscentroids

Page 69: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Factor analysis / PCA Factor analysis / PCA (Principal Components Analysis)(Principal Components Analysis)

• Purpose: to reduce the number of variables by replacing individual variables by factors / components

• Output: – Extracted factors– Loadings of variables on factors / components– Factor / component scores

• Can be mapped:– Geographical distribution of scores

Page 70: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Principal Components Analysis:Principal Components Analysis:component scorescomponent scores

Page 71: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Principal Component Analysis:Principal Component Analysis:scores of all five componentsscores of all five components

Page 72: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Classification tree modelingClassification tree modeling• Purpose: to pick individual predictors one at time and classify them

to optimize (minimize or maximize) a target variable• Algorithm:

– Classification and Regression Tree– CHAID– Exhaustive CHAID– QUEST

• Output:– Top predictors – Groups of records (nodes)– Rules

• Can be mapped:– Geographical distribution of rules

Page 73: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Classification tree andcorresponding map

Page 74: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

OLAP cubesOLAP cubes• OLAP stands for On-Line Analytical Processing• OLAP tools allow the user to

– query – browse – and summarize information in a very efficient, interactive, and dynamic

way• OLAP tools represent a vital component of both

the Business Intelligence and data mining technology. They provide an aggregated approach to analyzing large amounts of detailed data.

Page 75: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Why cubes?Why cubes?• OLAP databases are referred often as “cubes”

since they have multidimensional nature. A cube is a visual representation of a multidimensional table and has just three dimensions: rows, columns and layers.

• OLAP cubes are very flexible because they allow the user to move information between these three dimensions.

• Users can have multiple cubes for their business data: one cube for customers, one for sales, one for production, one for geography, etc.

Page 76: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Basic operations with OLAP cubesBasic operations with OLAP cubes

• The following are basic operations that can be performed with OLAP cubes:

• Slice• Dice• Roll-up• Drill down• Pivoting

Page 77: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Slicing OLAP cubesSlicing OLAP cubes• The slice operation is based on selecting one dimension

and focusing on a portion of a cube. For example, the following table presents seven statistics for one variable.

Page 78: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Dicing OLAP cubesDicing OLAP cubes

• The dice operation creates a sub-cube by focusing on two or more dimensions.

Page 79: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

RollingRolling--up OLAP cubesup OLAP cubes

• Roll-up, also called aggregation or dimension reduction, allows the user to move to the higher aggregation level. For example, instead of aggregating data by county, the user can select the whole province level.

Page 80: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

DrillingDrilling--down OLAP cubesdown OLAP cubes• The drill-down

operation is the reverse of a roll-up and represents the situation when the user moves down the hierarchy of aggregation, applying a more detailed grouping.

Page 81: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Pivoting OLAP cubesPivoting OLAP cubes

• Pivoting, or rotation, changes the perspective in presenting the data to the user.

Page 82: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

OLAP cubes OLAP cubes and spatial warehousesand spatial warehouses

• So far, the integration of OLAP technology and geographical analysis has been very limited. The majority of integration took place within the building and using of so-called spatial warehouses for visualization of results. The methodology based on similarities between OLAP cubes and map cubes is very promising.

• Whereas non-spatial warehouses utilize OLAP cubes, mostly as summary tables or spreadsheets, the spatial warehouses provide map cubes (collections of maps).

Page 83: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

OLAP cubes vs. map cubesOLAP cubes vs. map cubes• The different nature of OLAP cubes and map

cubes also results from different types of aggregation operations.

• Such statistics as arithmetic mean, median, mode, standard deviation, minimum, maximum, count, or sum have their equivalents in so-called map algebra for raster data types.

• In addition, spatial queries utilizing geometric operators (area, perimeter, centroid) and topological operators (inside, within, intersect, contains, connects, borders), supplement significantly non-spatial SQL queries.

Page 84: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Integrating OLAP with GISIntegrating OLAP with GIS• A bridge between OLAP and GIS should allow OLAP

users, who deal with geographical data, to display OLAP cubes as maps using GIS.

• The foundation for going in the opposite direction (from GIS to OLAP) should allow GIS users to create SQL queries with geometric and topological operators and then pass this information to OLAP cubes.

• The final step of the integration would be allowing the user to interactively browse OLAP cubes and to simultaneously view the results on maps. Also, the user should be able to query a map and view corresponding data within OLAP cubes.

Page 85: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Two disjoint technologiesTwo disjoint technologies

• Today, GIS and data mining are still used as separate technologies.

• If GIS and data mining software packages operate under the same operating system, data can be passed easily but still indirectly between such packages.

• The recent emphasis on interoperability in GIS should be extended beyond GIS technology.

Page 86: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

The most typical sequence of operationsThe most typical sequence of operations

• The most typical sequence of operations using GIS and data mining is as follows:

Data preparation including data cleansing (DM) Data preparation including deriving new geographical attributes (GIS) Spatial analysis (GIS) Modeling (DM) Validation (DM) Mapping initial results and spatial validation (GIS) Charting and interpreting results (DM) Mapping final results (GIS)

Page 87: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Future directionsFuture directions

• The future directions in spatial data mining can be summarized as follows (after Koperski, 1997):– Integrating artificial intelligence and GIS– Data mining using spatial object-oriented databases– Query language for spatial data mining– Creating multidimensional spatial rules– Mining under uncertainty– Spatial clustering– Visualization using multivariate thematic maps– Parallel data mining– Mining the presence of topological and geometric errors– Mining the remote sensing data– Mining spatiotemporal databases

Page 88: Geographic Dimension in Data Mining - Amazon S3 Dimension in Data Mining ... • Data mining – Data warehousing and decision ... • Regionalization (spatial clustering)

Selected bibliographySelected bibliography• CRISP-DM 1.0, 1999. SPSS.• Dramowicz K., 2002. Adding Geography to Data Mining. Data Mining Summit, Reston,

VA.• Dunhan M.H., 2003. Data Mining: Introduction and Advanced Topics. Prentice Hall.• Eklund P.W. et al., 1998. Data Mining and Soil Salinity Analysis. International Journal of

Geographical Information Science, 12. pp. 247-268.• Ester M., et al., 1998. Spatial Data Mining: Database Primitives, Algorithms and Efficient

DBMS Support. Data Mining and Knowledge Discovery, 4, 2/3, pp. 193-216.• Gahegan M., 2000. On the Application of Inductive Machine Learning Tools to

Geographical Analysis. Geographical Analysis, 2, pp. 113-139.• Kantardzic M., 2003. Data Mining: Concepts, Models, Methods, and Algorithms. Wiley.• Koperski K. et al., 1997. Spatial Data Mining: Progress and Challenge.

http://db.cs.sfu.ca/GeoMiner/survey/html/survy.html• Koperski K., J. Han, 1995. Discovery of Spatial Association Rules in Geographic

Information Databases. [In:] Egenhofer M., J. Ferring (eds.) Advances in Spatial Databases. Springler-Verlag, pp. 47-66.

• Miller H.J. and J. Han (eds.), 2001. Geographic Data Mining and Knowledge Discovery. Taylor and Francis.

• Oppenshaw S., 1999. Geographic Data Mining: Key Design Issues. 4th International Conference on GeoComputation.


Recommended