Spatial Congeries Pattern Mining Presented by: Iris Zhang Supervisor: Dr. David Cheung 24 October...

Spatial Congeries Pattern Mining

Presented by: Iris Zhang

Supervisor: Dr. David Cheung

24 October 2003

Outline Introduction Motivation Related work Formal definition Algorithms Experiments Conclusion

Introduction KDD

Discovery of interesting, implicit, and previously unknown knowledge from large databases [FPM91]

Spatial data mining Extraction of implicit knowledge, spatial

relations, or other patterns not explicitly stored in spatial databases [KH95]

Feature of Spatial Data Mining Spatial autocorrelation

Everything is related to everything else but nearby things are more related than distant things (Tobler, 1979)

Spatial heterogeneity The variation in spatial data is a function of

location

Motivation A famous historical example

In 1909, the residents of Colorado Springs were discovered that they had healthy teeth and the local drinking water had high level of fluoride. Researchers confirmed the positive role of fluoride in controlling tooth decay.

{healthy teeth, high level of fluoride}

Motivation (Cont’) Another case

[HSX02]

Related work Neighboring Class Sets Mining Co-location Pattern Mining

Neighboring Class Sets Access records of mobile services

ID Position Services …

xxx (14975,27020) Weather …

xxx (16723,24301) Timetable …

xxx (15521,26441) Ticket …

xxx … … …

xxx (14737,26752) Timetable …

Neighboring Class Sets Neighboring class sets

((timetable,ticket),4),

((timetable,weather)3),

((ticket,weather),2),

((timetable,ticket,weather),2)

[Mor01]

Neighboring Class Sets Grouping of points

[Mor01]


[Mor01]


[Mor01]

Neighboring Class Sets Apriori generation of valid instances

[Mor01]

Problems Undercount the number of instances Depend on the order of classes to generate

instances for k-neighboring class set (k>2) Provide an absolute number to be support

threshold

Co-location Patterns Mining Co-location: a subset of Boolean features

E.g.: (drought, EL Nino, substantial increase in vegetation, extremely high precipitation)

Co-location Patterns Mining Row instance I ={i1,i2,…,ik} of a co-

location C={f1,f2,…,fk}: ij is an instance of fj (j = 1,2,…k) ip and iq are neighbors to each other (A.1,B.1) is a row instance of co-location {A,B}

Table instance T of C is the set of all row instances of C {(A.1,B.1), (A.2,B.4), (A.3,B.4)} is table instance of

{A,B}

Co-location Patterns Mining Participant ratio for feature fi:

Pr({A,B},A}=3/4=75%, Pr({A,B},B}=2/5=40%

Participant index of a co-location C:

Pi({A,B})=min(0.75,0.4)=0.4

Co-location Pattern Mining Co-location rule: C1C2(p,cp)

C1 and C2 are co-locations

C1 C2 = p: participant index, cp: conditional probability {A}{B}(40%, 75%)

Conditional probability of a co-location rule:

Co-location Patterns Mining Apriori-property

Participant index is monotonically non-increasing as the size of the co-location increasing

Apriori-like mining algorithm Candidate generation Instances generation

Co-location Patterns Mining Candidate generation

Join

Prune

Co-location Patterns Mining Instance generation

Geometric approach Rtree join

Combinatorial approach Sort-merge join

Hybrid approach Rtree join to get instances for size 2 co-location Sort-merge join to get instances for size k(k>2)

co-location

Co-location Patterns Mining Example

Problems The participant index measure may overate

some co-location

The features are binary

Pr({A,B},A)=2/8=25%

Pr({A,B},B)=6/6=100%

Pi({A,B})=min(25%,100%)=25%

{B}{A}(25%, 100%)

{A}{B}(25%, 25%)

Probability({A,B})=7/(8*6)15%

Spatial Congeries Patterns Mining Input:

D = {D1,D2,…,Dn}

Spatial relation to regulate the relation of objects in patterns

min_fre threshold to determine whether an itemset is frequent

Output: Complete set of Spatial Congeries patterns

Spatial Congeries Patterns Mining Example of datasets

*Attribute values can be translated to categorical values

** {VD:10 WD:shallow DOP: near NL:existent} can be a pattern

ID Attribute Type Description

D1 Vegetation durability Ordinal Ordinate scale from 10 to 100

D2 Water depth Numeric In centimeters

D3 Distance to open water Numeric In meters

D4 Nest location Binary Existence or absence of bird nest

Formal Definition Item: an attribute value in a dataset. I is the

set of all items. E.g.: water depth: shallow

Itemset: subset of I E.g.: VD:10 WD:shallow DOP: near

N:existent E.g.: VD:100 WD:depth DOP:far N:absent

Formal Definition Spatial relation: rule to regulate the spatial relation of

objects in patterns

Instances of an item i: points which has attribute value i

Instances of an itemset: if instances of all items in the itemset satisfy the spatial relation, the combination of these instances is an instance of the itemset.

Observation The number of instances of itemsets is not

monotonically non-increasing E.g.: an instance of {triangle, circle} can construct two

instances of {triangle, circle, rectangle}

Conclusion: the number of instances of an itemset can be used to be the measure to determine whether the itemset is a pattern

Formal Definition Frequency of an itemset:

Number of instances of the itemset over all possible combinations of instances of items

E.g.: Frequency({A,B})=7/(8*6)15%

Formal Definition Spatial Congeries pattern:

If the frequency of an itemset is no less than frequency threshold min_fre, the itemset is a Spatial Congeries pattern.

Property of Frequency Lemma: the frequency of an itemset is monotonically

non-increasing with the size of the itemset increasing. Proof: (simplified)

For size k-1 itemset Ik-1 ={v1, v

2,…, v

k-1} and size k itemset Ik =

{v1, v

2,…, v

k-1,v

k}

*mq is the number of instances of Iq

**nq is the number of instances of item vq.

121

1

121

1

21 .........

k

k

kk

kk

k

kk nnn

m

nnnn

nm

nnn

mf

121

11 ...

k

kk nnn

mf

kk ff 1

Algorithm-1 Step 1: generate complete set of size 2

patterns by Rtree-join on complete Rtrees







Algorithm-1 Step 2:generate size k (k>2) patterns level

by level Generate size k (k>2) candidates

Join two size k-1 patterns Prune those candidates which have subsets that are

not frequent

Generate size k (k>2) instances

Sample

Square: a1Triangle: a2Circle: b1Diamond: c1

a2Y5X5

a1Y4X4

a1Y3X3

a2Y2X2a1Y1X1

b1Y8X8

b1Y7X7

b1Y6X6

c1Y9X9

Datasets A

Datasets B

Datasets C

Process of Algorithm-1

RJ to find the instances of size 2 candidates Build Rtree for each dataset A, B and C Do RJ find the instances of size 2 candidates

ma1b1 = 5, ma2b1 =3, ma1c1 = 2, ma2c1 = 0, mb1c1 = 0

Get size 2 patterns a1b1, a2b1,a1c1 according to the frequency threshold 50% fa1b1 = 5/(3*3) 56%, fa2b1 = 3/(2*3) = 50%,

fa1c1 = 2/(3*1) 67%, fa2c1 = 0

fb1c1 = 0

Process of Algorithm-1 Sort-merge-join to find the instances of

size k (k>2) candidates Generate size 3 candidates

Join size 2 pattern a1b1 and a1c1 to form a1b1c1 Prune a1b1c1 because b1c1 is not a pattern

Get size 3 patterns ( there is no size 3 patterns)

Algorithm-2 Step 1:generate all patterns for a combination of subsets.

Each subset corresponds to an item. All points in the subset have the item as their attribute value. E.g.: The first combination is a1b1c1. It needs to build rtrees for

subsets of a1, b1, c1 in order to generate size 2 patterns. Then it do sort-merge join to generate size k(k>2) patterns.

Step 2: generate all patterns for another combination until there is no combination E.g.: The second combination is a2b1c1.

Process of Algorithm-2 Generate patterns for combination a1b1c1

RJ on Rtrees for a1, b1 and c1 to get instances of candidates a1b1, a1c1, b1c1

Suppose a1b1 and a1c1 are patterns, size 3 candidates is a1b1c1 Sort-merge-join to get instances of a1b1c1

Generate patterns for combination a2b1c1 RJ on Rtrees for a2, b1, c1 to get instances of candidates a2b1

and a2c1. Because the instances of b1c1 has been generated, there is no need to do it again

Suppose a2b1 is pattern There is no size 3 candidate

Experiment Environment

CPU type: Pentium III Xeon 700MHz RAM: 4096M

Dataset Synthetic dataset with Gauss distribution

No. of clusters: 5 Map size: 800 E.g.: (622, 478, 5) is a point in a dataset

Experiment-1

*No. of Datasets: 3*No. of Attribute Values: 5*Distance threshold : 100*Frequency threshold: 0.01

Effects of No. of Points

0

500

1000

1500

2000

2500

1000 2000 3000 4000 5000 6000 7000 8000 9000

No. of Points

CP

U T

ime

(s)

cpu time-C

cpu time-P

Experiment-1



0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1000 2000 3000 4000 5000 6000 7000 8000 9000

No. of Points

IO-T

ime(

s)

IO time-C

IO time-P

Experiment-1



0

500

1000

1500

2000

2500

1000 2000 3000 4000 5000 6000 7000 8000 9000

No. of Points

Tim

e(s)

total-C

total-P

Experiment-2

*No. of Points in each datasets: 1000*No. of Attribute Values: 5*Distance threshold : 100*Frequency threshold: 0.01

Effects of No. of Datasets

0

200

400

600

800

1000

1200

1400

1600

1800

3 4 5 6 7 8 9

No. of Datasets

To

tal T

ime(

Clo

cks)

total-C

total-P

Experiment-3

*No. of Datasets: 5*No. of Points in each datasets: 1000*No. of Attribute Values: 5*Distance threshold: 100

Effects of Frequency Threshold

0

100

200

300

400

500

600

700

800

900

1000

0.05 0.03 0.005 0.004 0.003 0.002 0.001 0.0005

Frequency Threshold

Tim

e(s)

total-C

total-P

Experiment-4

*No. of Datasets: 3*No. of Points in each datasets: 1000*No. of Attribute Values: 5*Frequency threshold: 0.01

Effects of Distance Threshold

0

50

100

150

200

250

300

50 100 150 200 250 300

Distance Threshold

Tim

e(s

)

total-C

total-P

Conclusions Neighboring class set mining and co-location

pattern mining problem are introduced Spatial Congeries pattern mining is formulated

and provided with two Apriori-like mining algorithms

Future work: More pruning methods should be used to reduce the

time and space requirement The experiments should be done on real datasets

References [HSX02] Huang Y., Shekhar S., Xiong H. Discovering

Co-location Patterns from Spatial Datasets: A General Approach. Submited to IEEE TKED (under second round review)

[HXSP03] Huang Y., Xiong H., Shekar S., Pei J. Mining Confident Co-location Rules without A Support Threshold. Proc. of 18th ACM Symposium on Applied Computing (ACM SAC), 2003

[Mor01] Morimoto Y. Mining Frequent Neighboring Class Sets in Spatial Databases. Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2001.

Q&A

Date post:	19-Jan-2016
Category:	Documents
Upload:	maria-stevens
View:	219 times
Download:	0 times

Spatial Congeries Pattern Mining Presented by: Iris Zhang Supervisor: Dr. David Cheung 24 October...

Documents