A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive...

Post on 27-Jun-2020

1 views 0 download

transcript

amt@cs.rit.edu Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 1

A Comprehensive Look at Mining Time-Series and

Sequential Patterns

Ankur TeredesaiDepartment of Computer Science,Rochester Institute of Technology

Definition of Time-Series

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

25.175025.225025.250025.250025.275025.325025.350025.350025.400025.400025.325025.225025.200025.1750

..

..24.625024.675024.675024.625024.625024.625024.675024.7500

A time series is a collection of observations made sequentially in time.

amt@cs.rit.edu Dr. Ankur M. Teredesai P2

Sample Example for TimeSample Example for Time--Series (cont.)Series (cont.)

People measure things...People measure things...•The presidents approval rating•Their blood pressure•The annual rainfall in Los Angeles•The value of their Yahoo stock•The number of web hits per second

…… and things change over time.and things change over time.

amt@cs.rit.edu Dr. Ankur M. Teredesai P3

Thus time series occur in virtually every medical, scientific anThus time series occur in virtually every medical, scientific and d business domainbusiness domain

What can We Do with Time-Series?

ClusteringClustering ClassificationClassification

Query by Content

Rule Discovery

10

⇒s = 0.5c = 0.3

Motif DiscoveryMotif Discovery

amt@cs.rit.edu Dr. Ankur M. Teredesai P4

Novelty DetectionNovelty Detection

Sample Model for Information Streams Mining

amt@cs.rit.edu Dr. Ankur M. Teredesai P5

•Information streams vs. time-series :

•In many emerging science and business applications, data takes the form of streams rather than static datasets.

•We can define information stream as continuously incoming dynamic data by contrast with static time-series data.

*MIESIS (MIning from Earth Science Information Streams)

Information Streams SegmentationWe need to do this for

• Symbolization

• Dimensionality reductionUsing Fixed length sliding window

0

--

0 20 40 60 80 100 120

bbb

a

cc

c

a

amt@cs.rit.edu Dr. Ankur M. Teredesai P6

Information Streams Segmentation (cont.)Using turning points

0 10 20 30 40 50 60 70-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Time

Val

ue

Information stream data from sensor

amt@cs.rit.edu Dr. Ankur M. Teredesai P7

ClusteringFeature extraction

• For dimensional reduction, we need to extract features from raw information streams

• DFT (Discrete Fourier Transform), DWT (Discrete Wavelet Transform), PAA (Piecewise Aggregate Approximation), etc

Similarity Measure

• Defining the similarity between two raw information stream or two feature vectors

• Euclidean Distance Metric, Pearson’s Correlation Coefficient, etc

amt@cs.rit.edu Dr. Ankur M. Teredesai P8

Clustering (cont.)

Hierarchical Clustering

Partitional Clustering (e.g. K-means)

amt@cs.rit.edu Dr. Ankur M. Teredesai P9

Symbolic RepresentationExample 1

Example 2

R1R2 R5

R3

R7 R9R8

R6

R4

0

--

0 20 40 60 80 100 120

bbb

a

cc

c

a

amt@cs.rit.edu Dr. Ankur M. Teredesai P10

Symbolic Representation (cont.)Express information stream as sequence of symbols

Now, we can work on less dimensional space than raw information stream dataAlso, we can use well known string processing data structure like inverted index, HMM or suffix tree

aaabaabcbabccb

amt@cs.rit.edu Dr. Ankur M. Teredesai P11

Possible Mining OperationsNovelty detection

• Can be used to identify potential anomaly events

• It is also referred to as the detection of “Aberrant Behavior”, “Anomalies”, “Faults”, “Surprises”, “Deviants” ,“Temporal Change”, and “Outliers”.

• As like above words say, we can detect unseen patterns or sequences from incoming information stream based on training dataset

Prediction

• The utility of prediction model lies in detecting events, rather than predicting numerical values. Event is referred to as meaningful object to which we can assign some semantics, e.g., earthquake or flood.

Finding correlation between clusters

• We can detect Spatial/Temporal correlation between clusters or information streams

amt@cs.rit.edu Dr. Ankur M. Teredesai P12

Mining Time-Series and Sequence Data

Time-series database

• Consists of sequences of values or events changing with time

Applications

• Financial: stock price, inflation

• Biomedical: blood pressure

• Meteorological: precipitation

amt@cs.rit.edu Dr. Ankur M. Teredesai P13

Mining Time-Series and Sequence Data: Trend analysis

Categories of Time-Series Movements• Long-term or trend movements (trend curve)• Cyclic movements or cycle variations, e.g., business cycles• Seasonal movements or seasonal variations

– i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.

• Irregular or random movements

amt@cs.rit.edu Dr. Ankur M. Teredesai P14

Estimation of Trend CurveThe freehand method

• Fit the curve by looking at the graph

• Costly and barely reliable for large-scaled data miningThe least-square method

• Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points

The moving-average method

• Eliminate cyclic, seasonal and irregular patterns

• Loss of end data

• Sensitive to outliers

amt@cs.rit.edu Dr. Ankur M. Teredesai P15

Discovery of Trend in Time-Series (1) Estimation of seasonal variations

• Seasonal index– Set of numbers showing the relative values of a variable during the

months of the year– E.g., if the sales during October, November, and December are 80%,

120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months

• Deseasonalized data– Data adjusted for seasonal variations– E.g., divide the original monthly data by the seasonal index numbers for

the corresponding months

amt@cs.rit.edu Dr. Ankur M. Teredesai P16

Similarity Search in Time-Series AnalysisNormal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequenceTwo categories of similarity queries

• Whole matching: find a sequence that is similar to the query sequence

• Subsequence matching: find all pairs of similar sequencesTypical Applications

• Financial market

• Market basket data analysis

• Scientific databases

• Medical diagnosis

amt@cs.rit.edu Dr. Ankur M. Teredesai P17

Data transformationMany techniques for signal analysis require the data to be in the frequency domainUsually data-independent transformations are used

• The transformation matrix is determined a priori– E.g., discrete Fourier transform (DFT), discrete wavelet transform (DWT)

• The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain

• DFT does a good job of concentrating energy in the first few coefficients

• If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance

amt@cs.rit.edu Dr. Ankur M. Teredesai P18

Finding Surprising Patterns in a Time Series Database in Linear Time and Space

amt@cs.rit.edu Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 19

Paper by: Eamonn Keogh, Stefano Lonardi, Bill ChiuPresented in ACM SIGKDD 2002

Main PurposeNovelty Detection

• Authors said that this problem should not be confused with the relatively simple problem of outlier detection.

• They focused on finding surprising patterns, not on finding individually surprising datapoints.

• The blue time series at the top is a normal healthy human heart beats with an artificial “flatline” added. The sequence in red at the bottom indicates how surprising local subsections of the time series are detected

amt@cs.rit.edu Dr. Ankur M. Teredesai P20

Basic IdeasA pattern is surprising if its frequency of occurrence is greatly different from that which we expected.Their notion surprisingness of a pattern is not tied exclusively to its shape. Instead it depends on the difference between the shape’s expected frequency and its observed frequency. Example : Consider the head and shoulders pattern shown below

• The existence of this pattern in a stock market time series occurs an average of three times a year.

• If it occurred ten times this year : surprising.

• If its frequency of occurrence is less than expected : Also Surprising pattern.

amt@cs.rit.edu Dr. Ankur M. Teredesai P21

ApproachFormal Definition of surprising pattern

• A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of its occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.

ExampleIf x = principalskinner

Σ is {a,c,e,i,k,l,n,p,r,s}|x| is 16skin is a substring of xprin is a prefix of xner is a suffix of xIf y = in, then fx(y) = 2If y = pal, then fx(y) = 1principalskinner

How about “y = eik” ?

amt@cs.rit.edu Dr. Ankur M. Teredesai P22

Approach (cont.)Steps (TARZAN algorithm)

• Discretizing time-series into Symbolic strings– Fixed sized sliding window– Slope of the best-fitting line

• Calculate probability of any pattern, including ones we have never seen before using Markov models

• For maintaining linear time and space property, they use suffix tree data structure

• Computing scores by comparing trees between reference data and incoming information stream

aaabaabcbabccb

amt@cs.rit.edu Dr. Ankur M. Teredesai P23

Experimental EvaluationTwo features

• Sensitivity (High True Positive Rate)– The algorithm can find truly surprising patterns in a time series.– It’s similar with Precision

• Selectivity (Low False Positive Rate)– The algorithm will not find spurious “surprising” patterns in a time series– It’s similar with Recall

Goal is maintaining High Precision and Recall

• They achieved high Sensitivity

• But, Selectivity??

amt@cs.rit.edu Dr. Ankur M. Teredesai P24

Experimental Evaluation (cont.)Shock EGG

Training data

Test data(subset)

0 200 400 600 800 1000 1200 1400 1600

200 400 600 800 1000 1200 1400 16000

Tarzan’s level of surprise

amt@cs.rit.edu Dr. Ankur M. Teredesai P25

Experimental Evaluation (cont.)Power Demand

• They consider a dataset that contains the power demand for a Dutch research facility for the entire year of 1997. The data is sampled over 15 minute averages, and thus contains 35,040 points.

• Above is the first 3 weeks of the power demand dataset. Note the repeating pattern of a strong peak for each of the five weekdays, followed by relatively quite weekends

0 200 400 600 800 1000 1200 1400 1600 1800 2000500

1000

1500

2000

2500

amt@cs.rit.edu Dr. Ankur M. Teredesai P26

Experimental Evaluation (Power Demand cont.)

They used from Monday January 6th to Sunday March 23rd as reference data. This time period includes national holidays. They tested on the remainder of the year.

They showed the 3 most surprising subsequences found by each algorithm. For each of the 3 approaches they showed the entire week (beginning Monday) in which the 3 largest values of surprise fell.

Both TSA-tree and IMM returned sequences that appear to be normal workweeks.

Tarzan returned 3 sequences that correspond to the weeks that contain national holidays.

Tarzan TSA Tree IMM

amt@cs.rit.edu Dr. Ankur M. Teredesai P27

Experimental Evaluation (cont.)

amt@cs.rit.edu Dr. Ankur M. Teredesai P28

• The previous experiments demonstrate the ability of Tarzan to find surprising patterns (Sensitivity)

• However, they also need to consider Tarzans selectivity. – For reducing false alarms, they attempted to scale to massive datasets.

• If Tarzan is trained on “short random walk” dataset,– The chance that similar patterns of test data exist in the short training

database is very small>Many False alarms

– Solution for this problem >Increase the size of the training data, the surprisingness of the test data

should decrease >The more training on “huge random walk data”, the less surprising

pattern could be detected

Possible Future Research OpportunityMentioned in the paper

• Incorporating user feedback and domain based constraints

• Applying different feature extraction techniqueInformation streams + Ontology

• Finding method to combine information streams mining with Ontology

• Intuitively, if we can extract general/abnormal patterns in information streams and generate clusters, we can give semantics to the patterns or clusters.

• For example, we can define relationship between news stream data about “War at Iraq” and stock price change of oil company using “News-Stock Ontology Model”. It is because we can detect rapid increase of the amount of news regarding “War” and “Iraq” at time t, and we can see rapid increase/decrease of oil company’s stock price at time t+α.

amt@cs.rit.edu Dr. Ankur M. Teredesai P29

Suffix Tree Data Structure

amt@cs.rit.edu Dr. Ankur M. Teredesai P30

Multidimensional IndexingMultidimensional index

• Constructed for efficient accessing using the first few Fourier coefficientsUse the index can to retrieve the sequences that are at most a certain small distance away from the query sequencePerform postprocessing by computing the actual distance between sequences in the time domain and discard any false matches

amt@cs.rit.edu Dr. Ankur M. Teredesai P31

amt@cs.rit.edu Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 32

B-Trees

B-TreesGeneralizes multilevel index

• Number of levels varies with size of data file, but is quite often 3

• Height balanced– Equal length access paths to different records

• Adapts well to insertions and deletionsDBMS typically use a variant called a B+ tree

• All nodes have same format: n keys, n +1 pointers, and at least half of thisUseful for primary, secondary indexes, primary keys, non-keys

amt@cs.rit.edu Dr. Ankur M. Teredesai P33

B+Tree Example

Root n=3

100

120

150

180

30

3 5 11 30 35 100

101

110

120

130

150

156

179

180

200

amt@cs.rit.edu Dr. Ankur M. Teredesai P34

Sample Non-Leaf Node

120

150

180

< 120 120≤ k<150 150≤k<180 ≥180to keysto keys to keys to keys

amt@cs.rit.edu Dr. Ankur M. Teredesai P35

Sample Leaf Node

From non-leaf node

120

130

unus

ed

To r

ecor

d w

ith k

ey 1

20

To r

ecor

d w

ith k

ey 1

30

to next leafin sequence

amt@cs.rit.edu Dr. Ankur M. Teredesai P36

Nodes Must Not Be Too EmptyNumber of pointers in use

• At internal nodes at least ⎡(n+1)/2⎤– To child nodes

• At leaves at least ⎣(n+1)/2⎦– To data records/blocks

amt@cs.rit.edu Dr. Ankur M. Teredesai P37

Node Bounds

Full node Minimum node

120

150

180

30Non-leaf

3 5 11 30 35Leaf

amt@cs.rit.edu Dr. Ankur M. Teredesai P38

n=3

B+tree RulesAll leaves at the same lowest level

• Balanced treePointers in leaves point to records

• Except for “sequence pointer”Number of pointers/keys for B+tree

amt@cs.rit.edu Dr. Ankur M. Teredesai P39

Non-leaf(non-root) n+1 n ⎡(n+1)/2⎤ ⎡(n+1)/2⎤- 1

Leaf(non-root) n+1 n

Root n+1 n 2♠ 1

Max Max Min Min Ptrs keys Ptrs→data keys

⎣(n+1)/2⎦ ⎣(n+1)/2⎦

♠ Can be 1 if only one record in the file

B+tree Insertions Search for the key being insertedFour cases

• Leaf has space– Just insert (key, pointer-to-record)

• Leaf overflow

• Non-leaf overflow

• New root

amt@cs.rit.edu Dr. Ankur M. Teredesai P40

Leaf has SpaceInsert key 32

n=3

100

30

3 5 11 30 31 32

amt@cs.rit.edu Dr. Ankur M. Teredesai P41

Leaf OverflowInsert key 7

100 n=3

3 5 11 30 31

30

3 5

7

7

amt@cs.rit.edu Dr. Ankur M. Teredesai P42

Non-Leaf OverflowInsert key 160

100

120

150

180

150

156

179

180

200

160

180

160

179

n=3

amt@cs.rit.edu Dr. Ankur M. Teredesai P43

New RootInsert 45

10 20 30

1 2 3 10 12 20 25 30 32 40 40 45

40

30new rootn=3

Tree grows at root and maintains balance

amt@cs.rit.edu Dr. Ankur M. Teredesai P44

B+tree Deletions Search for key being deleted

• If found, deleteThree broad cases

• Leaf does not underflow

• Borrow keys from an adjacent sibling if that does not also cause underflows

• Coalesce with a sibling node– Repeat if needed

Sometimes acceptable to allow a B-tree leaf to become sub-minimum (no mergers) to violate B-tree definition

amt@cs.rit.edu Dr. Ankur M. Teredesai P45

Leaf Does Not UnderflowDelete key 35

n=4min number of keys

in a leaf= ⎣5/2⎦ = 2

10 40 100

10 20 30 35 40 50

amt@cs.rit.edu Dr. Ankur M. Teredesai P46

Borrow KeysDelete key 50

n=4min number of keys

in a leaf= ⎣5/2⎦ = 2

10 40 100

10 20 30 35 40 5035

35

amt@cs.rit.edu Dr. Ankur M. Teredesai P47

Coalesce with SiblingDelete key 50

n=4

20 40 100

20 30 40 5040

amt@cs.rit.edu Dr. Ankur M. Teredesai P48

Coalesce Non-LeafDelete 37

40 4530 3725 2620 2210 141 3

10 20 30 4040

30

25

25

new root

n=4min number of keys

in a non-leaf= ⎡(n+1)/2⎤ - 1=3-1= 2

amt@cs.rit.edu Dr. Ankur M. Teredesai P49

Tree shrinks at root

B+tree Deletions in PracticeCoalescing is often not implemented

• Too hard and usually not worth it!

• Subsequent insertions may return node back to required minimum size

• Compromise– Try redistributing keys with a sibling– If not possible, leave it there

• If all accesses to records are through B-tree– Place a "tombstone" for deleted record at the leaf

amt@cs.rit.edu Dr. Ankur M. Teredesai P50

Traditional B-TreesB-tree is similar to B+tree

• Each search key appears only once– No redundant storage of search keys

• Additional pointer field for each search key in non-leaf node– Points to record directly

P1 K1 P2 … Pn-1 Kn-1 Pn

versus

Pn-1 Rn-1 Kn-1 PnP1 R1 K1 P2 R2 K2 …

amt@cs.rit.edu Dr. Ankur M. Teredesai P51

B-Tree Advantages and DisadvantagesAdvantages

• Fewer nodes than corresponding B+-tree

• Possible to find key before hitting leaf nodeDisadvantages

• Only small fraction of all keys found early

• Non-leaf nodes are larger so reduced fan-out– B-tree often deeper than corresponding B+tree

• More complex than B+trees– Insertion/deletion and overall implementation

• B+trees usually better than B-trees

amt@cs.rit.edu Dr. Ankur M. Teredesai P52

B+ Trees in Practice Typical order: 100

• Typical fill-factor around 67%.

• Average fanout around 133 Typical capacities:

• Height 4: 1334 = 312,900,700 records

• Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool

• Level 1 = 1 page = 8 KBytes

• Level 2 = 133 pages = 1 MByte

• Level 3 = 17,689 pages = 133 MBytes

amt@cs.rit.edu Dr. Ankur M. Teredesai P53

Tree Structured Indexes Ideal for range-searches and equality searchesISAM: static structure

• Only leaf pages modified

• Overflow pages degrade performance B+tree: dynamic structure

• Inserts/deletes leave tree height-balanced, and offers graceful growth and shrinking– High fanout (F) ⇒ depth, rarely > 3 or 4– 67% occupancy on average

• Preferable to ISAM, modulo locking

• Widely used DBMS index structure and one of the most optimized DBMS components

amt@cs.rit.edu Dr. Ankur M. Teredesai P54

Multidimensional Data Geographic & multidimensional data applications

• Sale (store, day, item, color, size, etc.)– Each sale is a point in 5-dimensional space

• Customer: (age, salary, zip, married, …)Typical Queries

• Range queries– Find employees in the Toy department who make at least 25K Geoffrey

dollars?

• Nearest neighbor– I am here: where’s the nearest MacGregors?

• Is this expressible in SQL?

amt@cs.rit.edu Dr. Ankur M. Teredesai P55

Big Impediment For these queries, no clean way to eliminate lots of records that don't meet WHERE condition Approaches

• Index on one attribute– Get data for 1 attribute and remove others

• Index on attributes independently– Intersect pointers in main memory to save disk I/O– Does this help with nearest neighbor?

• Multiple key index– Index on one attribute provides pointer to an index on the other

amt@cs.rit.edu Dr. Ankur M. Teredesai P56

2-Level Indexing

I1

I2

I3

Index onfirst attribute

Index onsecond attribute

amt@cs.rit.edu Dr. Ankur M. Teredesai P57

Example

amt@cs.rit.edu Dr. Ankur M. Teredesai P58

ArtSalesToy

10k15k17k21k

12k15k15k19k

SalaryIndex

DeptIndex

Name=JoeDept=SalesSalary=15k

SampleEmployee

Some QueriesQuestion

• For what kinds of conditions about dept and salary will a multiple-key index (dept first) significantly reduce number of disk I/O's?

How about finding records where …

• Dept = “Sales” and Salary = 20k

• Dept = “Sales” and Salary > 20k

• Dept = “Sales”

• Salary = 20k

amt@cs.rit.edu Dr. Ankur M. Teredesai P59

Interesting Application

x

y

Geographic Data

Data

<X1,Y1, Attributes>

<X2,Y2, Attributes>

...Queries

• What city is at <Xi,Yi>?

• What is within 5 miles from <Xi,Yi>?

• Which is closest point to <Xi,Yi>?

amt@cs.rit.edu Dr. Ankur M. Teredesai P60

amt@cs.rit.edu Dr. Ankur M. Teredesai P61

h

nb

i a

co

d

10 20

10 20

e

g

f

m

l

kj25 15 35 20

40

30

20

10

h i a bcd efg

n omlj k

• Search points near f• Search points near b

5

15 15

Example

QueriesFind points with Yi > 20Find points with Xi < 5Find points “close” to i = <12,38>Find points “close” to b = <7,24>

amt@cs.rit.edu Dr. Ankur M. Teredesai P62

Other StructuresOther geographic index structures

• Quad Trees

• R TreesMore Multikey Indexes

• Grid

• Partitioned hash

amt@cs.rit.edu Dr. Ankur M. Teredesai P63

Grid IndexKey 2

X1 X2 …… XnV1V2

Key 1

Vn

To records with key1=V3, key2=X2

amt@cs.rit.edu Dr. Ankur M. Teredesai P64

ClaimCan quickly find records with

• key 1 = Vi and Key 2 = Xj

• key 1 = Vi

• key 2 = Xj

And also ranges …

• E.g., key 1 ≥ Vi and key 2 < Xj

amt@cs.rit.edu Dr. Ankur M. Teredesai P65

Storing Grid IndexesCatch with Grid Indexes!

• Storing Grid Index stored on disk?Problem

• Need regularity to compute position of <Vi,Xj> entry

LikeArray...

X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4

V1 V2 V3

amt@cs.rit.edu Dr. Ankur M. Teredesai P66

Solution: Use Indirection

Buckets

------------

------

------

------

X1 X2 X3V1

V2V3

V4

Grid only containspointers to buckets

amt@cs.rit.edu Dr. Ankur M. Teredesai P67

Buckets

Indexing Grid on Value Ranges

Salary Grid

Linear Scale1 2 3

Toy Sales Personnel

0-20K 120K-50K 2

50K- 38

amt@cs.rit.edu Dr. Ankur M. Teredesai P68

Grid can be regular without wasting space

We do have price of indirection

Partitioned Hashing Hash function

• Combines several attributes

• Great when attributes have values specifiedPartitioned hash function devotes some bits to each attribute independently

010110 1110010Key2Key1

h1 h2

amt@cs.rit.edu Dr. Ankur M. Teredesai P69

Example (1)

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

<Joe><Sally>

<Fred>000

111110101100011010001

amt@cs.rit.edu Dr. Ankur M. Teredesai P70

Insert<Fred, toy, 10K><Joe, sales, 10K><Sally, art, 30K>

Example (2)

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

000

111110101100011010001

Find Emp. with Dept. = Sales and Sal = 40k

amt@cs.rit.edu Dr. Ankur M. Teredesai P71

Example (3)

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look hereFind Emp. with Sal = 30k

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

000

111110101100011010001

amt@cs.rit.edu Dr. Ankur M. Teredesai P72

Example (4)

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look hereFind Emp. with Dept. = Sales

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

000

111110101100011010001

amt@cs.rit.edu Dr. Ankur M. Teredesai P73

amt@cs.rit.edu Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 74

R Trees

A Dynamic Index Structure for Spatial Representation

Why R Trees?Multi-dimensional spaces not well represented by point locationsNeed to be able to perform range searchesOne dimensional indexes not suitable for multi-dimensional spacesEx: Find all the counties within 20 mi radius of Georgia Tech

amt@cs.rit.edu Dr. Ankur M. Teredesai P75

Main ConceptsHeight balanced tree similar to a B-treeIndex records in leaf nodes point to data objectsIndex is dynamic and no periodic reorganization is requireIndex records are of the form ( @ leaf nodes):

(I, tuple-identifier) where

I => n dimensional bounding rectangle i.e.I = (I0,I1,…,In) where n = no of dimensions and Ii = [a, b] (a closed bounded rectangle)

amt@cs.rit.edu Dr. Ankur M. Teredesai P76

More Concepts…Non leaf nodes are of the form:

(I, child-pointer)where

child-pointer is the address of a lower node and I is the smallest rectangle covering all the rectangles in the lower node’s entriesM = maximum number of entries in one node

amt@cs.rit.edu Dr. Ankur M. Teredesai P77

More Concepts…m = parameter specifying the minimum no of entries in a node. m can be tuned at run time and is <= M/2R tree containing N index records has at most a height [logm N] -1Worst case space utilization: m/MMaximum no of nodes: N/m + N/m2 +…+1

amt@cs.rit.edu Dr. Ankur M. Teredesai P78

SearchingDenote the rectangle part of a node E by E.I and the child-pointer part by E.pAlgorithm Search: Given an R Tree with root node T find all index records whose rectangles overlap a search rectangle SStep 1) [Search subtrees.] If T is not a leaf, check each entry to determine if E.I overlaps S. For all overlapping entries invoke Search on the tree whose root is E.pStep 2) [Search leaf node.] If T is a leaf, check all entries E to determine whether E.I overlaps S. If so, E is a qualifying records. Return E.

amt@cs.rit.edu Dr. Ankur M. Teredesai P79

Insertion into R-TreeSimilar to B-Tree insertionNew index records added to leavesOverflowing nodes are splitSplits propagate up the tree

amt@cs.rit.edu Dr. Ankur M. Teredesai P80

Algorithm Insert - DetailsInvoke ChooseLeaf to select a leaf node L to place EIf L has room for another entry install E, else invoke SplitNode on L, obtaining L and LLInvoke AdjustTree on L (and LL if a split was performed)If root is split, create new root with two children (those obtained by splitting the old root)

amt@cs.rit.edu Dr. Ankur M. Teredesai P81

Algorithm ChooseLeafSet N to be the rootIf N is leaf, return NIf N is not leaf, let F be the entry in N whose rectangle needs least enlargement to include E.ISet N to be the child node pointed to by F.P and repeat from Step 2

amt@cs.rit.edu Dr. Ankur M. Teredesai P82

Algorithm AdjustTreeSet N = L, if L was split set NN = LLIf N is Root, STOPLet P be N’s parent and let EN be N’s entry in P. Adjust EN.I so that it “tightly”encloses all entries in NIf NN exists, create a new entry ENN, with ENN.p pointing to NN and ENN.I enclosing all rectangles in NN. Add ENN to P if there is room. Otherwise invoke SplitNode to produce PP.Move up the next level, repeat process

amt@cs.rit.edu Dr. Ankur M. Teredesai P83

Node Splitting“Full” node to be split when new entry needs to be addedMust ensure that on any subsequent searches, with high probability only one node needs to be exploredTotal area of two rectangles to be minimizedExhaustive Search – Exponential Complexity

amt@cs.rit.edu Dr. Ankur M. Teredesai P84

Quadratic Cost AlgorithmUse PickSeeds to choose two entries to be first elements of the two groups. Repeat STEP 3 until all entries have been assigned to one of the groupsInvoke algorithm PickNext to choose next entry to assign. Add it to the group whose covering rectangle needs to be expanded the least.

amt@cs.rit.edu Dr. Ankur M. Teredesai P85

Algorithm PickSeedsFor each pair of entries E1 and E2, let J be the rectangle including E1.I and E2.I.

Calculate d = area(J) – area(E1.I) – area(E2.I)Choose the pair with largest d value

amt@cs.rit.edu Dr. Ankur M. Teredesai P86

Algorithm PickNextFor each entry E not yet in any group, calculate d1 = the area increase required in

the covering rectangle of Group 1 to include E.1. Calculate d2 similarly for Group 2

Choose any entry with maximum difference between d1 and d2

amt@cs.rit.edu Dr. Ankur M. Teredesai P87

Algorithm LinearPickSeedsAlong each dimension, find the entry whose rectangle has the highest low side, and the one

with highest low side. Record the separation between themNormalize the separations by dividing by the width of the entire set along corresponding

dimensionChoose the pair with greatest normalized separation along any dimension

amt@cs.rit.edu Dr. Ankur M. Teredesai P88

Algorithm DeleteInvoke FindLeaf to locate leaf node L containing E. Remove E from LInvoke CondenseTree on LIf root node has only one child, make the child the new root

amt@cs.rit.edu Dr. Ankur M. Teredesai P89

Algorithm FindLeafSet T to be the root of the treeIf T is not leaf, check each entry F in T to determine if F.I overlaps E.I. For each

entry, invoke FindLeaf on the tree pointed to by F.PIf T is a leaf, check each entry to see if it matches E. If E is found return T

amt@cs.rit.edu Dr. Ankur M. Teredesai P90

Algorithm CondenseTreeSet N = L. Set Q, the set of eliminated nodes as empty setIf N is the root, go to STEP 6, else, let P be the parent of N, and let EN be N’s

entry in PIf N has fewer than m entries, delete EN from P and add N to Q

amt@cs.rit.edu Dr. Ankur M. Teredesai P91

Algorithm CondenseTree (contd..)If N not eliminated, adjust EN.I to tightly contain all entries in NSet N = P and repeat from STEP 2Reinsert all entries in Q. Entries from eliminated leaf nodes are inserted as in

algorithm Insert. Entries from higher-level nodes are to be inserted higher in the tree.

amt@cs.rit.edu Dr. Ankur M. Teredesai P92

amt@cs.rit.edu Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 93

Multi-dimensional Sequential Pattern Mining

OutlineWhy multidimensional sequential pattern mining?Problem definitionAlgorithmsExperimental resultsConclusions

amt@cs.rit.edu Dr. Ankur M. Teredesai P94

Why Sequential Pattern Mining?Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences)Many data and applications are time-related

• Customer shopping patterns, telephone calling patterns – E.g., first buy computer, then CD-ROMS, software, within 3 mos.

• Natural disasters (e.g., earthquake, hurricane)

• Disease and treatment

• Stock market fluctuation

• Weblog click stream analysis

• DNA sequence analysis

amt@cs.rit.edu Dr. Ankur M. Teredesai P95

Sequential Pattern MiningMining of frequently occurring patterns related to time or other sequences

Examples

• Renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi”in that order

• Collection of ordered events within an intervalApplications

• Targeted marketing

• Customer retention

• Weather prediction

amt@cs.rit.edu Dr. Ankur M. Teredesai P96

Motivating ExampleSequential patterns are useful

• “free internet access buy package 1 upgrade to package 2”

• Marketing, product design & developmentProblems: lack of focus

• Various groups of customers may have different patternsMD-sequential pattern mining: integrate multi-dimensional analysis and sequential pattern mining

amt@cs.rit.edu Dr. Ankur M. Teredesai P97

Sequences and PatternsGiven a set of sequences, find the complete set of frequent subsequences

A sequence : < (ef) (ab) (df) c b >A sequence database

amt@cs.rit.edu Dr. Ankur M. Teredesai P98

Elements items within an element are listed alphabetically

SID sequence

10 <a(ababc)(acc)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(abab)(df)ccb>

40 <eg(af)cbc>

<a(bc)dc> is a subsequence of <<aa(a(abcbc)(ac))(ac)dd((ccff)>)>

Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

Sequential Pattern: Basics

A sequence sequence : <(bd) c b (ac)>

Elements Elements

A sequence database sequence database

<a(bdbd)bcbcb(ade)>50<(be)(ce)d>40

<(ah)(bf)abf>30<(bf)(ce)b(fg)>20<(bdbd)cbcb(ac)>10

SequenceSeq. ID

<ad(ae)> is a subsequence subsequence of <aa(bdd)bcb(aadee)>

Given support threshold support threshold min_sup =2, <(bd)cb> is a sequential patternsequential pattern

amt@cs.rit.edu Dr. Ankur M. Teredesai P99

Enhanced similarity search methodsAllow for gaps within a sequence or differences in offsets or amplitudesNormalize sequences with amplitude scaling and offset translationTwo subsequences are considered similar if one lies within an envelope of εwidth around the other, ignoring outliersTwo sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction

amt@cs.rit.edu Dr. Ankur M. Teredesai P100

Subsequence MatchingBreak each sequence into a set of pieces of window with length wExtract the features of the subsequence inside the windowMap each sequence to a “trail” in the feature spaceDivide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangleUse a multipiece assembly algorithm to search for longer sequence matches

amt@cs.rit.edu Dr. Ankur M. Teredesai P101

Sequential pattern mining: Cases and Parameters Duration of a time sequence T

• Sequential pattern mining can then be confined to the data within a specified duration

• Ex. Subsequence corresponding to the year of 1999

• Ex. Partitioned sequences, such as every year, or every week after stock crashes, or every two weeks before and after a volcano eruption

Event folding window w

• If w = T, time-insensitive frequent patterns are found

• If w = 0 (no event sequence folding), sequential patterns are found where each event occurs at a distinct time instant

• If 0 < w < T, sequences occurring within the same period w are folded in the analysis

amt@cs.rit.edu Dr. Ankur M. Teredesai P102

Sequential pattern mining: Cases and Parameters (2)Time interval, int, between events in the discovered pattern

• int = 0: no interval gap is allowed, i.e., only strictly consecutive sequences are found– Ex. “Find frequent patterns occurring in consecutive weeks”

• min_int ≤ int ≤ max_int: find patterns that are separated by at least min_intbut at most max_int– Ex. “If a person rents movie A, it is likely she will rent movie B within 30

days” (int ≤ 30)

• int = c ≠ 0: find patterns carrying an exact interval– Ex. “Every time when Dow Jones drops more than 5%, what will happen

exactly two days later?” (int = 2)

amt@cs.rit.edu Dr. Ankur M. Teredesai P103

Episodes and Sequential Pattern Mining MethodsOther methods for specifying the kinds of patterns

• Serial episodes: A → B

• Parallel episodes: A & B

• Regular expressions: (A | B)C*(D → E)Methods for sequential pattern mining

• Variations of Apriori-like algorithms

amt@cs.rit.edu Dr. Ankur M. Teredesai P104

Click StreamsClient click-stream analysis is a click-by-click view of a visitor's journey (or journeys) through a web site. By viewing a click-stream report, you can follow the exact pathway a visitor took through a web site, even down to the length of time they spent looking at each particular page.

amt@cs.rit.edu Dr. Ankur M. Teredesai P105

Click Streams…ContinuedThe people most interested in this report would typically be involved in marketing, web design or web development. The information presented provides a click-by-click view of how visitors are interacting and navigating through their web site.

amt@cs.rit.edu Dr. Ankur M. Teredesai P106

Periodicity AnalysisPeriodicity is everywhere: tides, seasons, daily power consumption, etc.Full periodicity

• Every point in time contributes (precisely or approximately) to the periodicity

Partial periodicity: A more general notion

• Only some segments contribute to the periodicity– Jim reads NY Times 7:00-7:30 am every week day

Cyclic association rules

• Associations which form cyclesMethods

• Full periodicity: FFT, other statistical analysis methods

• Partial and cyclic periodicity: Variations of Apriori-like mining methods

amt@cs.rit.edu Dr. Ankur M. Teredesai P107

MD Sequence DatabaseP=(*,Chicago,*,<bf>) matches tuple 20 and 30If support =2, P is a MD sequential pattern

amt@cs.rit.edu Dr. Ankur M. Teredesai P108

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Mining of MD Seq. Pat.Embedding MD information into sequences

• Using a uniform seq. pat. mining methodIntegration of seq. pat. mining and MD analysis method

amt@cs.rit.edu Dr. Ankur M. Teredesai P109

UNISEQEmbed MD information into sequences

amt@cs.rit.edu Dr. Ankur M. Teredesai P110

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Mine the extended sequence database

using sequential pattern mining methods

cid MD-extension of sequences

10 <(Business,Boston,Middle)(bd)cba>

20 <(Professional,Chicago,Young)(bf)(ce)(fg)>

30 <(Business,Chicago,Middle)(ah)abf>

40 <(Education,New York,Retired)(be)(ce)>

Mine Sequential Patterns by Prefix Projections

Step 1: find length-1 sequential patterns

• <a>, <b>, <c>, <d>, <e>, <f>Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:

• The ones having prefix <a>;

• The ones having prefix <b>;

• …

• The ones having prefix <f> SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

amt@cs.rit.edu Dr. Ankur M. Teredesai P111

Find Seq. Patterns with Prefix <a>

Only need to consider projections w.r.t. <a>

• <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

• Further partition into 6 subsets– Having prefix <aa>;– …– Having prefix <af>

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

amt@cs.rit.edu Dr. Ankur M. Teredesai P112

Completeness of PrefixSpan

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

SDBLength-1 sequential patterns<a>, <b>, <c>, <d>, <e>, <f>

amt@cs.rit.edu Dr. Ankur M. Teredesai P113

<a>-projected database<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>

Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db … <af>-proj. db

Having prefix <af>

…<b>-projected database

Having prefix <b>Having prefix <c>, …, <f>

… …

Efficiency of PrefixSpan

No candidate sequence needs to be generated

Projected databases keep shrinking

Major cost of PrefixSpan: constructing projected databases

• Can be improved by bi-level projections

amt@cs.rit.edu Dr. Ankur M. Teredesai P114

Mining MD-Patterns

amt@cs.rit.edu Dr. Ankur M. Teredesai P115

All

(cust-grp,*,*) (*,city,*) (*,*,age-grp)

(cust-grp,city) Cust-grp,*,age-grp)

(cust-grp,city,age-grp)

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

MD pattern(*,Chicago,*)

BUC processing

Dim-Seq

First find MD-patterns• E.g. (*,Chicago,*)

Form projected sequence database• <(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*)

Find seq. pat in projected database• E.g. (*,Chicago,*,<bf>)

amt@cs.rit.edu Dr. Ankur M. Teredesai P116

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Seq-Dim

Find sequential patterns• E.g. <bf>

Form projected MD-database• E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for

<bf>Mine MD-patterns

• E.g. (*,Chicago,*,<bf>)

amt@cs.rit.edu Dr. Ankur M. Teredesai P117

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Scalability Over Dimensionality

amt@cs.rit.edu Dr. Ankur M. Teredesai P118

Scalability Over Cardinality

amt@cs.rit.edu Dr. Ankur M. Teredesai P119

Scalability Over Support Threshold

amt@cs.rit.edu Dr. Ankur M. Teredesai P120

Scalability Over Database Size

amt@cs.rit.edu Dr. Ankur M. Teredesai P121

Pros & Cons of AlgorithmsSeq-Dim is efficient and scalable

• Fastest in most casesUniSeq is also efficient and scalable

• Fastest with low dimensionalityDim-Seq has poor scalability

amt@cs.rit.edu Dr. Ankur M. Teredesai P122

ConclusionsMD seq. pat. mining are interesting and usefulMining MD seq. pat. efficiently

• Uniseq, Dim-Seq, and Seq-DimFuture work

• Applications of sequential pattern mining

amt@cs.rit.edu Dr. Ankur M. Teredesai P123