A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive...

transcript

amt@cs.rit.edu Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

A Comprehensive Look at Mining Time-Series and

Sequential Patterns

Ankur TeredesaiDepartment of Computer Science,Rochester Institute of Technology

Definition of Time-Series

0 50 100 150 200 250 300 350 400 450 50023

25.175025.225025.250025.250025.275025.325025.350025.350025.400025.400025.325025.225025.200025.1750

..24.625024.675024.675024.625024.625024.625024.675024.7500

A time series is a collection of observations made sequentially in time.

amt@cs.rit.edu Dr. Ankur M. Teredesai P2

Sample Example for TimeSample Example for Time--Series (cont.)Series (cont.)

People measure things...People measure things...•The presidents approval rating•Their blood pressure•The annual rainfall in Los Angeles•The value of their Yahoo stock•The number of web hits per second

…… and things change over time.and things change over time.

Thus time series occur in virtually every medical, scientific anThus time series occur in virtually every medical, scientific and d business domainbusiness domain

What can We Do with Time-Series?

ClusteringClustering ClassificationClassification

Query by Content

Rule Discovery

⇒s = 0.5c = 0.3

Motif DiscoveryMotif Discovery

Novelty DetectionNovelty Detection

Sample Model for Information Streams Mining

•Information streams vs. time-series :

•In many emerging science and business applications, data takes the form of streams rather than static datasets.

•We can define information stream as continuously incoming dynamic data by contrast with static time-series data.

*MIESIS (MIning from Earth Science Information Streams)

Information Streams SegmentationWe need to do this for

• Symbolization

• Dimensionality reductionUsing Fixed length sliding window

0 20 40 60 80 100 120

Information Streams Segmentation (cont.)Using turning points

0 10 20 30 40 50 60 70-0.2

Information stream data from sensor

ClusteringFeature extraction

• For dimensional reduction, we need to extract features from raw information streams

• DFT (Discrete Fourier Transform), DWT (Discrete Wavelet Transform), PAA (Piecewise Aggregate Approximation), etc

Similarity Measure

• Defining the similarity between two raw information stream or two feature vectors

• Euclidean Distance Metric, Pearson’s Correlation Coefficient, etc

Clustering (cont.)

Hierarchical Clustering

Partitional Clustering (e.g. K-means)

Symbolic RepresentationExample 1

Example 2

R1R2 R5

R7 R9R8

0 20 40 60 80 100 120

Symbolic Representation (cont.)Express information stream as sequence of symbols

Now, we can work on less dimensional space than raw information stream dataAlso, we can use well known string processing data structure like inverted index, HMM or suffix tree

aaabaabcbabccb

Possible Mining OperationsNovelty detection

• Can be used to identify potential anomaly events

• It is also referred to as the detection of “Aberrant Behavior”, “Anomalies”, “Faults”, “Surprises”, “Deviants” ,“Temporal Change”, and “Outliers”.

• As like above words say, we can detect unseen patterns or sequences from incoming information stream based on training dataset

Prediction

• The utility of prediction model lies in detecting events, rather than predicting numerical values. Event is referred to as meaningful object to which we can assign some semantics, e.g., earthquake or flood.

Finding correlation between clusters

• We can detect Spatial/Temporal correlation between clusters or information streams

Mining Time-Series and Sequence Data

Time-series database

• Consists of sequences of values or events changing with time

Applications

• Financial: stock price, inflation

• Biomedical: blood pressure

• Meteorological: precipitation

Mining Time-Series and Sequence Data: Trend analysis

Categories of Time-Series Movements• Long-term or trend movements (trend curve)• Cyclic movements or cycle variations, e.g., business cycles• Seasonal movements or seasonal variations

– i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.

• Irregular or random movements

Estimation of Trend CurveThe freehand method

• Fit the curve by looking at the graph

• Costly and barely reliable for large-scaled data miningThe least-square method

• Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points

The moving-average method

• Eliminate cyclic, seasonal and irregular patterns

• Loss of end data

• Sensitive to outliers

Discovery of Trend in Time-Series (1) Estimation of seasonal variations

• Seasonal index– Set of numbers showing the relative values of a variable during the

months of the year– E.g., if the sales during October, November, and December are 80%,

120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months

• Deseasonalized data– Data adjusted for seasonal variations– E.g., divide the original monthly data by the seasonal index numbers for

the corresponding months

Similarity Search in Time-Series AnalysisNormal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequenceTwo categories of similarity queries

• Whole matching: find a sequence that is similar to the query sequence

• Subsequence matching: find all pairs of similar sequencesTypical Applications

• Financial market

• Market basket data analysis

• Scientific databases

• Medical diagnosis

Data transformationMany techniques for signal analysis require the data to be in the frequency domainUsually data-independent transformations are used

• The transformation matrix is determined a priori– E.g., discrete Fourier transform (DFT), discrete wavelet transform (DWT)

• The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain

• DFT does a good job of concentrating energy in the first few coefficients

• If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance

Finding Surprising Patterns in a Time Series Database in Linear Time and Space

Paper by: Eamonn Keogh, Stefano Lonardi, Bill ChiuPresented in ACM SIGKDD 2002

Main PurposeNovelty Detection

• Authors said that this problem should not be confused with the relatively simple problem of outlier detection.

• They focused on finding surprising patterns, not on finding individually surprising datapoints.

• The blue time series at the top is a normal healthy human heart beats with an artificial “flatline” added. The sequence in red at the bottom indicates how surprising local subsections of the time series are detected

Basic IdeasA pattern is surprising if its frequency of occurrence is greatly different from that which we expected.Their notion surprisingness of a pattern is not tied exclusively to its shape. Instead it depends on the difference between the shape’s expected frequency and its observed frequency. Example : Consider the head and shoulders pattern shown below

• The existence of this pattern in a stock market time series occurs an average of three times a year.

• If it occurred ten times this year : surprising.

• If its frequency of occurrence is less than expected : Also Surprising pattern.

ApproachFormal Definition of surprising pattern

• A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of its occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.

ExampleIf x = principalskinner

Σ is {a,c,e,i,k,l,n,p,r,s}|x| is 16skin is a substring of xprin is a prefix of xner is a suffix of xIf y = in, then fx(y) = 2If y = pal, then fx(y) = 1principalskinner

How about “y = eik” ?

Approach (cont.)Steps (TARZAN algorithm)

• Discretizing time-series into Symbolic strings– Fixed sized sliding window– Slope of the best-fitting line

• Calculate probability of any pattern, including ones we have never seen before using Markov models

• For maintaining linear time and space property, they use suffix tree data structure

• Computing scores by comparing trees between reference data and incoming information stream

aaabaabcbabccb

Experimental EvaluationTwo features

• Sensitivity (High True Positive Rate)– The algorithm can find truly surprising patterns in a time series.– It’s similar with Precision

• Selectivity (Low False Positive Rate)– The algorithm will not find spurious “surprising” patterns in a time series– It’s similar with Recall

Goal is maintaining High Precision and Recall

• They achieved high Sensitivity

• But, Selectivity??

Experimental Evaluation (cont.)Shock EGG

Training data

Test data(subset)

0 200 400 600 800 1000 1200 1400 1600

200 400 600 800 1000 1200 1400 16000

Tarzan’s level of surprise

Experimental Evaluation (cont.)Power Demand

• They consider a dataset that contains the power demand for a Dutch research facility for the entire year of 1997. The data is sampled over 15 minute averages, and thus contains 35,040 points.

• Above is the first 3 weeks of the power demand dataset. Note the repeating pattern of a strong peak for each of the five weekdays, followed by relatively quite weekends

0 200 400 600 800 1000 1200 1400 1600 1800 2000500

Experimental Evaluation (Power Demand cont.)

They used from Monday January 6th to Sunday March 23rd as reference data. This time period includes national holidays. They tested on the remainder of the year.

They showed the 3 most surprising subsequences found by each algorithm. For each of the 3 approaches they showed the entire week (beginning Monday) in which the 3 largest values of surprise fell.

Both TSA-tree and IMM returned sequences that appear to be normal workweeks.

Tarzan returned 3 sequences that correspond to the weeks that contain national holidays.

Tarzan TSA Tree IMM

Experimental Evaluation (cont.)

• The previous experiments demonstrate the ability of Tarzan to find surprising patterns (Sensitivity)

• However, they also need to consider Tarzans selectivity. – For reducing false alarms, they attempted to scale to massive datasets.

• If Tarzan is trained on “short random walk” dataset,– The chance that similar patterns of test data exist in the short training

database is very small>Many False alarms

– Solution for this problem >Increase the size of the training data, the surprisingness of the test data

should decrease >The more training on “huge random walk data”, the less surprising

pattern could be detected

Possible Future Research OpportunityMentioned in the paper

• Incorporating user feedback and domain based constraints

• Applying different feature extraction techniqueInformation streams + Ontology

• Finding method to combine information streams mining with Ontology

• Intuitively, if we can extract general/abnormal patterns in information streams and generate clusters, we can give semantics to the patterns or clusters.

• For example, we can define relationship between news stream data about “War at Iraq” and stock price change of oil company using “News-Stock Ontology Model”. It is because we can detect rapid increase of the amount of news regarding “War” and “Iraq” at time t, and we can see rapid increase/decrease of oil company’s stock price at time t+α.

Suffix Tree Data Structure

Multidimensional IndexingMultidimensional index

• Constructed for efficient accessing using the first few Fourier coefficientsUse the index can to retrieve the sequences that are at most a certain small distance away from the query sequencePerform postprocessing by computing the actual distance between sequences in the time domain and discard any false matches

B-Trees

B-TreesGeneralizes multilevel index

• Number of levels varies with size of data file, but is quite often 3

• Height balanced– Equal length access paths to different records

• Adapts well to insertions and deletionsDBMS typically use a variant called a B+ tree

• All nodes have same format: n keys, n +1 pointers, and at least half of thisUseful for primary, secondary indexes, primary keys, non-keys

B+Tree Example

Root n=3

3 5 11 30 35 100

Sample Non-Leaf Node

< 120 120≤ k<150 150≤k<180 ≥180to keysto keys to keys to keys

Sample Leaf Node

From non-leaf node

to next leafin sequence

Nodes Must Not Be Too EmptyNumber of pointers in use

• At internal nodes at least ⎡(n+1)/2⎤– To child nodes

• At leaves at least ⎣(n+1)/2⎦– To data records/blocks

Node Bounds

Full node Minimum node

30Non-leaf

3 5 11 30 35Leaf

B+tree RulesAll leaves at the same lowest level

• Balanced treePointers in leaves point to records

• Except for “sequence pointer”Number of pointers/keys for B+tree

Non-leaf(non-root) n+1 n ⎡(n+1)/2⎤ ⎡(n+1)/2⎤- 1

Leaf(non-root) n+1 n

Root n+1 n 2♠ 1

Max Max Min Min Ptrs keys Ptrs→data keys

⎣(n+1)/2⎦ ⎣(n+1)/2⎦

♠ Can be 1 if only one record in the file

B+tree Insertions Search for the key being insertedFour cases

• Leaf has space– Just insert (key, pointer-to-record)

• Leaf overflow

• Non-leaf overflow

• New root

Leaf has SpaceInsert key 32

3 5 11 30 31 32

Leaf OverflowInsert key 7

100 n=3

3 5 11 30 31

Non-Leaf OverflowInsert key 160

New RootInsert 45

10 20 30

1 2 3 10 12 20 25 30 32 40 40 45

30new rootn=3

Tree grows at root and maintains balance

B+tree Deletions Search for key being deleted

• If found, deleteThree broad cases

• Leaf does not underflow

• Borrow keys from an adjacent sibling if that does not also cause underflows

• Coalesce with a sibling node– Repeat if needed

Sometimes acceptable to allow a B-tree leaf to become sub-minimum (no mergers) to violate B-tree definition

Leaf Does Not UnderflowDelete key 35

n=4min number of keys

in a leaf= ⎣5/2⎦ = 2

10 40 100

10 20 30 35 40 50

Borrow KeysDelete key 50

in a leaf= ⎣5/2⎦ = 2

10 40 100

10 20 30 35 40 5035

Coalesce with SiblingDelete key 50

20 40 100

20 30 40 5040

Coalesce Non-LeafDelete 37

40 4530 3725 2620 2210 141 3

10 20 30 4040

new root

in a non-leaf= ⎡(n+1)/2⎤ - 1=3-1= 2

Tree shrinks at root

B+tree Deletions in PracticeCoalescing is often not implemented

• Too hard and usually not worth it!

• Subsequent insertions may return node back to required minimum size

• Compromise– Try redistributing keys with a sibling– If not possible, leave it there

• If all accesses to records are through B-tree– Place a "tombstone" for deleted record at the leaf

Traditional B-TreesB-tree is similar to B+tree

• Each search key appears only once– No redundant storage of search keys

• Additional pointer field for each search key in non-leaf node– Points to record directly

P1 K1 P2 … Pn-1 Kn-1 Pn

versus

Pn-1 Rn-1 Kn-1 PnP1 R1 K1 P2 R2 K2 …

B-Tree Advantages and DisadvantagesAdvantages

• Fewer nodes than corresponding B+-tree

• Possible to find key before hitting leaf nodeDisadvantages

• Only small fraction of all keys found early

• Non-leaf nodes are larger so reduced fan-out– B-tree often deeper than corresponding B+tree

• More complex than B+trees– Insertion/deletion and overall implementation

• B+trees usually better than B-trees

B+ Trees in Practice Typical order: 100

• Typical fill-factor around 67%.

• Average fanout around 133 Typical capacities:

• Height 4: 1334 = 312,900,700 records

• Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool

• Level 1 = 1 page = 8 KBytes

• Level 2 = 133 pages = 1 MByte

• Level 3 = 17,689 pages = 133 MBytes

Tree Structured Indexes Ideal for range-searches and equality searchesISAM: static structure

• Only leaf pages modified

• Overflow pages degrade performance B+tree: dynamic structure

• Inserts/deletes leave tree height-balanced, and offers graceful growth and shrinking– High fanout (F) ⇒ depth, rarely > 3 or 4– 67% occupancy on average

• Preferable to ISAM, modulo locking

• Widely used DBMS index structure and one of the most optimized DBMS components

Multidimensional Data Geographic & multidimensional data applications

• Sale (store, day, item, color, size, etc.)– Each sale is a point in 5-dimensional space

• Customer: (age, salary, zip, married, …)Typical Queries

• Range queries– Find employees in the Toy department who make at least 25K Geoffrey

dollars?

• Nearest neighbor– I am here: where’s the nearest MacGregors?

• Is this expressible in SQL?

Big Impediment For these queries, no clean way to eliminate lots of records that don't meet WHERE condition Approaches

• Index on one attribute– Get data for 1 attribute and remove others

• Index on attributes independently– Intersect pointers in main memory to save disk I/O– Does this help with nearest neighbor?

• Multiple key index– Index on one attribute provides pointer to an index on the other

2-Level Indexing

Index onfirst attribute

Index onsecond attribute

Example

ArtSalesToy

10k15k17k21k

12k15k15k19k

SalaryIndex

DeptIndex

Name=JoeDept=SalesSalary=15k

SampleEmployee

Some QueriesQuestion

• For what kinds of conditions about dept and salary will a multiple-key index (dept first) significantly reduce number of disk I/O's?

How about finding records where …

• Dept = “Sales” and Salary = 20k

• Dept = “Sales” and Salary > 20k

• Dept = “Sales”

• Salary = 20k

Interesting Application

Geographic Data

<X1,Y1, Attributes>

<X2,Y2, Attributes>

...Queries

• What city is at <Xi,Yi>?

• What is within 5 miles from <Xi,Yi>?

• Which is closest point to <Xi,Yi>?

kj25 15 35 20

h i a bcd efg

n omlj k

• Search points near f• Search points near b

Example

QueriesFind points with Yi > 20Find points with Xi < 5Find points “close” to i = <12,38>Find points “close” to b = <7,24>

Other StructuresOther geographic index structures

• Quad Trees

• R TreesMore Multikey Indexes

• Grid

• Partitioned hash

Grid IndexKey 2

X1 X2 …… XnV1V2

To records with key1=V3, key2=X2

ClaimCan quickly find records with

• key 1 = Vi and Key 2 = Xj

• key 1 = Vi

• key 2 = Xj

And also ranges …

• E.g., key 1 ≥ Vi and key 2 < Xj

Storing Grid IndexesCatch with Grid Indexes!

• Storing Grid Index stored on disk?Problem

• Need regularity to compute position of <Vi,Xj> entry

LikeArray...

X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4

V1 V2 V3

Solution: Use Indirection

Buckets

------------

------

X1 X2 X3V1

Grid only containspointers to buckets

Buckets

Indexing Grid on Value Ranges

Salary Grid

Linear Scale1 2 3

Toy Sales Personnel

0-20K 120K-50K 2

50K- 38

Grid can be regular without wasting space

We do have price of indirection

Partitioned Hashing Hash function

• Combines several attributes

• Great when attributes have values specifiedPartitioned hash function devotes some bits to each attribute independently

010110 1110010Key2Key1

Example (1)

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

<Fred>000

111110101100011010001

Insert<Fred, toy, 10K><Joe, sales, 10K><Sally, art, 30K>

Example (2)

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

<Mary>

<Sally>

111110101100011010001

Find Emp. with Dept. = Sales and Sal = 40k

Example (3)

<Mary>

<Sally>

look hereFind Emp. with Sal = 30k

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

111110101100011010001

Example (4)

<Mary>

<Sally>

look hereFind Emp. with Dept. = Sales

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

111110101100011010001

R Trees

A Dynamic Index Structure for Spatial Representation

Why R Trees?Multi-dimensional spaces not well represented by point locationsNeed to be able to perform range searchesOne dimensional indexes not suitable for multi-dimensional spacesEx: Find all the counties within 20 mi radius of Georgia Tech

Main ConceptsHeight balanced tree similar to a B-treeIndex records in leaf nodes point to data objectsIndex is dynamic and no periodic reorganization is requireIndex records are of the form ( @ leaf nodes):

(I, tuple-identifier) where

I => n dimensional bounding rectangle i.e.I = (I0,I1,…,In) where n = no of dimensions and Ii = [a, b] (a closed bounded rectangle)

More Concepts…Non leaf nodes are of the form:

(I, child-pointer)where

child-pointer is the address of a lower node and I is the smallest rectangle covering all the rectangles in the lower node’s entriesM = maximum number of entries in one node

More Concepts…m = parameter specifying the minimum no of entries in a node. m can be tuned at run time and is <= M/2R tree containing N index records has at most a height [logm N] -1Worst case space utilization: m/MMaximum no of nodes: N/m + N/m2 +…+1

SearchingDenote the rectangle part of a node E by E.I and the child-pointer part by E.pAlgorithm Search: Given an R Tree with root node T find all index records whose rectangles overlap a search rectangle SStep 1) [Search subtrees.] If T is not a leaf, check each entry to determine if E.I overlaps S. For all overlapping entries invoke Search on the tree whose root is E.pStep 2) [Search leaf node.] If T is a leaf, check all entries E to determine whether E.I overlaps S. If so, E is a qualifying records. Return E.

Insertion into R-TreeSimilar to B-Tree insertionNew index records added to leavesOverflowing nodes are splitSplits propagate up the tree

Algorithm Insert - DetailsInvoke ChooseLeaf to select a leaf node L to place EIf L has room for another entry install E, else invoke SplitNode on L, obtaining L and LLInvoke AdjustTree on L (and LL if a split was performed)If root is split, create new root with two children (those obtained by splitting the old root)

Algorithm ChooseLeafSet N to be the rootIf N is leaf, return NIf N is not leaf, let F be the entry in N whose rectangle needs least enlargement to include E.ISet N to be the child node pointed to by F.P and repeat from Step 2

Algorithm AdjustTreeSet N = L, if L was split set NN = LLIf N is Root, STOPLet P be N’s parent and let EN be N’s entry in P. Adjust EN.I so that it “tightly”encloses all entries in NIf NN exists, create a new entry ENN, with ENN.p pointing to NN and ENN.I enclosing all rectangles in NN. Add ENN to P if there is room. Otherwise invoke SplitNode to produce PP.Move up the next level, repeat process

Node Splitting“Full” node to be split when new entry needs to be addedMust ensure that on any subsequent searches, with high probability only one node needs to be exploredTotal area of two rectangles to be minimizedExhaustive Search – Exponential Complexity

Quadratic Cost AlgorithmUse PickSeeds to choose two entries to be first elements of the two groups. Repeat STEP 3 until all entries have been assigned to one of the groupsInvoke algorithm PickNext to choose next entry to assign. Add it to the group whose covering rectangle needs to be expanded the least.

Algorithm PickSeedsFor each pair of entries E1 and E2, let J be the rectangle including E1.I and E2.I.

Calculate d = area(J) – area(E1.I) – area(E2.I)Choose the pair with largest d value

Algorithm PickNextFor each entry E not yet in any group, calculate d1 = the area increase required in

the covering rectangle of Group 1 to include E.1. Calculate d2 similarly for Group 2

Choose any entry with maximum difference between d1 and d2

Algorithm LinearPickSeedsAlong each dimension, find the entry whose rectangle has the highest low side, and the one

with highest low side. Record the separation between themNormalize the separations by dividing by the width of the entire set along corresponding

dimensionChoose the pair with greatest normalized separation along any dimension

Algorithm DeleteInvoke FindLeaf to locate leaf node L containing E. Remove E from LInvoke CondenseTree on LIf root node has only one child, make the child the new root

Algorithm FindLeafSet T to be the root of the treeIf T is not leaf, check each entry F in T to determine if F.I overlaps E.I. For each

entry, invoke FindLeaf on the tree pointed to by F.PIf T is a leaf, check each entry to see if it matches E. If E is found return T

Algorithm CondenseTreeSet N = L. Set Q, the set of eliminated nodes as empty setIf N is the root, go to STEP 6, else, let P be the parent of N, and let EN be N’s

entry in PIf N has fewer than m entries, delete EN from P and add N to Q

Algorithm CondenseTree (contd..)If N not eliminated, adjust EN.I to tightly contain all entries in NSet N = P and repeat from STEP 2Reinsert all entries in Q. Entries from eliminated leaf nodes are inserted as in

algorithm Insert. Entries from higher-level nodes are to be inserted higher in the tree.

Multi-dimensional Sequential Pattern Mining

OutlineWhy multidimensional sequential pattern mining?Problem definitionAlgorithmsExperimental resultsConclusions

Why Sequential Pattern Mining?Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences)Many data and applications are time-related

• Customer shopping patterns, telephone calling patterns – E.g., first buy computer, then CD-ROMS, software, within 3 mos.

• Natural disasters (e.g., earthquake, hurricane)

• Disease and treatment

• Stock market fluctuation

• Weblog click stream analysis

• DNA sequence analysis

Sequential Pattern MiningMining of frequently occurring patterns related to time or other sequences

Examples

• Renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi”in that order

• Collection of ordered events within an intervalApplications

• Targeted marketing

• Customer retention

• Weather prediction

Motivating ExampleSequential patterns are useful

• “free internet access buy package 1 upgrade to package 2”

• Marketing, product design & developmentProblems: lack of focus

• Various groups of customers may have different patternsMD-sequential pattern mining: integrate multi-dimensional analysis and sequential pattern mining

Sequences and PatternsGiven a set of sequences, find the complete set of frequent subsequences

A sequence : < (ef) (ab) (df) c b >A sequence database

Elements items within an element are listed alphabetically

SID sequence

10 <a(ababc)(acc)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(abab)(df)ccb>

40 <eg(af)cbc>

<a(bc)dc> is a subsequence of <<aa(a(abcbc)(ac))(ac)dd((ccff)>)>

Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

Sequential Pattern: Basics

A sequence sequence : <(bd) c b (ac)>

Elements Elements

A sequence database sequence database

<a(bdbd)bcbcb(ade)>50<(be)(ce)d>40

<(ah)(bf)abf>30<(bf)(ce)b(fg)>20<(bdbd)cbcb(ac)>10

SequenceSeq. ID

<ad(ae)> is a subsequence subsequence of <aa(bdd)bcb(aadee)>

Given support threshold support threshold min_sup =2, <(bd)cb> is a sequential patternsequential pattern

Enhanced similarity search methodsAllow for gaps within a sequence or differences in offsets or amplitudesNormalize sequences with amplitude scaling and offset translationTwo subsequences are considered similar if one lies within an envelope of εwidth around the other, ignoring outliersTwo sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction

Subsequence MatchingBreak each sequence into a set of pieces of window with length wExtract the features of the subsequence inside the windowMap each sequence to a “trail” in the feature spaceDivide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangleUse a multipiece assembly algorithm to search for longer sequence matches

Sequential pattern mining: Cases and Parameters Duration of a time sequence T

• Sequential pattern mining can then be confined to the data within a specified duration

• Ex. Subsequence corresponding to the year of 1999

• Ex. Partitioned sequences, such as every year, or every week after stock crashes, or every two weeks before and after a volcano eruption

Event folding window w

• If w = T, time-insensitive frequent patterns are found

• If w = 0 (no event sequence folding), sequential patterns are found where each event occurs at a distinct time instant

• If 0 < w < T, sequences occurring within the same period w are folded in the analysis

Sequential pattern mining: Cases and Parameters (2)Time interval, int, between events in the discovered pattern

• int = 0: no interval gap is allowed, i.e., only strictly consecutive sequences are found– Ex. “Find frequent patterns occurring in consecutive weeks”

• min_int ≤ int ≤ max_int: find patterns that are separated by at least min_intbut at most max_int– Ex. “If a person rents movie A, it is likely she will rent movie B within 30

days” (int ≤ 30)

• int = c ≠ 0: find patterns carrying an exact interval– Ex. “Every time when Dow Jones drops more than 5%, what will happen

exactly two days later?” (int = 2)

Episodes and Sequential Pattern Mining MethodsOther methods for specifying the kinds of patterns

• Serial episodes: A → B

• Parallel episodes: A & B

• Regular expressions: (A | B)C*(D → E)Methods for sequential pattern mining

• Variations of Apriori-like algorithms

Click StreamsClient click-stream analysis is a click-by-click view of a visitor's journey (or journeys) through a web site. By viewing a click-stream report, you can follow the exact pathway a visitor took through a web site, even down to the length of time they spent looking at each particular page.

Click Streams…ContinuedThe people most interested in this report would typically be involved in marketing, web design or web development. The information presented provides a click-by-click view of how visitors are interacting and navigating through their web site.

Periodicity AnalysisPeriodicity is everywhere: tides, seasons, daily power consumption, etc.Full periodicity

• Every point in time contributes (precisely or approximately) to the periodicity

Partial periodicity: A more general notion

• Only some segments contribute to the periodicity– Jim reads NY Times 7:00-7:30 am every week day

Cyclic association rules

• Associations which form cyclesMethods

• Full periodicity: FFT, other statistical analysis methods

• Partial and cyclic periodicity: Variations of Apriori-like mining methods

MD Sequence DatabaseP=(*,Chicago,*,<bf>) matches tuple 20 and 30If support =2, P is a MD sequential pattern

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Mining of MD Seq. Pat.Embedding MD information into sequences

• Using a uniform seq. pat. mining methodIntegration of seq. pat. mining and MD analysis method

UNISEQEmbed MD information into sequences

Mine the extended sequence database

using sequential pattern mining methods

cid MD-extension of sequences

10 <(Business,Boston,Middle)(bd)cba>

20 <(Professional,Chicago,Young)(bf)(ce)(fg)>

30 <(Business,Chicago,Middle)(ah)abf>

40 <(Education,New York,Retired)(be)(ce)>

Mine Sequential Patterns by Prefix Projections

Step 1: find length-1 sequential patterns

• <a>, , <c>, <d>, <e>, <f>Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:

• The ones having prefix <a>;

• The ones having prefix ;

• …

• The ones having prefix <f> SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Find Seq. Patterns with Prefix <a>

Only need to consider projections w.r.t. <a>

• <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

• Further partition into 6 subsets– Having prefix <aa>;– …– Having prefix <af>

SID sequence

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Completeness of PrefixSpan

SID sequence

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

SDBLength-1 sequential patterns<a>, , <c>, <d>, <e>, <f>

<a>-projected database<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>

Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db … <af>-proj. db

Having prefix <af>

…-projected database

Having prefix Having prefix <c>, …, <f>

… …

Efficiency of PrefixSpan

No candidate sequence needs to be generated

Projected databases keep shrinking

Major cost of PrefixSpan: constructing projected databases

• Can be improved by bi-level projections

Mining MD-Patterns

(cust-grp,*,*) (*,city,*) (*,*,age-grp)

(cust-grp,city) Cust-grp,*,age-grp)

(cust-grp,city,age-grp)

MD pattern(*,Chicago,*)

BUC processing

Dim-Seq

First find MD-patterns• E.g. (*,Chicago,*)

Form projected sequence database• <(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*)

Find seq. pat in projected database• E.g. (*,Chicago,*,<bf>)

Seq-Dim

Find sequential patterns• E.g. <bf>

Form projected MD-database• E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for

<bf>Mine MD-patterns

• E.g. (*,Chicago,*,<bf>)

Scalability Over Dimensionality

Scalability Over Cardinality

Scalability Over Support Threshold

Scalability Over Database Size

Pros & Cons of AlgorithmsSeq-Dim is efficient and scalable

• Fastest in most casesUniSeq is also efficient and scalable

• Fastest with low dimensionalityDim-Seq has poor scalability

ConclusionsMD seq. pat. mining are interesting and usefulMining MD seq. pat. efficiently

• Uniseq, Dim-Seq, and Seq-DimFuture work

• Applications of sequential pattern mining

A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive...

Documents