Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in...

transcript

Uncertain Sequence Data:

Algorithms and Applications

James Bailey

The University of Melbourne

ALSIP 2014

Relationship to ALSIP

• We’ll be looking at data mining for sequential data, where the

elements of the sequence are uncertain

• Fit with the theme of the workshop

– Challenges in designing efficient algorithms for this scenario

– Uncertainty can be viewed as a form of succinctness or

compression

– Applications of uncertain sequence models for text, data

streams and spatio/temporal scenarios

Talk Outline

• Background

• Uncertain Data Models

• Challenges

• Related Work

• Mining Probabilistic Spatio-Temporal Sequential Patterns

• Matching Substrings Over Uncertain Sequences

• Future Directions

Background – data uncertainty

• Sources of data uncertainty

– Incompleteness of data sources

– Artificial noise in privacy-sensitive applications

– Uncertainty arising from imprecision in measurements and

observations.

Model Year Made Kilometers Transmission Body Type

Honda Civic 2004 ------- Auto Sedan

Mazda 3 2002 63,357 Manual -------

Name Marital Status Occupation

Jim Ross ***** Engineer

Mobile Satellite

Data uncertainty

• Uncertainty due to compression

• Given a collection of certain sequences, summarise/collapse

them into an uncertain sequence

– Sequence 1 A A A A B C

– Sequence 2 A A B A B A

– Sequence 3 A A A A A A

– Summary A A A|B A A|B A|C

• E.g.

– Compress a set of trajectories

– Consensus description for a group of proteins

– Summarise a group of time series

Applications

• Applications with data uncertainty

– Trajectory data analysis

– Bioinformatics (DNA and protein comparison)

– Web querying

– Text recognition

A,C,G,T?

Applications

• Text mining

– Given a noisy stream of speech being parsed by a machine

• word1 word2 word3 word4 word5 word6 ………

• There may be uncertainty about each word. E.g. Was

word3 “likes” or was it “strikes” or was it “spikes” ?

• Wish to compute probability that the stream contains the

query phrase “more strikes”

• Example query: what is the probability that the (certain)

sequence query = CO is a substring ?

C O C A C O L A

C 0.4 0.1 0.4 0 0.4 0.1 0 0

G 0.3 0.1 0.3 0 0.3 0.1 0 0

O 0.3 0.7 0.3 0 0.3 0.7 0 0

Q 0 0.1 0 0 0 0.1 0 0

L 0 0 0 0 0 0 1.0 0

A 0 0 0 1.0 0 0 0 1.0

Applications

• Linear spatial anomalies

– Each person has a probability of being a “bad guy”.

– What is the probability 3 bad guys are together in a row ?

Applications: Bioinformatics

• A motif (protein) can be represented as an uncertain sequence

• Given uncertain sequence s1 and and uncertain sequence s2,

how similar are s1 and s2 ?

• Compute all possible k-mers, use these k-mers as a feature

space to compare the similarity of s1 and s2.

1 2 3 4

A 0.2 0.5 0.1 0.7

C 0.3 0.1 0.2 0.3

T 0.4 0.1 0.2 0

G 0.1 0.3 0.5 0

Vector representation

AA AC AT AG CA ..

0.4 0.2 0.9 0.01 0.33 ..

Examples: uncertain data

• Uncertain Transactions (frequency checking problem)

• Uncertain Trajectory Data (frequency checking with gap constraints)

• Uncertain Sequence (substring/subsequence matching problem)

time Clusters

1 {o1:0.6,o2:1.0,o3:0.8}; {o4:0.7,o5:1.0,o6:0.8}

2 {o1:0.8,o2:1.0}; {o3:1.0,o4:0.5}; {o5:0.7,o6:0.5}

6 {o1:0.4,o2:1.0}; {o3:1.0,o4:0.9,o5:0.9,o6:0.8}

TransactionID Itemset Probability

1 {a,b} 0.8

2 {b,c,d} 0.7

A 0.1 0 0.2 1.0

C 0.2 1.0 0.4 0

G 0.3 0 0.2 0

T 0.4 0 0.2 0

Frequency Checking vs.

Substring/Subsequence Matching

• Checking the frequency of an itemset I in a set of transactions

– I occurs at least three times in a transaction database

• To match a subsequence in a (longer) reference sequence

– “III” is contained in the reference sequence (at least once)

where the gap is set to infinity.

Tran 1 Tran 2 Tran 3 Tran 4 Tran 5 Tran 6

I ¬I I I ¬I I

Treat transaction history as an ordered list;

Itemset I as a character in the sequence;

All other characters as ¬I.

𝑠 [1] 𝑠 [2] 𝑠 [3] 𝑠 [4] 𝑠 [5] 𝑠 [6]

I A I I B I

Uncertain Data Models

• Expectation models

– Treat existence probability as weights

– Lacks an indication on confidence

– item a has an expected frequency of 1.0, but it is not

possible that a has a frequency of 2.

– item b has a lower expected frequency of 0.5, but b has

probability of 0.06 of having a frequency of 2.

TransactionID Items

1 a:1.0,b:0.2

2 b:0.3

Uncertain Data Models cont.

• Probabilistic (confidence) models

– Possible world semantics

– More popular

• Probability that b occurs in at least one transaction :

= P(W1) + P(W2) + P(W3) = 0.44

TransactionID Items

1 a:1.0,b:0.2

2 b:0.3

instantiated

Possible

World Wi

Items Probability

1 T1: a,b

2 T1: a

3 T1: a,b

4 T1: a

Challenges

• As the size of uncertain sequence/ time space/ transaction DB

increases, the number of possible worlds grows exponentially.

• In the problem of substring/subsequence matching, checking

pattern characteristics for uncertain data in the presence of gaps

comes with extra challenges.

Possible worlds challenge

Related Work

• Uncertain frequent itemset mining (Bernecker et.al. 09)

• Top-k queries in uncertain data (Hua et.al. 08, Yi et.al.10)

• Notations of source-level and event-level uncertain models for

sequential pattern mining (Muhammad and Raman 10)

• Mining Probabilistic Spatio-Temporal Sequential Patterns (Li

et.al. 13)

• Matching Substrings over Uncertain Sequences (Li et.al. 14)

Related Work

• Top-k queries in uncertain data (Hua et.al. 08)

– Based on probabilistic model

– Ranking query with a probabilistic threshold p

– Returning tuples whose top-k probability values are at least p

– Three algorithms proposed:

• An exact algorithm with pruning rules: faster for a small k

• A sampling (approximation) method:

– Trade of between accuracy and efficiency

– Generally more stable in runtime.

• A Poisson approximation based method

– Better approximation as k increases

Related Work

• Uncertain frequent itemset mining (Bernecker et.al. 09)

– Based on a probabilistic model (possible world semantics)

– Given a frequency threshold and a probability threshold of

an itemset, the main task is compute the frequentness

probability.

– A dynamic programming approach introduced

• Using Poissonn binomial recurrence technique

• Linear time and space complexity (assuming the

frequency threshold as a constant)

Related Work

• Notations of source-level and event-level uncertain models for

sequential pattern mining

– Discussed both expectation model and probabilistic model

– Two uncertain models for probabilistic frequentness

• Event-level uncertainty

• Source-level uncertainty

p-sequence

DXp {a,b: 0.6} {c,d:0.3}

DYp {a,b: 0.4} {c,d:0.7}

e-id event W

e1 (a,b) X:0.6, Y:0.4

e2 (c,d) X:0.3, Y:0.7

Matching Substrings over Uncertain Sequences

Y. Li, J. Bailey, L. Kulik and J. Pei. Efficient Matching of Substrings in Uncertain

Sequences. in Proceedings of the 2014 SIAM International Conference on Data

Mining (SDM), 2014.

Problem Definition

• Given a query substring q and an uncertain sequence 𝑠 , our main task

is to calculate the substring matching probability 𝑃(𝑞 ⊑ 𝑠 ).

• An example query is what is the probability that the (certain) sequence

AGCTCT is a substring of 𝑠 ?

• No gaps permitted in matching

• Challenge: the number of possible world increases exponentially with

the size of uncertain sequence 𝑠 .

Possible worlds challenge

A Dynamic Programming Approach

(overview)

• To split the problem of computing substring matching probability for a

sequence with size of j into sub-problems of computing the substring

matching probabilities for sequences with size of j − 1.

• Our approach consists of two parts:

– Backward Index Computation: to perform a top-down scan on q

for computing the backward indices (for performing backward

matching).

– Dynamic programming scheme: to compute the substring

matching probability using a bottom-up dynamic programming

scheme.

• Forward matching

• Backward matching and Tail matching

• Reset

Forward Matching

• In the step of matching 𝑞(𝑖) over 𝑠 (𝑗), if 𝑠 𝑗 = 𝑞[𝑖], then we

continue to match 𝑞(𝑖 − 1) over 𝑠 (𝑗 − 1).

• Example

• 𝑞= AGCT, if 𝑠 6 = 𝑞[4],

• then, we match 𝑞(3)= AGC over 𝑠 5 .

𝑠 [1] 𝑠 [2] 𝑠 [3] 𝑠 [4] 𝑠 [5] 𝑠 [6]

Backward Matching

• Move backward and match 𝑞(𝑘) over 𝑠 (𝑗 − 1).

• Example

• 𝑞= ACTC, 1234

𝑠 [1] 𝑠 [2] 𝑠 [3] 𝑠 [4] 𝑠 [5] 𝑠 [6]

T C T C

𝑠 [1] 𝑠 [2] 𝑠 [3] 𝑠 [4] 𝑠 [5] 𝑠 [6]

A C T C T C

𝑘 = 2

Tail Matching and Reset

• Tail Matching: match 𝑞(𝑚 − 1) over 𝑠 (𝑗 − 1).

• Not constrained by gap compared to backward matching

• Reset: match 𝑞(𝑚) over 𝑠 (𝑗 − 1), if the conditions for the three

other scenarios are false.

Example: matching AGCTCT over 𝑠 (𝟏𝟎)

• Backward Matching index Computation

• One pass on the query, starting from the right.

Example: matching AGCTCT over 𝑠 (𝟏𝟎)

• Computing the probability

• It computes and stores the internal results column by column in

a bottom-up manner.

• 𝑚: the size query

• 𝑛: the size of uncertain reference sequence

• Total #nodes computed and stored: 𝑛 − 𝑚 + 1 ⋅ 𝑚 = 𝑂 𝑚 ⋅ 𝑛

Time Complexity and Space Complexity

Experiments

• Scalability

Mining Probabilistic Spatio-Temporal Sequential Patterns

Y. Li, J. Bailey, L. Kulik, and J. Pei. Mining Probabilistic Frequent Spatio-

Temporal Sequential Patterns with Gap Constraints from Uncertain

Databases. In Proceedings of the 2013 IEEE International Conference on Data

Mining (ICDM), 2013.

Background

• Spatio-temporal (sequential) patterns in certain data

– Flocks (Vieira et al. 09)

– Convoys (Jeung et al. 10)

– Swarms (Li et al. 11)

• A minimum number of moving objects 𝑂 stay together for a minimum

number of (consecutive) timestamps 𝑇.

– Minimum number of objects: 𝑂 ≥ 𝑚𝑖𝑛𝑜

– Minimum number of timestamps: 𝑇 ≥ 𝑚𝑖𝑛𝑡

– Maximum gap constraint: ⨆𝑇 ≤ 𝑔

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

√ √ √

⨆𝑇 = 2

𝑇 = {𝑡1, 𝑡3, 𝑡6}

Examples: uncertain data

• Uncertain Trajectory Data (frequency checking with gap constraints)

time Clusters

1 {o1:0.6,o2:1.0,o3:0.8}; {o4:0.7,o5:1.0,o6:0.8}

2 {o1:0.8,o2:1.0}; {o3:1.0,o4:0.5}; {o5:0.7,o6:0.5}

6 {o1:0.4,o2:1.0}; {o3:1.0,o4:0.9,o5:0.9,o6:0.8}

Example

• Parameters: 𝑚𝑖𝑛𝑜 = 2,𝑚𝑖𝑛𝑡 = 3, g = 1

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

{𝑜1, 𝑜2} {𝑜1, 𝑜2} {𝑜1, 𝑜2}

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

{𝑜1, 𝑜2} {𝑜1, 𝑜2} {𝑜1, 𝑜2}

⨆𝑇 = 2 > 𝑔

Data Uncertainty in Location Data

• Location is represented by a probability-density function

• Whether objects 𝑂 stay together at 𝑡 is probabilistic.

– Co-occurrence of objects at 𝑇 is described by a discrete

probability-distribution function.

lon/lat: 𝑓 𝑥 = 𝑎𝑥 + 𝑏

0.2 0.5

0.2 0.1 0

𝑃(𝑇 = 𝑡1, 𝑡2 ) 𝑃(𝑇 = 𝑡1 ) 𝑃(𝑇 = ∅) 𝑃(𝑇 = 𝑡2 )

• The main computational challenge is to calculate the frequentness

probability.

o the probability that a pattern 𝑂 satisfies minimum #timestamps

threshold 𝑚𝑖𝑛𝑡 and maximum gap constraint 𝑔.

• The problem is to find all patterns that satisfy the probability threshold.

• Challenge: the number of possible world increases exponentially with both

object space and time space.

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7 𝑡8 …

√ √ √

𝑔 𝑇𝑆 𝑚𝑖𝑛𝑡

Problem Definition

(overview)

• Spliting the problem of computing frequentness probability at the first j timestamps into subproblems of computing frequentness probabilities at the

first j − 1 timestamps.

• Question: constraints on the patterns of subproblems?

o It depends on whether 𝑂 occurs at 𝑡𝑗

• Two constraints need to be considered:

o Minimum number of timestamps threshold 𝑚𝑖𝑛𝑡.

o Maximum gap constraint 𝑔.

• The frequentness probability 𝑃≥𝑖,𝑗𝑔

𝑂 at 𝑇𝑗 = {𝑡1…𝑡𝑗}

the frequentness probabilities at 𝑇𝑗−1

• 𝑇 = the timestamps that 𝑂 occurs at 𝑇𝑗

• 𝑇′ = the timestamps that 𝑂 occurs at 𝑇𝑗−1

• Question: constraints on 𝑇′ to make 𝑇 ≥ 𝑖 and ⨆𝑇 ≤ 𝑔 ?

• For minimum #timestamps threshold 𝑖:

o if 𝑂 @ 𝑡𝑗 𝑂 must occur at least 𝑖 − 1 timestamps of 𝑇𝑗−1

𝑇′ ≥ 𝑖 − 1

o Otherwise 𝑂 must occur at least 𝑖 timestamps of 𝑇𝑗−1 𝑇′ ≥ 𝑖

• Tail gap: ⋁𝑇,𝑗 = 𝑗 −𝑚, where 𝑡𝑚 is the last timestamp in 𝑇

• For gap constraint 𝑔: o if 𝑂 @ 𝑡𝑗 𝑇′ must fulfill gap constraint ⨆𝑇′ ≤ 𝑔 and tail gap constraint ⋁𝑇′,𝑗−1 ≤ 𝑔

o otherwise 𝑇′ must fulfill gap constraint ⨆𝑇′ ≤ 𝑔

• For tail gap constraint 𝑦: o if 𝑂 @ 𝑡𝑗 𝑇′ must fulfill gap constraint ⨆𝑇′ ≤ 𝑔 and tail gap constraint ⋁𝑇′,𝑗−1 ≤ 𝑦

o otherwise 𝑇′ must fulfill gap constraint ⨆𝑇′ ≤ 𝑔 and tail gap constraint

⋁𝑇′,𝑗−1 ≤ 𝑦 − 1

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5

√ √

⋁𝑇,5 = 5 − 3

𝑇 ={𝑡1, 𝑡3}

Example: Tail Gap

• Computing 𝑃≥3,62 (𝑂)

• if 𝑂 @ 𝑡6, ⨆𝑇′ ≤ 2 and ⋁𝑇′,5 ≤ 2

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

√ √ √

⋁𝑇′,5 = 2 ⨆𝑇′ = 0

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6

√ √ √

𝑇′ = {𝑡2, 𝑡3}

⨆𝑇′ = 0 ⋁𝑇′,5 = 3

𝑇′ = {𝑡1, 𝑡2} 𝑇 = {𝑡1, 𝑡2, 𝑡6}

• Bottom up approach: the internal results stored for further calculations.

• Trade off between time complexity and space complexity.

• The internal results are calculated layer by layer.

Implementation

Example: Computing 𝑃≥3,51 (𝑂)

𝑃0,2 𝑃0,1 𝑃0,0

𝑃0,2 𝑃0,1 𝑃0,0 Bottom layer

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 1

Bottom layer

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 1

Bottom layer

Internal layer 1

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

A C = A + B

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

A C = A + B

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

A C = A + B

Output

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

Bottom layer

Internal layer 1

Internal layer 2

Top layer

A C = A + B

Output

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑚𝑖𝑛𝑡 + 1

Bottom layer

Internal layer 1

Internal layer 2

Top layer

A C = A + B

Output

𝑃≥3,51 𝑃≥3,4

1 𝑃≥3,30

𝑃≥2,4∨1,1

𝑃≥2,3∨1,1

𝑃≥2,3∨0,1

𝑃≥2,2∨0,1

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

𝑃0,2 𝑃0,1 𝑃0,0

𝑖 = 3

𝑖 = 2

𝑖 = 1

ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑚𝑖𝑛𝑡 + 1

𝑤𝑖𝑑𝑡ℎ = |𝑇𝑆| −𝑚𝑖𝑛𝑡 +1

Bottom layer

Internal layer 1

Internal layer 2

Top layer

A C = A + B

Output

𝑃≥1,3∨1,1

𝑃≥1,2∨1,1

𝑃≥1,2∨0,1

𝑃≥1,1∨0,1

ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑔 + 1

|𝑇𝑆| −𝑚𝑖𝑛𝑡 +1

𝑤𝑖𝑑𝑡ℎ = |𝑇𝑆| −𝑚𝑖𝑛𝑡 +1 − 𝑔

𝑖 = 1

• Parallelogram in shape

• #nodes = (|𝑇𝑆 −𝑚𝑖𝑛𝑡 + 1 − g × (g + 1)

• A quadratic function that peaks at (|𝑇𝑆 −𝑚𝑖𝑛𝑡 /2

Maximum gap

#nodes

#Nodes per Internal Layers

• Total #nodes computed and stored = 𝑂(𝑚𝑖𝑛𝑡 ⋅ 𝑔 ⋅ 𝑇𝑆 )

• Linear time: 𝑂( 𝑇𝑆 ) if we assume input parameters 𝑚𝑖𝑛𝑡 and 𝑔 are

constants.

• If 𝑔 = ∞ (𝑔 = |𝑇𝑆| − 𝑚𝑖𝑛𝑡),

o equivalent to uncertain frequent itemset mining.

Time Complexity and Space Complexity

𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7 𝑡8 …

√ √ √

𝑔 𝑇𝑆 𝑚𝑖𝑛𝑡

Experiments

Future Directions

Subsequence Matching with Arbitrary Gaps

• Substring: no gap is allowed.

• Subsequence: gap is set to infinite.

• Subsequence matching with arbitrary gaps

– Gap constraints are imposed to relax and/or restrict the

distance between two adjacent characters.

Connections and Comparison

(i) Subsequence matching with arbitrary gaps

(ii) Substring matching in uncertain data

(ii) Uncertain S.T. sequential patterns

(iv) Uncertain frequent itemset mining

Future Directions

• Establlising hardness results for matching problems in uncertain

sequences

• Considering richer types of queries

• Considering uncertainty in the query (in addition to uncertainty in

the reference)

• Use of succinct data structures to speed up matching

• Investigation for real world applications

Thank you!

Questions?

Uncertain Sequence Data: Algorithms and Applications James ...Related Work • Top-k queries in...

Documents