+ All Categories
Home > Documents > Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui...

Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui...

Date post: 03-Jan-2016
Category:
Upload: arron-wright
View: 214 times
Download: 0 times
Share this document with a friend
83
Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao
Transcript
Page 1: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining Order Preserving Submatrices (OPSMs) from data with replicates

Presenter: Chun-Kit Chui

Supervisor: Dr. Ben Kao

Page 2: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Presentation Outline

Conventional Order Preserving Submatrixes

Multiple-value matrix data model Mining OPSMs from the new data model Efficient methods – bounding techniques Experimental evaluation

Page 3: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Order Preserving Submatrix

The Order Preserving Submatrix is a pattern-based subspace clustering problem which usually applies on mining gene expression datasets.

Gene expression dataset:

One of the goals of microarray data analysis is to group the coexpressing genes into a cluster. Genes that shows similar changes of expression levels (up or down of

expression values) under some environmental stimuli (experimental conditions).

C1 C2 C3 C4 C5 C6 C7 C8

G1 36 32 12 19 18 42 33 8

G2 11 22 33 24 30 3 9 23

G3 14 18 48 28 38 11 33 21

G4 20 14 5 10 7 24 44 13

G5 38 25 10 24 19 39 8 22

Experimental conditions (Experimental settings)

Expression value of a geneunder an experimental setting.

Genes

Page 4: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Order Preserving Submatrix

Gene’s expression level may vary substantially due to its sensitivity to experimental settings. i.e. the change of expression level in response to the change of

experimental condition is often considered more meaningful than its actual value.

In microarray experiments, the design of experimental conditions is often based on little knowledge of gene functions. i.e. clustering algorithms should consider a subset of conditions which

maximizes the similarities among a subset of genes.

C1 C2 C3 C4 C5 C6 C7 C8

G1 36 32 12 19 18 42 33 8

G2 11 22 33 24 30 3 9 23

G3 14 18 48 28 38 11 33 21

G4 20 14 5 10 7 24 44 13

G5 38 25 10 24 19 39 8 22

Raw gene expression datasetData matrix plotted

No obvious patterns observed.

Page 5: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Order Preserving Submatrix

C1 C2 C3 C4 C5 C6 C7 C8

G1 36 32 12 19 18 42 33 8

G2 11 22 33 24 30 3 9 23

G3 14 18 48 28 38 11 33 21

G4 20 14 5 10 7 24 44 13

G5 38 25 10 24 19 39 8 22

Raw gene expression datasetData matrix plotted

C3 C5 C4 C2 C1 C6

G1 12 18 19 32 36 42

G4 5 7 10 14 20 24

G5 10 19 24 25 38 39

Reordered subset of columns(experimental conditions)

Subset of Genes

Order Preserving Submatrix

No obvious patterns observed.

Consider a subset of experimental conditions and a subset of genes and we reorder the columns.

The change of expression values of the genes in response to the change of experimental condition is the same. They are all increasing.

Identifying this submatrix (subset of genes and conditions and the ordering of conditions) is particularly useful for the biologists.E.g. infer gene regulatory networks.

Page 6: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Order Preserving Submatrix

Given a data matrix M with n rows and m columns.

An order preserving submatrix S is A subset of row R.

E.g. R={G1,G4,G5} A subset of column C.

E.g. C={C1,C2,C3,C4,C5,C6} A column order

constraint, s.t. the entries of all rows in R are increasingly ordered.

E.g. <C3,C5,C4,C2,C1,C6> Mining OPSMs: Find all

OPSMs with |R| greater than or equal to a user specified threshold (frequent) and |C| greater than or equal to c.

C3 C5 C4 C2 C1 C6

G1 12 18 19 32 36 42

G4 5 7 10 14 20 24

G5 10 19 24 25 38 39

Subset of Genes

Order Preserving Submatrix

Reordered subset of columns(experimental conditions)

C1 C2 C3 C4 C5 C6 … Cm

G1 36 32 12 19 18 42 … 8

G2 11 22 33 24 30 3 … 23

G3 3 25 31 22 11 4 … 26

G4 20 14 5 10 7 24 … 13

G5 38 25 10 24 19 39 … 22

… … … … … … … … …

Gn …

Raw gene expression dataset

Page 7: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs

Mining OPSMs can be reduced to a special case of sequential pattern mining. Transform the data matrix to a sequence dataset by sorting

each row in ascending order and replace the entries with the corresponding column labels.

Each sequential pattern uniquely specifies an OPSM, with all the supporting sequences as the supporting rows.

A row supports an OPSM if the order constraint of the OPSM is a subsequence of the transformed column sequence of the row.

C1 C2 C3 C4 C5 C6 C7 C8

G1 36 32 12 19 18 42 33 8

G2 11 22 33 24 30 3 9 23

G3 14 18 48 28 38 11 33 21

G4 20 14 5 10 7 24 44 13

G5 38 25 10 24 19 39 8 22

Raw data matrixGene Column Sequence

G1 <8,3,5,4,2,7,1,6>

G2 <6,7,1,2,8,4,5,3>

G3 <6,1,2,8,4,7,5,3>

G4 <3,5,4,8,2,1,6,7>

G5 <7,3,5,8,4,2,1,6>

Transformed sequence dataset

Sequential pattern(OPSM)

<3,5,4,2,1,6>

An OPSM

Page 8: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs – Two properties

Apriori property If the OPSM with column order constraint

<a,b> is infrequent (#rows smaller than a user specified support threshold), the OPSMs with column order constraint as it’s superset (e.g. <a,b,c>, <a,b,c,d>, <c,a,b,d>) are all infrequent.

E.g. OPSM with column order constraint <C8,C3,C5,C4,C2> is supported by G1 only, it is infrequent.

Adding more constraints to the OPSM will only reduce the number of supporting sequences.

OPSM <C8,C3,C5,C4,C2,C1> must be infrequent. According to the Apriori property, we can

have an iterative method to prune the search space.

Gene Column Sequence

G1 <8,3,5,4,2,7,1,6>

G2 <6,7,1,2,8,4,5,3>

G3 <6,1,2,8,4,7,5,3>

G4 <3,5,4,8,2,1,6,7>

G5 <7,3,5,8,4,2,1,6>

Transformed sequence dataset

i.e. Mine the frequent size-k OPSMs, identify infrequent size-k+1 OPSMs and prune them, continue to the next (k+1) iteration.

Page 9: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs – Two properties

Transitivity If the column order constraint of an OPSM

O1 is <x1,x2,…,xi,y1,y2,…,yj> and another OPSM O2 is <y1,y2,…,yj,z1,z2,….,zk>. Then the intersection of R1 and R2 yields the set of supporting rows for OPSM O3 with column order constraint < x1,x2,…,xi,y1,y2,…, yj,z1,z2,….,zk>.

E.g. OPSM <C3,C5,C7> is supported by G1, G4, OPSM <C5,C7,C1> is supported by G1.

OPSM <C3,C5,C7,C1> is supported by G1.

According to the Transitivity property, we can obtain the supports of size-(k+1) OPSMs from size-k frequent OPSMs without rescan the sequence dataset.

Gene Column Sequence

G1 <8,3,5,4,2,7,1,6>

G2 <6,7,1,2,8,4,5,3>

G3 <6,1,2,8,4,7,5,3>

G4 <3,5,4,8,2,1,6,7>

G5 <7,3,5,8,4,2,1,6>

Transformed sequence dataset

Gene <3,5,7> <5,7,1> <3,5,7,1>

G1

G2

G3

G4

G5

Page 10: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

The Subsequence function verifies the supporting rows of the candidate

OPSMs.

Mining OPSMs

OPSM-GenOPSM-GenSubsequenceFunction

SubsequenceFunction

Size-2 OPSMs candidates

Frequent size-k OPSMs

Size k+1Candidate OPSMs

According to the Apriori property, we only generate those

size k+1 candidates with all proper subsets being frequent.

According to the Transitivity property, we can obtain the supporting rows of the size k+1 candidates from the size k large OPSMs. No need to scan the dataset.

Start from mining size-2 OPSMs because size-1

OPSMs does not have any “orderings”.

Those OPSMs with #supporting rows greater

than or equal to the support threshold are frequent, they are passed into the OPSM-Gen procedure.

The algorithm terminates when no more candidates are generated.

Page 11: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs – Data Structure

Subsequence Function

Subsequence Function OPSM-GenOPSM-Gen

Column order constraint Supporting rows

<a,b,c> 1,3,5,6

… …

Head/Tail OPSM ptr

H <a,b,c>

a

b

b

c d

c

a d

root

Head/Tail OPSM ptr

T <a,b,c>

A Head-tail tree data structure was proposed to facilitate candidate generation in the OPSM-Gen procedure.

To add an OPSM <a,b,c> into the Head-tail tree, we first follow the “head” of the OPSM (i.e. <a,b>) to traverse the tree and store the OPSM in the leaf node, we indicate it reaches the leaf node by following the “head” (H) of the OPSM.

Then, we follow the “tail” of the OPSM (i.e. <b,c>) to traverse the

tree and store the OPSM in the leaf node, indicate it reaches the leaf node

by following the “tail” (T) of the OPSM.

Size-3 frequent OPSMs table

In the OPSM-Gen procedure, we have the size-k frequent OPSMs table storing the column order

constraint (OPSMs) and the support rows (bit vectors/ tid-lists)

of the OPSMs.

Page 12: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs – Data Structure

Column order constraint Supporting rows

<a,b,c> 1,3,5,6

<a,b,d> 4,5,6

<c,a,b> 1,2,3,4,5

<e,a,b> 2,5,6,7

Head/Tail OPSM ptr

H <a,b,c>

H <a,b,d>

T <c,a,b>

T <e,a,b>

a

b

b

c d

c

a d

root

Head/Tail OPSM ptr

T <a,b,c>

According to the Transitivity property, to generate the size-4 candidates, we can simply merge the tail OPSMs and head OPSMs

within each leaf node.

Size-3 frequent OPSMs table

Subsequence Function

Subsequence Function OPSM-GenOPSM-Gen

Page 13: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

OPSM-GenOPSM-Gen

Mining OPSMs – Data Structure

Column order constraint Supporting rows

<a,b,c> 1,3,5,6

<a,b,d> 4,5,6

<c,a,b> 1,2,3,4,5

<e,a,b> 2,5,6,7

Head/Tail OPSM ptr

H <a,b,c>

H <a,b,d>

T <c,a,b>

T <e,a,b>

a

b

b

c d

c

a d

root

Head/Tail OPSM ptr

T <a,b,c>

A Head-tail tree data structure was proposed to facilitate candidate generation in the OPSM-Gen procedure.

According to the Transitivity property, to generate the size-4 candidates, we can simply merge the tail OPSMs and head OPSMs

within each leaf node.

Size-3 frequent OPSMs table

For example, Tail <c,a,b> and Head

<a,b,d> can be merged to form a new size-4

OPSM <c,a,b,d>.

Follow their ptrs, we can find their supporting rows. By the Transitivity property, OPSM

<c,a,b,d> is supported by row 4 and 5 (intersection of the two support row

vectors).

Subsequence Function

Subsequence Function

Page 14: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

The replicated data model

Multiple-value matrix

Page 15: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Multiple-value matrix data model

Recently, researches in microarray data analysis have shown that any single microarray output is subject to substantial variability.

[Stefan Bleuler et al, Evo Workshops 2005; R. Coombes et al, Journal of Computational Biology 2002; G. C. Tseng et al, Nucleic Acids Research. 2001; J. P. Brody et al., National Academy of Sciences.,2002]

The error of the expression values of the genes under an experimental condition can be large.

Replication is strongly supported by biologists as a straightforward approach for improving the quality of inferences made from experimental studies.

[T.-K. Jenssen et al, Nucleic Acids Research 2002 ; M.-L. T. Lee et al, PNAS 2000 ; J. Novak et al, Genomics,2002; R. Ramakrishnan et al, Nucleic Acids Research 2002]

Technical Biological

VariabilityMeasurement error of the experimental system (comparatively small).

Natural heterogeneity among individuals (can be large).

ReplicationRepeated measurement of the same sample.

The practice of measuring multiple samples under the same experimental condition.

Page 16: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Multiple-value matrix data model

The practice of conducting repeated experiments is stressed in many literatures on microarray studies. A study on the effect of repeated measures on the detection

of differentially expressed genes has reported that stable results are typically not obtained until at least five biological replicates have been used. [P. Pavlidis et al, Bioinformatics 2003 ]

Another study on variability analysis of gene expression data suggests that at least three repeated experiments should be conducted instead of one. [M.-L. T. Lee et al, PNAS 2000]

Therefore, it is necessary to consider the data outputted by the repeated experiments when analyzing the gene expression data.

Page 17: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Multiple-value matrix data model

With repeated experiments, the data outputted by microarray experiments can be organized as a matrix in which each entry is a set of expression levels of a

gene under an experimental condition.

8 Conditions

3 Genes

There are 3 repeated experiments conducted under experimental condition

(column) C1, the expression value of gene (row) G1 in the first, second and third

repeated experiments (replicates) are 23, 24 and 22 respectively.

Page 18: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Multiple-value matrix data model

Which OPSM does G2 supports?

<25, 26, 27, 31, 36, 37, 40, 45>Expressionvalues

There are two enumerated column orderings deduced from G2 which conform to the column order constraint of this OPSM.

Let’s consider this set of replicates.

<25, 26, 27, 31, 36, 37, 41, 45>Expressionvalues

Since the expression values of G2 in column <C6,C4,C1,C3,C5,C7,C8,C2> are increasingly ordered, we say that the OPSM with column order constraint <C6,C4,C1,C3,C5,C7,C8,C2> is one of the OPSM that is possibly supported by G2.

Page 19: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Multiple-value matrix data model

Which OPSM does G2 supports?

<22, 27, 30, 31, 33, 36, 43, 45>Expression

values

There are six enumerated column orderings deduced from G2 which conform to the column order constraint of this OPSM.

Which OPSM does G2 more conform to?

Page 20: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

We define the score given by a row (gene) to an OPSM being the fraction of all the enumerated column orderings which conform to the column

order constraint of the OPSM.

The problem of obtaining the counts of

the enumerated column ordering <C1,C2> is equivalent to obtaining the

number of subsequence matches of <C1,C2> in the transformed sequence

dataset.

Scoring Model

Row Column sequence

Row 1 <1,1,2,2,1,2,1,2>

Raw Dataset

Transformed Sequence Dataset

Enumerated column orderings table

Which OPSM, <C1,C2> or <C2,C1> does G1 supports?

From the raw dataset, we can enumerate all the possible column orderings and store them in the Enumerated column orderings table.

In this case, there are 16 enumerated column orderings in total.

<C1,C2> has 11 out of 16 of the enumerated column orderings, the

OPSM with column order constraint <C1,C2> scores 11/16 from G1.

<C1,C2> <C2,C1>

Counts 11 5

Score 11/16 5/16

<C1,C2> <C2,C1

>

#Subsequence matches

11 5

Score 11/16 5/16

Subsequence Matches

Enumerated column orderings counts

Similar to the conventional OPSM mining, we can

transform the raw dataset to a sequence dataset.

The denominator of the score function can be calculated by multiplying the number of replicates of the columns

involved. i.e. 4*4=16.

Page 21: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Scoring Model

Column order constraint

Supporting rows (#subsequence

matches )

Total Score

<a,b,c> 1(14), 3(11), 5(22), 6(63) (14+11+22+63) / 64

<a,b,d> 4(44), 5(36), 6(25) (44+36+25) / 64

<c,a,b> 1(52), 2(42), 3(14), 4(20) (52+42+14+20) / 64

<e,a,b> 2(42), 5(31), 6(36), 7(13) (42+31+36+13) / 64

… … …

Head/Tail OPSM ptr

H <a,b,c>

H <a,b,d>

T <c,a,b>

T <e,a,b>

a

b

b

c d

c

a d

root

Size-3 frequent OPSMs table

To mine the OPSMs under the scoring model, each supporting row (gene) is associated with

the #subsequence matches of the column order constraint

(OPSM).

Here, we use the total score as the support measure of the OPSMs. Those OPSMs with total scores over a user-specified threshold are regarded as frequent.

From the #subsequence matches, we can calculate the score contributed by each row (gene) and the total sum of the scores obtained for the OPSM.

Page 22: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs from multiple-value matrix

Row Column sequence

Row 1 <1,1,2,3,2,3,1,2,1,3,2,3>

Raw Dataset

Transformed Sequence Dataset

<C1,C2> <C2,C3 >

#Subsequence matches

11 10

Score 11/16 10/16

Subsequence Matches

An example raw dataset with 3 conditions (columns), and

each condition has 4 repeated experiments (replicates).

We transform the raw dataset into a sequence dataset by sorting the entries in

ascending order and replace the entries with their condition (column) IDs.

From the row 1 sequence, we found that <C1,C2> has 11 subsequence matches in

row 1. Similarly, <C2,C3> has 10 subsequence matches in row 1.

Page 23: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

After obtaining the total scores of the two OPSMs, we found

that they are frequent. Therefore, they are stored in the size-2 frequent OPSM table.

Mining OPSMs from multiple-value matrix

Row Column sequence

Row 1 <1,1,2,3,2,3,1,2,1,3,2,3>

Raw Dataset

Transformed Sequence Dataset

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

<C1,C2> <C2,C3 >

#Subsequence matches

11 10

Score 11/16 10/16

Subsequence Matches

Page 24: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs from multiple-value matrix

Row Column sequence

Row 1 <1,1,2,3,2,3,1,2,1,3,2,3>

Raw Dataset

Transformed Sequence Dataset

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

<C1,C2> <C2,C3 >

#Subsequence matches

11 10

Score 11/16 10/16

Subsequence Matches

Head-Tail Tree

Subsequence Function

Subsequence Function OPSM-GenOPSM-Gen

In the OPSM-gen procedure, OPSMs are organized in a

Head-Tail tree data structure.

Page 25: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs from multiple-value matrix

Row Column sequence

Row 1 <1,1,2,3,2,3,1,2,1,3,2,3>

Raw Dataset

Transformed Sequence Dataset

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

<C1,C2> <C2,C3 >

#Subsequence matches

11 10

Score 11/16 10/16

Subsequence Matches

Head-Tail Tree

OPSM-GenOPSM-GenSubsequence Function

Subsequence Function

According to the transitivity property, Tail <C1,C2> and Head <C2,C3> can be merged to form a

size-3 OPSM.

Question: Can we deduce the #subsequence matches

(score) of <C1,C2,C3> in Row 1 from the size-2

frequent OPSMs table?

Recall that in conventional OPSM mining, we can deduce the support

of <C1,C2,C3> from the size-2 frequent OPSM table by intersecting the supporting rows s.t. we do not

need to rescan the dataset.

Page 26: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs from multiple-value matrix

Raw Dataset

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

OPSM-GenOPSM-GenSubsequence Function

Subsequence Function

Enumerated ordering tables (size-2) for row 1

Essentially, obtaining the #subsequence

matches of <C1,C2,C3> is equivalent to perform a join on the column C2

of the two tables.

Since the joining information cannot be deduced from the count

(#subsequence matches), we cannot obtain the #subsequence

matches of <C1,C2,C3> without revisiting the sequence dataset.

Question: Can we materialize these tables to

facilitate the joining?

Page 27: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining OPSMs from multiple-value matrix

OPSM-GenOPSM-GenSubsequenceFunction

SubsequenceFunction

Size-2 OPSMs candidates

Frequent size-k OPSMs

Size k+1Candidate OPSMs

Combinatorial explosion of the

number of candidates.

Unlike the conventional OPSM mining, we have to

revisit the dataset to obtain the #subsequence

matches for the candidates.

Obtain the #subsequence matches of a candidate requires enumeration of the

column orderings, which is exponential to the size of the candidate OPSMs.

Same process has to be repeated for all rows.

Reduce the #candidates through some bounding

techniques.

Organize the candidates in a prefix tree and

verify the #subsequence matches in a single

scan over the dataset.

Compress the sequence dataset to reduce the effort for obtaining the #subsequence matches (tree

traversal).

Page 28: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

This is an upper bound of the

#subsequence matches of

<C1,C2,C3> in row 1. If we apply this bound on all the rows, we can obtain an upper bound

of the score of an OPSM.

Min upper bound

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Motivating questions Assume the # replicates of all columns are 4. We have 11 subsequences <C1,C2> in row 1, and there are 4 “C3”s, the

maximum possible #subsequence matches of <C1,C2,C3> in row 1 is … We have 10 subsequences <C2,C3> in row 1, and there are 4 “C1”s, the

maximum possible #subsequence matches of <C1,C2,C3> in row 1 is … Therefore, the upper bound of the possible #subsequence matches of

<C1,C2,C3> in row 1 is …

44

40

40

We assume all the 4”C3”s are on the right of the 11 subsequences <C1,C2>. Therefore we guess the maximum possible #subsequence matches

of <C1,C2,C3> is 11*4= 44.

Page 29: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array

How many “C2”s after the 1st “C1”?

1 2 3 4

<C1,C2> 4

How many “C2”s after the 2nd

“C1”?

Construct a T-array for the tail OPSM <C1,C2>.

4 2 1

Page 30: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array

How many “C2”s after the 1st “C1”?

1 2 3 4

<C1,C2> 4

How many “C2”s after the 2nd

“C1”?

4 2 1

<C2,C3>H-array1 2 3 4

How many “C3”s after the 1st “C2”?

4

How many “C3”s after the 2nd

“C2”?

3 2 1

Page 31: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

Page 32: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

There are 4 “C2”s after the 1st “C1”.

There are 4 “C3”s after the 1st “C2”.

4

So we can conclude that there are 4 <C1,C2,C3> orderings

formed by the 1st C1 and the 1st C2.

# Subsequence matches of <C1,C2,C3> =

Page 33: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

There are 4 “C2”s after the 1st “C1”.

There are 3 “C3”s after the 2nd “C2”.

4

Therefore we can conclude that there are 3 <C1,C2,C3> orderings formed by the 1st C1 and the 2nd

C2.

+ 3# Subsequence matches of <C1,C2,C3> =

Page 34: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

There are 4 “C2”s after the 1st “C1”.

There are 2 “C3”s after the 3rd “C2”.

4 + 3 + 2# Subsequence matches of <C1,C2,C3> =

Page 35: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

There are 4 “C2”s after the 1st “C1”.

There is 1 “C3” after the 4th “C2”.

4 + 3 + 2 + 1# Subsequence matches of <C1,C2,C3> =

Page 36: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Similar for the 2nd “C1”. There are 4 “C2”s after the

2nd “C1”.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

4 + 3 + 2 + 1# Subsequence matches of <C1,C2,C3> =

Page 37: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Similar for the 2nd “C1”. There are 4 “C2”s after the

2nd “C1”.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

4 + 3 + 2 + 1

So we can sum all the 4 entries of the H-array.

+ 10# Subsequence matches of <C1,C2,C3> =

Page 38: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

There are 2 “C2”s after the 3rd “C1”, which slots of the H-array should we sum up?

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

4 + 3 + 2 + 1 + 10# Subsequence matches of <C1,C2,C3> =

Page 39: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

There are 2 “C2”s after the 3rd “C1”, which slots of the H-array should we sum up?

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

4 + 3 + 2 + 1 + 10

Since there are only 2 “C2”s after the 3rd “C1”, the 2 “C2”s must be the 3rd and 4th “C2”s. Otherwise, T-array[3] will not be 2.

+ 2 + 1# Subsequence matches of <C1,C2,C3> =

Page 40: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Finally, there is 1 “C2” after the 4th “C1”.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset

Transformed Sequence Dataset

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

# Subsequence matches of <C1,C2,C3> = 4 + 3 + 2 + 1 + 10

Since there is only 1 “C2” after the 4th “C1”, the “C2” must be the 4th “C2”. Otherwise, T-

array[4] will not be 1.

+ 2 + 1 + 1

= 24Finally, we can deduce that the

#subsequence matches of <C1,C2,C3> from row 1 is 24.

Page 41: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Can we store the HT-arrays instead of the

#subsequence matches s.t. we don’t need to rescan the

dataset in the subsequence function

procedure?

HT arrays

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

T-array

1 2 3 4

<C1,C2, … ,Cx-1>

…H-array1 2 3 …

<C2, … ,Cx-1, Cx>

However, the number of slots

of H-array is exponential to the number of columns in the

OPSMs.

Generalized HT-arrays :

Page 42: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

This slot indicate how many “C2”s after the 1st “C1”,

therefore it’s value cannot be larger than #replicates of

C2.

To obtain an upper bound of the #subsequence matches of <C1,C2,C3>, we try to guess the T-array s.t. the #subsequence matches of <C1,C2,C3> is maximum.This can be done by assigning the “C1”s to the left in the column sequence as much as possible.

Motivation: We can obtain the bound of the #subsequence matches

of <C1,C2,C3> by guessing the HT-arrays from the #subsequence matches of the tail and

head OPSMs.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Page 43: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

This slot indicate how many “C2”s after the 1st “C1”,

therefore it’s value cannot be larger than #replicates of

C2.

To obtain an upper bound of the #subsequence matches of <C1,C2,C3>, we try to guess the T-array s.t. the #subsequence matches of <C1,C2,C3> is maximum.This can be done by assigning the “C1”s to the left in the column sequence as much as possible.

Motivation: We can obtain the bound of the #subsequence matches

of <C1,C2,C3> by guessing the HT-arrays from the #subsequence matches of the tail and

head OPSMs.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

For the H-array, we assign the “C3”s to the right in the column sequence as much as possible.

4 4 3 0

Push right

Page 44: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Motivation: We can obtain the bound of the #subsequence matches

of <C1,C2,C3> by guessing the HT-arrays from the #subsequence matches of the tail and

head OPSMs.

The H-array cannot be .

If there are no C3 after the 1st C2, then there will not be any C3 after the 2nd, 3rd and 4th C2.Therefore, there is a constraint when assigning the value to the HT arrays:Array [x] >= Array [x+1].

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

For the H-array, we assign the “C3”s to the right in the column sequence as much as possible.

4 4 3 0

Push right

3 3 2 2

0 2 4 4

This slot indicate how many “C2”s after the 1st “C1”,

therefore it’s value cannot be larger than #replicates of

C2.

Page 45: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> =

Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of <C1,C2,C3> from the two guessed HT-arrays.

Page 46: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> = 3 + 3 + 2 + 2

Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of <C1,C2,C3> from the two guessed HT-arrays.

Page 47: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> = 3 + 3 + 2 + 2 + 3 + 3 + 2 + 2

Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of <C1,C2,C3> from the two guessed HT-arrays.

Page 48: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> = 3 + 3 + 2 + 2 + 3 + 3 + 2 + 2 + 3 + 2 + 2

Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of <C1,C2,C3> from the two guessed HT-arrays.

Page 49: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> = 3 + 3 + 2 + 2 + 3 + 3 + 2 + 2 + 3 + 2 + 2

Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of <C1,C2,C3> from the two guessed HT-arrays.

Page 50: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> = 3 + 3 + 2 + 2 + 3 + 3 + 2 + 2 + 3 + 2 + 2

= 27 The upper bound of #subsequence matches of <C1,C2,C3> from row 1 is

27.

Upper bound: 27

Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of <C1,C2,C3> from the two guessed HT-arrays.

Page 51: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower boundSimilarly, we can obtain a lower bound of

#subsequence of <C1,C2,C3> by assigning C1 on the right of C2 as much as possible, and

C3 on the left of C2 as much as possible.

Page 52: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower boundSimilarly, we can obtain a lower bound of

#subsequence of <C1,C2,C3> by assigning C1 on the right of C2 as much as possible, and

C3 on the left of C2 as much as possible.

Lower bound of the # Subsequence matches =of <C1,C2,C3>

6

Page 53: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower boundSimilarly, we can obtain a lower bound of

#subsequence of <C1,C2,C3> by assigning C1 on the right of C2 as much as possible, and

C3 on the left of C2 as much as possible.

6 + 6Lower bound of the # Subsequence matches =of <C1,C2,C3>

Page 54: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower boundSimilarly, we can obtain a lower bound of

#subsequence of <C1,C2,C3> by assigning C1 on the right of C2 as much as possible, and

C3 on the left of C2 as much as possible.

6 + 6 + 6Lower bound of the # Subsequence matches =of <C1,C2,C3>

Page 55: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower boundSimilarly, we can obtain a lower bound of

#subsequence of <C1,C2,C3> by assigning C1 on the right of C2 as much as possible, and

C3 on the left of C2 as much as possible.

6 + 6 + 6 + 2Lower bound of the # Subsequence matches =of <C1,C2,C3>

Page 56: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

The lower bound of the #subsequence matches of <C1,C2,C3> is 20.

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint

Supporting rows (#subsequence

matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …

Size-2 frequent OPSMs table

Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower bound:20

6 + 6 + 6 + 2

= 20

Lower bound of the # Subsequence matches =of <C1,C2,C3>

Page 57: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Comparisons

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower bound:20

HT array: 24

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

Recall that the HT-array method can return the exact #subsequence matches of <C1,C2,C3>, which is 24.However, it is not feasible to keep the HT-arrays for each candidate.

Min upper bound: 40

The Min upper bound approach returns 40 as the upper bound of the #subsequence matches of <C1,C2,C3>.Compare with the HT-bound technique, the HT-bound is much more tighter.

Page 58: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Generalized HT upper bound

Tail sequence

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27Tail = <C1,C2, … ,Cx-1>

Head = <C2, … ,Cx-1, Cx>Head sequence

Generated sequence New = <C1,C2, … ,Cx-1, Cx>

Assume the number of replicate for column Cy is r(Cy) .

Middle = <C2, … ,Cx-1>Middle sequence

Page 59: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Generalized HT upper bound

…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27Tail = <C1,C2, … ,Cx-1>

Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>

The qth -slot of the T-array represents the number of “Middle sequence”s after the qth C1.

Middle = <C2, … ,Cx-1>

Therefore, the #slots for T-array is equal to the #replicates of C1.

The maximum possible value for each slot is equal to the maximum possible #subsequence matches for the middle sequence.

Maximum possible value

Tail sequence

Head sequence

Generated sequence

Assume the number of replicate for column Cy is r(Cy) .

Middle sequence

Page 60: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Therefore the #slots for H-array is equal to the maximum possible #subsequence matches of the middle sequence.

Generalized HT upper bound

…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27Tail = <C1,C2, … ,Cx-1>

Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>

Middle = <C2, … ,Cx-1>

Maximum possible value

… … …H-array1 2 3 ………………

<C2, … ,Cx-1, Cx>

The qth -slot of the H-array represents the number of “Cx“s after the qth “Middle sequence”.

The maximum possible value for each slot in H-array is equal to the #replicates of Cx.

Maximum possible value

Tail sequence

Head sequence

Generated sequence

Assume the number of replicate for column Cy is r(Cy) .

Middle sequence

Page 61: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

We notice that the “push left” assignment always yields a T-array in which the first k slots are fully filled, and all the slots after the k+1 slot are zeros.

Generalized HT upper bound

…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27Tail = <C1,C2, … ,Cx-1>

Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>

Middle = <C2, … ,Cx-1>

Maximum possible value

Let T be the #subsequencematches for the Tail sequence(i.e. 11 in the example)

Tail sequence

Head sequence

Generated sequence

Assume the number of replicate for column Cy is r(Cy) .

Middle sequence

Page 62: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

We notice that the “push left” assignment always yields a T-array in which the first k slots are fully filled, and all the slots after the k+1 slot are zeros.

Generalized HT upper bound

…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27Tail = <C1,C2, … ,Cx-1>

Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>

Middle = <C2, … ,Cx-1>

Maximum possible value

e.g. The first = 2 slots with value 4.

Let T be the #subsequencematches for the Tail sequence(i.e. 11 in the example)

Tail sequence

Head sequence

Generated sequence

Assume the number of replicate for column Cy is r(Cy) .

Middle sequence

Rule 1: The first slots with value .

Page 63: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

We notice that the “push left” assignment always yields a T-array in which the first k slots are fully filled, and all the slots after the k+1 slot are zeros.

Generalized HT upper bound

…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27Tail = <C1,C2, … ,Cx-1>

Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>

Middle = <C2, … ,Cx-1>

Maximum possible value

Let T be the #subsequencematches for the Tail sequence(i.e. 11 in the example)

Rule 1: The first slots with value .

Rule 2: The slot with value .

e.g. The 3rd slot with value 11 mod 4 = 3.

Tail sequence

Head sequence

Generated sequence

Assume the number of replicate for column Cy is r(Cy) .

Middle sequence

Page 64: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

We notice that the “push left” assignment always yields a T-array in which the first k slots are fully filled, and all the slots after the k+1 slot are zeros.

Generalized HT upper bound

…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27Tail = <C1,C2, … ,Cx-1>

Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>

Middle = <C2, … ,Cx-1>

Maximum possible value

Let T be the #subsequencematches for the Tail sequence(i.e. 11 in the example)

Rule 1: The first slots with value .

Rule 2: The slot with value .

Rule 3: The other slots with value zero.

Tail sequence

Head sequence

Generated sequence

Assume the number of replicate for column Cy is r(Cy) .

Middle sequence

Page 65: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Generalized HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27Tail = <C1,C2, … ,Cx-1>

Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>

Middle = <C2, … ,Cx-1>

Let H be the #subsequencematches for the Head sequence(i.e. 10 in the example)

Rule 1: The first slots with value .

Rule 2: The other slots with value .

… … …H-array1 2 3 ………………

<C2, … ,Cx-1, Cx> Maximum possible value

Similar to the T-array, the H-array can be divided into two partitions, the values in the first partition are larger than the values in the second partition by 1.

Tail sequence

Head sequence

Generated sequence

Assume the number of replicate for column Cy is r(Cy) .

Middle sequence

Page 66: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Generalized HT upper bound

…T-array

1 2 3 …

<C1,C2, … ,Cx-1>Maximum

possible value

Let T be the #subsequencematches for the Tail sequence.

Rule 1: The first slots with value .

Rule 2: The slot with value .

Rule 3: The other slots with value zero.

Let H be the #subsequencematches for the Head sequence.

Rule 1: The first slots with value .

Rule 2: The other slots with value .

… … …H-array1 2 3 ………………

<C2, … ,Cx-1, Cx> Maximum possible value

With these rules, we can deduce a formula to calculate the upper bound without constructing these arrays.

Similar method can be applied for the HT-lower bound, therefore we do not need to materialize any of the HT-arrays.

Page 67: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Compression method

Row Column sequence

Row 1 <1, 1, 1, 2, 1, 2, 2, 2,>

Transformed Sequence DatasetGiven a column sequence, we would like to find the #subsequence matches of <C1,C2> in the column sequence of row 1.

The naive method is to enumerate all the size-2 subsequences and count the occurrence of <C1,C2>, which requires enumerating 16 column orderings.

Row Column sequence

Row 1 <1(3), 2, 1, 2(3)>

Compressed Sequence Dataset

#subsequence matches of <C1,C2> in row 1 : 3*1

There are 3 “C1”s on the left of 1”C2”, therefore there are 3*1= 3 <C1,C2>s.

Page 68: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Compression method

Row Column sequence

Row 1 <1, 1, 1, 2, 1, 2, 2, 2,>

Transformed Sequence DatasetGiven a column sequence, we would like to find the #subsequence matches of <C1,C2> in the column sequence of row 1.

The naive method is to enumerate all the size-2 subsequences and count the occurrence of <C1,C2>, which requires enumerating 16 column orderings.

Row Column sequence

Row 1 <1(3), 2, 1, 2(3)>

Compressed Sequence Dataset

#subsequence matches of <C1,C2> in row 1 : 3*1+ 3*3 + 1*3

There are 3 “C1”s on the left of 3”C2”s, therefore there are 3*3= 9 <C1,C2>s.

There are 3 “C1”s on the left of 1”C2”, therefore there are 3*1= 3 <C1,C2>s.

There are 1 “C1” on the left of 3”C2”s, therefore there are 1*3= 3 <C1,C2>s.

Page 69: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Compression method

Row Column sequence

Row 1 <1, 1, 1, 2, 1, 2, 2, 2,>

Transformed Sequence DatasetGiven a column sequence, we would like to find the #subsequence matches of <C1,C2> in the column sequence of row 1.

The naive method is to enumerate all the size-2 subsequences and count the occurrence of <C1,C2>, which requires enumerating 16 column orderings.

Row Column sequence

Row 1 <1(3), 2, 1, 2(3)>

Compressed Sequence Dataset

#subsequence matches of <C1,C2> in row 1 : 3*1+ 3*3 + 1*3

There are 15 <C1,C2>s in total. This way to obtain the #subsequence matches only requires enumerating 3 column orderings.

= 15

Page 70: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Experimental Evaluation

Page 71: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Experimental settings

C programming language Machine

CPU : 2.6 GHz Memory : 1 Gb Fedora

Dataset Real dataset : Yeast galactose dataset

Subset of 205 genes (rows) yeast galactose data 20 experimental conditions (columns) 4 biological replicates per condition Publicly available :

http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedovic_bioinf2003.html

Synthetic dataset Replicate simulation - Generate normal distributions according to means and variances

of the replicates in the real dataset, and randomly generate a new replicate value according to the distribution.

Column simulation – Generate a new column by randomly select an experimental condition in the real dataset and perturb the mean and variance.

Row simulation – Generate normal distributions according to means and variances of the replicates in the real dataset, and generate a new row according to the distributions.

Page 72: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Execution time per iteration

The Brute-force approach is to mine the OPSMs without using any bounding techniques. All the algorithms start from mining size-2 OPSMs.

For the HT-bounds, we use the HT upper bound to identify infrequent candidates which can be pruned, and we use the HT lower bound to identify large OPSMs. We do not verify the #subsequence matches for those large OPSMs.

The number of candidates generated in each iteration using different bounding techniques

The HT upper bound technique can reduce the #candidates by more than a half in all iterations.

Page 73: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Execution time per iteration

The HT bounds + compression approach uses the HT upper and lower bounds to reduce candidate set, and uses the compression method to reduce the cost of obtaining the #subsequence matches of the candidates.

The number of candidates generated in each iteration using different bounding techniques

Execution time in each iterationusing different bounding techniques

For the HT-bounds, we use the HT upper bound to identify infrequent candidates which can be pruned, and we use the HT lower bound to identify large OPSMs. We do not verify the #subsequence matches for those large OPSMs.

Page 74: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Vary the support threshold

The saving from the HT upper bound decreases as the support threshold decreases. It is because it’s harder for an upper bound to be less than the support requirement (pruning condition) as the support requirement decreases.

Scalability test on support threshold

The saving from the lower bound increases as the support threshold decreases. The reason is that as support requirement decreases, the differences between the supports of large candidates and the support requirement increase, those large OPSMs become more obvious and become more easy to identify.

Execution time saving (%)compared with the Brute force approach

The HT bounds + compression method achieves the best execution time saving.

Page 75: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Vary the #columns

Scalability test on #columns

Essentially, increase in columns will increase the number of candidates generated but NOT the cost of obtaining the #subseqeunce matches for the candidates.

The pruning power of the bounding techniques are quite independent to the number of columns in the dataset.

Execution time saving (%)compared with the Brute force approach

Page 76: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Vary the #replicates

Scalability test on #replicates

Execution time saving (%)compared with the Brute force approach

The saving from both Min upper bound and HT upper bound decreases as #replicates increases. Why?

Page 77: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Vary the #replicates

Execution time saving (%)compared with the Brute force approach

…T-array

1 2 3 …

<C1,C2, … ,Cx-1>Maximum

possible value

… … …H-array1 2 3 ………………

<C2, … ,Cx-1, Cx> Maximum possible value

HT Upper bound

The number of slots of the T and H arrays are determined by the #replicates, essentially, the larger the arrays, the looser the bounds.

The saving from both Min upper bound and HT upper bound decreases as #replicates increases. Why?

Page 78: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Vary the #replicates

The saving from both Min upper bound and HT upper bound decreases as #replicates increases. Why?

Execution time saving (%)compared with the Brute force approachMin upper bound: 40

We have 11 subsequences <C1,C2> in row 1, and there are 4 “C3”s (#replicates), the maximum possible #subsequence matches of <C1,C2,C3> in row 1 is … 11*4=44

…T-array

1 2 3 …

<C1,C2, … ,Cx-1>Maximum

possible value

… … …H-array1 2 3 ………………

<C2, … ,Cx-1, Cx> Maximum possible value

HT Upper bound

The number of slots of the T and H arrays are determined by the #replicates, essentially, the larger the arrays, the looser the bounds.

In Min upper bound, we multiply the #replicate of C3 with #subsequences of <C1,C2>. The tightness of the Min bound is also determined by the #replicates.

Page 79: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Vary the #replicates

Scalability test on #replicates

Execution time saving (%)compared with the Brute force approach

The saving from both Min upper bound and HT upper bound decreases as #replicates increases. Why?

The saving from HT bounds + compression method increases as #increases.This is mainly due to the saving from compressing the sequence s.t. the #enumerated sequences is reduced.

Page 80: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Vary the #rowsScalability test on #rows

Page 81: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Conclusion

Single microarray output is subject to substantial variability, replication is the common practice to address this issue.

We have proposed a scoring model to mine the Order Preserving Submatrixes from gene expression dataset with repeated measurements.

Mining OPSMs under the scoring model requires heavy computational cost (obtaining #subsequence matches) An HT Bounding technique and compression method is

proposed to efficiently mine the OPSMs. Experimental results show that the HT bounding technique +

compression method achieves the best CPU cost saving.

Page 82: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Things not covered in this talk

Biological evaluation of cluster quality : oPOSSIUM, Gene Ontology, ARI

Efficient method of the subsequence function. Prefix tree to organize the candidates, verify the

supports through a single dataset scan. Compression on the sequence dataset, reduce the

#prefix tree traversal. Bounding techniques Application in other areas :Collaborative Filtering Visualization of OPSMs

Page 83: Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

End

Thank you!


Recommended