+ All Categories
Home > Documents > Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

Date post: 18-Dec-2015
Category:
Upload: esmond-manning
View: 214 times
Download: 1 times
Share this document with a friend
59
Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009
Transcript
Page 1: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

Pattern Matching Algorithms: An Overview

Shoshana NeuburgerThe Graduate Center, CUNY

9/15/2009

Page 2: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

2 of 59

Overview

• Pattern Matching in 1D• Dictionary Matching• Pattern Matching in 2D• Indexing

– Suffix Tree– Suffix Array

• Research Directions

Page 3: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

3 of 59

What is Pattern Matching?

Given a pattern and text, find the pattern in the text.

Page 4: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

4 of 59

What is Pattern Matching?

• Σ is an alphabet.• Input:

Text T = t1 t2 … tn

Pattern P = p1 p2 … pm

• Output: All i such that

., ii tp

mkkPkiT 0],1[][

Page 5: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

5 of 59

Pattern Matching - Example

Input: P=cagc = {a,g,c,t} T=acagcatcagcagctagcat

Output: {2,8,11}

1 2 3 4 5 6 7 8 …. 11

acagcatcagcagctagcat

Page 6: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

6 of 59

Pattern Matching Algorithms

• Naïve Approach– Compare pattern to text at each location.– O(mn) time.

• More efficient algorithms utilize information from previous comparisons.

Page 7: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

7 of 59

Pattern Matching Algorithms

• Linear time methods have two stages 1. preprocess pattern in O(m) time and space.2. scan text in O(n) time and space.

• Knuth, Morris, Pratt (1977): automata method• Boyer, Moore (1977): can be sublinear

Page 8: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

8 of 59

KMP Automaton

P = ababcb

Page 9: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

9 of 59

Dictionary Matching

• Σ is an alphabet.

• Input:Text T = t1 t2 … tn

Dictionary of patterns D = {P1, P2, …, Pk}

All characters in patterns and text belong to Σ.

• Output: All i, j such that

where mj = |Pj|

,1,0],1[][ kjmllPliT jj

Page 10: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

10 of 59

Dictionary Matching Algorithms

• Naïve Approach:– Use an efficient pattern matching algorithm for

each pattern in the dictionary.– O(kn) time.

More efficient algorithms process text once.

Page 11: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

11 of 59

AC Automaton

• Aho and Corasick extended the KMP automaton to dictionary matching

• Preprocessing time: O(d)• Matching time: O(n log |Σ| +k).

Independent of dictionary size!

k

jjPd

1

||

Page 12: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

12 of 59

AC Automaton

D = {ab, ba, bab, babb, bb}

Page 13: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

13 of 59

Dictionary Matching

• KMP automaton does not depend on alphabet size while AC automaton does – branching.

• Dori, Landau (2006): AC automaton is built in linear time for integer alphabets.

• Breslauer (1995) eliminates log factor in text scanning stage.

Page 14: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

14 of 59

Periodicity

A crucial task in preprocessing stage of most pattern matching algorithms:

computing periodicity.

Many forms– failure table– witnesses

Page 15: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

15 of 59

Periodicity

• A periodic pattern can be superimposed on itself without mismatch before its midpoint.

• Why is periodicity useful?Can quickly eliminate many candidates for pattern occurrence.

Page 16: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

16 of 59

Periodicity

Definition:• S is periodic if S = and

is a proper suffix of .• S is periodic if its longest prefix that is also a

suffix is at least half |S|.• The shortest period corresponds to the

longest border.

2,' kk '

Page 17: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

17 of 59

Periodicity - Example

S = abcabcabcab |S| = 11• Longest border of S: b = abcabcab;

|b| = 8 so S is periodic.• Shortest period of S: =abc

= 3 so S is periodic.

||

Page 18: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

18 of 59

Witnesses

Popular paradigm in pattern matching:1.find consistent candidates2.verify candidates

consistent candidates → verification is linear

Page 19: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

19 of 59

Witnesses

• Vishkin introduced the duel to choose between two candidates by checking the value of a witness.

• Alphabet-independent method.

Page 20: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

20 of 59

Witnesses

Preprocess pattern:• Compute witness for each location of self-

overlap.• Size of witness table:

, if P is periodic,, otherwise.

||

2

m

Page 21: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

21 of 59

Witnesses

• WIT[i] = any k such that P[k] ≠ P[k-i+1].• WIT[i] = 0, if there is no such k.

k is a witness against i being a period of P.

Example: Pattern

Witness Table

a a a b

0 4 4 4

1 2 3 4

Page 22: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

22 of 59

Witnesses

Let j>i. Candidates i and j are consistent if they are sufficiently far from each other OR WIT[j-i]=0.

Page 23: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

23 of 59

DuelScan text:• If pair of candidates is close and inconsistent,

perform duel to eliminate one (or both).• Sufficient to identify pairwise consistent

candidates: transitivity of consistent positions.

a a a b

P=

T=

i j witness

ba?

Page 24: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

24 of 59

2D Pattern Matching

• Σ is an alphabet.

• Input:Text T [1… n, 1… n]

Pattern P [1… m, 1… m]

• Output: All (i, j) such that

., ijij tp

mlklkPljkiT ,0],1,1[],[

MRI

Page 25: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

25 of 59

2D Pattern Matching - ExampleInput: Pattern = {A,B}

Text

Output: { (1,4),(2,2),(4, 3)}

A B A

A B A

A A B

A A B A B A A

B A B A B A B

A A B A A B B

B A A B A A A

A B A B A A A

B B A A B A B

B B B A B A B

A A B A B A A

B A B A B A B

A A B A A B B

B A A B A A A

A B A B A A A

B B A A B A B

B B B A B A B

A A B A B A A

B A B A B A B

A A B A A B B

B A A B A A A

A B A B A A A

B B A A B A B

B B B A B A B

A A B A B A A

B A B A B A B

A A B A A B B

B A A B A A A

A B A B A A A

B B A A B A B

B B B A B A B

Page 26: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

26 of 59

Bird / Baker

• First linear-time 2D pattern matching algorithm.

• View each pattern row as a metacharacter to linearize problem.

• Convert 2D pattern matching to 1D.

Page 27: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

27 of 59

Bird / Baker

Preprocess pattern:• Name rows of pattern using AC automaton.• Using names, pattern has 1D representation.• Construct KMP automaton of pattern.

Identical rows receive identical names.

Page 28: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

28 of 59

Bird / Baker

Scan text:• Name positions of text that match a row of

pattern, using AC automaton within each row.• Run KMP on named columns of text.

Since the 1D names are unique, only one name can be given to a text location.

Page 29: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

29 of 59

Bird / Baker - Example

Preprocess pattern:• Name rows of pattern using AC automaton.• Using names, pattern has 1D representation.• Construct KMP automaton of pattern.

A B A

A B A

A A B

1

1

2

Page 30: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

30 of 59

Bird / Baker - Example

Scan text:• Name positions of text that match a row of

pattern, using AC automaton within each row.• Run KMP on named columns of text.

A A B A B A A

B A B A B A B

A A B A A B B

B A A B A A A

A B A B A A A

B B A A B A B

B B B A B A B

0 0 2 1 0 1 0

0 0 0 1 0 1 0

0 0 2 1 0 2 0

0 0 0 2 1 0 0

0 0 1 0 1 0 0

0 0 0 0 2 1 0

0 0 0 0 0 1 0

0 0 2 1 0 1 0

0 0 0 1 0 1 0

0 0 2 1 0 2 0

0 0 0 2 1 0 0

0 0 1 0 1 0 0

0 0 0 0 2 1 0

0 0 0 0 0 1 0

Page 31: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

31 of 59

Bird / Baker

• Complexity of Bird / Baker algorithm:

time and space.

• Alphabet-dependent.

• Real-time since scans text characters once.

• Can be used for dictionary matching:

replace KMP with AC automaton.

||log2 n

Page 32: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

32 of 59

2D Witnesses

• Amir et. al. – 2D witness table can be used for linear time and space alphabet-independent 2D matching.

• The order of duels is significant.• Duels are performed in 2 waves over text.

Page 33: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

33 of 59

Indexing

• Index text– Suffix Tree– Suffix Array

• Find pattern in O(m) time

• Useful paradigm when text will be searched for several patterns.

Page 34: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

34 of 59

Suffix Triebanana$

anana$nana$

ana$na$

a$$

n

b

n

a

a

a

an

n

a

a

n

n

a

a

$

$$

$

$$

suf1

suf2

suf3

suf4

suf5

suf6

suf7• One leaf per suffix.• An edge represents one character.• Concatenation of edge-labels on the path from the root to leaf i spells the

suffix that starts at position i.

suf1

suf2

suf6

suf5suf4

suf3

$suf7

T = banana$

Page 35: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

35 of 59

Suffix Treebanana$

anana$nana$

ana$na$

a$$

banana$

a

na

na$

na

na$

$

$

$

suf1

suf2

suf3

suf4

suf5

suf6

suf7• Compact representation of trie.• A node with one child is merged with its parent.• Up to n internal nodes.• O(n) space by using indices to label edges

suf1

suf2

suf6

suf5

suf4

suf3

[7,7]

$

[1,7][3,4]

[2,2]

[7,7]

[5,7] [7,7]

[7,7]

[5,7]

[3,4]

T = banana$

Page 36: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

36 of 59

Suffix Tree Construction

• Naïve Approach: O(n2) time

• Linear-time algorithms:Author Date Innovation Scan Direction

Weiner 1973 First linear-time algorithm,alphabet-dependent suffix links

Right to left

McCreight 1976 Alphabet-independent suffix links, more efficient

Left to right

Ukkonen 1995 Online linear-time construction, represents current end

Left to right

Amir and Nor 2008 Real-time construction Left to right

Page 37: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

37 of 59

Suffix Tree Construction

• Linear-time suffix tree construction algorithms rely on suffix links to facilitate traversal of tree.

• A suffix link is a pointer from a node labeled xS to a node labeled S; x is a character and S a possibly empty substring.

• Alphabet-dependent suffix links point from a node labeled S to a node labeled xS, for each character x.

Page 38: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

38 of 59

Index of Patterns

• Can answer Lowest Common Ancestor (LCA) queries in constant time if preprocess tree accordingly.

• In suffix tree, LCA corresponds to Longest Common Prefix (LCP) of strings represented by leaves.

Page 39: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

39 of 59

Index of Patterns

To index several patterns: Concatenate patterns with unique characters

separating them and build suffix tree.Problem: inserts meaningless suffixes that span several patterns.

OR Build generalized suffix tree – single structure for

suffixes of individual patterns.Can be constructed with Ukkonen’s algorithm.

Page 40: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

40 of 59

Suffix Array

• The Suffix Array stores lexicographic order of suffixes.

• More space efficient than suffix tree.• Can locate all occurrences of a substring by

binary search.• With Longest Common Prefix (LCP) array can

perform even more efficient searches.• LCP array stores longest common prefix

between two adjacent suffixes in suffix array.

Page 41: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

41 of 59

Suffix ArrayIndex Suffix Index Suffix LCP

1 mississippi 11 i 02 ississippi 8 ippi 13 ssissippi 5 issippi 14 sissippi 2 ississippi 45 issippi 1 mississippi 06 ssippi 10 pi 07 sippi 9 ppi 18 ippi 7 sippi 09 ppi 4 sissippi 210 pi 6 ssippi 111 i 3 ssissippi 3

sort suffixes alphabetically

Page 42: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

42 of 59

Suffix array

T = mississippi

3 4 5 6 7 8 91 2 1110

5 2 1 10 9 7 411 8 36

Index

Suffix

1 4 0 0 1 0 20 1 31LCP

Page 43: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

43 of 59

Search in Suffix Array

O(m log n):Idea: two binary searches

- search for leftmost position of X- search for rightmost position of X

In between are all suffixes that begin with X

With LCP array: O(m + log n) search.

Page 44: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

44 of 59

Suffix Array Construction

• Naïve Approach: O(n2) time

• Indirect Construction: – preorder traversal of suffix tree– LCA queries for LCP.Problem: does not achieve better space efficiency.

Page 45: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

45 of 59

Suffix Array Construction• Direct construction algorithms:

• LCP array construction: range-minima queries.

Author Date Complexity Innovation

Manber, Myers 1993 O(n log n) Sort and search, KMR renaming

Karkkainen and Sanders 2003 O(n) Linear-time

Ko and Aluru 2003 O(n) Linear-time

Kim, et. al. 2003 O(n) Linear-time

Page 46: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

46 of 59

Compressed IndicesSuffix Tree: O(n) words = O(n log n) bits

Compressed suffix tree• Grossi and Vitter (2000)

– O(n) space.

• Sadakane (2007) – O(n log |Σ|) space.– Supports all suffix tree operations efficiently.– Slowdown of only polylog(n).

Page 47: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

47 of 59

Compressed IndicesSuffix array is an array of n indices, which is stored in:

O(n) words = O(n log n) bits

Compressed Suffix Array (CSA)Grossi and Vitter (2000)

• O(n log |Σ|) bits• access time increased from O(1) to O(logε n)

Sadakane (2003)• Pattern matching as efficient as in uncompressed SA.• O(n log H0) bits

• Compressed self-index

Page 48: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

48 of 59

Compressed Indices

FM – index• Ferragina and Manzini (2005)• Self-indexing data structure • First compressed suffix array that respects the

high-order empirical entropy • Size relative to compressed text length.• Improved by Navarro and Makinen (2007)

Page 49: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

49 of 59

Dynamic Suffix Tree

Dynamic Suffix Tree• Choi and Lam (1997)• Strings can be inserted or deleted efficiently.• Update time proportional to string

inserted/deleted.• No edges labeled by a deleted string.• Two-way pointer for each edge, which can be

done in space linear in the size of the tree.

Page 50: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

50 of 59

Dynamic Suffix Array

Dynamic Suffix Array• Recent work by Salson et. al.• Can update suffix array after construction if

text changes.• More efficient than rebuilding suffix array.• Open problems:

– Worst case O(n log n).– No online algorithm yet.

Page 51: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

51 of 59

Word-Based Index

• Text size n contains k distinct words• Index a subset of positions that correspond to

word beginnings• With O(n) working space can index entire text

and discard unnecessary positions.• Desired complexity

– O(k) space.– will always need O(n) time.Problem: missing suffix links.

Page 52: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

52 of 59

Word-Based Suffix Tree

Construction Algorithms:

Author Date Results

Karkkainen and Ukkonen 1996 O(n) time and O(n/j) space construction of sparse suffix tree (every jth suffix)

Anderson et. al. 1999 Expected linear-time and k-space construction of word-based suffix tree for k words.

Inenaga and Takeda 2006 Online, O(n) time and k-space construction of word-based suffix tree for k words.

Page 53: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

53 of 59

Word-Based Suffix Array

Ferragina and Fischer (2007) – word-based suffix array construction algorithm

• Time and space optimal construction.• Computation of word-based LCP array in O(n)

time and O(k) space. • Alternative algorithm for construction of

word-based suffix tree.• Searching as efficient as ordinary sufffix array.

Page 54: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

54 of 59

Research Directions

Problems we are considering:• Small space dictionary matching.• Time-space optimal 2D compressed dictionary

matching algorithm.• Compressed parameterized matching.• Self-indexing word-based data structure.• Dynamic suffix array in O(n) construction time.

Page 55: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

55 of 59

Small-Space

• Applications arise in which storage space is limited.

• Many innovative algorithms exist for single pattern matching using small additional space:– Galil and Seiferas (1981) developed first time-

space optimal algorithm for pattern matching.– Rytter (2003) adapted the KMP algorithm to work

in O(1) additional space, O(n) time.

Page 56: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

56 of 59

Research Directions

• Fast dictionary matching algorithms exist for 1D and 2D. Achieve expected sublinear time.

• No deterministic dictionary matching method that works in linear time and small space.

• We believe that recent results in compressed self-indexing will facilitate the development of a solution to the small space dictionary matching problem.

Page 57: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

57 of 59

Compressed Matching

• Data is compressed to save space.• Lossless compression schemes can be

reversed without loss of data.• Pattern matching cannot be done in

compressed text – pattern can span a compressed character.

• LZ78: data can be uncompressed in time and space proportional to the uncompressed data.

Page 58: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

58 of 59

Research Directions

• Amir et. al. (2003) devised an algorithm for 2D LZ78 compressed matching.

• They define strongly inplace as a criteria for the algorithm: that the extra space is proportional to the optimal compression of all strings of the given length.

• We are seeking a time-space optimal solution to 2D compressed dictionary matching.

Page 59: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

59 of 59

Thank you!


Recommended