Semi-Numerical String Matching

transcript

All the methods we’ve seen so far have been based on comparisons.

We propose alternative methods of computation such as:

Arithmetic. Bit – operations. The fast Fourier transform.

Semi-numerical String Matching

We will survey three examples of such methods:

The Random Fingerprint method due to Karp and Rabin.

Shift–And method due to Baeza-Yates and Gonnet, and its extension to agrep due to Wu and Manber.

A solution to the match count problem using the fast Fourier transform due to Fischer and Paterson and an improvement due to Abrahamson.

Semi-numerical String Matching

Exact match problem: we want to find all the occurrences of the pattern P in the text T.

The pattern P is of length n. The text T is of length m.

Karp-Rabin fingerprint - exact match

Arithmetic replaces comparisons.

An efficient randomized algorithm that makes an error with small probability.

A randomized algorithm that never errors whose expected running time is efficient.

We will consider a binary alphabet: {0,1}.

Karp-Rabin fingerprint - exact match

Strings are also numbers, H: strings → numbers. Let s be a string of length n,

Definition:let Tr denote the n length substring of T starting at position r.

in issH1

Strings are also numbers, H: strings → numbers.

T = 1 0 1 1 0 1 0 1

P = 0 1 0 1

T = 1 0 1 1 0 1 0 1 H(T5) = 5 =

P = 0 1 0 1 H(P) = 5

T = 1 0 1 1 0 1 0 1 H(T2) = 6 ≠

P = 0 1 0 1 H(P) = 5

Theorem:

There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr)

Proof:

Follows immediately from the unique representation of a number in base 2.

We can compute H(Tr) from H(Tr-1)

T = 1 0 1 1 0 1 0 1 T1 = 1 0 1 1

T2 = 0 1 1 0

)1()1(2)(2)( 1 nrTrTTHTH nrr

)0110(61622012112)(

11)1011()(4

A simple efficient algorithm:

Compute H(T1). Run over T

Compute H(Tr) from H(Tr-1) in constant time,and make the comparisons.

Total running time O(m)?

Let’s use modular arithmetic, this will help us keep the numbers small.

For some integer p The fingerprint of P is defined byHp(P) = H(P) (mod p)

Karp-Rabin

Lemma:

And during this computation no number ever exceeds 2p.

Karp-Rabin

))(mod(

))}(mod()))...](mod4()(mod2)}3(

)(mod2)]2()(mod2)1({[...({[)(

pnPpPpP

pPpPPH p

P = 1 0 1 1 1 1 H(P) = 47

p = 7 Hp(P) = 47 (mod 7) = 5

An example

)(5)7(mod5

51)7(mod22

21)7(mod24

41)7(mod25

51)7(mod22

20)7(mod21

Intermediate numbers are also kept small. We can still compute H(Tr) from H(Tr-1).

Arithmetic:

Modular arithmetic:

Karp-Rabin

)1()1(2)(2)( 1 nrTrTTHTH nrr

))](mod1()1())(mod2(

)))(mod(2[()( 1

pnrTrTp

pTHTHn

Intermediate numbers are also kept small. We can still compute H(Tr) from H(Tr-1).

Arithmetic:

Modular arithmetic:

Karp-Rabin

)2(22 1 nn

)))(mod(mod2(2)(mod2 1 ppp nn

How about the comparisons?

Arithmetic:There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr)

Modular arithmetic:

If there is an occurrence of P starting at position r of T

then Hp(P) = Hp(Tr)

There are values of p for which the converse is not true!

Karp-Rabin

Definition:

If Hp(P) = Hp(Tr) but P doesn’t occur in T starting at position r, we say there is a false match between P and T at position r.

If there is some position r such that there is a false match between P and T at position r, we say there is a false match between P and T.

Karp-Rabin

Our goal will be to choose a modulus p such that

p is small enough to keep computations efficient. p is large enough so that the probability of a false

match is kept small.

Karp-Rabin

Definition:For a positive integer u, п(u) is the number of primes that are less than or equal to u.

Prime number theorem (without proof):

Prime moduli limit false matches

)ln(26.1)(

)ln( u

Lemma (without proof):if u ≥ 29, then the product of all the primes that are less than or equal to u is greater than 2u.

Example: u = 29, the prime numbers less than or equal to 29 are: 2,3,5,7,11,13,17,19,23,29, their product is

6,469,693,230 ≥ 536,870,912 = 229

Corollary:If u ≥ 29 and x is any number less than or equal to 2u, then x has fewer than п(u) distinct prime divisors.

Proof: Assume x has k ≥ п(u) distinct prime divisors q1 , …, qk then 2u ≥ x ≥ q1* …* qk but q1* …* qk is at least as large as the product of the first п(u) prime numbers.

Theorem:Let I be a positive integer, and p a randomly chosen prime less than or equal to I.If nm ≥ 29 thenThe probability of a false match between P and T is less than or equal to п(nm) / п(I) .

Proof: Let R be the set of positions in T where P doesn’t

begin. We have By the corollary the product has at most п(nm)

distinct prime divisors. If there is a false match at position r then p divides

thus also divides

p must be in a set of size п(nm) but p was chosen randomly out of a set of size п(I).

Rs sTHPH 2|)()(|

|)()(| rTHPH

Rs sTHPH |)()(|

Choose a positive integer I. Pick a random prime p less than or equal to I, and

compute P’s fingerprint – Hp(P).

For each position r in T, comput Hp(Tr) and test to see if it equals Hp(P). If the numbers are equal either declare a probable match or check and declare a definite match.

Running time: excluding verification O(m).

Random fingerprint algorithm

The smaller I is, computations are more efficient The larger I is, the probability of a false match

decresses.

Proposition:When I = nm2

1. The largest number used in the algorithm requires at most 4(log(n)+log(m)) bits. 2. The probability of a false match is at most 2.53/m.

How to choose I

Proof:

How to choose I

)ln()ln(

)ln(2)ln(126.1

)ln(26.1

An idea: why not choose k primes?

Proposition: when k primes are chosen randomly and

independently between 1 and I, the probability of a false match is at most

Proof: We saw that if p allows and error it is in a set of at most п(nm) integers. A false match can occur only if each of the independently chosen k primes is in a set of size of at most п(nm) integers.

Extensions

k = 4, n = 250, m = 4000I = 250*40002 < 232

An illustaration

12104000

When k primes are used, the probability of a false match is at most

Proof: Suppose a false match occurs at position r. That means that each of the primes must divide |H(P)-H(Tr) | ≤ 2n. There are at most п(n) primes that divide it.Each prime is chosen from a set of size п(I) and by chance is a part of a set of size п(n).

Even lower limits on the error

Consider the list L of locations in T where the Karp-Rabin algorithm declares P to be found.

A run is a maximal interval of starting locationsl1, l2, …, lr in L such that every two numbers differ by at most n/2.

Let’s verify a run.

Checking for error in linear time

Check the first two declared occurrences explicitly.P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…

P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…

If there is a false match stop. Otherwise P is semi periodic with period

d = l1 – l2.

d is the minimal period.

For each i check that li+1 – li = d.

Check the last d characters of li for each i.

Check l1

Check l2

P is semi periodic with period 3.

T = abbabbabbabbabbabbabbabbabbax…

Check li+1 – li = 3

For each i check the last 3 characters of li.

P = babT = abbabbabbabbabbabbabbabbabbax…

Report a false match or approve the run.

No character of T is examined more than twice during a single run.

Two runs are separated by at least n/2 positions and each run is at least n positions long. Thus no character of T is examined in more than two consecutive runs.

Total verification time O(m).

Time analysis

When we have a false match we start again with a different prime.

The expected probability of a false match is O(1/m).

We have converted the algorithm to one that never mistakes with expected linear running time.

Time analysis

It is efficient and simple. It is space efficient. It can be generalized to solve harder problems

such as 2-dimensional string matching. It’s performance is backed up by a concrete

theoretical analysis.

Why use Karp-Rabin?

The Shift-And

Method

We start with the exact match problem.

Define M to be a binary n by m matrix such that:

M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j.

M(i,j) = 1 iff P[1 .. i] ≡ T[j-i+1 .. j]

The Shift-And Method

Let T = california Let P = for

M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j.

How does M solve the exact match problem?

The Shift-And Method

1 2 3 4 5 6 7 8 9 m = 10

1 0 0 0 0 1 0 0 0 0 0

2 0 0 0 0 0 1 0 0 0 0

n=3 0 0 0 0 0 0 1 0 0 0

How to construct M

We will construct M column by column. Two definitions are in order: Bit-Shift(j-1) is the vector derived by shifting the

vector for column j-1 down by one and setting the first bit to 1.

Example:

(BitShift

We define the n-length binary vector U(x) for each character x in the alphabet. U(x) is set to 1 for the positions in P where character x appears.

Example:

P = abaac

How to construct M

Initialize column 0 of M to all zeros For j > 1 column j is obtained by

How to construct M

))(()1()( jTUjBitShiftjM

1 2 3 4 5 6 7 8 9 10

T = x a b x a b a a x a

1 2 3 4 5

P = a b a a c

An example j = 1

1 2 3 4 5 6 7 8 9 10

))1((&)0( TUBitShift

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5

P = a b a a c

An example j = 2

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5

P = a b a a c

An example j = 3

1 2 3 4 5 6 7 8 9 10

1 0 1 0

2 0 0 1

3 0 0 0

4 0 0 0

5 0 0 0

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5

P = a b a a c

An example j = 8

1 2 3 4 5 6 7 8 9 10

1 0 1 0 0 1 0 1 1

2 0 0 1 0 0 1 0 0

3 0 0 0 0 0 0 1 0

4 0 0 0 0 0 0 0 1

5 0 0 0 0 0 0 0 0

For i > 1, Entry M(i,j) = 1 iff

1) The first i-1 characters of P match the i-1characters of T ending at character j-1.

2) Character P(i) ≡ T(j).

1) is true when M(i-1,j-1) = 1. 2) is true when the i’th bit of U(T(j)) = 1.

The algorithm computes the and of these two bits.

Correctness

1 2 3 4 5 6 7 8 9 10

a b a a c

Correctness

1 2 3 4 5 6 7 8 9 10

1 0 1 0 0 1 0 1 1 0 1

2 0 0 1 0 0 1 0 0 0 0

3 0 0 0 0 0 0 1 0 0 0

4 0 0 0 0 0 0 0 1 0 0

5 0 0 0 0 0 0 0 0 0 0 M(4,8) = 1, this is because a b a a is a prefix of P of length

4 that ends at position 8 in T. Condition 1) – We had a b a as a prefix of length 3 that

ended at position 7 in T ↔ M(3,7) = 1. Condition 2) – The fourth bit of P is the eighth bit of T ↔

The fourth bit of U(T(8)) = 1.

Formally the running time is Θ(mn). However, the method is very efficient if n is the size

of a single or a few computer words.

Furthermore only two columns of M are needed at any given time. Hence, the space used by the algorithm is O(n).

How much did we pay?

We extend the shift-and method for finding inexact occurrences of a pattern in a text.

Reminder example:T = aatatccacaa P = atcgaa

P appears in T with 2 mismatches starting at position 4,it also occurs with 4 mismatches starting at position 2. a a t a t c c a c a a a a t a t c c a c a a

a t c g a a a t c g a a

agrep: The Shift-And Method with errors

Our current goal given k find all the occurrences of P in T with up to k mismatches.

We define the matrix Mk to be an n by m binary matrix, such that:

Mk (i,j) = 1 iffAt least i-k of the first i characters of P match the i characters up through character j of T.

What is M0? How does Mk solve the k-mismatch problem?

We compute Ml for all l=0, … , k. For each j compute M(j), M1(j), … , Mk(j)

For all l initialize Ml(0) to the zero vector. The j’th column of Ml is given by:

Computing Mk

The first i-1 characters of P match a substring of T ending at j-1, with at most l mismatches, and the next pair of characters in P and T are equal.

Computing Mk

* * * * *

))(())1(( jTUjMBitShift l

The first i-1 characters of P match a substring of T ending at j-1, with at most l -1 mismatches.

Computing Mk

* * * * *

))1(( 1 jMBitShift l

We compute Ml for all l=1, … , k. For each j compute M(j), M1(j), … , Mk(j)

For all l initialize Ml(0) to the zero vector. The j’th column of Ml is given by:

Computing Mk

))](())1(([

jMBitShift

jTUjMBitShift

1 2 3 4 5 6 7 8 9 10

P = a b a a c

Example: M1

1 2 3 4 5 6 7 8 9 10

1 1 1 1 1 1 1 1 1 1 1

2 0 0 1 0 0 1 0 1 1 0

3 0 0 0 1 0 0 1 0 0 1

4 0 0 0 0 1 0 0 1 0 0

5 0 0 0 0 0 0 0 0 1 0

1 2 3 4 5 6 7 8 9 10

1 0 1 0 0 1 0 1 1 0 1

2 0 0 1 0 0 1 0 0 0 0

3 0 0 0 0 0 0 1 0 0 0

4 0 0 0 0 0 0 0 1 0 0

5 0 0 0 0 0 0 0 0 0 0

1 2 3 4 5 6 7 8 9 10

P = a b a a

Example: M1

1 2 3 4 5 6 7 8 9 10

1 1 1 1 1 1 1 1 1 1 1

2 0 0 1 0 0 1 0 1 1 0

3 0 0 0 1 0 0 1 0 0 1

4 0 0 0 0 1 0 0 1 0 0

5 0 0 0 0 0 0 0 0 1 0

Formally the running time is Θ(kmn). Again, the method is practically efficient for small n. Still only a constant number of columns of M are

needed at any given time. Hence, the space used by the algorithm is O(n).

The match count problem

We want to count the exact number of characters that match each of the different alignments of P with T.

a a t a t c c a c a a a a t a t c c a c a a

a t c g a a a t c g a a

The match-count problem

We will first look at a simple algorithm which extends the techniques we’ve seen so far.

Next, we introduce a more efficient algorithm that exploits existing efficient methods to calculate the Fourier transform.

We conclude with a variation that gives good performance for unbounded alphabets.

The match-count problem

We define the matrix MC to be an n by m integer valued matrix, such that:

MC(i,j) = The number of characters of P[1..i] that match T[j-I+1,..,j]

How does MC solve the match-count problem?

Match-count Algorithm 1

Initialize column 0 of MC to all zeros For j ≥ 1 column j is obtained by

Total of Θ(nm) comparisons and (simple) additions.

Computing MC

1)1,1(

)1,1(),(

jiMCjiMC

Otherwise

jTiP )()(

Define a vector W that counts the matching symbols, it’s indices are the possible alignments.

T = a b a b c a a a a b a b c a a a P = a b c a a b c aW(1) = 2 W(2) = 0

a b a b c a a a a b a b c a a a a b c a a b c aW(3) = 4 W(4) = 1

a b a b c a a a

a b c aW(5) = 1

Match-count algorithm 2

Let’s handle one symbol at a time:

T = a b a b c a a a

P = a b c a

Ta = 1 0 1 0 0 1 1 1 Wa(1) = 1

Pa = 1 0 0 1

1 0 1 0 0 1 1 1 Wa(3) = 2

1 0 0 1

We have W = Wa + Wb + Wc.

Or in the general case

We can calculate Wα using a convolution.

Let’s rephrase the problem. X = Tα padded with n zeros on the right.

Y = Pα padded with m zeros on the right.

We have two vectors X,Y of length m+n.

Ta = 1 0 1 0 0 1 1 1

Pa = 1 0 0 1

X = 1 0 1 0 0 1 1 1 0 0 0 0

Y = 1 0 0 1 0 0 0 0 0 0 0 0

In our modified representation:

Where the indices are taken modulo n+m.

W(1) = < 1 0 1 0 0 1 1 1 0 0 0 0,

1 0 0 1 0 0 0 0 0 0 0 0 >W(2) = < 0 1 0 1 0 0 1 1 1 0 0 0 ,

0 1 0 0 1 0 0 0 0 0 0 0 >

)()()(mn

jiYjXiW

In our modified representation:

Where the indices are taken modulo n+m.

This is the convolution of X and the reverse of Y.

Using FFT calculating convolution takes timeO(m log(m)).

)()()(mn

jiYjXiW

The total running time is O(|∑| m log(m))

What happens if |∑| is large?

For example when |∑| =n, we get O(n m log(m)) which is actually worse than the naïve algorithm.

An idea: some symbols might appear more often than others.

Use convolutions for the frequent symbols. Use a more simple counting method for the rest.

Say α appears less than c times in P. Record the locations of α in P

l1,…,lr r ≤c. Go over the text, when we see α at location j we

increment W(j-l1+1) , … , W(j-lr+1+1).

Rare symbols

T = a b a b c a a a…P = a b c a c

l1 = 3, l2 = 5

j = 5 → W(5-3+1)++ W(5-5+1)++

W(3)++ W(1)++

a b a b c a a a… T = a b a b c a a a...

a b c a c a b c a c

Rare symbols

We can do this for all the rare symbols in one sweep of T, for each position in T we make up to c updates in W.

Thus handling the rare symbols will cost us O(cm).

For the frequent symbols we pay one convolution per symbol so we pay at most O(n/c m log(m)).

We choose the c that gives us the best balance

The total running time is

Determining c

))log(( mnmO

Dan Gusfield, Algorithms on Strings, Trees and Graphs.Cambridge Univ. Press, Cambridge,1997.

References

The end

Semi-Numerical String Matching

Documents