1 Construction of Index: (Page 197) Objective: Given a document, find the number of occurrences of...

1

Construction of Index: (Page 197)

• Objective: Given a document, find the number of occurrences of each word in the document.

• Example: Computer Science students know computers and computer languages.

• Keywords: computer, computers, science, students, know, and, languages.

2

Linear time algorithm:

• Let T be the text, |T| the length of T. We can find the occurrences of each word in T in O(|T|) time.

3

Constructing an automaton:

onk

s c i e n c

tupmoc

l

na

egaugna

edut n

sr

e

s

w

d

s

t

e

4

Remarks:

• There is a final state for each word.• There is a counter on each final state storing the

number of occurrences that the final state is reached.

• While reading, the algorithm creates new states for the new word.

• For words having met before, we just go through the corresponding states.

• When the final state is read, add 1 to the counter.

5

Assignment one (due in week 6 on Monday, 9:20 pm)

• Write a program to convert a text into a vector such that each element of the vector is the number of occurrences of the corresponding keyword.

• Marking Scheme: • 100 % if using the linear time algorithm• 20% if using O(nm) time, where n is the length of the text

and m the number of words in the document• A report describing the program is required.

– A flow chart of the program is required.– Specification of each function– Comments for codes.

6

Remarks:

• The following part might be hard for you. However, it is useful and no other part in the course is harder than this part.

7

String Matching

The problem:

• Input: a text T (very long string) and a pattern P (short string).

• Output: the index in T where a copy of P begins.

8

Some Notations and Terminologies

• |P| and |T|: the lengths of P and T.• P[i]: the i-th letter of P.• Prefix of P: a substring of P starting with

P[1].• P[1..i]: the prefix containing the first i

letters of P.• suffix of P[1..i]: a substring of P[1..i]

ending at P[i], e.g. P[3..i], P[5..i] (i>4).

9

Straightforward method• Basic idea:1. i=1;2. Start with T[i] and match P with T[i],T[i+1], ... T[i+|P|-1]3. whenever a mismatch is found, i=i+1 and goto 2 until i+|P|-1<|T|.

• Example 1: T=ABABABCCA and P=ABABCP: ABABC A ABABC | | |T: ABABABCCA ABABABCCA ABABABCCA

10

Analysis

• Step 2 takes O(|P|) comparisons in the worst case.

• Step 2 could be repeated O(|T|) times.

• Total running time is O(|T||P|).

11

Knuth-Morris-Pratt Method (linear time algorithm)

A better idea• In step 3, when there is a mismatch we move

forward one position (i=i+1).• We may move more than one position at a time

when a mismatch occurs. (carefully study the pattern P).

For example:P: ABABC ABAT: ABABABCCA ABABABCCA

12

Questions:• How to decide how many positions we should

jump when a mismatch occurs?• How much we can benefit? O(|T|+|P|).

Example 2:P: abcabcabcaa |T: abcabcabcabcaa | abcabcab

back here

13

• We can move forward more than one position. Reason?• Study of Pattern PP[1..7] abcabcaP[1..10] abcabcabcaP[1..7] abcabcaP[1..4] abca

• P[1..7] is the longest prefix that is also a suffix of P[1..10].

• P[1..4] is a prefix that is a suffix of P[1..10], but not the longest.

• Hint: When mismatch occurs at P[i+1], we want to find the longest prefix of P[1..i] which is also a suffix of P[1..i].

• Suffix of P is a substring of P ending at the last position of P.

14

Failure function• f(i) is the largest r with (r<i) such that

P[1] P[2] ...P[r] = P[i-r+1]P[i-r+2], ..., P[i].

Prefix of length r Suffix of P[1]P[2]…P[i] of length r

• That is, P[1,f(i)] is the longest prefix that is a suffix of P[1..i].

• Example 3: P=ababaccc and i=5.

P[1] P[2] P[3]

a b a

a b a b a

P[3] P[4] P[5] (r=3) f(5)=3.

15

• Example 4:

P=abcabbabcabbaa

It is easy to verify that

f(1)=0, f(2)=0, f(3)=0, f(4)=1, f(5)=2,

f(6)=0, f(7)=1, f(8)=2, f(9)=3, f(10)=4,

f(11)=5, f(12)=6, f(13)=7, f(14)=1.

16

The Scan Algorithm(draw a figure to show)

• i: indicates that T[i] is the next character in T to be compared with the head of the pattern.

• q: indicates that P[q+1] is the next character in P to be compared with T[i].

1. i=1 and q=0;2. Compare T[i] with P[q+1]

case 1: T[i]==P[q+1]i=i+1;q=q+1;if q+1==|P| then print "P occurs at i+1-|P|"

case 2: T[i]≠P[q+1] and q≠0q=f(q);

case 3: T[i]≠P[q+1] and q==0i=i+1;

3. Repeat step2 until i==|T|.

17

• Example 5: P=abcabbabcabbaa

T=abcabcabbabbabcabbabcabbaa abcabb | | | abcabbabc | abc | a(i=i+1) abcabbabcabbaa(q+1=|p|)

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14

f(i) 0 0 0 1 2 0 1 2 3 4 5 6 1 1

18

Running time complexity(hard)• The running time of the scan algorithm is O(|T|).• Proof:

– There are two pointers i and p.– i: the next character in T to be compared.– p: the position of P[1]. (See figure below)

p i

P:abcabcabcaa |T:abcabcabcabcaa |P: abcabcaa

p

19

Facts:1 When a match is found, move i forward.2 When a mismatch is found, move p forward

until p and i are the same. (When p=i and a mismatch occur, move both i and p forward)

From facts 1 and 2, it is easy to see that the total number of comparisons is at most 2|T|.

Thus, the time complexity is O(|T|).

20

Another version of scan algorithm (code)n=|T|m=|P|q=0for i=1 to n{ while q>0 and P[q+1]≠T[i] do { q=f(q) } if P[q+1]==T[i] then q=q+1 if q==m then { print "pattern occurs at i-m+1" q=f(q) }}

21

Basic idea:

Case 1: f(1) is always 0.

Case 2: if P[q]==P[f(q-1)+1] then f(q)=f(q-1)+1.Example: p=abcabcc

f(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2; f(6)=3; f(7)=0;

P[4]= P[f(4-1)+1], f(4)=f(4-1)+0+1=1.

P[5]= P[f(5-1)+1], f(5)=f(5-1)+1=1+1=2.

P[6]= P[f(6-1)+1]. F(6)=f(6-1)+1=2+1=3.

Failure Function Construction

22

Case 3: if P[q]P[f(q-1)+1] and f(q-1)≠0 then consider P[q] ?= P[f(f(q-1))+1] (Do it recursively)

Case 4: if P[q] P[f(q-1)+1] and f(q-1)==0 then f[q]=0.

Consider the computation of f(7).

P[4] P[1]

P[7] ≠P[f(7-1)+1], P[7] ≠P[f(f(7-1))+1]

c a c a

23

The algorithm (code) to compute failure function

1. m=|P|;2. f(1)=0;3. k=0;4. for q=2 to |P| do {5. k=f(q-1);6. if(k>0 and P[k+1]!=P[q]) { k=f(k); goto 6; }7. if(k>0 and P[k+1]==P[q]) { f[q]=k+1; }8. if(k==0) { if(P[k+1]==P[q] f[q]=1; else f[q]=0; } }

24

Another version

1. m=|P|;2. f(1)=0;3. k=0;4. for q=2 to |P| do {5. k=f(q-1);6. while(k>0 and P[k+1]!=P[q]) do {7. k=f(k); }8. if(P[k+1]==P[q]) then k=k+1;9. f[q]=k; }

25

• Example 3: 1 2 3 4 5 6 7 8 9 10 11 12P=a b c a b c a b c a a cf(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2; f(6)=3; f(7)=4; f(8)=5; f(9)=6; f(10)=7; f(11)=1.(The computation of f(11) is very interesting.)

Question: Do we need to compute f(12)?Yes, if you want to find ALL occurrences of P.No, if you just want to find the first occurrence of P.

26

Example:

P=abaabc

T=abcabcabc

abcabc

abcabc

When a match is found at the end of P, call f(|p|).

Running time complexity (Fun Part, not required)

The running time of failure function construction algorithm is O(|P|). (The proof is similar to that for scan algorithm.)

Total running time complexity

The total complexity for failure function construction and scan algorithm is O(|P|+|T|).

i 1 2 3 4 5 6

f(i) 0 0 0 1 2 3

27

Linear Time Algorithm for Multiple patterns

• Input: a string T (very long) and a set of patterns P1,P2,...,Pk.

• Output: all the occurrences of Pi's in T.

Let us consider the set of patterns { he, she, his, hers }. We can construct an automata as follows:

28

0 1

543

6 7

2 8 9h e r s

i s

s h e

e,i,r

29

• g(s,a)=s' means that at state s if the next input letter is a then the next state is s'.

• The states of the automata is organized column by column.

• Each state corresponds to a prefix of some pattern Pi.

• F: the set of final states (dark circled) corresponding to the ends of patterns.

• For the starting state 0, add g(0,a)=0, if g(0,a) is originally fail.

30

• Exercise: write down the g() function for the above automata.

• Failure function

f(s) = the state for the longest prefix of some pattern Pi that is a suffix of the string in the path from 0 (starting state) to s.

• Example:

he is the longest prefix for hers that is a suffix of the string she.

31

The scan algorithm

Text: T[1]T[2]...T[n]s=0;for i:=1 to n do{ while g(s,T[i])=fail do s=f(s); s:=g(s,T[i]); if s is in F then return "yes";}return "no"

32

Theorem: The scan algorithm takes O(|T|) time.

Proof: Again, the two pointer argument.• When a match is found, move the first

pointer forward. (s:=g(s,T[i]);)• When a mismatch is found (g(s,T[i])==fail),

move the second pointer forward. (s=f(s);)• When a final state is meet, declare the

finding of a pattern. (if s is in F then return "yes";)

33

• Example:i=1 2 3 4 5 6 7 8 s h e r s h i i 3 4 5 2 8 9 3 4 1 0 0 0

s 1 2 3 4 5 6 7 8 9

f(s) 0 0 0 1 2 0 3 0 3

34

Failure function construction• Basic idea: similar to that for one pattern.for each state s of depth 1 do f(s)=0for each depth d>=1 do for each state sd of depth d and character a such that g(sd,a)=s' do

{ s=f(sd) while g(s,a)=fail do { s=f(s) } f(s')=g(s,a) }

35

• g(0,c)≠fail for any possible character c.

The failure function for {he, she, his, hers} is

Time complexity: O(|P1|+|P2|+...+|Pk|).

Proof: Two pointer argument.

Leave it for assignment (optional)

s 1 2 3 4 5 6 7 8 9

f(s) 0 0 0 1 2 0 3 0 3

Date post:	21-Dec-2015
Category:	Documents
View:	214 times
Download:	0 times

1 Construction of Index: (Page 197) Objective: Given a document, find the number of occurrences of...

Documents