Post on 22-Jan-2016
description
transcript
Martin Kay String Matching 1 1
Martin Kay
Stanford University
Martin Kay String Matching 1 2
Naive Search (1)naive_search(Pattern, Text, 1) :-
append(Pattern, _, Text).
naive_search(Pattern, [_ | Text], N) :-
naive_search(Pattern, Text, N0),
N is N0+1.
naive_search("is", "mississippi", N).
N = 2 ? ;
N = 5 ? ;
no| ?-
Martin Kay String Matching 1 3
pref — A Prefix Predicate
pref(P, T) :-
assert(stat(T, P)),
fail.pref([], _).
pref([H | P], [H | T]) :-
pref(P, T).
Make an entry inthe data baseevery time the
predicate iscalled.
Martin Kay String Matching 1 4
Search using prefnaive_search1(Pattern, Text, 1) :-
pref(Pattern, Text).
naive_search1(Pattern, [_ | Text], N) :-
naive_search1(Pattern, Text, N0),
N is N0+1.
| ?- naive_search1([i,s], [m,i,s,s,i,s,s,i,p,p,i], N).
N = 2 ? ;
N = 5 ? ;
no| ?-
Martin Kay String Matching 1 5
The Statistics| ?- listing(stat).
stat([m,i,s,s,i,s,s,i,p,p,i], [i,s]).
stat([i,s,s,i,s,s,i,p,p,i], [i,s]).
stat([s,s,i,s,s,i,p,p,i], [s]).
stat([s,i,s,s,i,p,p,i], []).
stat([s,s,i,s,s,i,p,p,i], [i,s]).
stat([s,i,s,s,i,p,p,i], [i,s]).
stat([i,s,s,i,p,p,i], [i,s]).
stat([s,s,i,p,p,i], [s]).
stat([s,i,p,p,i], []).
stat([s,s,i,p,p,i], [i,s]).
stat([s,i,p,p,i], [i,s]).
stat([i,p,p,i], [i,s]).
stat([p,p,i], [s]).
stat([p,p,i], [i,s]).
stat([p,i], [i,s]).
stat([i], [i,s]).
stat([], [s]).
stat([], [i,s]).
18 Entries
11 Allignments
Martin Kay String Matching 1 6
Observe--
If the pattern “mississippi” matched part of the way, we can move over all the the characters matched because none of them can be an “m”, which is what we need to start a new match.
m i s s i o n a r y . . . .m i s s i s s i p p i
Mismatch
No “m” here
Text:Pattern:
or maybeeven here
So move to here!
Martin Kay String Matching 1 7
Observe further --
p e r p e n d i c u l a r . . .p e r p e t r a t e
Text:Pattern:
Mismatch
This is a prefixof the pattern
p e r p e t r a t e
So try this
Martin Kay String Matching 1 8
Observe yet further --
p e r p e t u a l . . . . .p e r p e t r a t e
Text:Pattern:
Mismatch
No (shorter) prefixof the pattern ends
here
p e r p e t r a t e
So move tohere
Martin Kay String Matching 1 9
Overlaps
a b a c a b a d a b a c a b aa b a c a b a d a b a c a b a
a b a c a b a d a b a c a b aa b a c a b a d a b a c a b a
a b a c a b a d a b a c a b aa b a c a b a d a b a c a b a
a b a c a b a d a b a c a b aa b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
a b a b a c a b a d a b a c a b a d a b a c a b a b a
Search for
in the text
a b a c a b a d a b a c a b a
Martin Kay String Matching 1 10
Déja vu
a b a c a b a d a b a c a b aa b a c a b a d a b a c a b a
a b a c a b a d a b a c a b aa b a c a b a d a b a c a b a
a b a c a b a d a b a c a b aa b a c a b a d a b a c a b a
a b a c a b a d a b a c a b aa b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
a b a b a c a b a d a b a c a b a d a b a c a b a b a
Search for
in the text
a b a c a b a d a b a c a b a
Martin Kay String Matching 1 11
On-line search
We have seen this much of the text so far:
We are looking for the pattern cacao.We have some number (0 or more) searchesin progress and are waiting for the next characterto see which ones continue and maybe to starta new one.
c a c a
c a c a c a
c
cc
Martin Kay String Matching 1 12
0 a [0]
1 b [0, 1]
2 a [0, 2]
3 b [0, 1, 3]
4 a [0, 2]
5 c [0, 1, 3]
6 a [0, 4]
7 b [0, 1, 5]
8 a [0, 2, 6]
9 d [0, 1, 3, 7]
10 a [0, 8]
11 b [0, 1, 9]
12 a [0, 2, 10]
13 c [0, 1, 3, 11]
14 a [0, 4, 12]
15 b [0, 1, 5, 13]
16 a [0, 2, 6, 14]
result 2
17 d [0, 1, 3, 7]
18 a [0, 8]
19 b [0, 1, 9]
20 a [0, 2, 10]
21 c [0, 1, 3, 11]
22 a [0, 4, 12]
23 b [0, 1, 5, 13]
24 a [0, 2, 6, 14]
result 10
25 b [0, 1, 3, 7]
26 a [0, 2]
a b a c a b a d a b a c a b a
a b a b a c a b a d a b a c a b a d a b a c a b a b a
Search for
in the text
1. The rightmost pointer always moves.
2. Others pointers move if they can do so over the same character
3. A new ‘0’ is introduced on the left
A pointer in a given position always has pointers in the same set of positions to its left
These are properties of the pattern only.
Therefore they can be cached or precompiled.
Martin Kay String Matching 1 13
0 a [0]
1 b [0, 1]
2 a [0, 2]
3 b [0, 1, 3]
4 a [0, 2]
5 c [0, 1, 3]
6 a [0, 4]
7 b [0, 1, 5]
8 a [0, 2, 6]
9 d [0, 1, 3, 7]
10 a [0, 8]
11 b [0, 1, 9]
12 a [0, 2, 10]
13 c [0, 1, 3, 11]
14 a [0, 4, 12]
15 b [0, 1, 5, 13]
16 a [0, 2, 6, 14]
result 2
17 d [0, 1, 3, 7]
18 a [0, 8]
19 b [0, 1, 9]
20 a [0, 2, 10]
21 c [0, 1, 3, 11]
22 a [0, 4, 12]
23 b [0, 1, 5, 13]
24 a [0, 2, 6, 14]
result 10
25 b [0, 1, 3, 7]
26 a [0, 2]
a b a c a b a d a b a c a b a
a b a b a c a b a d a b a c a b a d a b a c a b a b a
Search for
If this matches ...
then so will these
Martin Kay String Matching 1 14
a b a c a b a d a b a c a b a
a b a b a c a b a d a b a c a b a d a b a c a b a b a
Search for
So try these
only if this fails!
0 a [0]
1 b [0, 1]
2 a [0, 2]
3 b [0, 1, 3]
4 a [0, 2]
5 c [0, 1, 3]
6 a [0, 4]
7 b [0, 1, 5]
8 a [0, 2, 6]
9 d [0, 1, 3, 7]
10 a [0, 8]
11 b [0, 1, 9]
12 a [0, 2, 10]
13 c [0, 1, 3, 11]
14 a [0, 4, 12]
15 b [0, 1, 5, 13]
16 a [0, 2, 6, 14]
result 2
17 d [0, 1, 3, 7]
18 a [0, 8]
19 b [0, 1, 9]
20 a [0, 2, 10]
21 c [0, 1, 3, 11]
22 a [0, 4, 12]
23 b [0, 1, 5, 13]
24 a [0, 2, 6, 14]
result 10
25 b [0, 1, 3, 7]
26 a [0, 2]
Martin Kay String Matching 1 15
The failure function
a [0]
b [0, 1]
a [0, 2]
b [0, 1, 3]
a [0, 2]
c [0, 1, 3]
a [0, 4]
b [0, 1, 5]
a [0, 2, 6]
d [0, 1, 3, 7]
a [0, 8]
b [0, 1, 9]
a [0, 2, 10]
c [0, 1, 3, 11]
a [0, 4, 12]
a b a c a b a d a b a c a ...
0 1 2 3 4 5 6 7 8 9 10 11 12 ...
0 0 1 0 1 2 3 0 1 2 3 4 ...
Martin Kay String Matching 1 16
a [0]
b [0, 1]
a [0, 2]
b [0, 1, 3]
a [0, 2]
c [0, 1, 3]
a [0, 4]
b [0, 1, 5]
a [0, 2, 6]
d [0, 1, 3, 7]
a [0, 8]
b [0, 1, 9]
a [0, 2, 10]
c [0, 1, 3, 11]
a [0, 4, 12]
a b a c a b a d a b a c a ...
0 1 2 3 4 5 6 7 8 9 10 11 12 ...
0 0 1 0 1 2 3 0 1 2 3 4 ...
Martin Kay String Matching 1 17
The Failure Function
a b c a b c a b c
a b c a b c a b c
a b c a b c a b c
a b c a b c a b c
a b c a b c a b c
-1 0 0 0 1 2 3 4 5
Martin Kay String Matching 1 18
The Failure Function
a b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
-1 0 0 1 0 1 2 3 0 1 2 3 4 5 6
a b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
Martin Kay String Matching 1 19
The Failure Function-1 0 0 1 0 1 2 3 0 1 2 3 4 5 6
a b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
a b a c a b a d a b a c a b a
Martin Kay String Matching 1 20
Substring, Prefix, Suffix• Part of a string S (even if it covers the
whole of S) is a substring of S.• If it includes the first (last) character
of S, it is a prefix (suffix) of S.• If it does not cover the whole of S, it
is a proper substring (prefix, suffix) of S.
Example: S = ababacSome substrings: ababac, ab, b, bab, ac,
only ababac is not properSome prefixes: ababac, a, aba,
only ababac is not properSome suffixes: ababac, abac, c,
only ababac is not proper
is the empty string
Martin Kay String Matching 1 21
Borders
• If B is a proper prefix and a proper suffix of a string S, it is a border of S.
• Note is a border of every string
Examples:abcabcabc has borders abc, abcabc, abacabadabacaba has borders abacaba, aba, a,
Martin Kay String Matching 1 22
a b c a b c a b c
a b c a b c a b c
a b c a b c a b c
a b c a b c a b c
-1 0 0 0 1 2 3 4 5
Borders
a b c a b c a b c
Martin Kay String Matching 1 23
border in Prolog
border(Pattern, Boarder) :-
append([_ | _], Border, Pattern),
append(Border, _, Pattern).
Martin Kay String Matching 1 24
border(I, Pattern, Q) :- J is I-1, border(J, Pattern, P), nth0(J, Pattern, C), extend(C, P, Pattern, Q).
extend(_, -1, _, 0).extend(C, P, Pattern, Q) :- nth0(P, Pattern, C), !, Q is P+1.extend(C, P0, Pattern, R) :- border(P0, Pattern, Q), extend(C, Q, Pattern, R).
Borders in Linear-time
-1 0 0 1 0 1 2 3 0 1
a b a c a b a d a b
a b a c a b a d a b
a b a c a b a d a b
Borders at position i+1
extend borders at position i
Martin Kay String Matching 1 25
Building A Tableborder(I, Pattern, Q) :- J is I-1, border(J, Pattern, P), nth0(J, Pattern, C), extend(C, P, Pattern, Q).
extend(_, -1, _, 0).extend(C, P, Pattern, Q) :- nth0(P, Pattern, C), !, Q is P+1.extend(C, P0, Pattern, R) :- border(P0, Patttern, Q), extend(C, Q, Pattern, R).
make_table(Pattern) :- retractall(border_table(_, _)), assert(border_table(0, 0)), assert(border_table(1, 0)), length(Pattern, PL), make_table(Pattern, 2, PL).
make_table(_, I, N) :- I>N, !.make_table(Pattern, I, N) :- border(I, Pattern, K), assert(border_table(I, K)), J is I+1, make_table(Pattern, J, N).
Martin Kay String Matching 1 26
Building A Tableborder(I, Pattern, Q) :- J is I-1, border_table(J, P), nth0(J, Pattern, C), extend(C, P, Pattern, Q).
extend(_, -1, _, 0).extend(C, P, Pattern, Q) :- nth0(P, Pattern, C), !, Q is P+1.extend(C, P0, Pattern, R) :- border_table(P0, Q), extend(C, Q, Pattern, R).
make_table(Pattern) :- retractall(border_table(_, _)), assert(border_table(0, 0)), assert(border_table(1, 0)), length(Pattern, PL), make_table(Pattern, 2, PL).
make_table(_, I, N) :- I>N, !.make_table(Pattern, I, N) :- border(I, Pattern, K), assert(border_table(I, K)), J is I+1, make_table(Pattern, J, N).
Martin Kay String Matching 1 27
Searchingsearch(Pattern, Text, N) :- make_table(Pattern), retract(border_table(0, _)), assert(border_table(0, 0)), length(Pattern, PL), search(Pattern, PL, Text, N).
search(Pattern, PL, Text, N) :- common_prefix(Pattern, Text, CPL), search(CPL, Pattern, PL, Text, N).
search(CPL, _, CPL, _, 0).search(CPL, Pattern, PL, Text0, N) :- border_table(CPL, BL), M is CPL-BL, advance(Text0, M, Text), search(Pattern, PL, Text, N0), N is N0+M.
Build the table
Do the search
Martin Kay String Matching 1 28
Reference
Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing , 6(2):323-350, June 1977.