Text redundancies (2)
Maxime Crochemore
King’s College London Universite Paris-Est
&
M.C. CANT 2012 1/27
Algorithms and combinatorics on words
⋆ Links between combinatorial properties of words and algorithms on
strings
⋆ Examples:
– Text searching
– Text indexing and suffix arrays
– Text compression and permutations
– Locating repeats in strings
⋆ Combinatorial aspects in Applied Combinatorics on Words
[Lothaire, 2005], [Lothaire, 2002]
http://igm.univ-mlv.fr/∼berstel/Lothaire/index.html
⋆ Other examples in Algorithms on Strings
[C., Hancart, Lecroq, 2007], [C., Rytter, 1994]
http://www.dcs.kcl.ac.uk/staff/mac/
M.C. CANT 2012 2/27
Periods and borders of words
⋆ Non-empty string u, integer p, 0 < p ≤ |u|
⋆ p is a period of u if any of these conditions is satisfied:
– u[i] = u[i + p], for 1 ≤ i ≤ |u| − p
– u is a prefix/factor of some yk, k > 0, |y| = p
– u = yw = wz, for some strings y, z, w with |y| = |z| = p
b o r d e r l i n e b o r d e r
b o r d e r l i n e b o r d e r
-� p
-�
p
⋆ period(u) = smallest period of u (can be |u|)
border(u) = longest proper border of u (can be empty)
⋆ Periods and borders of abaabaa
3 abaa
6 a
7 empty string
M.C. CANT 2012 3/27
Periodicity Lemma
Lemma 1 (Periodicity Lemma [Fine, Wilf, 1965])
If p and q are periods of a word x and satisfy
p + q − GCD(p, q) ≤ |x| then GCD(p, q) is a period of x.
a b a a b a b a a b a b a a b · · ·
a b a a b a b a a b a
a b a a b a b a a b a a b a b · · ·
Lemma 2 (Weak Periodicity Lemma)
If p and q are periods of a word x and satisfy p+ q ≤ |x| then
GCD(p, q) is a period of x.
Used in the analysis of KMP algorithm and of many other
pattern matching algorithms.
M.C. CANT 2012 4/27
Proof of the weaker statement
⋆ p and q periods of x with p + q ≤ |x| and p > q
⋆ p− q period of x
a b c
-p
�
q-�
p− q
a bc�
q
-p
-�
p− q
⋆ the rest like Euclid’s induction
M.C. CANT 2012 5/27
On-line String Matching (3)
u c
u a
u a-�
compatible shift
⋆ compatible with match
shift = period(u)
[Morris, Pratt, 1969]
⋆ id+not incomp. with c
[Knuth, M., P., 1977]
⋆ best shift = period(uc)
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
[Simon, 1989], [Hancart, 1993], [Breslauer, Colussi, Toniolo, 1993]M.C. CANT 2012 6/27
Delay
⋆ Delay: maximal number of comparisons on a text letter
⋆ MP algorithm: delay ≤ |x|
⋆ KMP algorithm: delay ≤ logΦ(|x| + 1)
proof by Periodicity Lemma
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a b
a b a a b a a
a b a a b a a
a b a a b a a
⋆ Simon-Hancart algorithm: delay ≤ min(1 + log2 |x|, cardA)
use of string-matching automaton
M.C. CANT 2012 7/27
Searching with an automaton
⋆ Uses the string-matching automaton SMA(x):
smallest determin. automaton accepting A∗x
⋆ Example x = abaa, A = {a, b}
0 1 2 3 4a b a a
b a
b
b
a
b
⋆ Search for abaa in:
b a b b a a b a a b a a b b a · · ·
state 0 0 1 2 0 1 1 2 3 4 2 3 4 2 0 1 · · ·
M.C. CANT 2012 8/27
Construction of SMA(x)
⋆ Unwinding arcs
⋆ From SMA(abaa) . . .
0 1 2 3 4a b a a
b a
b
b
a
b
⋆ . . . to SMA(abaab)
0 1 2 3 4 5a b a a b
b a
b
b
a
a
b
M.C. CANT 2012 9/27
Significant arcs
⋆ Complete SMA(ananas)
- -���� ���� ���� ���� ���� ���� ����0 1 2 3 4 5 6- - - - - -
a n a n a s
����n, s ����
a � ��a
� ��a
' $�a
� ��s � ��
n� ��
n, s& %�s& %�
n, s& %�
n, s
⋆ Forward arcs: spell the pattern
⋆ Backward arcs: arcs going backwards without
reaching the initial state
- -���� ���� ���� ���� ���� ���� ����0 1 2 3 4 5 6- - - - - -
a n a n a s
����a � ��
a� ��
a' $�
a
� ��n
M.C. CANT 2012 10/27
Complexity
⋆ Time and space optimisation: implementation of
significant arcs only
– Forward arcs: spell the pattern
– Backward arcs: arcs going backwards not to
initial state
Lemma 3 SMA(x) has at most |x| backward arcs.
⋆ implementation of SMA(x) in O(|x|) space
⋆ construction in O(|x|) time, independent of alphabet
size
⋆ Optimal searching strategy:
delay ≤ min(1 + log2 |x|, cardA)
[Hancart, 1993]
[Breslauer, Colussi, Toniolo, 1993]
M.C. CANT 2012 11/27
Local periods
⋆ Overlap
w overlap of (u, v) if w 6= ε and:
A∗u ∩ A∗w 6= ∅ and vA∗ ∩ wA∗ 6= ∅
|w| is a local period of uv at position |u|
⋆ Local Period
localperiod(u, v) = smallest local period of (u, v)
a b a b b a
b b
a b a b b a
a a
a b a b b a
b a b a
a b a b b a
b b a b a b b a b a
M.C. CANT 2012 12/27
Maximal Local Period
⋆ Word of period 5
a b a b b a
1 2 2 5 1 3 1
⋆ Note: localperiod(u, v) ≤ period(uv)
⋆ (u, v) is a critical factorization of uv if
localperiod(u, v) = period(uv)
localperiod(u, v) is maximal among all local periods
⋆ Computation of all local periods in linear time
[Duval, Kolpakov, Kucherov, Lecroq, Lefebvre, 2003]
M.C. CANT 2012 13/27
Critical Factorization
Theorem 1 (Critical Factorization Theorem)
Any non-empty word x can be factorized into u · v with both:
• |u| < p and
• localperiod(u, v) = period(x).
[Cesari, Vincent, Duval, 1983]
a b a b b a
b b a b a b b a b a
Leads to time-space optimal string-matching algorithm:
two-way algorithm
M.C. CANT 2012 14/27
Two-Way String Matching
text y
pattern x
w c
u w a
u
c w v
va w
v
shifts-�
|wc|-�
period(x)
⋆ Time-space optimality [C., Perrin, 1992]
– Search Time: linear time (≤ 2n comparisons)
with constant extra space
– Preprocessing Time: idem, based on next th.
⋆ Other solutions: [Galil, Seiferas, 1983]
[C., 1992], [C., Rytter, 1994], [Rytter, 2002]
⋆ Real-time version:
[Breslauer, Grossi, Mignosi, 2011]
M.C. CANT 2012 15/27
Example (4)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
a b a b b a b a
left-to-right and right-to-left scans
occurrence found
next shift length = period = 5
M.C. CANT 2012 19/27
Maximal suffixes
⋆ Orderings
≤ lexicographic ordering based on ≤ of alphabet
� lexicographic ordering based on ≤−1 of alphabet
Theorem 2 (C., Perrin, 1992) x 6= ǫ
Let x = uv with v = suffix of x that is maximal for ≤
Let x = u′v′ with v′ = suffix of x that is maximal for �
If |v| ≤ |v′| then (u, v) is a critical factorization of x
otherwise (u′, v′) is.
Moreover, |u| < period(x) and |u′| < period(x).
a b a a b a a
a b a a b a a
a b a b a a b b a b a b a
a b a b a a b b a b a b a
M.C. CANT 2012 20/27
Proof (4)
Four cases — x = uv, w shortest overlap of (u, v)
⋆ w suffix of u, v prefix of w
v ≤ w < wv, impossible
x u v
w w
⋆ w suffix of u, w prefix of v
z < v implies v = wz < wv
impossible
x u
w w
z
⋆ u suffix of w and v prefix of w
(u, v) is a critical factorization
x u v
w w
⋆ u suffix of w and w prefix of v
z < v; yz ≺ yv implies z ≺ v;
z prefix of v then border of v
w period of v and of x.
xv′
w
y
yz
M.C. CANT 2012 21/27
Computing maximal suffixes
⋆ Algorithm adapted from Lyndon factorisation
[Duval, 1983]
⋆ Runs in linear time according to string length
and constant extra space
M.C. CANT 2012 22/27
Maximal-suffix computation
⋆ v maximal suffix; |w| its period; w′ proper prefix of w
x u v
w w w′
a b a c b c b a c b c b a c b cu w w w′
⋆ Match: the periodicity continues
a b a c b c b a c b c b a c b cu w w new w′
? ?
b
⋆ Smaller letter: new w, border-free
a b a c b c b a c b c b a c b cu new w
? ?
a
⋆ Greater letter: new u and recomputation on the rest
a b a c b c b a c b c b a c b cnew u recomputation
? ?
c
M.C. CANT 2012 23/27
Perfect factorisation
Theorem 3 Any non-empty word x can be factorised
into u · v with both:
— |u| < 2 period(v) and
— v starts with at most one cube of a primitive word.
[Galil, Seiferas, 1983], [C., Rytter, 1994]
[Mignosi, Restivo, Salemi, 1995] see [Lothaire, 2002]
⋆ Example word with period = 10
a a a b a a b a a b a a a b a a b a a b a a a b a a · · ·
⋆ Leads to a time-space optimal string matching
M.C. CANT 2012 24/27
Square prefixes
w w
v v
u u
Lemma 4 (Three-square Lemma)
If u2 proper prefix of v2, v2 proper prefix of w2, and u
primitive then |u| + |v| ≤ |w|.
[C., Rytter, 1995], [Lothaire, 2001]
10 a a b a a b a a a b a a b a a b a a a b
7 a a b a a b a a a b a a b a
3 a a b a a b
M.C. CANT 2012 25/27
Primitively-rooted squares in a word
Lemma 5 ([Fraenkel, Simpson, 1998])
No more than 2n p.-r. squares in a word of length n.
y
w w
v v
u u?
rightmost positions on y? impossible!
Direct proofs [Hickerson, 2004], [Ilie, 2005]
Best bound: 2n− Θ(logn) [Ilie, 2005]
Computation in linear time [Gusfield, Stoye, 1998]
Lemma 6 ([C., 1981], [Gusfield, Stoye, 1999])
Maximal nb. of occurrences of p.-r. squares: cn logn.
Maximum reached with Fibonacci words.
M.C. CANT 2012 26/27
Runs
Run of x = maximal periodicity in x = non extensible
occurrence of a power in x
Theorem 4 ([Kolpakov, Kucherov, 1998])
Maximal number of runs in a word of length n: O(n).
M.C. CANT 2012 27/27