Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of...

Post on 07-Jul-2020

0 views 0 download

transcript

Combinatorics of Finite Wordsand Suffix Automata

Gabriele Fici

Dipartimento di Informatica e ApplicazioniUniversita di Salerno (Italy)

CAI 2009 - Thessaloniki

20 May 2009

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Combinatorics of Finite Words

A is a finite set of letters (the alphabet).

A finite word w is an element of A∗.

Its length |w | is the number of its letters.

The empty word ε has length 0.

Let w = a1a2 . . . an be a word.a1 . . . ai , with 1 ≤ i ≤ n, and ε are the prefixes of w .aj . . . an, with 1 ≤ j ≤ n, and ε are the suffixes of w .aj . . . ai , with 1 ≤ i , j ≤ n, and ε are the factors of w .

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Combinatorics of Finite Words

Example

A = {a,n,b, c}, w = banana

|banana| = 6

ba is a prefix of banana

nana is a suffix of banana

a, ba, ε, banana are factors of banana

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .

balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.

differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211

finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...

many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

What’s the target?

Target

Classify the words through their combinatorial properties.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The suffix automaton

Definition (Blumer et al. 1985 - Crochemore 1986)The suffix automaton of the word w is the minimal deterministicautomaton recognizing the suffixes of w .

ExampleThe suffix automaton of aabbabb:

0 1 2 3 4 5 6 7

3′′ 4′′

3′

a a b b a b b

b

b

a

b

b

a

0-0

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Algorithmically

Theorem (Blumer et al. 1985 - Crochemore 1986)The suffix automaton of a word w over a fixed alphabet A canbe built in time and space O(|w |).

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

One way to build the SA

Build a non-deterministic automaton:

w = aabbabb

0 1 2 3 4 5 6 7a a b b a b b

Determinize by subset construction:

{0, 1, 2, . . . , 7} {1, 2, 5} {2} {3} {4} {5} {6} {7}

{3, 6} {4, 7}

{3, 4, 6, 7}

a a b b a b b

b

b

a

b

b

a

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

One way to build the SA

Build a non-deterministic automaton:

w = aabbabb

0 1 2 3 4 5 6 7a a b b a b b

Determinize by subset construction:

{0, 1, 2, . . . , 7} {1, 2, 5} {2} {3} {4} {5} {6} {7}

{3, 6} {4, 7}

{3, 4, 6, 7}

a a b b a b b

b

b

a

b

b

a

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Ending Positions

We associate to each factor v of w the set of ending positionsof v in w .

Examplew = a a b b a b b

1 2 3 4 5 6 7

Endset(b) = {3,4,6,7}, Endset(abb) = Endset(bb) = {4,7}.

We define on Fact(w) the equivalence:

u ∼ v ⇔ Endset(u) = Endset(v)

Then Fact(w)/ ∼ is the set of states of the SA of w .

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Ending Positions

We associate to each factor v of w the set of ending positionsof v in w .

Examplew = a a b b a b b

1 2 3 4 5 6 7

Endset(b) = {3,4,6,7}, Endset(abb) = Endset(bb) = {4,7}.

We define on Fact(w) the equivalence:

u ∼ v ⇔ Endset(u) = Endset(v)

Then Fact(w)/ ∼ is the set of states of the SA of w .

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Ending Positions

We associate to each factor v of w the set of ending positionsof v in w .

Examplew = a a b b a b b

1 2 3 4 5 6 7

Endset(b) = {3,4,6,7}, Endset(abb) = Endset(bb) = {4,7}.

We define on Fact(w) the equivalence:

u ∼ v ⇔ Endset(u) = Endset(v)

Then Fact(w)/ ∼ is the set of states of the SA of w .

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The number of states

The number of states (classes) of the SA is noted |Qw |.

The bounds on |Qw | are well known:

|w |+ 1 ≤ |Qw | ≤ 2|w | − 1

The upper bound is reached for w = ab|w |−1, with a 6= b.

And for the lower bound?

ProblemCharacterize the class of words for which |Qw | = |w |+ 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The number of states

The number of states (classes) of the SA is noted |Qw |.

The bounds on |Qw | are well known:

|w |+ 1 ≤ |Qw | ≤ 2|w | − 1

The upper bound is reached for w = ab|w |−1, with a 6= b.

And for the lower bound?

ProblemCharacterize the class of words for which |Qw | = |w |+ 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The number of states

The number of states (classes) of the SA is noted |Qw |.

The bounds on |Qw | are well known:

|w |+ 1 ≤ |Qw | ≤ 2|w | − 1

The upper bound is reached for w = ab|w |−1, with a 6= b.

And for the lower bound?

ProblemCharacterize the class of words for which |Qw | = |w |+ 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The number of states

The number of states (classes) of the SA is noted |Qw |.

The bounds on |Qw | are well known:

|w |+ 1 ≤ |Qw | ≤ 2|w | − 1

The upper bound is reached for w = ab|w |−1, with a 6= b.

And for the lower bound?

ProblemCharacterize the class of words for which |Qw | = |w |+ 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Special Factors

Definitionv is a left special factor of w if there exist a 6= b such thatav and bv are factors of w .

v is a right special factor of w if there exist a 6= b such thatva and vb are factors of w .

v is a bispecial factor of w if it is both left and right special.

Example (w = aabbabb)

LS = {ε,a,b,ab,abb}, RS = {ε,a,b}, BIS = {ε,a,b}

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Special Factors

Definitionv is a left special factor of w if there exist a 6= b such thatav and bv are factors of w .

v is a right special factor of w if there exist a 6= b such thatva and vb are factors of w .

v is a bispecial factor of w if it is both left and right special.

Example (w = aabbabb)

LS = {ε,a,b,ab,abb}, RS = {ε,a,b}, BIS = {ε,a,b}

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The number of states

Theorem (Sciortino, Zamboni 2007)

If |A| = 2 then the following conditions are equivalent for a wordover A:

|Qw | = |w |+ 1Every left special factor of w is a prefix of ww is a prefix of a standard sturmian word.

Without restriction on the cardinality of A we have the formula:

Lemma

|Qw | = |w |+ 1 + |D(w)|

where D(w) is the set of left special factors which are notprefixes.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The number of states

Theorem (Sciortino, Zamboni 2007)

If |A| = 2 then the following conditions are equivalent for a wordover A:

|Qw | = |w |+ 1Every left special factor of w is a prefix of ww is a prefix of a standard sturmian word.

Without restriction on the cardinality of A we have the formula:

Lemma

|Qw | = |w |+ 1 + |D(w)|

where D(w) is the set of left special factors which are notprefixes.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Property LSP

DefinitionA word has property LSP if every left special factor is a prefix.

Corollary

|Qw | = |w |+ 1 ⇐⇒ w has property LSP

ProblemCharacterize the class of words having the property LSP, overan arbitrary fixed alphabet A.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Property LSP

DefinitionA word has property LSP if every left special factor is a prefix.

Corollary

|Qw | = |w |+ 1 ⇐⇒ w has property LSP

ProblemCharacterize the class of words having the property LSP, overan arbitrary fixed alphabet A.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Property LSP

DefinitionA word has property LSP if every left special factor is a prefix.

Corollary

|Qw | = |w |+ 1 ⇐⇒ w has property LSP

ProblemCharacterize the class of words having the property LSP, overan arbitrary fixed alphabet A.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The binary case

For binary words we have the formula:

|Qw | = 2|w | − Hw − Pw

Hw is the minimal length of a prefix of w occurring only once,Pw is the maximal length of a left special prefix of w.

As a corollary we obtain a new characterization of standardsturmian words:

Corollary

w is a prefix of a stand. sturm. word⇔ |w | = Hw + Pw + 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The binary case

For binary words we have the formula:

|Qw | = 2|w | − Hw − Pw

Hw is the minimal length of a prefix of w occurring only once,Pw is the maximal length of a left special prefix of w.

As a corollary we obtain a new characterization of standardsturmian words:

Corollary

w is a prefix of a stand. sturm. word⇔ |w | = Hw + Pw + 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Example

Example (w = aabbabb)

0 1 2 3 4 5 6 7

3′′ 4′′

3′

a a b b a b b

b

b

a

b

b

a

0-0

Hw = 2 since aa occurs only once.Pw = 1 since a is left special.

|Qw | = 2 · 7− 2− 1 = 11

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The number of edges

What about the number of edges Ew?

The bounds on Ew are well known:

|w | ≤ Ew ≤ 3|w | − 4

For binary words we give the formula:

Lemma

Ew = |Qw |+ |G(w)| − 1

G(w) is the union of the sets of bispecial factors and rightspecial prefixes of w.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The number of edges

What about the number of edges Ew?

The bounds on Ew are well known:

|w | ≤ Ew ≤ 3|w | − 4

For binary words we give the formula:

Lemma

Ew = |Qw |+ |G(w)| − 1

G(w) is the union of the sets of bispecial factors and rightspecial prefixes of w.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

The number of edges

What about the number of edges Ew?

The bounds on Ew are well known:

|w | ≤ Ew ≤ 3|w | − 4

For binary words we give the formula:

Lemma

Ew = |Qw |+ |G(w)| − 1

G(w) is the union of the sets of bispecial factors and rightspecial prefixes of w.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Example

Example (w = aabbabb)

0 1 2 3 4 5 6 7

3′′ 4′′

3′

a a b b a b b

b

b

a

b

b

a

0-0

G(w) = BIS(w) ∪ (Pref (w) ∩ RS(w)) = {ε,a,b} ∪ {ε,a}

|G(w)| = 3 ⇒ Ew = 11 + 3− 1 = 13.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Further Research

ProblemDoes this approach can be applied to other data structures(factor oracle, suffix tree/trie, suffix array, etc.)?

ProblemCharacterize the words having property LSP (i.e. every leftspecial factor is a prefix).

ProblemCompute the average size of the SA for particular class ofwords.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata