Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar...

Smallest grammar by recompression

Artur Jez

Max Planck Institute for Informatics

17.06.2013

Grammar based-compression

Represent w as a CFG generating it.

Advantagesit is usually small (at most quadratic vs. LZ)compression is fastit is exponential on good dataextracts hierarchical structureit is easy to work onrelated to LZW and LZ

17.06.2013 2/17



Advantagesit is usually small (at most quadratic vs. LZ)compression is fastit is exponential on good data

extracts hierarchical structureit is easy to work onrelated to LZW and LZ

17.06.2013 2/17



Advantagesit is usually small (at most quadratic vs. LZ)compression is fastit is exponential on good dataextracts hierarchical structureit is easy to work on

related to LZW and LZ

17.06.2013 2/17



Advantagesit is usually small (at most quadratic vs. LZ)compression is fastit is exponential on good dataextracts hierarchical structureit is easy to work onrelated to LZW and LZ

17.06.2013 2/17

Smallest grammar

ProblemGiven w return smallest CFG Gw such that L(Gw ) = w .

With O(1) increase in size, this is an SLP.

Definition (SLP: Straight Line Programme)CFG with

ordered nonterminals X1,X2, . . .

Chomsky normal formfor Xi → XjXk we have j , k < i

17.06.2013 3/17

Smallest grammar

ProblemGiven w return smallest CFG Gw such that L(Gw ) = w .

With O(1) increase in size, this is an SLP.

Definition (SLP: Straight Line Programme)CFG with

ordered nonterminals X1,X2, . . .

Chomsky normal formfor Xi → XjXk we have j , k < i

17.06.2013 3/17

What is knownBest approximation ratioO(log(n/g)), where g is the size of the optimal grammar.

Rytter– represent w as LZ, size ` ≤ g– translation of LZ into SLP, size O(` log(n/`)) ≤ O(g log(n/g))– the intermediate grammar is balanced (AVL-type condition)

Charikar et al.:– similar as Rytter– different balance criterion (length of word)

Sakamoto– local replacement rules (plus a global partition): pairs and blocks– analysis vs LZ

Linear time.

17.06.2013 4/17





Linear time.

17.06.2013 4/17





Linear time.

17.06.2013 4/17





Linear time.

17.06.2013 4/17





Linear time.

17.06.2013 4/17

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation.

analysis in the recompression framework, vs. SLP– very robust– good: easier to show better approximation?– bad: might be in fact larger

not balanced– good: easier to show approximation?– bad: worse for further processing

height O(log n), when a` has height 1

Algorithm similar to Sakamoto, different analysis.

17.06.2013 5/17

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation.analysis in the recompression framework, vs. SLP

– very robust– good: easier to show better approximation?– bad: might be in fact larger




17.06.2013 5/17

This talk






17.06.2013 5/17

This talk






17.06.2013 5/17

This talk






17.06.2013 5/17

Example

a aa a bb a bc a bb a b c ab

IntuitionPhases: compress only pairs and block from the beginning of aphase.Treat nonterminals as letters.To speed up, we make some pair compression simultaneously(partition Σ to Σ`,Σr , pairs from Σ`Σr )

17.06.2013 6/17

Example

a aa a bb a bc a bb a b c ab


17.06.2013 6/17

Example

a3 a bb a bc a bb a b c aba3 → a3


17.06.2013 6/17

Example

a3 a bb a bc a b2 a b c aba3 → a3, b2 → b2


17.06.2013 6/17

Example

a3 b c a b2 c abdd da3 → a3, b2 → b2, d→ ab


17.06.2013 6/17

Example

a3 b c a b2 c edd da3 → a3, b2 → b2, d→ ab, e→ ba


17.06.2013 6/17

Example



17.06.2013 6/17

Example



17.06.2013 6/17

Algorithm

1: while |T | > 1 do

2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression

10: return the constructed grammar

17.06.2013 7/17

Algorithm

1: while |T | > 1 do2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)

5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression


17.06.2013 7/17

Algorithm

1: while |T | > 1 do2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .

8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression


17.06.2013 7/17

Algorithm

1: while |T | > 1 do2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression


17.06.2013 7/17

Algorithm

1: while |T | > 1 do2: L← list of letters in T3: for each a ∈ L do . Blocks compression4: compress maximal blocks of a . O(|T |)5: P ← list of pairs6: find partition of Σ into Σ` and Σr7: . Try to maximize the occurrences from Σ`Σr in T .8: for ab ∈ P ∩ Σ`Σr do . These pairs do not overlap9: compress pair ab . Pair compression


17.06.2013 7/17

Partition

1/4 appearances coveredA partition Σ`Σr such that 1/4 of pairs is covered.

After block compression aa does not appear.Random partition: 1/4 pairs can be covered.derandomise (expected value)we need number of appearances of ab: RadixSortO(|T |).

17.06.2013 8/17

Partition

1/4 appearances coveredA partition Σ`Σr such that 1/4 of pairs is covered.

After block compression aa does not appear.Random partition: 1/4 pairs can be covered.derandomise (expected value)we need number of appearances of ab: RadixSortO(|T |).

17.06.2013 8/17

Size reduction

Size dropConsider set of two consecutive letters ab in T .For 1/4 of them one letter is compressed in a phase.

– if a = b: it is compressed– if a 6= b: 1/4 of those pairs is in Σ`Σr

When we consider ab we replace it, unless one letter wasalready replaced.

Length drops by a constant factor.

Towards running timeIt is enough to show that one round runs in O(|T |).

17.06.2013 9/17

Size reduction


– if a = b: it is compressed

– if a 6= b: 1/4 of those pairs is in Σ`ΣrWhen we consider ab we replace it, unless one letter wasalready replaced.



17.06.2013 9/17

Size reduction






17.06.2013 9/17

Size reduction






17.06.2013 9/17

Running time

PartitionO(|T |) time.

Block compressionBy RadixSort, O(|T |) time.

Pair compressionBy RadixSort, O(|T |) time.

17.06.2013 10/17

Number of nonterminals

Representation cost

when c replaces ab we add rule c → ab, representation cost 1when a`1 , a`2 , . . . , a`k are replaced with a`1 , a`2 , . . . , a`k(`1 < `2 . . . < `k ):

– first represent a`2−`1 , a`3−`2 , . . . , a`k−`k−1 as a`2−`1 , a`3−`2 , . . . ,a`k−`k−1

– do this by binary expansion(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

– aì+1 → aì+1−ì aì

– representation cost

O( k−1∑

i=1

log(ì+1 − ì ))

17.06.2013 11/17


Representation costwhen c replaces ab we add rule c → ab, representation cost 1

when a`1 , a`2 , . . . , a`k are replaced with a`1 , a`2 , . . . , a`k(`1 < `2 . . . < `k ):



– aì+1 → aì+1−ì aì


O( k−1∑

i=1

log(ì+1 − ì ))

17.06.2013 11/17


Representation costwhen c replaces ab we add rule c → ab, representation cost 1when a`1 , a`2 , . . . , a`k are replaced with a`1 , a`2 , . . . , a`k(`1 < `2 . . . < `k ):



– aì+1 → aì+1−ì aì


O( k−1∑

i=1

log(ì+1 − ì ))

17.06.2013 11/17





– aì+1 → aì+1−ì aì


O( k−1∑

i=1

log(ì+1 − ì ))

17.06.2013 11/17





– aì+1 → aì+1−ì aì


O( k−1∑

i=1

log(ì+1 − ì ))

17.06.2013 11/17





– aì+1 → aì+1−ì aì


O( k−1∑

i=1

log(ì+1 − ì ))

17.06.2013 11/17

Analysis outline

We begin with a G generating T (mental experiment)in each moment we keep G generating the current T

– we apply the compression to G– it is changed so that this can be done

representation cost is calculated using G

G is of more general form: Xi → uXjvXkwexplicit letters have creditrepresentation cost is paid by released credit:

– ab is replaced by c– we need 1 representation cost– each ab in G is replaced with c, 1 credit is released– (bit more tricky for blocks)

we only need to count the number of created credit

17.06.2013 12/17

Analysis outline







17.06.2013 12/17

Analysis outline







17.06.2013 12/17

Analysis outline







17.06.2013 12/17

Analysis outline





– ab is replaced by c– we need 1 representation cost– each ab in G is replaced with c, 1 credit is released

– (bit more tricky for blocks)


17.06.2013 12/17

Analysis outline







17.06.2013 12/17

Analysis outline







17.06.2013 12/17

Pair compression

X1 → ababcab, X2 → abcbX1abX1a

compression of ab: easycompression of ba: problem

Definition (Non-crossing pairs)ab is non-crossing pair iff none of the below happens

aX appears in a rule, X begins with bXb appears in a rule, X ends with a

When each pair from Σ`Σr is non-crossing,replace all those pairs in G (no new credit).

17.06.2013 13/17

Pair compression

X1 → ababcab, X2 → abcbX1abX1acompression of ab: easy

compression of ba: problem




17.06.2013 13/17

Pair compression

X1 → ababcab, X2 → abcbX1abX1acompression of ab: easycompression of ba: problem




17.06.2013 13/17

Pair compression





17.06.2013 13/17

Pair compression





17.06.2013 13/17

Making pairs non-crossing

When ab has a crossing appearance: aXi or XibXi defines bw : change it to w , replace Xi by bXi

symmetrically for ending a

LeftPop(b)1: for i ← 1 . .g − 1 do2: if the first symbol in Xi → α is b then3: remove this b4: replace Xi in productions by bXi

LemmaAfter LeftPop(b) and RightPop(a) the ab is non-crossing.

Can be done in parallel for all ab ∈ Σ`Σr .Credit increases by O(g)

17.06.2013 14/17







17.06.2013 14/17






Can be done in parallel for all ab ∈ Σ`Σr .

Credit increases by O(g)

17.06.2013 14/17


When ab ∈ Σ`Σr has a crossing appearance: aXi or XibXi defines bw : change it to w , replace Xi by aXi


LeftPop1: for i ← 1 . .g − 1 do2: if the first symbol in Xi → α is b ∈ Σr then3: remove this b4: replace Xi in productions by bXi

LemmaAfter LeftPop and RightPop the pairs Σ`Σr are non-crossing.

Can be done in parallel for all ab ∈ Σ`Σr .

Credit increases by O(g)

17.06.2013 14/17


When ab ∈ Σ`Σr has a crossing appearance: aXi or XibXi defines bw : change it to w , replace Xi by aXi


LeftPop1: for i ← 1 . .g − 1 do2: if the first symbol in Xi → α is b ∈ Σr then3: remove this b4: replace Xi in productions by bXi

LemmaAfter LeftPop and RightPop the pairs Σ`Σr are non-crossing.


17.06.2013 14/17

Blocks & Wrap up

IdeaSimilarly as pairs

Xi defines aì wbri : change it to wreplace Xi in rules by aì Xibri

analysis: more tricky but worksO(g)

In totalO(g) per phaseO(log n) phasesO(g log n) credit in total (= size of created grammar)can be improved to O(g log(n/g))

17.06.2013 15/17

Blocks & Wrap up





17.06.2013 15/17

Blocks & Wrap up





17.06.2013 15/17

Acknowledgments

M. LohreySuggesting the analysis.

P. Gawrychowskiintroducing to the topicliterature

– K. Mehlhorn, R. Sundar and Ch. Uhrig, Maintaining DynamicSequences under Equality Tests in Polylogarithmic Time, ‘97

– H. Sakamoto, A fully linear-time approximation algorithm forgrammar-based compression, ’05

– M. Lohrey and Ch. Mathissen, Compressed Membership inAutomata with Compressed Labels, ’11

17.06.2013 16/17

Acknowledgments

M. LohreySuggesting the analysis.

P. Gawrychowskiintroducing to the topicliterature

– K. Mehlhorn, R. Sundar and Ch. Uhrig, Maintaining DynamicSequences under Equality Tests in Polylogarithmic Time, ‘97

– H. Sakamoto, A fully linear-time approximation algorithm forgrammar-based compression, ’05

– M. Lohrey and Ch. Mathissen, Compressed Membership inAutomata with Compressed Labels, ’11

17.06.2013 16/17

Open problems, related research

Open problemsbetter approximationsimpler computational model (no RadixSort)addition chains (O( log n

log log n ) approximation known)

Other applications: recompressioncompressed membershipfully compressed pattern matchingword equations

17.06.2013 17/17

Open problems, related research

Open problemsbetter approximationsimpler computational model (no RadixSort)addition chains (O( log n

log log n ) approximation known)

Other applications: recompressioncompressed membershipfully compressed pattern matchingword equations

17.06.2013 17/17

Date post:	07-Sep-2021
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Smallest grammar by recompressionstelo/cpm/cpm13/14_jez.pdf · 2013. 7. 9. · Grammar...

Documents