Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation.

Post on 18-Dec-2015

213 views 0 download

Tags:

transcript

Elegant XML CompressionElegant XML Compression

Presented by Minko Dudev

02.02.2006

IR Seminar WS06/07IR Seminar WS06/07Final PresentationFinal Presentation

XMLXML

1|Emma|J. Austin|1816|English|A. Bertrand2|Jane Eyre|C. Bronte|1847|English|Smith Elder and Co

<biblio><book id=1>

<title>Emma</title><author>J. Austin</author><year>1816</year><language>English</language><publisher>A. Bertrand</publisher>

</book><book id=2>

<title>Jane Eyre</title><author>C. Bronte</author><year>1847</year><language> English</language><publisher>Smith Elder and Co</publisher>

</book></biblio>

Readable

Hierarchical

Simple to parse

Platform independent

BUTVERY

VERBOSE

Non-Queriable Compression Non-Queriable Compression

<book id=1><title>Emma</title><author>J. Austin</author><year>1816</year><language>English</language>

</book><book id=2>

<title>Jane Eyre</title><author>C. Bronte</author><year>1847</year><language>English</language>

</book>

T1

T3

T4T5

T6

C112

C2Emma

Jane Eyre

C3J. AustinC. Bronte

C418161847

C5EnglishEnglishT1 T2 C1 T3 C2/ T4 C3/ T5 C4/ T6 C5// T1…

T2

Very good compression BUT

WHOLE DOCUMENT MUST BE DECOMPRESSED

Queriable CompressionQueriable Compression<book id=1>

<title>Emma</title><author>J. Austin</author><year>1816</year><language>English</language>

</book><book id=2>

<title>Jane Eyre</title><author>C. Bronte</author><year>1847</year><language>English</language>

</book>

T1 T2 enc(1)T3 enc(Emma) /T4 enc(J. Austin) /T5 enc(1816) /T6 enc(English) /

/T1 T2 enc(2)

T3 enc(Jane Eyre) /T4 enc(C. Bronte) /T5 enc(1847) /T6 enc(English) /

/Can be queried

BUTHAS BAD COMPRESSION RATIO

GoalsGoals

A new scheme that

Has very good compression properties

Can be queried

Has good performance

<biblio><book id=1>

<title>Emma</title><author>J. Austin</author><year>1816</year><language>English</

language></book><book id=2>

<title>Jane Eyre</title><author>C.

Bronte</author><year>1847</year><language>English</

language></book>

</biblio>

XML as a TreeXML as a Tree

biblio

book

id author title

1 J.Austin Emma

id author title

2 C.Bronte Jane Eyre

book

XML document = labeled treeSearch operations

What are the children of some node What are the parents of some nodeWhat are the nodes that have a certain path prefixHow many paths with a certain prefix exist

The XBW TransformThe XBW Transform

A

B BC

D a E

a b

D b D

c c

D

b

ABDaaEbCDcbDcBDb

emptyABADBABABAEBAADADCACACADCAABADBA

0001011001011111

Slast Slabel SpathABCBDaEDDbDabccb

emptyAAABABABABACACACADBADBADCADCAEBA

0001001100111111

Slast Slabel Spath

stablesort

pre-order

∑N={A, B, C…}

∑L={a, b, c… }

Skew AlgorithmO(N)

CompressibilityCompressibility<biblio

<book

@id <author <title

§1 §J.Austin §Emma

= = =

Slast Slabel Spath

1 1 <biblio empty

2 1 = <author<book<biblio

3 1 = <author<book<biblio

4 0 <book <biblio

5 1 <book <biblio

6 0 @id <book<biblio

7 0 <author <book<biblio

8 1 <title <book<biblio

9 0 @id <book<biblio

10 0 <author <book<biblio

11 1 <title <book<biblio

12 1 = <title<book<biblio

13 1 = <title<book<biblio

14 1 = @id<book<biblio

15 1 = @id<book<biblio

16 1 §J. Austin =<author<book<biblio

17 1 §C. Bronte =<author<book<biblio

18 1 §Emma =<title<book<biblio

19 1 §Jane Eyre =<title<book<biblio

20 1 §1 =@id<book<biblio

21 1 §2 =@id<book<biblio

PCDATA =

Some propertiesSome properties

ABCBDaEDDbDabccb

emptyAAABABABABACACACADBADBADCADCAEBA

0001001100111111

Slast Slabel SpathA

B BC

D a E

a b

D b D

c c

D

b

Children lie contiguously

Relative order of parents and children is preserved

Only scans = O(N)

Inverse XBW TransformInverse XBW Transform

123456789

10111213141516

ABCBDaEDDbDabccb

0001001100111111

Slast Slabel

J[i]= Jump to the first child of node i; J[5]=12

emptyAAABABABABACACACADBADBADCADCAEBA

A

B BC

D a E

a b

D b D

c c

D

b

FJ

1 A2 B3 C4 D5 E

2591216

C12141

259712-1161314-115-1-1-1-1-1

259712-1161314-115-1-1-1-1-1

F[x]= First component prefixed by x; F[B=2]=5

C[x]= Count occurrences of x in Slabel; C[B=2]=2

Subpath searchSubpath search

ABCBDaEDDbDabccb

0001001100111111

Slast Slabel

123456789

10111213141516

emptyAAABABABABACACACADBADBADCADCAEBA

Find nodes with path P=ABD

A

B BC

D a E

a b

D b D

c c

D

b

F2591216

1 A2 B3 C4 D5 E

rank and select O(1)

thus O(|P|) time for subpath search

Compression & SearchCompression & SearchAll search operations boil down to counting

How many 1s are there up to …

How many labels=X up to …

Slast=111010010011111

gzip(11101001)0 gzip(0011111)5

Slabel=<biblio==<book<book<@id<author<title@id<author<title====

gzip(<biblio==<book<book<@id<author)

<author:0 <biblio:0<book:0<title:0@id:0

=:0

gzip(<title@id<author<title====)

<author=1

<biblio=1<book=2<title=0 @id=1

=:2

C=5

C~30B

Compression & SearchCompression & Search

Slast

Slabel

gzip(<biblio==<book<book<@id<author)

<author:0 <biblio:0<book:0<title:0@id:0

=:0

gzip(<title@id<author<title====)

<author:1 <biblio:1<book:2<title:0 @id:1

=:2

Spcdata=§J. Austin§C. Bronte§Emma§Jane Eyre§1§2

=<author<book<biblio =<title<book<biblio =<id<book<biblio

J. AustinC. Bronte

EmmaJane Eyre

12420

gzip(11101001)0 gzip(0011111)5

FM-Index

SummarySummary

You have seen Challenges of XML

Verbose, Non-queriable compression, Querieble but bad compression

The XBW transformHow to construct and invert it in O(N) time

How to navigate and search in it O(1) and O(P) time

Partitioning of the XBW transform for compression

FIN