Elegant XML CompressionElegant XML Compression
Presented by Minko Dudev
02.02.2006
IR Seminar WS06/07IR Seminar WS06/07Final PresentationFinal Presentation
XMLXML
1|Emma|J. Austin|1816|English|A. Bertrand2|Jane Eyre|C. Bronte|1847|English|Smith Elder and Co
<biblio><book id=1>
<title>Emma</title><author>J. Austin</author><year>1816</year><language>English</language><publisher>A. Bertrand</publisher>
</book><book id=2>
<title>Jane Eyre</title><author>C. Bronte</author><year>1847</year><language> English</language><publisher>Smith Elder and Co</publisher>
</book></biblio>
Readable
Hierarchical
Simple to parse
Platform independent
BUTVERY
VERBOSE
Non-Queriable Compression Non-Queriable Compression
<book id=1><title>Emma</title><author>J. Austin</author><year>1816</year><language>English</language>
</book><book id=2>
<title>Jane Eyre</title><author>C. Bronte</author><year>1847</year><language>English</language>
</book>
T1
T3
T4T5
T6
C112
C2Emma
Jane Eyre
C3J. AustinC. Bronte
C418161847
C5EnglishEnglishT1 T2 C1 T3 C2/ T4 C3/ T5 C4/ T6 C5// T1…
T2
Very good compression BUT
WHOLE DOCUMENT MUST BE DECOMPRESSED
Queriable CompressionQueriable Compression<book id=1>
<title>Emma</title><author>J. Austin</author><year>1816</year><language>English</language>
</book><book id=2>
<title>Jane Eyre</title><author>C. Bronte</author><year>1847</year><language>English</language>
</book>
T1 T2 enc(1)T3 enc(Emma) /T4 enc(J. Austin) /T5 enc(1816) /T6 enc(English) /
/T1 T2 enc(2)
T3 enc(Jane Eyre) /T4 enc(C. Bronte) /T5 enc(1847) /T6 enc(English) /
/Can be queried
BUTHAS BAD COMPRESSION RATIO
GoalsGoals
A new scheme that
Has very good compression properties
Can be queried
Has good performance
<biblio><book id=1>
<title>Emma</title><author>J. Austin</author><year>1816</year><language>English</
language></book><book id=2>
<title>Jane Eyre</title><author>C.
Bronte</author><year>1847</year><language>English</
language></book>
</biblio>
XML as a TreeXML as a Tree
biblio
book
id author title
1 J.Austin Emma
id author title
2 C.Bronte Jane Eyre
book
XML document = labeled treeSearch operations
What are the children of some node What are the parents of some nodeWhat are the nodes that have a certain path prefixHow many paths with a certain prefix exist
The XBW TransformThe XBW Transform
A
B BC
D a E
a b
D b D
c c
D
b
ABDaaEbCDcbDcBDb
emptyABADBABABAEBAADADCACACADCAABADBA
0001011001011111
Slast Slabel SpathABCBDaEDDbDabccb
emptyAAABABABABACACACADBADBADCADCAEBA
0001001100111111
Slast Slabel Spath
stablesort
pre-order
∑N={A, B, C…}
∑L={a, b, c… }
Skew AlgorithmO(N)
CompressibilityCompressibility<biblio
<book
@id <author <title
§1 §J.Austin §Emma
…
= = =
Slast Slabel Spath
1 1 <biblio empty
2 1 = <author<book<biblio
3 1 = <author<book<biblio
4 0 <book <biblio
5 1 <book <biblio
6 0 @id <book<biblio
7 0 <author <book<biblio
8 1 <title <book<biblio
9 0 @id <book<biblio
10 0 <author <book<biblio
11 1 <title <book<biblio
12 1 = <title<book<biblio
13 1 = <title<book<biblio
14 1 = @id<book<biblio
15 1 = @id<book<biblio
16 1 §J. Austin =<author<book<biblio
17 1 §C. Bronte =<author<book<biblio
18 1 §Emma =<title<book<biblio
19 1 §Jane Eyre =<title<book<biblio
20 1 §1 =@id<book<biblio
21 1 §2 =@id<book<biblio
PCDATA =
Some propertiesSome properties
ABCBDaEDDbDabccb
emptyAAABABABABACACACADBADBADCADCAEBA
0001001100111111
Slast Slabel SpathA
B BC
D a E
a b
D b D
c c
D
b
Children lie contiguously
Relative order of parents and children is preserved
Only scans = O(N)
Inverse XBW TransformInverse XBW Transform
123456789
10111213141516
ABCBDaEDDbDabccb
0001001100111111
Slast Slabel
J[i]= Jump to the first child of node i; J[5]=12
emptyAAABABABABACACACADBADBADCADCAEBA
A
B BC
D a E
a b
D b D
c c
D
b
FJ
1 A2 B3 C4 D5 E
2591216
C12141
259712-1161314-115-1-1-1-1-1
259712-1161314-115-1-1-1-1-1
F[x]= First component prefixed by x; F[B=2]=5
C[x]= Count occurrences of x in Slabel; C[B=2]=2
Subpath searchSubpath search
ABCBDaEDDbDabccb
0001001100111111
Slast Slabel
123456789
10111213141516
emptyAAABABABABACACACADBADBADCADCAEBA
Find nodes with path P=ABD
A
B BC
D a E
a b
D b D
c c
D
b
F2591216
1 A2 B3 C4 D5 E
rank and select O(1)
thus O(|P|) time for subpath search
Compression & SearchCompression & SearchAll search operations boil down to counting
How many 1s are there up to …
How many labels=X up to …
Slast=111010010011111
gzip(11101001)0 gzip(0011111)5
Slabel=<biblio==<book<book<@id<author<title@id<author<title====
gzip(<biblio==<book<book<@id<author)
<author:0 <biblio:0<book:0<title:0@id:0
=:0
gzip(<title@id<author<title====)
<author=1
<biblio=1<book=2<title=0 @id=1
=:2
C=5
C~30B
Compression & SearchCompression & Search
Slast
Slabel
gzip(<biblio==<book<book<@id<author)
<author:0 <biblio:0<book:0<title:0@id:0
=:0
gzip(<title@id<author<title====)
<author:1 <biblio:1<book:2<title:0 @id:1
=:2
Spcdata=§J. Austin§C. Bronte§Emma§Jane Eyre§1§2
=<author<book<biblio =<title<book<biblio =<id<book<biblio
J. AustinC. Bronte
EmmaJane Eyre
12420
gzip(11101001)0 gzip(0011111)5
FM-Index
SummarySummary
You have seen Challenges of XML
Verbose, Non-queriable compression, Querieble but bad compression
The XBW transformHow to construct and invert it in O(N) time
How to navigate and search in it O(1) and O(P) time
Partitioning of the XBW transform for compression
FIN