On the Memory Requirements of XPath Evaluation over XML Streams

Post on 21-Jan-2016

23 views 0 download

description

On the Memory Requirements of XPath Evaluation over XML Streams. Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center. Preliminaries: XML. x 0. < conference > < name > PODS < speaker > < name > Josifovski - PowerPoint PPT Presentation

transcript

On the Memory Requirements of XPath

Evaluation over XML Streams

Ziv Bar-YossefMarcus FontouraVanja Josifovski

IBM Almaden Research Center

Preliminaries: XML

<conference> <name> PODS </name>

<speaker> <name> Josifovski </name> <paper_cnt> 1 </paper_cnt> </speaker>

<speaker> <name> Fagin </name> <paper_cnt> 3 </paper_cnt> </speaker></conference>

conference

name

speaker

namepaper_cnt

root

speaker

namepaper_cnt

PODS

JosifovskiFagin1 3

x0

x1

x2

x3

x6

x4 x5 x7

x8

Preliminaries: XPath 1.0

/conference[name = PODS]/speaker[paper_cnt > 1]/name

conference

name

root

DocumentQuery

Result: { x7 }

speaker

namepaper_cnt

= PODS

> 1

conference

name

speaker

namepaper_cnt

root

speaker

namepaper_cnt

PODS

JosifovskiFagin1 3

x0

x1

x2

x3

x6

x4 x5 x7

x8

XML Streams

XML stream: XML document arriving as a one-way stream

Critical resources:

• Memory

• Processing time

Why XML streams?

• For transferring XML between systems

• For efficient access to large XML documents

Streaming XML Algorithms

XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] TurboXPath [Josifovski, Fontoura, and Barta 04] …

Our Results

Space lower bounds for evaluating XPath on XML streams

A streaming XML algorithm Matches the lower bounds on a large fragment

of the language Uses space sub-linear in the query size rather

than exponential in the query size

Related Work Space complexity of XPath evaluation over non-

streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]

Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

Data Complexity [Vardi 82]

(Q,D) Evaluation function of a query Q on document D.

Q(D) Evaluation function of a fixed query Q on document D.

Data complexity on Q: Complexity of best algorithm for Q on worst D.

Worst-case data complexity: maxQ (complexity of Q).

We characterize the data complexity of Q separately for each Q (not just the worst-case one).

XPath Fragment

1. Queries are subsumption-free

conference

name

root

Query

= PODS name != SIGMOD

conference

root

Query

name != SIGMOD

Not subsumption-free Subsumption-free

XPath Fragment (cont.)

2. Queries are univariate

conference

paper_cnt

root

Query

author_cnt

Query

Not univariate Univariate

<

conference

paper_cnt

root

author_cnt< 30 > 30

XPath Fragment (cont.)

3. Queries consist of conjunctions only

4. Queries are “star-restricted”

Query Frontier Size

1. Frontier at u: u, its siblings, and the siblings of its ancestors.

Theorem 1: For all queries Q in the fragment,

stream-space(Q) = (FrontierSize(Q)).

Definitions:

2. FrontierSize(Q): size of largest frontier.

conference

name

root

Query

speaker

namepaper_cnt

= PODS

> 1

Theorem 2: For all queries Q in the fragment that have at least one “//” node,

stream-space(Q) = (recDepthQ(D)).

Document Recursion Depth

//part

number

root

name

part

numbername

root

name

x0

x1

x3

x4

x4

x6

x7

x2

Definition:

recDepthQ(D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q.

Document D

Query Q

part

part

number

x5Compressor12

Refrigerator

456

Document Depth

Definition:

depth(D): Length of longest root-to-leaf path.

part

numbername

root

name

x0

x1

x3

x4

x4

x6

x7

x2

Document D

part

part

number

x5Compressor12

Refrigerator

Theorem 2: For all queries Q in the fragment that have at least one “/” node,

stream-space(Q) = (log depth(D)).456

New algorithm

Theorem 4(a):

For all queries Q in a “Univariate XPath”:

Space: O(|Q| recDepth(D) log depth(D)).Time: O(|D| |Q| recDepth(D)).

Theorem 4(b):

For all queries Q in a subset of our fragment and for non-recursive documents D,Space: O(FrontierSize(Q) log depth(D)).Time: O(|D| FrontierSize(Q)).

Proof of Theorem 1

Fragment:

•“subsumption-free”•“univariate”•Conjunctions only •“star-restricted”

Theorem 1: For all queries Q in the fragment,

stream-space(Q) = (FrontierSize(Q)).

conference

name

root

Query

speaker

namepaper_cnt

= PODS

> 1

Critical DocumentDefinition: Document D is critical for query Q, if:

(1) D matches Q.

(2) If we remove from D any node, it no longer matches Q.

conference

name

root

Query Q

speaker

namepaper_cnt

= PODS

> 1

conference

name

speaker

namepaper_cnt

root

speaker

namepaper_cnt

PODS

JosifovskiFagin1 3

x0

x1

x2

x3

x6

x4 x5 x7

x8

Document D

Main Lemmas

Lemma 1: For all queries Q in the fragment and any critical document D for Q,

stream-space(Q) = (FrontierSize(D)).

Lemma 2: For all queries Q in the fragment, there is a critical document D so that

FrontierSize(D) = FrontierSize(Q).

showproof

Theorem 1: For all queries Q in the fragment,

stream-space(Q) = (FrontierSize(Q)).

One-way Communication Complexity

Alice Bob

x ym

f: (X, Y) Z

f(x,y)

CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.

D

Reduction

Alice Bob

stateA()

A : streaming algorithm for Q using space S

stateA()

Theorem: stream-space(Q) >= CC(Q)

Q(D)

D,

Fooling Set Technique

Theorem: For any fooling set T, CC(Q) = (log |T|).

Definition

A set T of partitioned documents is a fooling set for Q if:1. All documents in T match Q.

2. For any two distinct documents D,, D, in T, either D, does not match Q or D, does not match Q.

Partitioned document:

Document prefix Document suffix

Proof of Lemma 1

Lemma 1: For all queries Q in the fragment nd any critical document D for Q,

stream-space(Q) = (FS(D)).

conference

name

root

Query Q

speaker

name

= PODS

> 1

conference

name

root

speaker

namepaper_cnt

Fagin 3

x0

x1

x2

x3

x4

x5

Document D

paper_cnt

PODS

Proof of Lemma 1

For each subset S of Frontier(D), define a partitioned document DS:

S = { x2, x5 }

conference

name

root

Query Q

speaker

name

= PODS

> 1

conference

name

root

speaker

name paper_cnt

Fagin 3

x0

x1

x2

x3

x4

x5

Document DS

paper_cnt

PODS

2. If S T, need: either DST or DTS does not match Q.

Proof of Lemma 1 (cont)

1. For all S, DS matches Q.

Claim: { DS }S is a subset of Frontier(D) is a fooling set.

stream-space(Q) >= log(2FS(D)) = FS(D).

Proof of Claim:

Proof of Claim (example)

conference

name

root

speaker

name paper_cnt

x0

x1

x3x2

x4

x5

Document DT

T = { x4,x5 }

PODS

Document DTS

conference

name

root

speaker

namepaper_cnt

x0

x1

x2

x3

x5x4

Document DS

S = { x2,x5 }

PODS

Fagin

Fagin 3

3conference

root x0

x1

Conference name missing!speaker

name paper_cnt

x3

x4

Fagin 3

name

Fagin

x4x5

Algorithm

Uses the query as an NFA Based on three global data structures

Pointer array Validation array Level array

Matches the lower bounds for a fragment of XPath.

Algorithm Example Run

<a> <c>c1</c> <b>b1</b></a>...

<a> <c>c1</c> <b>b1</b></a>...

aF

1

Level array

Validation array

Pointer array with one entry

/a

/b

$ u0

u1

u2 /c u3

Query: /a[b and c]Input XML

Algorithm Example Run

<a> <c>c1</c> <b>b1</b></a>...

<a> <c>c1</c> <b>b1</b></a>... a

F

1

$

bF

2

a

cF

2

Index 0

Index 1

Query: /a[b and c]Input XML

/a

/b

$ u0

u1

u2 /c u3

Algorithm Example Run

<a> <c>c1</c> <b>b1</b></a>...

<a> <c>c1</c> <b>b1</b></a>...

Input XML

aF

1

$

Query: /a[b and c]

bF

2

a

cF

2

Index 0

Index 1

bF

2

c

cF

2

/a

/b

$ u0

u1

u2 /c u3

<a> <c>c1</c> <b>b1</b></a>...

<a> <c>c1</c> <b>b1</b></a>... a

F

1

$

bF

2

a

cF

2

Index 0

Index 1

bF

2

c

cF

2

bF

2

/c

cT

2

Algorithm Example RunQuery: /a[b and c]Input XML

/a

/b

$ u0

u1

u2 /c u3

<a> <c>c1</c> <b>b1</b></a>...

<a> <c>c1</c> <b>b1</b></a>... a

F

1

$

bF

2

a

cF

2

Index 0

Index 1

bF

2

c

cF

2

bF

2

b

cT

2

Algorithm Example Run

bF

2

/c

cT

2

Query: /a[b and c]Input XML

/a

/b

$ u0

u1

u2 /c u3

<a> <c>c1</c> <b>b1</b></a>...

<a> <c>c1</c> <b>b1</b></a>... a

F

1

$

bF

2

a

cF

2

Index 0

Index 1

bF

2

c

cF

2

bF

2

b

cT

2

Algorithm Example Run

bF

2

/c

cT

2

bT

2

/b

cT

2

Query: /a[b and c]Input XML

/a

/b

$ u0

u1

u2 /c u3

<a> <c>c1</c> <b>b1</b></a>...

<a> <c>c1</c> <b>b1</b></a>... a

F

1

$

bF

2

a

cF

2

bF

2

c

cF

2

bF

2

b

cT

2

Algorithm Example Run

bF

2

/c

cT

2

bT

2

/b

cT

2

aT

1

/aReturn

TRUE

Query: /a[b and c]Input XML

/a

/b

$ u0

u1

u2 /c u3

Conclusion: our Contributions

Space lower bounds on the instance data complexity of XPath on XML streams:1. In terms of Query Frontier Size

2. In terms of Document Recursion Depth

3. In terms of Document Depth

A streaming XML algorithm Matches the lower bounds on a fragment of the

language Does not use finite-state automata

XPath 1.0

C

N

S

N P

$

S

N P

PODS

Josifovski Fagin1 3

x0

x1

x2

x3 x6

x4 x5x7 x8

/conference/name

/C

/N

$ u0

u1

u2

DQ

Result: { x2 }

XPath 1.0

C

N

S

N P

$

S

N P

PODS

Josifovski Fagin1 3

x0

x1

x2

x3 x6

x4 x5x7 x8

/conference//name

/C

//N

$ u0

u1

u2

D

Q

Result: { x2, x4, x7 }

D 31 1 2 2 3 31 1 2 2 3

Reduction

Alice Bobs1

s2

s3

s4

A : S-space streaming algorithm for Q.

r ¸ 1: integer.

(r = 6)

s0s1 s2 s3 s4 s5 s6

s5

s6

Theorem: S ¸ CC(Qr) / r

Q(D) Q(D)