On the Memory Requirements of XPath Evaluation over XML Streams

transcript

On the Memory Requirements of XPath

Evaluation over XML Streams

Ziv Bar-YossefMarcus FontouraVanja Josifovski

IBM Almaden Research Center

Preliminaries: XML

<speaker> <name> Josifovski </name> <paper_cnt> 1 </paper_cnt> </speaker>

<speaker> <name> Fagin </name> <paper_cnt> 3 </paper_cnt> </speaker></conference>

conference

speaker

namepaper_cnt

speaker

namepaper_cnt

JosifovskiFagin1 3

x4 x5 x7

Preliminaries: XPath 1.0

/conference[name = PODS]/speaker[paper_cnt > 1]/name

conference

DocumentQuery

Result: { x7 }

speaker

namepaper_cnt

= PODS

conference

speaker

namepaper_cnt

speaker

namepaper_cnt

JosifovskiFagin1 3

x4 x5 x7

XML Streams

XML stream: XML document arriving as a one-way stream

Critical resources:

• Memory

• Processing time

Why XML streams?

• For transferring XML between systems

• For efficient access to large XML documents

Streaming XML Algorithms

XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] TurboXPath [Josifovski, Fontoura, and Barta 04] …

Our Results

Space lower bounds for evaluating XPath on XML streams

A streaming XML algorithm Matches the lower bounds on a large fragment

of the language Uses space sub-linear in the query size rather

than exponential in the query size

Related Work Space complexity of XPath evaluation over non-

streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]

Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

Data Complexity [Vardi 82]

(Q,D) Evaluation function of a query Q on document D.

Q(D) Evaluation function of a fixed query Q on document D.

Data complexity on Q: Complexity of best algorithm for Q on worst D.

Worst-case data complexity: maxQ (complexity of Q).

We characterize the data complexity of Q separately for each Q (not just the worst-case one).

XPath Fragment

1. Queries are subsumption-free

conference

= PODS name != SIGMOD

conference

name != SIGMOD

Not subsumption-free Subsumption-free

XPath Fragment (cont.)

2. Queries are univariate

conference

paper_cnt

author_cnt

Not univariate Univariate

conference

paper_cnt

author_cnt< 30 > 30

XPath Fragment (cont.)

3. Queries consist of conjunctions only

4. Queries are “star-restricted”

Query Frontier Size

1. Frontier at u: u, its siblings, and the siblings of its ancestors.

Theorem 1: For all queries Q in the fragment,

stream-space(Q) = (FrontierSize(Q)).

Definitions:

2. FrontierSize(Q): size of largest frontier.

conference

speaker

namepaper_cnt

= PODS

Theorem 2: For all queries Q in the fragment that have at least one “//” node,

stream-space(Q) = (recDepthQ(D)).

Document Recursion Depth

//part

number

numbername

Definition:

recDepthQ(D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q.

Document D

Query Q

number

x5Compressor12

Refrigerator

Document Depth

Definition:

depth(D): Length of longest root-to-leaf path.

numbername

Document D

number

x5Compressor12

Refrigerator

Theorem 2: For all queries Q in the fragment that have at least one “/” node,

stream-space(Q) = (log depth(D)).456

New algorithm

Theorem 4(a):

For all queries Q in a “Univariate XPath”:

Space: O(|Q| recDepth(D) log depth(D)).Time: O(|D| |Q| recDepth(D)).

Theorem 4(b):

For all queries Q in a subset of our fragment and for non-recursive documents D,Space: O(FrontierSize(Q) log depth(D)).Time: O(|D| FrontierSize(Q)).

Proof of Theorem 1

Fragment:

•“subsumption-free”•“univariate”•Conjunctions only •“star-restricted”

conference

speaker

namepaper_cnt

= PODS

Critical DocumentDefinition: Document D is critical for query Q, if:

(1) D matches Q.

(2) If we remove from D any node, it no longer matches Q.

conference

Query Q

speaker

namepaper_cnt

= PODS

conference

speaker

namepaper_cnt

speaker

namepaper_cnt

JosifovskiFagin1 3

x4 x5 x7

Document D

Main Lemmas

Lemma 1: For all queries Q in the fragment and any critical document D for Q,

stream-space(Q) = (FrontierSize(D)).

Lemma 2: For all queries Q in the fragment, there is a critical document D so that

FrontierSize(D) = FrontierSize(Q).

showproof

One-way Communication Complexity

Alice Bob

f: (X, Y) Z

f(x,y)

CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.

Reduction

Alice Bob

stateA()

A : streaming algorithm for Q using space S

stateA()

Theorem: stream-space(Q) >= CC(Q)

Fooling Set Technique

Theorem: For any fooling set T, CC(Q) = (log |T|).

Definition

A set T of partitioned documents is a fooling set for Q if:1. All documents in T match Q.

2. For any two distinct documents D,, D, in T, either D, does not match Q or D, does not match Q.

Partitioned document:

Document prefix Document suffix

Proof of Lemma 1

Lemma 1: For all queries Q in the fragment nd any critical document D for Q,

stream-space(Q) = (FS(D)).

conference

Query Q

speaker

= PODS

conference

speaker

namepaper_cnt

Fagin 3

Document D

paper_cnt

Proof of Lemma 1

For each subset S of Frontier(D), define a partitioned document DS:

S = { x2, x5 }

conference

Query Q

speaker

= PODS

conference

speaker

name paper_cnt

Fagin 3

Document DS

paper_cnt

2. If S T, need: either DST or DTS does not match Q.

Proof of Lemma 1 (cont)

1. For all S, DS matches Q.

Claim: { DS }S is a subset of Frontier(D) is a fooling set.

stream-space(Q) >= log(2FS(D)) = FS(D).

Proof of Claim:

Proof of Claim (example)

conference

speaker

name paper_cnt

Document DT

T = { x4,x5 }

Document DTS

conference

speaker

namepaper_cnt

Document DS

S = { x2,x5 }

Fagin 3

3conference

root x0

Conference name missing!speaker

name paper_cnt

Fagin 3

Algorithm

Uses the query as an NFA Based on three global data structures

Pointer array Validation array Level array

Matches the lower bounds for a fragment of XPath.

Algorithm Example Run

<a> <c>c1</c> b1</a>...

Level array

Validation array

Pointer array with one entry

u2 /c u3

Query: /a[b and c]Input XML

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

Index 0

Index 1

u2 /c u3

<a> <c>c1</c> b1</a>...

Input XML

Query: /a[b and c]

Index 0

Index 1

u2 /c u3

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

Index 0

Index 1

Algorithm Example RunQuery: /a[b and c]Input XML

u2 /c u3

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

Index 0

Index 1

u2 /c u3

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

Index 0

Index 1

u2 /c u3

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

/aReturn

u2 /c u3

Conclusion: our Contributions

Space lower bounds on the instance data complexity of XPath on XML streams:1. In terms of Query Frontier Size

2. In terms of Document Recursion Depth

3. In terms of Document Depth

A streaming XML algorithm Matches the lower bounds on a fragment of the

language Does not use finite-state automata

XPath 1.0

Josifovski Fagin1 3

x4 x5x7 x8

/conference/name

Result: { x2 }

XPath 1.0

Josifovski Fagin1 3

x4 x5x7 x8

/conference//name

Result: { x2, x4, x7 }

D 31 1 2 2 3 31 1 2 2 3

Reduction

Alice Bobs1

A : S-space streaming algorithm for Q.

r ¸ 1: integer.

(r = 6)

s0s1 s2 s3 s4 s5 s6

Theorem: S ¸ CC(Qr) / r

Q(D) Q(D)

On the Memory Requirements of XPath Evaluation over XML Streams

Documents