Automaton meets algebra: A hybrid paradigm for XML stream processing

Automaton Meets Algebra: A Hybrid Paradigm forXML Stream Processing⋆

Hong Su, Elke A. Rundensteiner, and Murali Mani

CS Department, Worcester Polytechnic Institute, Worcester, MA, 01609–2280, USA

Abstract

XML stream applications bring the challenge of efficiently processing queries on sequen-tially accessible token-based data streams. The automata paradigm is naturally suited forpattern recognition on tokenized XML streams, but requirespatches for fulfilling the filter-ing or restructuring functionalities in the XML query language. In contrast, the algebraicparadigm is a well-established technique for processing self-contained tuples. It howeverdoes not traditionally support token inputs. TheRaindropframework is the first to accom-modate these two paradigms within one algebraic framework,taking advantage of both.This paper describes the overall framework, highlighting in particular three aspects. First,we describe how the tokens and tuples are modeled in one uniform query processing model.Second, we present the query rewriting that switches computations between these two datamodels. Third, we discuss strategies for the implementation and synchronization of theoperators within the framework. We report experimental results that illustrate the uniqueoptimization opportunities offered by this novel framework.

Key words: XML Stream, XQuery Processing, Automata, Algebra

1 Motivation of XML Stream Processing

There is a growing interest in data stream applications suchas monitoring systemsfor stock, traffic and network activities[6]. Various research projects have recentlytargeted stream applications, such as Aurora [3], Borealis[11], STREAM [7], Ni-agara [10], TelegraphCQ [9], Cougar [14] and CAPE [32]. Manyof them focus on

⋆ This research is partly supported by NSF under Grant No. 0414567. Hong Su would liketo thank IBM for the IBM Cooperative Fellowship from 2001 - 2004.

Email addresses:[email protected] (Hong Su),[email protected](Elke A. Rundensteiner),[email protected] (Murali Mani).

Preprint submitted to Elsevier Science 4 October 2005

relational or object applications which assume a tuple-like data model (a tuple cancontain flat values and objects as in a relational or object database).

Due to the proliferation of XML data in web services [28], there is also a surgein XML stream applications [8,13,17,18,20,27,31,38] suchas XML packet routing[2], selective dissemination of information [4], and notification systems [29]. Achallenge that XML stream applications pose is that the “tuple” processing unit nolonger completely fits. In the XML context, we specifically refer to a “tuple” asa set of cells, each cell containing a set of XML nodes (i.e. XML trees). This isbecause the XML query semantics [35] is defined as XML node outputs computedon the given XML node inputs.

However, XML streams are usually modeled as a sequence of primitive tokens,such as a start tag, an end tag or a PCDATA item. That is to say, aprocessing unitof XML streams has to be a token, which is at a lower granularity than an XMLnode. Such a processing style, i.e., a processing unit beingat a lower granularitythan the data model, has been little studied in the database community before. Thisgranularity difference is a XML stream specific challenge that has to be addressed.

2 State-of-the-Art

In this section, we describe two camps of solutions that havebeen proposed forXML stream processing. The first camp uses tokens as the processing unit through-out the whole evaluation process. In contrast, the second camp uses different pro-cessing units in different stages of evaluation. In the firststage, it consumes tokeninputs but generates tuple outputs. Tuple processing unitsare then used throughoutthe second stage.

2.1 Pure Automata Paradigm

Automata were originally designed to match patterns over strings, which is verysimilar to matching path expressions over tokens in XML. Several projects [20,27,31,38]are inspired to exclusively use automata for the complete query processing. Such apure automata paradigmhas to strike a balance between the expressive power ofthe query it can handle and the manageability of its constructs.

For example, XPush [20], using a push-down automaton, supports rather limitedquery capabilities. Since the push-down automaton has no output buffers, it cannotreturn the destination elements reachable via an XPath, notto mention restructurethe destination elements. It can only give a boolean result indicating whether or notan XPath is contained in the input stream.

2

Some projects adopt more powerful automata in order to provide more query ca-pabilities. Typical examples are XSM [27] and XSQ [31] supporting the XQueryand XPath languages respectively. However the increased expressive power of thequeries is not gained without sacrifice. The Turing-machine-like model they adoptdescribes the computations at a rather low level. Such a query model is somewhatsimilar to a procedural language that presents all internaldetails of the computa-tions. Figure 1 gives an example of how a path expression “/a”is modeled in XSM.The automaton reads a token from the input buffer each time. The state transitionindicates if a certain token has been read (expressed as the part before “|”), then thecorresponding actions (expressed as the part after “|”) will be taken. For instance,the transition from state 1 to state 2 indicates that if a token <a> is read, it shouldbe copied to an output buffer.

Fig. 1. Automata in XSM Encoding an XPath expression “/a”

Such a pure automata paradigm has not been thouroughly studied as a query pro-cessing paradigm before. Many problems that have been well studied in tuple-basedalgebraic frameworks remain unexplored in this new paradigm, such as how to op-timize the queries in a modular fashion, how to rewrite the queries, how to costalternative processing plans, etc.

2.2 Loosely-Coupled Automata and Algebra Paradigm

The tuple-based algebraic paradigm has been widely adoptedby the database com-munity for query processing. It is thus not surprising that numerous tuple-basedalgebras for processing static XML have been proposed [25,30,36]. Naturally it isexpected that such an algebraic paradigm could also be utilized for XML streamprocessing so that existing techniques can be borrowed. However, as mentionedbefore, such an algebraic paradigm does not handle the tokeninput data model.

Recent work, such as Tukwila [23] and YFilter [38], aims to bridge the actual to-ken inputs and the tuple inputs typically assumed by the algebra paradigm. Theyprocess an XQuery in two stages. In the first stage, they use automata to handleall structural pattern retrieval. XML nodes are built from tokens and organized intotuples. These output tuples are then filtered or restructured in the second stage by amore conventional tuple-based algebraic engine.

We now give an example. Figure 2 (a) shows an XML stream (basedon the XMLbenchmark XMark [1]). Each token in the XML stream is annotated with an identi-fier for ease of reference. Figure 2 (b) shows an XQuery on thisstream. This query

3

pairs sellers with bidders of a certain open auction. Figure3 shows the correspond-ing Tukwila [23] query plan. The portions underneath and above the line describethe computations in the first (i.e., automata) and second stage (i.e., algebra) re-spectively. While the algebra processing is expressed as a query tree of operators(skipped in the figure), the automata processing is modeled as a single operatorcalled X-Scan (YFilter also has a similar “path matching engine”). Tukwila as-sumes that retrieving a pattern in automata is rather cheap;therefore all patternsshould be retrieved in the automata. As a result, theX-Scanoperator exposes afixed interface to its downstream operators, namely, the bindings to all the XPathexpressions in the query as annotated beside theX-Scanin Figure 2.

Fig. 2. Example XML Document and XQuery

Fig. 3. Tukwila Query Plan for Query in Figure 2 (b)

However, in our work we will illustrate that the assumption made by Tukwila maynot necessarily hold. For example, consider an alternativeplan which only pushesthe pattern retrievalopenauctions/openauctionand$a/initial into theX-Scanop-erator. Only thoseopenauctionelements that haveinitial child elements are ex-tracted into XML nodes. They are further navigated into to locate the remainingpatterns as we do when processing static XML data. Intuitively, patterns$a/initial,$a/seller and$a/bid/bidder are retrieved in parallel in Tukwila while they areretrieved in a serialized manner in our alternative plan. Such an alternative will per-form better than the Tukwila plan if anopen auction rarely has aninitial elementso that the subsequent pattern retrieval could be avoided1 .

1 Although Tukwila provides afollow operator which retrieves patterns in XML nodes, it

4

In summary, automata processing, though accommodated in analgebraic frame-work as an operator, is not considered to be rewritten with any other operators. Sucha paradigm does not benefit from the opportunities that an algebraic frameworkis supposed to provide. We thus call this approach aloosely-coupled automata-algebraparadigm due to the strict separation between the token-based automataprocessing and the tuple-based algebraic processing.

2.3 Our Approach: Uniform Automata-Algebra Paradigm

We instead propose a paradigm that overcomes the limitations in both the pureautomata and the loosely-coupled automata-algebra paradigms. We also model thepattern matching type of computation (the one captured mostnaturally by automataprocessing) as a query plan composed of operators at a finer granularity thanX-Scan[23]. Such a model offers several benefits. First, the portion of the plan modelingautomata processing can be reasoned over in a modular fashion. That is, optimiza-tion techniques can be studied for each operator separatelyrather than only for theautomata as a whole. Second, since the automata processing is expressed as an al-gebraic plan just like the other computations, rewriting rules can be applied to, forexample, switch computations into or out of the automata. Wehave implemented aprototype system based on thistightly-coupled automata-algebraparadigm [21].

The contributions of our system, calledRaindrop, include:

• We accommodate both token-based processing and tuple-based processing withinoneuniform algebraic model. To model the token-based processing also as al-gebraic plans, we propose a data model for tokens as well as a set of algebraoperators and plan structures that manipulate tokens.

• We present a three-level algebraic framework, i.e.,semantics-focused plan, streamlogical planandstream physical plan. Each levels adds more details to the planat the adjacent higher level. Such a layered framework enables us to reason atdifferent abstraction levels, thus rendering optimizations tractable and practical.

• We offer a set of rewriting rules that pushes or pulls computations into or out ofthe automata. This unique optimization opportunity is not found in either pure-automata or loosely-coupled automata-algebra paradigms.

While the basic ideas ofRaindropare presented in conference papers [22,26], wenow offer the following additional contributions.

• We develop efficient implementations for operators modeling automata process-ing. These implementations take full advantage of automatabehavior and thusare in many cases more efficient than the other implementations in the literature.

is explicitly mentioned in [23] thatfollow will be only used for retrieving XLinks insteadof XPaths. No pattern retrieval will be considered to be moved out ofX-Scan.

5

• The implementations of operators modeling automata processing impose certainsynchronization modes, i.e., certain operators must be invoked at a certain timeto ensure both the correctness and efficiency of the execution of the plan. Wepropose a programming model to accommodate such modes.

• We perform extensive experiments illustrating that under different characteris-tics of the input sources, no single strategy that pushes computations into theautomata can ensure plan optimality. This confirms the necessity of reasoningabout computation push-in or pull-out of the automata.

3 Three-level Algebraic Framework Overview

The Raindrop algebraic framework is composed of plans at three levels. An XQuerywill be first compiled into the plan at the highest level. Stepby step, it will be finallyrefined into the plan at the lowest level.

1.Semantics-focused plan:The plan at this level focuses on expressing the seman-tics of an XQuery. The nature of the input source, i.e., whether it is stored data ortokenized stream data, is not exposed yet. General XQuery optimization techniquesthat are not specific to either stored or streaming data, suchas XQuery decorrelationthat removes nested subqueries [15,33], and query tree minimization that removesredundant pattern retrieval [5], can be applied on this query plan. (Section 4)

2. Stream logical plan: The plan at this level is specialized to account for theinput being XML token streams, instead of assuming random access to the com-plete XML data. For this, the data model accommodates tokenized inputs. Corre-spondingly, new operators and new plan structures are also introduced to modelthe automata processing. Moreover, rewriting rules are defined to rewrite the plansinvolving these new constructs. (Sections 5 and 6)

3. Stream physical plan: This level provides implementation details for opera-tors in the stream logical plan. In particular, the implementations of the operatorsthat model automata processing have an important feature, i.e., they require certainsynchronization to ensure their correctness. (Sections 7 and 8)

4 Semantics-Focused Plan

Our semantics-focused planis based on an XML algebra called the XML AlgebraTree (XAT) [36,37]. The algebra defines a set of operators including (1) XML-specific operators, e.g., operators for navigating into thenested XML structures,and operators for XML result construction, and (2) SQL-likeoperators such asSelect, Join, Groupby, Orderby, Union, DifferenceandIntersect.

6

The input and output of the operators are a collection ofXAT tuples. A cell in anXAT tuple can be (1) an atomic value, (2) an XML element node or(3) a collectionof XML element nodes. Each cell is bound to a variable that is explicitly or implic-itly specified in the query. In the example XAT tuples in Figure 4,$s, $a and$b areexplicitly defined in Figure 2 (b). Results of$a/initial and$b/phone are assignednames$d and$e. The cells bound to$d and$e contain a collection of XML nodes:the collection for$d ($d = $a/initial) contains one element node while the collec-tion for $e ($e = $b/phone/text()) contains two text nodes. We use the notation “||”to separate items in a collection.

$s <open_auctions> …

</open_auctions

$a <open_auction> … <open_auction>

$d <initial>15.00

</initial> ||

$b <seller> … </seller>

$e 508 - 1234567 || 508 - 0004567

Fig. 4. Example XAT Tuples

Table 1 gives the semantics of the XAT operators that will be used in this paper.The full set of XAT operators can be found in [37]. Some operators generate newcolumns. For example, aNavUnnestor NavNestoperator navigates into a contextnode and finds the destination element nodes. Such a navigateoperator generatesnew columns containing the destination element nodes. For example, in Table 1,NavUnnestor NavNesthas one output variable$col2.

Operator Description

SourcesourceName$col Bind data source specified bysourceName to column$col.

Taggerp$col Tag an input tupleu according to patternp. Output a new tuple which is the con-catenation ofu and taggered data. The taggered data is bound to new column$col.

NavUnnest$col1,path$col2 Navigate into element node in column$col1 of input tupleu. For each destinationnoden reachable viapath, output a new tuple which is the concatenation of inputtuple andn. n is bound to new column$col2.

NavNest$col1,path$col2 Navigate into element node in column$col1 of input tupleu. All destination nodesreachable viapath are aggregated into a collectionN . Output a new tuple which isthe concatenation ofu and the collection.N is bound to new column$col2.

Selectc If input tupleu satisfies conditionc, output it.

Joinc If two input tuplesu andv, each from a different input source, satisfy conditionc,output a tuple which is the concatenation ofu andv.

Table 1Semantics of XAT Operators

Figure 5 shows the semantics-focused plan for the query in Figure 2 (b). It alsoshows the output XAT tuples of some operators. We now highlight the differ-ence of two types of navigate operators, namely,NavUnnestandNavNest. A vari-able binding in a “for” clause is modeled as aNavUnnestoperator. For example,“ for $a in Stream(“openauctions”)/openauctions/openauction” is expressed asNavUnnest$s,/open auctions/open auction$a where$s represents the input stream. The“for” clause iterates over the items in the expression results and binds the variableto each item in turn. Therefore$a of an output tuple contains onlyone elementnode(refer to the output ofNavUnnest$s,/open auctions/open auction$a in Figure 5).

In contrast, a binding in a “let”, “where” or “return” clauseis expressed as a

7

Fig. 5. Semantics-Focused Plan (annotated with intermediate results) for querying data inFigure 2 (a))

NavNest. Each such clause binds a variable to the expression resultswithout itera-tion. The nameNavNestindicates that the output variable of an output tuple con-tainsa collection of element nodes(refer to the outputs ofNavNest$b,/phone/text()$ein Figure 5).

At this level, we apply general optimization heuristics forquery rewriting [36,37].Since these optimization techniques are not stream specific, they are omitted here.

5 Algebraically Modeling Token-based Processing

The second level in the framework, i.e., the stream logical level, is targeted at pro-cessing the query on a tokenized input stream. In order to maintain the “closure”property of the algebra, i.e., use one data model throughoutthe algebraic frame-work, the XAT data model is extended at this level to accommodate data inputs.That is to say, besides the three data formats allowed in an XAT tuple cell as de-scribed in Section 4, a new data format calledcontextualized tokenis additionallysupported. New operators and query plan structures are alsointroduced to manipu-late this new data format.

5.1 Token-Based Data Format

The new data format, calledcontextualized token, consists of two parts:token valuedescribes the local characteristics of the token; andtoken contextdescribes the re-lationship between this token and the other tokens in the stream.

Token Value. A token value essentially is the information represented bya SAXevent, namely, (1) the token’s type (i.e., a start tag, end tag or PCDATA item), (2)

8

the token’s name (for a start or end tag) or the token’s content (for a PCDATA item)and (3) the token’s attributes if any (for a start tag).

Token Context. We support context regarding the forward ancestor-descendantrelationships between tokens. These relationships are most commonly queried inXPath expressions usingchild anddescendantaxis specifications.

Definition 1 A tokent is associated withan elemente if t is e’s start tag, end tagor direct PCDATA content. Each token is associated with exactly one element.

Definition 2 A tokent is a component tokenof an elemente if the element associ-ated witht is e’s descendant element ore itself.

Example 1 In Figure 2 (a), token 2 is associated with anopenauctionelement.Tokens 2 to 29 are all component tokens of thisopenauctionelement.

Three boolean functions are supported on the contextualized token types:

1. Reachable(t1, t2, p) compares the accessibility relationship between tokenst1and t2: if t1 and t2 are both start tags, the function returns whether the elementassociated witht2 is reachable viap from the element associated witht1.

2.Within(t1, t2) compares the component relationship between tokenst1 andt2: ift1 is a start tag, this function returns whethert2 is a component token of the elementassociated witht1.

3. t1 = t2 compares whethert1 andt2 are associated with the same element in termsof element identity (not only the same element content).

5.2 Token-Related Operators

Operator Description

StreamSourcestreamName$col Bind stream source specified bystreamName to column$col.

TokenNav$col1,path$col2 Locate tokens that are components of the element which is accessible viapathfrom $col1.

ExtractUnnest$col1$col2 Compose tokens located by TokenNav$col1,path$col2 into XML nodes. For eachdestination noden reachable viapath, output a new tuple which is the concate-nation of input tuple andn. n is bound to new column$col2.

ExtractNest$col1$col2 Compose tokens located by TokenNav$col1,path$col2 into XML nodes. All des-tination nodes reachable viapath are aggregated into a collectionN . Output anew tuple which is the concatenation of input tuple andN . N is bound to newcolumn$col2.

StructuralJoin$e Given two input tuplesu andv, if u.$e = v.$e, output a tuple which is the con-catenation ofu andv.

Table 2Semantics of Token-Related Operators

We now introduce new operators that either generate or consume tuples containing

9

Notation Explanation

u.$c get binding of cell$c from tupleu

< c1 = v1, c2 = v2, ... > construct a tuple with cellc1 assigned the valuev1, cell c2 assigned the valuev2...

u1 ◦ u2 construct a tuple by concatenating tuplesu1 andu2. If u1 andu2 contain cells thatare bound to the same variable, remove one of the redundant cells.

++ merge operator for list (a list is represented as [ ])

⊕ compose tokens into XML nodes

Table 3Notations Used for Defining Token-Related Operators

contextualized tokens, as listed in Table 2. We denote the semantics of an oper-ator byOpparamsoutvar(Un), whereOp is the operator’s name,params is a listof input parameters,outvar is the output variable andUn is a collection of thefirst n input tuples. We use the monoid comprehension calculus [16]to expressOpparamsoutvar(Un), i.e., the output ofOp on Un. Informally, a monoid compre-hension is in the form ofmergeFunc{f(a, b, ...)| a← A, b← B, ...,pred1, pred2,...}. In the part after “|”, A (resp.B) is a collection on which variablea (resp.b)iterates.pred1 (or pred2) is a predicate defined over variables such asa andb. Thefunctionf(a, b, ...) constructs a collection that contains only one tuple. This singletuple in the collection is composed ofa, b and so on. In the part before “|”, thefunctionmergeFunc merges multiple collections into one collection. In summary,a monoid comprehension returns a collection, which is generated as follows:

result : = an empty collection;for eacha in A, b in B, ...,

if pred1 ∧ pred2 ∧ ...result : = result mergeFunc f (a, b, ...)

returnresult.

For example, a monoid comprehension∪{(a, b)|a← {1, 2}, b← {4}} first createstwo collections{(1, 4)} and{(2, 4)}, then merges them using the function∪ andreturns a collection{(1, 4), (2, 4)}.

The notations used for defining the semantics of operators are listed in Table 3. Weillustrate each operator using the example in Figure 2. Eachtoken in the input oroutput is annotated with its identifier.

5.2.1 StreamSource

This operator binds the sequence of the tokens from the stream specified bystrNameto the output variable.

Example 2 For StreamSource“open auctions”$s, its first 4 output tuples are:

10

$s $s

<openauctions>1 <openauctions>1

<openauctions>1 <openauction>2

<openauctions>1 <seller>3

<openauctions>1 <sellerid>4

Suppose the operator now consumes the firstn tokens, denoted asTn, in the stream.n output tuples are constructed. Each output tuple contains two variables. In thequery,$s, the explicitly specified output variable, is bound to the start tag of theroot element in the stream, denoted ast0. The variablet0 identifies the root elementand thus also identifies the stream (in the rest of this section, we always use a starttag to identify its associated element). Simply identifying an element is not goodenough. We are also interested in the element content. Therefore each output tuplealso contains an implicit variable$s for $s. The variable$s is bound to a componenttoken of the element associated with$s. In short, an output tuple ofStreamSourcecontains an identifier of the stream and a component token of the stream.

StreamSourcestrName$s(Tn) =

++ {< $s = t0, $s = t> |t← Tn}

5.2.2 Token Navigate Operator TokenNav

TokenNav$col1,path$col2 operator recognizes patterns over the token stream. It re-turns the component tokens of the destination element$col2 accessible viapathfrom the context element$col1. Each output tuple contains such a component to-ken and the token identifying the destination element.

Example 3 If TokenNav$s,/open auctions/open auction$a takes the first 4 output tuplesfromStreamSource“open auctions”$s in Example 2 as input, its output is:

$s $a $a

<openauctions>1 <openauction>2 <openauction>2

<openauctions>1 <openauction>2 <seller>3

<openauctions>1 <openauction>2 <sellerid>4

For example, the second output tuple represents that token 2, i.e.,<openauction>,is reachable via/open auctions /open auction from token 1. It also representsthat token 3, i.e.,<seller>, is a component of the element associated with token 2.

11

TokenNav$col1,path$col2(Un) =

++{u1◦ < $col2 = u1.$ ˜col1, $ ˜col2 = u2.$ ˜col1 > |

u1 ← Un, u2 ← Un, Reachable(u1.$col1, u1.$ ˜col1, path),Within(u1.$ ˜col1, u2.$ ˜col1)}

An input tupleu1 ∈ Un to TokenNav$col1,path$col2 contains bindings of variable$col1 and the explicit variable$ ˜col1. If Reachable(u1.$col1, u1.$ ˜col1, path) istrue, thenu1.$ ˜col1 is the start tag of a destination element. For each componenttoken of this destination element, i.e., for eachu2.$ ˜col1 that hasu2 ∈ Un andWithin(u1.$ ˜col1, u2.$ ˜col1) is true, an output tuple is constructed. The output tu-ple is the concatenation ofu1, the start tag of the destination element ($col2 =u1.$ ˜col1), and the component token (i.e.,$ ˜col2 = u2.$ ˜col1).

5.2.3 Composition Operator ExtractUnnest

Sections 5.2.3 and 5.2.4 present two extract operators,ExtractUnnest$col1$col2andExtractNest$col1$col2 (generally referred asExtractoperator). Both of themmust have an input operator in the form ofTokenNav$col1,path$col2. The inputTokenNav$col1,path$col2 locates the component tokens of$col2 while the extractoperators composes these component tokens into XML nodes.

Example 4 SupposeExtractUnnest$s$a consumes the first 3 output tuples ofTokenNav$s,/open auctions/open auction$a in Example 3. It generates the below tuple.

$s $a $a

<openauctions>1 <openauction>2 <openauction>2<seller>3 <sellerid>4

The cell$a contains a yet-to-be-completed element node. It is composed of tokens2, 3 and 4. This tuple is a partial output for the input seen so far. Eventually,$awould contain a complete element node composed from token 2 to token 29.

ExtractUnnest$col1$col2(Un) =Group

{$col2},⊕($col2)Un

Group{$col2},⊕($col2)

Un is a function that groups input tuples on destination node

$col2 so that the component tokens (i.e.,$ ˜col2) of the same destination node areall collapsed into one group. The component tokens within one group are thencomposed (represented as⊕) into one element node.

12

5.2.4 Composition Operator ExtractNest

The difference betweenExtractNest andExtractUnnest is analogous to the dif-ference betweenNavNest andNavUnnest mentioned in Section 4. The destina-tions found within the same context are aggregated into one single collection.

Example 5 SupposeExtractNest$b$e consumes the first 2 tuples generated byTokenNav$b,/phone/text()$e which are,

$s $a $b $e $e

<open auctions>1 <open auction>2 <seller>3 <phone>7 508-12345678

<open auctions>1 <open auction>2 <seller>3 <phone>10 508-000456711

the output tuple below is generated:

$s $a $b $e

<open auctions>1 <open auction>2 <seller>3 508 − 1234567||508 − 0004567

The output tuple represents that within anopenauctionelement with a start tag 2(bound to$a), there is asellerchild element with a start tag 3 (bound to$b). Sofar, twophonesubelements of thissellerhave been formed and aggregated into onecollection (bound to$e).

ExtractNest$col1,path$col2(Un) =

Group{$col1},++($col2)

(Group{$col2},⊕($col2)

Un)

From the definitions ofExtractNest andExtractUnnest, we can seeExtractNesthas a further grouping on the output ofExtractUnnest by the context node$col1.In this way, all the destinations found within the same context are grouped togetherand aggregated (represented as++) into one collection.

5.2.5 Structural Join

In Figure 2 (b), path expressions$a/seller and$a/bid/bidder share the same con-text variable$a. To capture this relationship,StructuralJointakes outputs of twoExtractoperators as inputs and “glues” bindings of individual pathexpressions.

Example 6 Suppose output tuples ofExtractNest$a$b andExtractNest$a$c arejoined on$a. Assume the left input is one XAT tuple:

$s $a $b

<openauctions>1 <openauction>2 <seller><sellerid>001 ...</seller>

13

and the right input corresponds to two XAT tuples:

$s $a $c

<openauctions>1 <openauction>2 <bidder><bidderid><032> ...</bidder>

<openauctions>1 <openauction>2 <bidder><bidderid><145> ...</bidder>

Then two output tuples are constructed as below:

$s $a $b $c

<open auctions>1 <open auction>2 <seller><sellerid> 001 ...</seller> <bidder><bidderid> 032...</bidder>

<open auctions>1 <open auction>2 <seller><sellerid> 001 ...</seller> <bidder><bidderid> 145...</bidder>

The output tuples represent that within anopenauctionelement with start tag 2(bound to$a), there is aseller element (bound to$b) and twobidder elements(bound to$c).

Below, we useULn1 andURn2 to denote the firstn1 andn2 input tuples from theleft and right upstream operators respectively.

StructuralJoin$e(ULn1, URn2) =

++{ul ◦ur|ul← ULn1, ur ← URn2, ul.$e = ur.$e}

Fig. 6. Stream Logical Plan for the Semantics-focused Plan in Figure 5

5.3 Stream-Specific Plan Structures

XML streams arrive on the fly so that unless a token is explicitly stored, it can beaccessed only once. The token-related operators must be connected in a way which

14

ensures that no repetitive token access occurs. An automaton can read data onceand concurrently recognize multiple patterns. We therefore propose a special planstructure that models the automata behavior.

Each pattern is defined as a sequence of states in the automaton. The input drivesthe transition between these states. Figure 6 depicts a stream logical plan adopt-ing this processing style.TokenNav$a,/seller$b andTokenNav$a,/bid/bidder$c sharethe same upstream operatorTokenNav$s,/open auctions/open auction$a. This sharing in-dicates that, for every token read from the common upstream operator, we try tomatch either$a/seller or $a/bid/bidder. The two downstream extract operatorsExtractUnnest$a$b andExtractUnnest$a$c compose thesellerandbidderele-ment nodes respectively. Later on,StructuralJoin$a glues theseller andbidderelements which are subelements of the sameopenauctionelement into one outputtuple.

5.4 Regular Tuple-based Operators

Apart from the token-based operators, the rest of the operators in a Raindrop planconsume or generate the “regular” cells of tuples, i.e., they do not consume orgenerate tokens.NavUnnest, NavNest, Select andTagger defined in Table 1are examples of such operators.

6 Rewrite Rules Involving Token-Related Operators

We now present two rewrite rules that involve token-relatedoperators. The first rulemaps the semantics-focused plan to a default stream logicalplan while the secondrule provides alternative stream logical plans other than the default one.

6.1 Default Mapping Rewrite Rule

Thedefault mapping rewrite rule, shown in Figure 7, provides a default mappingfrom a semantics-focused plan to a stream logical plan. First, the generalSourceoperator is replaced by a more specificStreamSourceoperator. Second, the bot-tommostNavUnnest operator (resp.NavNest) is mapped to aTokenNav andanExtractUnnest (resp.ExtractNest) pair. The purpose of this rewriting is toavoid the extraction of the complete incoming stream.

15

NavUnnest ( NavNest ) $s , path 1 $col1

Source streamName $s

TokenNav $s, path1 $col1

StreamSource “ streamName ” $s

ExtractUnnest ( ExtractNest ) $s $col1

Fig. 7. Default Mapping Rewrite Rule

TokenNav $col0, path0 $col1

Extract $col0 $col1


Extract $col0 $col1

NavUnnest(NavNest ) $col1, path1 $col2


ExtractUnnest(ExtractNest ) $col1 $col2

StructuralJoin $col1

Fig. 8. Pattern Retrieval on Token-or-Node Switching Rule

6.2 Token-or-Node Switching Rule

The token-or-node switching rule, shown in Figure 8, rewrites operators that re-trieve patterns on XML nodes to operators that retrieve patterns on tokens. Wenow explain why this rewriting results in an equivalent plan. The internal logic ofNavUnnest(NavNest)$col1,path1$col2 consists of two parts. First, it locates thedestination element nodes $col2. This is achieved byTokenNav andExtract inthe rewritten plan. Second, it generates an output tuple foreach destination elementnode. Each output tuple is a concatenation of the input tupleand the destinationelement node. This is equivalent to the cartesian product ofthe input tuples and theset of destination element nodes.StructuralJoin in the rewritten plan capturesthis part.

By applying different rewrite rules, we can end up with planswith different amountsof pattern retrieval performed on the tokens. We show later that such differences canhave a major impact on the performance.

7 Implementation Strategies for Token-Related Operators

In this section, we present the stream physical level, i.e.,the implementation for thestream logical operators. Since the implementation for theregular tuple-based op-erators can reuse the one developed for the static context ina pipelining style (i.e.,operate on each input tuple rather than the whole input), we omit their discussion.Instead, we focus on the implementation for the token-related operators which haveno counterpart in the static context. A logical operator mayhave several implemen-tations. Our purpose here is not to enumerate every possiblealternative, but insteadto show one solid base implementation.

16

7.1 Implementation of TokenNav

7.1.1 Using Automata for Path Recognition

Automata are used to recognize the path expressions on tokenstreams. Figure 9 (a)shows such an automaton for the plan in Figure 6. The automaton is composed ofseveral smaller automata, each corresponding to a different TokenNavoperator inthe plan. Each final state (shown as a state with double circles) corresponds to theend of a path in aTokenNavoperator.

q0 q0

(a) Finite Automaton

( b) Stack Content

q0 q1

q2

q0 q1

< open_auctions > <open_auction>

q2 q4

q0 q1

<seller>

q2 q4

q0 q1

< sellerid >

q2 q4

q0 q1

</ sellerid >

q2 q4

q0 q1

001

q2 open_auction

q1

initial q3 q4 seller

q5 phone

q7 bid q0 open_auctions

q6 bidder

Fig. 9. Implementation ofStreamSource/TokenNav

A stack [38,23] stores the history of the state transitions.Figure 9 (b) shows thesnapshot of the stack after each token has been processed. The stack contains in-stances of the states. Initially, the stack contains only the instance of the start stateq0. Each incoming start tag is looked up in the transition entries of each state in-stance at the stack top. For any state that is transitioned to, we push its instanceonto the stack. If no transition is found, we push an empty set. In our example, thiswould be the case when<sellerid> is processed. When an end tag is encountered,the state instances at the stack top are popped off; thus the stack is restored to thestatus before its matching start tag had been processed. Fora PCDATA item, nochange is made to the stack.

7.1.2 Synchronization of Automaton with Token-Related Operators

The output tuples ofTokenNavdescribed in Section 5 are only logical concepts.At the physical level, no XAT tuples are actually output byTokenNav. Outputof TokenNav$col1,path$col2 includes (1) token value, (2) information needed forgrouping the tokens that are the components of the same XML node (i.e., identi-fiers of$col2), and (3) information needed for grouping XML nodes that aresubele-ments of the same node (i.e., identifiers of$col1). The semantics ofTokenNav’soutput expected by its downstream token-related operatorsare captured by trigger-

17

ing the corresponding downstream operators when certain automaton events hap-pen.

Algorithm 1 Pseudocode of Automatonpublic class Automaton{

1: int storingCounter;2: StorageManager storeMgr;3: void handleStartTag(Token startTag){4: for each state on top of stackdo5: state.transit(startTag);6: end for7: for each state pushed onto stackdo8: if state is associated with extract operatorthen9: storingCounter++;

10: end if11: end for12: if (storingCounter> 0) then13: storeMgr.store(startTag);14: end if15: for each state on top of stackdo16: trigger corresponding operators;17: end for18: }

19: void handleEndTag(Token endTag){20: pop out all states at stack top;21: if (storingCounter> 0) then22: storeMgr.store(endTag);23: end if24: for each state popped offdo25: if state is associated with extract operatorthen26: storingCounter− −;27: end if28: end for29: for each state popped offdo30: trigger corresponding operators;31: end for32: }

33: void handlePCData(Token pcdata){34: if (storingCounter> 0) then35: storeMgr.store(pcdata);36: end if37: }

...

}

Algorithm 1 illustrates the automaton behavior. ThestoreMgrin automaton stores

18

the data extracted from the stream. ThestoringCountermaintains the number ofextract operators that request to store the token currentlybeing processed. The threemethods, namely,handleStartTag, handleEndTagandhandlePCData, describe theprocess of handling a start tag, an end tag and a PCDATA item respectively. Forexample, inhandleStartTag, the processing takes three steps. First, the automatonperforms the state transitions and pushes state instances onto the stack (lines 4 - 6).Second, the automaton computes whether the current token needs to be stored: ifyes, the token is put into the storage manager (lines 12 to 14). Third, the operatorsassociated with the state instances on the stack top are invoked (lines 15 - 17).

7.1.3 Property of Automaton Implementation

Our automata are designed to satisfy the “exclusive reach” property whenever pos-sible. This property is important for two reasons. First, itensures the correctness ofsynchronizing the automata events and the token-related operators (i.e., line 16 inhandleStartTagand line 30 inhandleEndTagin Algorithm 1). Second, it enables usto implement the structural join operator more efficiently than previous literature.

Property 1 Final State Reached by Destination Node Only (Exclusive-Reach).Given aTokenNav$col1,path$col2 operator, the instance of a final state ofpath canbe only pushed onto the stack (resp. popped off the stack) when a start tag (resp.end tag) of the destination node$col2 is encountered.

An automaton must be carefully constructed in order to satisfy the “exclusive reach”property. For example, for an XQuery “for $v in /a return$v//b”, we will constructan automaton in Figure 10 (a) instead of the one in Figure 10 (b). In both figures,q1 is the final state of path/a. The bottom parts of Figures 10 (a) and (b) showthe stack contents as tokens<a><c><b>... are processed. In Figure 10 (a),q1 ispushed onto the stack only by the token<a>. In Figure 10 (b), besides<a>, q1

can also be pushed onto the stack by<c> and<b> which are not bindings of$a.

q1a

q0b

q2

*

q1a

q0b

q2

*

q1a b

q2q0 q3

*

q1a b

q2q0q0 q3

*

q0 q0

q1,q2 1

q2

q0

q1, q2

<a> <c>

q2q2,q3

q0

q1,q2

<b>

(a) Correct Automaton Encoding (b) Incorrect Automaton Encoding

q0 q0

q1

q1

q0

q1

<a> <c>

q1

q1, q2

q0

q1

<b>

Fig. 10. Automaton Encoding for Paths Involving “//”

An XPath can be seen as a sequence of items where an item can be “/”, “//” or anavigation step. If we divide the sequence into two parts, wecall the second part apostfixof the path. We now give a theorem whose proof can be found in [34].

19

Theorem 1 If the “exclusive-reach” property holds, a final state can have at mostone instance in the stack (we say the automaton is “final stateduplicate free”)except in two circumstances: (1) if there is aTokenNav$col1,path$col2 wherepathcontains a “//” and the data is recursive; and (2) if there is aTokenNav$col1,path$col2where a postfix ofpath is a “//” followed by zero or more “*”.

Figures 11 (a) and (b) illustrate circumstances (1) and (2) in Theorem 1 respec-tively. In Figure 11 (a), the automaton encodes$v//a. Given a recursive XMLtoken stream, e.g.,<a><a></a></a>..., two instances of final stateq2 appearin the stack when the second<a> is processed. In Figure 11 (b), the automatonencodes$v/a//. Even if the XML stream is not recursive, there can still be twoinstances of final stateq1 in the stack since the start tags of any descendant of$v/apushq1 into the stack.

q1 a

q0

*

a q1 q0 q2

*

q0, q1 q0, q1 q1,q2 1

q1,q2

q0, q1 q1, q2

<a> <a>

(a) (b)

q0 q0 q1

q1

q0 q1

<a> <c>

$v/a// $v//a

Fig. 11. final state duplicates

In the next sections, we focus on implementations when the automaton is final stateduplicate free because they are distinguished from (and more efficient than) thosein the other systems [23,38]. The implementations when the automaton is not finalstate duplicate free can be found in [34].

7.2 Implementation of ExtractUnnest

At the logical level, anExtractUnnest$col1$col2 operator consumes outputs froma TokenNav$col1,path$col2 operator. This producer-consumer relationship is cap-tured by the association of the final stateqn of path with ExtractUnnest:

(1) When an instance ofqn is pushed onto the stack,ExtractUnnest$col1$col2is invoked (line 16 in Algorithm 1). From the “exclusive-reach” property weknow a start tag of$col2 has been encountered. ThisExtractUnnest preparesa new XAT tuple. This tuple contains only one cell which is a placeholder ofbindings of$col2.

(2) When an instance ofqn is popped off the stack, an end tag of$col2 has beenencountered.ExtractUnnest$col1$col2 is invoked again (line 31 in Algorithm1). A complete element node of$col2 is added into the corresponding place-holder. The XAT tuple is then complete and can be output.

20

7.3 Implementation of ExtractNest

ExtractNest$col1$col2 is associated withqn andq0 whereqn andq0 correspond tothe end and the beginning ofpath in TokenNav$col1,path$col2:

(1) When an instance ofqn is pushed onto the stack: if this is the first time aninstance ofqn is pushed within a binding of$col1, ExtractNest creates atuple with a placeholder. All the destination nodes locatedwithin the same$col1 would be put into this placeholder.

(2) When an instance ofqn is popped off,ExtractNest adds the newly completeddestination node to the placeholder.

(3) When an instance ofq0 is popped off the stack, by Theorem 1 we know therecannot be another instance ofq0 in the stack. Therefore the placeholder con-tains only those destination nodes located within this binding of $col1. Since$col1 has been completely processed,ExtractNest outputs the tuple.

Example 7 Figure 12 (a) depicts the stream physical plan in Figure 6 with Token-Nav operators replaced by an automaton. Figure 12 (b) shows the processing oftoken 7 in Figure 2:q5 is pushed onto the stack;ExtractNest$b$e creates a tuple;the storing counter is increased and<phone> is buffered. Figure 12 (c) shows theprocessing of token 9. First,q5 is popped off. Second, the storing counter is de-creased to 0. Third,ExtractNest$b$e adds thephone element to the placeholder.The dashed line in the placeholder indicates that the placeholder is “open”, i.e.,there may be morephone elements that could still be located within the sameseller.ExtractNest$b$e “closes” the cell in Figure 12 (d) when token 13 is processed.

7.4 Implementation of StructuralJoin

A StructuralJoin$col1 operator must have an upstream operator in the form ofTokenNav$col0,path$col1. ThisStructuralJoin is invoked when an instance ofqn,the state that corresponds to the end ofpath, is popped off the stack. The input tu-ples toStructuralJoin contain only elements located within this binding of$col1.ThereforeStructuralJoincan simply perform a Cartesian product on its input tu-ples. The input tuples are purged after the Cartesian product so that they would notparticipate in the next Cartesian product for a different binding of$col1. Since ourstructural join must be invoked when a certain automaton event happens, we call itan in-time structural join.

In Tukwila [23] and YFilter [38], input tuples toStructuralJoin$col1 carry theidentifiers of$col1. StructuralJoin$col1 performs value comparison on these iden-tifiers. We call it anidentifier-basedstructural join. Raindrop only chooses anidentifier-based implementation when the automaton is not guaranteed to be finalstate duplicate free. When the automaton is final state duplicate free, the in-time

21

q2 open_auction

q1

initial q3

q0 open_auctions

ExtractNest $b $e ExtractUnnest $a $b

StructuralJoin $b

StructuralJoin $a

Sel $ej = “ 508 - 1234567 ”

Tagger <auction>$ b,$c </auction> $f

(a) Query Plan with Automata

$e

q2 q4

q0 q1

9</phone>

(c) Processing Token 9

storingCounter = 0

q2 q4

q0 q1

7 <phone>

q5

(b) Processing Token 7

ExtractNest $b $e

storingCounter = 1

(d) Processing Token 13

13</seller>

<phone>

<phone>508 - 1234567</phone>

ExtractNest $b $e

<phone>508 - 1234567</phone>

<phone>508 - 0004567</phone>

$e

storingCounter = 0

Storage Manager

Storage Manager

Storage Manager

q4 seller

q5 phone

q7 bid q6 bidder

ExtractUnnest $a $c

ExtractNest $a $d

$e

ExtractNest $b $e

q2

q0 q1

Fig. 12. InvokingExtractNest Operator

structural join is more efficient than identifier-based structural join since it does notperform any value comparison.

8 Programming Model for Synchronizing the Execution of Operators

In a traditional query plan, synchronization among operators is usually achievedby theiterator mode [20], namely, an operator is always invoked by its immediatedownstream operator. However, only using this model does not meet the needsof a Raindropplan. First, execution ofTokenNav operators, i.e., the automa-ton, must be data-driven. In other words, given twoTokenNav operators such as

22

TokenNav$a,/seller$b andTokenNav$a,/bid/bidder$c, which operator will locate thedesired patterns first is completely decided by the data. Their common downstreamoperator such asStructuralJoin$a cannot demand to execute oneTokenNav firstand the other second. Second, the implementations ofExtract andStructualJoinrely on being invoked by a certainTokenNav at a certain time to be correct and ef-ficient. For example, theStructuralJoin$a operator must be invoked by its ances-tor upstream operatorTokenNav$s,/open auctions/open acution$a when q2 is poppedout of the stack. We propose to support three invocation modes in Raindrop. Wenow enumerate each mode and the operators that it applies to.

8.1 AncestorUpstreamDriven Mode

If an operator is invoked by its ancestor upstream operator,we say it is invokedin the AncestorUpStreamDrivenmode. OnlyStructuralJoin andExtractNestoperators are invoked in this mode since they both rely on thegrouping infor-mation sent by an ancestorTokenNav operator. For example, in Figure 12, bothStructuralJoin$b andExtractNest$b$e are invoked byTokenNav$a,/seller$b whena</seller> is encountered. All the intermediate data betweenTokenNav$a,/seller$bandStructuralJoin$b must be consumed during this invocation.StructuralJoin$b

then invokes its two immediate upstream operators, i.e.,Sel$e=“508−1234567” andExtractUnnest$a$b. The invocation of these two operators will be further illus-trated in the next invoking mode.

Example 8 Algorithm 2 shows the pseudocode ofStructuralJoin. StructuralJoinimplements anancestorUpstreamDrivenmethod. Each time when a</seller> isencountered, an instance ofq4 is popped off the stack. Line 30 inhandleEndTagin Algorithm 1 will call theancestorUpstreamDrivenmethod ofStructuralJoin$b.In this method, each upstream operator of thisStructuralJoin is invoked. Fromthe perspective of these upstream operators, they are invoked by their downstreamoperators. Therefore theirdownstreamDriven()methods are called (line 8 in Al-gorithm 2). The output tuples of these upstream operators are then consumed byStructuralJoin(line 10 in Algorithm 2). The result ofStructuralJoincan be eitherconsumed by the downstream operator (line 14 in Algorithm 2)or enqueued intothe output queue ofStructuralJoin$b (line 16 in Algorithm 2).

8.2 DownstreamDriven Mode

This DownstreamDrivenmode is similar to the traditional iterator mode, namely,an operator is invoked by its immediate downstream operator. This invoking modeis applicable to all regular tuple-based operators,ExtractUnnest, ExtractNestandStructuralJoin. Note thatExtractNest andStructuralJoin can be invokedby in bothAncestorUpstreamDrivenandDownstreamDrivenmode.ExtractNest

23

Algorithm 2 Programming Model for In-Time Structural Joinpublic class StructuralJoin{

1: BooleanisImmediateUpstreamDriven;2: List[ ] inputQueues;3: List outputQueue;

...4: public ancestorUpstreamDriven(){5: List outputTuples;6: List[] inputTuples;7: for each upstream operatorupstreamOpdo8: inputTuples.add(upstreamOp.downstreamDrive());9: end for

10: outputTuples= join(inputTuples);11: for each downstream operatordownOpdo12: if downOp.isImmediateUpstreamDriventhen13: int inputPos= downOp.getUpstreamOpPos(this);14: downOp.immediateUpstreamDrive(outputTuples, inputPos);15: else16: outputQueue.enqueue(outputTuples);17: end if18: end for19: }20: public List downstreamDriven(){21: return outputQueue.dequeueAll();22: }

}

(StructuralJoin resp.) is invoked in anAncestorUpstreamDrivenmode for form-ing a complete tuple (for performing cartesian product on its input resp.), but itkeeps the tuple (the cartesian product result resp.) until it is invoked by its immedi-ate downstream operator.

Example 9 For the plan in Figure 12, when a</openauction> is encountered,the ancestorUpstreamDrivenmethod ofStructuralJoin$a is called. This methodthen calls thedownstreamDrivenmethod ofStructuralJoin$b (line 8 in Algorithm2). StructuralJoin$b then sends all the tuples in its output queue toStructural$a

(line 21 in Algorithm 2).

8.3 ImmediateUpStreamDriven

The ImmediateUpStreamDrivenmode is also calleddata driven. An operator isinvoked by its immediate upstream operator. For example, inAlgorithm 2, whenStructuralJoin generates output, if its downstream operator is data driven, thedownstream operator is invoked to consume these output (line 14). Operators areinvoked in this mode in two cases.

24

In the first case, if an operator does not have aStructuralJoin in its downstream,e.g.,Tagger<auction>$b,$c</auction>$f in Figure 12 (a), it will not be invoked in adownstream driven method. Such an operator has to be invokedby its immediateupstream operator.

In the second case, suppose an operator has a downstreamStructuralJoin$col1,however there is a data delay in the arrival of component tokens of bindings of$col1. Instead of waiting to be invoked byStructuralJoin$col1 until an end tag of$col1 arrives, the operator can consume whatever input it has right now to reduceCPU idle time. For example, suppose operatorSelect$e=“508−1234567” in Figure 12is instead placed betweenStructuralJoin$b andStructuralJoin$a. If there is datadelay in the arrival of a</open auction>, Select$e=“508−1234567” can still operateon its available input, i.e.,seller elements that havephonesubelements. This canbe a future optimization opportunity, namely, adaptive invocation mode switching.

8.4 Summary

Our programming model allows operators to be invoked in different modes. Allmodes are defined in a modular manner. This can be a significantadvantage whena flexible configuration of the synchronization among operators is needed. Forexample, if we want to switchSelect$e=“508−1234567” from a downstreamDrivenmode to animmediateUpstreamDrivenmode, we can simply set the Boolean flagof “isImmediateUpstreamDriven” operator to true for thisSelect. Therefore anyoutput fromStructuralJoin$a will be immediately sent toSelect$e=“508−1234567”

for consumption (see lines 12 - 14 in Algorithm 2).

9 Experiments

We have implemented a prototype ofRaindrop[21] with Java 1.4.1. We use ToX-Gene [12], an XML data generator, to generate the XML documents. We ran ex-periments on two Pentium III 800 Mhz machines with 512MB memory each. Onemachine sends XML token streams via sockets to another machine which wouldthen process the received data. Our experiments show the performance differencesamong the plans with different amounts of computation pushed into the automata.

The cost of query execution consists of two parts: one for buffering the data, andthe other for manipulating (e.g., navigating) the buffereddata. The queries we testfall into two categories. In the first category, all alternatives of the query buffer thesame amount of data. The performance differences among alternatives only resultfrom the differences in the data manipulation costs. In the second category, somealternatives of the query trade buffering costs for manipulation costs, i.e., buffering

25

for $a in stream( “ open_auctions ” )/open_auctions / open_auction/bidder[filter1][filter2]...[ filtern ]

return <auction>{$a}</auction>

Fig. 13. Query with Filters

more data in the hopes that later manipulation may be accelerated. The performancedifferences among the alternatives show the tradeoff between the two costs.

9.1 Analysis of Performance Difference among Alternatives

We analyze which factors may lead to performance differences among alternatives.Supposefilter1 has a low selectivity, then we should evaluate this filter before theother filters. This follows the classical “push down operators of low selectivity”optimization technique. However, in the automata, all filter patterns are retrievedin parallel. In order to assure that the other filters are evaluated afterfilter1 isevaluated, we must leave the other filters out of the automata.

In the opposite case where all filters are frequently satisfied, a plan that pushes allfilters into the automata (calledmaximal-navigation-pushdownplan) can evaluateall filters in one single access of the tokens. The other plans, however, have toaccess thebidder elementsn times if there aren filters to be evaluated out of theautomata. A maximal-navigation-pushdown plan might outperform the other plans.

9.2 Testing Queries Having Alternative Plans with Same Buffering Cost

Figure 13 shows an example query template from an XML benchmark, XMark [1].This query asks to return anybidderelement that satisfies a set of filters where eachfilter is a linear XPath, i.e., XPath with no filters. In any alternative, abidderele-ment always has to be buffered because it may appear in the final answer. Thereforeall alternatives have the same buffering costs.

9.2.1 Testing on Cheap Filters

In the first experimentation, we vary the selectivity offilter1 in the data set. Theselectivity of all the other filters is 100%. The length of each filter is 1, i.e., eachfilter has only one deterministic navigation step and does not have any descendantaxis. The cost of evaluating such a filter is rather cheap. Theaverage width ofbidder, i.e., the number of its children elements, is set to 30.

We test three plans. In the zero-filter-pushdown plan, all filters are evaluated outof the automata butfilter1 is evaluated before any other filters. In the one-filter-pushdown plan,filter1 is evaluated in the automata. The evaluation order of the

26

Data Size = 84M, Filter Number = 20,

Average Filter Pattern Length = 1

0

10000

20000

30000

40000

0% 13% 25% 50% 75% 100%

Pattern Selectivity

Exe

cu

tio

nT

ime

(ms)

Zero Filter

Pushdown

One Filter

Pushdown

Maximal

Navigation

Pushdown

Fig. 14. Performance of Alternative Plans for Queries with 20 Filters of Average Length 1

Pattern Selectivity 0% 12% 25% 50% 100%

Query with 5 filters 1.06 1.02 1.03 0.95 0.90



Fig. 15. Ratio of Execution Time of Maximal Pushdown with Execution Time of Zero FilterPushdown for Queries with Different Numbers of Filters

other filters does not matter here since they have the same selectivity. The third planpushes all filters into the automata and is calledmaximal-navigation-pushdownplan. Figure 14 shows that when selectivity is between 0% and25%, the zero-filter-pushdown plan performs better than the maximal-navigation-pushdown plan;otherwise, the maximal-navigation-pushdown plan performs better than the zero-filter-pushdown plan. At all times, the zero-filter-pushdown plan behaves similarlyto the one-filter-pushdown plan. That is to say, evaluating asingle pattern on tokenshas a similar performance as evaluating the pattern on element nodes. Therefore inthe following experimentations, we only illustrate one of these two plans.

Figure 15 further compares the performance of different queries. All queries con-form to the query template in Figure 13 but differ in the number of filters. We reportthe ratio of the execution time of maximal-navigation-pushdown with that of thezero-filter-pushdown plan. Figure 15 illustrates that measures are needed to judgewhether it is worthwhile to consider alternative plans. Fora simple query, suchas the one with 5 filters, both the zero-filter-pushdown and maximal-navigation-pushdown plans always perform similarly (the ratio is closeto 1) when the selec-tivity of filter1 varies. As the query gets more complicated, i.e., the numberoffilters increases, the differences among alternative plansget more significant.

9.2.2 Testing on Expensive Filters

We now test two queries with more expensive filters. In the first query,filter1 stillhas a length of 1 but all other filters are longer. Correspondingly, savings from theevaluation on afilteri (i 6= 1) are larger than those in Figure 14. Figure 16 gives the

27

Data Size = 56M, Filter Number = 10,

Average Filter Pattern Depth = 5

0

5000

10000

15000

20000

25000

0 25% 50% 75% 100%

Pattern Selectivity

Exe

cu

tio

nT

ime

(ms)

Zero Filter

Pushdown

Maximal

Navigation

Pushdown

Fig. 16. Performance of Alternative Plans for Queries with 10 Filters of Average Length 5

Data Size = 56M, Filter Number = 2 (with // in one Filter)

0

5000

10000

15000

20000

25000

30000

0% 25% 50% 75% 100%

Pattern Selectivity

Exe

cu

tio

nT

ime

(ms) Zero Filter

Pushdown

Maximal

Navigation

Pushdown

Fig. 17. Performance of Alternative Plans for Queries with 2Filters (One Filter has “//”)

experimental result. When the selectivity offilter1 is 0%, the ratio of the execu-tion time of a maximal-navigation-pushdown plan with that of zero-filter-pushdownplan can reach 1.46. This is the same as that in a query with 20 shorter filter pat-terns (refer to the first cell in third row in Figure 15). Also,the crossover betweenthe two plans shifts from 25% in Figure 14 to 50% in Figure 16. In other words, thezero-filter-pushdown plan is more likely to win over the maximal-navigation-pushdown plan compared to the scenarios in Section 9.2.1.

The second query we test has only two filters.filter1 still has a length of 1 butfilter2 starts with a “//”. Any component token ofbidderwill lead to an automatonstate transition. Computing such a filter is more expensive than a filter with onlydeterministic navigation steps, because to evaluate a filter with n deterministic nav-igation steps, component tokens ofbidder that are more thann levels deep withinbidder would not induce any transitions. Figure 17 confirms that theperformancedifference among alternative plans of this query can be significant.

9.3 Testing Queries Having Alternatives with Different Buffering Costs

We now study queries which conform to the template in Figure 18. Figures 19,20 and 21 show alternatives that push one, three, or all navigation operators down

28

for $a in stream( “ open_auctions ” ) / open_auctions / open_auction/auction[initial ], $b in $o/seller, $c in $o/bid/bidder[filter 1 ][filter 2 ]...[filter n ]

return <auction>{$b, $c}</auction>

Fig. 18. Query with Multiple Bindings in For Clause

ExtractUnnest $s $a

NavNest $a, initial $d

...

NavUnnest $a, /seller $b NavUnnest $a, /bid/bidder $c

NavNest $c, filter1 $e

Join $a

...

TokenNav $s , / open_auctions/open_auction $a

Fig. 19. One Navigation Pushdown

Fig. 20. Three Navigation Pushdown

to the automata respectively. In Figure 19, eachopenauctionelement has to bebuffered since it will be navigated into later to find theinitial, seller andbiddersubelements. In contrast, both plans in Figures 20 and 21 buffer only a minimalamount of data, i.e., thebidderandseller, for later navigation or result construction.

We vary three factors in the data set. First, we vary the selectivity of the filter/initial but keep the selectivity of all the other filters at 100%. Second, we vary thesize of the data that are subelements ofopenauctionother thanselleror bidder. We

29

ExtractNest $a $d

ExtractUnnest $a $b

ExtractUnnest $a $c

StructuralJoin $a

... ...

...

ExtractNest $c $e

StructuralJoin $c

...

TokenNav $a, /initial $d

TokenNav $a, /seller $b

TokenNav $a, /bid/bidder $c

TokenNav $c, filter1 $e

...

Fig. 21. Maximal Navigation Pushdown

Data Set extra buffering ratio% average number ofseller’s within anopenauction

Data Set 1 0% 1

Data Set 2 50% 1

Data Set 3 0% 10

Table 4Data Characteristics of Three Data Sets

call the ratioK = (the size of the above data) / (the overall size ofsellerandbid-der) anextra buffering ratio. Third, we vary the number ofsellerelements in eachopenauction. We fix the average number ofbidderelements in anopenauctionto20. We generated three data sets whose data characteristicsare shown in Table 4.

Figures 22 and 23 show the results on the first two data sets. Weobserved:

(1) The three-navigation-pushdown plan is always better than the one-navigation-pushdown plan due to two reasons. First, three-navigation-pushdown plannever buffers more data than one-navigation-pushdown plan. In data set 2where the extra buffering ratio is 50%, it buffers much less data. Second, theJoin$a operator in Figure 19 is an identifier-based join. It is more costly thantheStructuralJoin$a operator in Figure 20.

(2) The crossover point of one-navigation-pushdown and maximal-navigation-pushdown plans occurs at a lower selectivity in Figure 23 than that in Figure22. This is because in Figure 23, the cost that the one-navigation-pushdownplan saves in pattern retrieval is offset by the cost that theone-navigation-pushdown plan spends in buffering extra data.

Figure 24 reports the results on the third data set. The trendof the performancedifferences between one-navigation-pushdown and maximal-navigation-pushdownplans remains similar to that in Figure 22. However three-navigation-pushdownperforms extremely badly (its performance when the selectivity is larger than 25%

30

Data Size = 48M, Seller Number=1,

Extra Buffering Ratio = 0%

0

5000

10000

15000

20000

25000

30000

35000

0% 25% 50% 75% 100%Pattern Selectivity

Exe

cu

tio

nT

ime

(ms) 1 Nav

Pushdown

3 Nav

Pushdown

Maximal Nav

Pushdown

Fig. 22. Performance on Data Set 1

Data Size = 92M, Seller Number =1,


0

10000

20000

30000

40000

50000

60000

0% 25% 50% 75% 100%

Pattern Selectivity

Exe

cu

tio

nT

ime

(ms)

1 Nav

Pushdown

3 Nav

Pushdown

Maximal

Nav

Pushdown


Data Size = 56M, Seller Number = 10,


0

10000

20000

30000

40000

50000

60000

70000

80000

0% 25% 50% 75% 100%

Pattern Selectivity

Exe

cu

tio

nT

ime

(ms)

1 Nav

Pushdown

3 Nav

Pushdown

Maximal

Nav

Pushdown


is not shown due to extremely high cost). This is because in Figure 20, abidderis paired with eachseller by StructuralJoin$a. Therefore eachbidder is dupli-cated 10 times since there are 10seller elements within the sameopenauction.Correspondingly, any downstream computation on abidder element will be du-plicated. For example, a singlebidderelement will be navigated into 10 times byNavNest$c,filter1

$e in Figure 20 to evaluatefilter1. In the other two plans, eitherJoin$o in Figure 19 orStructuralJoin$a in Figure 21 is performed after locatingall the patterns withinbidderso that no navigation computation is duplicated.

31

10 Related Work

XML stream processing has been active recently. XSM [27] andXSQ [31] usetransducer models for pattern retrieval. Basically, they define a template for eachcomponent in XQuery or XPath, and then compile the query intoa network of suchinstantiated templates. Though XSM supports queries with more expressive powerthan XSQ does, XSQ provides more efficient memory managementthan XSM bypromptly cleaning up intermediate buffers when they are no longer needed.

Lazy PDA [19] and XPush [20] are two deterministic-automata-based approachesfor handling a limited subset of XML query language features. Both of them onlyreturn a Boolean value indicating whether an XPath expression evaluates to non-empty results. Lazy PDA stands forlazy deterministic pushdown automata. It iscalled “lazy” because it computes the automata states at run-time so that only thestates that would actually be transit to are computed. This could effectively reducethe exponential blow-up of the number of states compared to when “eager” PDAwould be computed at compile time. Lazy PDA supports only XPath expressionswithout filters (i.e., linear patterns) while XPush allows XPath expressions to havefilters (i.e., tree patterns). XPush extends Lazy PDA by having additional constructsfor supporting tree patterns and predicate evaluation.

YFilter [38] and Tukwila [23] are closest to our work. Their approaches modeltoken-based automata computations in a coarse operator. Our work instead uni-formly integrates the token-based and tuple-based computations and thus naturallyoffers query rewrite optimization opportunities.

Another camp [17,18] builds systems using SAX handlers. They define a set ofhandlers, each for handling certain computations such as evaluating a navigationstep, performing a selection and constructing an element. These handlers are nestedso that one handler can pass an event it receives to another handler. Again, this is anew methodology not in synch with well-known algebraic optimization techniques.Existing algebra optimization techniques cannot be directly adopted.

BEA/XQRL [13] processes stored sequence of tokens. XQuery is compiled into anetwork of expressions. An expression is equivalent in functionality with an alge-braic operator. There are two major differences between BEA/XQRL and Raindrop.First, in BEA/XQRL, all the internal data passed among expressions are always to-ken streams, in contrast to both tokens and tuples inRaindrop. Second, the tokensin BEA/XQRL and the tokens in Raindrop are not equivalent concepts in terms oftheir accessibility. In BEA/XQRL, the token stream is stored so that the same datacan be accessed by expressions multiple times. In Raindrop,tokens arrive on-the-fly and cannot be accessed more than once unless they are buffered, as explicitlyspecified by theExtract operators. The pull-based model in BEA/XQRL, whichassumes a look back on previous tokens is possible, does not work in Raindrop.

32

A data driven model is a must for buffering data before a pull-based model canoperate on buffered data.

11 Conclusion

Raindropaccommodates a token-based automata paradigm and a tuple-based al-gebraic paradigm within one framework. This is a novel approach compared tothe other approaches in the literature which typically model the two processingparadigms separately and thus optimize them separately as well. Our approach in-stead allows the query optimization over all computations to be performed in a uni-form manner. In particular, we provide a unique optimization opportunity. Previousliterature considers only the plans in which computations are maximally pusheddown into the automata. Our experimentations however clearly demonstrate thatsuch plans do not ensure the optimality. With different query types and data charac-teristics, different automata pushdown strategies are needed for generating optimalplans. We currently are studying this unique optimization opportunity.

Acknowledgement.We would like to thank Jinhui Jian, who implemented partof the Raindrop system. Our thanks also go to the Rainbow teamand Cape team,especially Xin Zhang, Song Wang, Ling Wang, Luping Ding and Bradford Pielech,in the Database System Research Group at Worcester Polytechnic Institute whoprovided related code support.

References

[1] A. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu and R. Busse. XMark:A Benchmark for XML Data Management. InProc. of the Int. Conf. on Very LargeData Bases (VLDB), pages 974–985, 2002.

[2] A. Snoeren, K. Conkey and D. Gifford. Mesh-based ContentRouting using XML. In18th ACM Symposium on Operating System Principles (SOSP), 2001.

[3] D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker,N. Tatbul, and S. Zdonik. Aurora: A new model and architecture for data streammanagement.VLDB Journal, 12(2):120–139, August 2003.

[4] M. Altinel and M. Franklin. Efficient Filtering of XML Documents for SelectiveDissemination. InProceeding of VLDB, pages 53–64, 2000.

[5] S. Amer-Yahia, S. Cho, L. V. Lakshmanan, and D. Srivastava. Minimization of TreePattern Queries. InSIGMOD, pages 497–508, June 2001.

[6] B. Babcock, S. Babu, R. Motwani, and J. Widom. Models and issues in data streams.In PODS, pages 1–16, June 2002.

33

[7] S. Babu and J. Widom. Continuous queries over data streams. InACM SIGMOD, Sep2001.

[8] C. Koch, S. Scherzinger, N. Scheweikardt and B. Stegmaier. FluxQuery: AnOptimizing XQuery Processor for Streaming XML Data. InVLDB, pages 228–239,2004.

[9] S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, J. Hellerstein, W. Hong,S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. Shah.TelegraphCQ:Continuous dataflow processing for an uncertain world. InCIDR, pages 269–280,2003.

[10] J. Chen, D.J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous querysystem for internet databases. InACM SIGMOD, pages 379–390, June 2002.

[11] D. Abadi, Y. Ahmad, M. Balazinska and et. al. The design of the borealis streamprocessing engine. InProceedings CIDR, page to appear, 2005.

[12] D. Barbosa, A. Mendelzon, and J. Keenleyside et al. ToXgene: a Template-Based DataGenerator for XML. InProceedings of WEBDB, pages 49–54, 2002.

[13] D. Florescu, C. Hillery and D. Kossmann et al. The BEA streaming XQuery processor.In VLDB Journal 13(3), pages 294–315, 2004.

[14] A. J. Demers, J. Gehrke, R. Rajaraman, A. Trigoni, and Yong Yao. The Cougar Project:A Work-In-Progress Report. InSigmod Record 32 (4), pages 53–59, 2003.

[15] Alin Deutsch, Yannis Papakonstantinou, and Yu Xu. The NEXT Logical Frameworkfor XQuery. InProc. of the Int. Conf. on Very Large Data Bases (VLDB), pages 29–41,2004.

[16] L. Fegaras and D. Maier. Towards an Effective Calculus for Object Query Languages.In Proceedings of SIGMOD, pages 47–58, 1995.

[17] Leonidas Fegaras. The Joy of SAX. InFirst International Workshop on XQueryImplementation, Experience and Perspectives (XIME-P), 2004.

[18] George Russell, Mathias Neumuller and Richard Connor.Stream-based XMLProcessing with Tupe Filtering. 2003.

[19] T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing XML Streams withDeterministic Automata. InICDT, pages 173–189, 2003.

[20] A. Gupta and D. Suciu. Stream Processing of XPath Queries with Predicates. InProceedings of SIGMOD, pages 419–430, 2003.

[21] H. Su, E. A. Rundensteiner and M. Murali. Semantic QueryOptimization in anAutomata-Algebra Combined XQuery Engine over XML Streams.In VLDB Demo,2004.

[22] H. Su, J. Jian and E. A. Rundensteiner. Raindrop: A Uniform and Layered AlgebraicFramework for XQueries on XML Streams. InCIKM, pages 279–286, 2003.

34

[23] Z. Ives, A. Halevy, and D. Weld. An XML Query Engine for Network-Bound Data.VLDB Journal, 11 (4): 380–402, 2002.

[24] J. Chen, D. Dewitt, F. Tian et al. NiagaraCQ: A Scalable Continuous Query Systemfor Internet Databases. InSIGMOD, 2000.

[25] H. V. Jagadish, Shurug Al-Khalifa, Adriane Chapman, Laks V. S. Lakshmanan,Andrew Nierman, Stelios Paparizos, Jignesh M. Patel, Divesh Srivastava, NuweeWiwatwattana, Y. Wu, and C. Yu. Timber: A native xml database. In VLDB JournalVolume 11 Issue 4, pages 274–291, 2002.

[26] J. Jian, H. Su, and E. Rundensteiner. Automaton Meets Query Algebra: Towards AUnified Model for XQuery Evaluation over XML Data Streams. InProceedings ofER, 2003.

[27] B. Ludascher, P. Mukhopadhyay, and Y. Papakonstantinou. A Transducer-Based XMLQuery Processor. InProceedings of VLDB, pages 227–238, 2002.

[28] M. J. Carey, M. Blevins and P. Takacsi-Nagy. Integration, Web Services Style. InIEEE Data Eng. Bull. 25 (4): 17-21, 2002.

[29] B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on theWeb. InProceedings of the ACM SIGMOD International Conference on Managementof Data, Santa Barbara, CA, pages 437–448, May 2001.

[30] P. Mukhopadhyay and Y. Papakonstantinou. Mixing querying and navigation in mix.In Proceedings of ICDE 2002, 2002.

[31] F. Peng and S. Chawathe. XPath Queries on Streaming Data. In Proceedings ofSIGMOD, pages 431–442, 2003.

[32] E. A. Rundensteiner, L. Ding, T. Sutherland, Y. Zhu, B. Pielech, and N. Mehta. Cape:Continuous query engine with heterogeneous-grained adaptivity. In VLDB Demo,pages 1353–1356, 2004.

[33] S. Wang, E. A. Rundensteiner and M. Mani. Optimization of nested xqueryexpressions with orderby clauses. InXML Schema and Data Management (XSDM),Tokyo, Japan, April 2005.

[34] H. Su. Automaton Meets Algebra: A Hybrid Paradigm for XML Stream Processing.Ph.D. Dissertation, Computer Science Department, Worcester Polytechnic Institute,2005.

[35] W3C. XML Query Data Model. http://www.w3.org/TR/query-datamodel, 2000.

[36] X. Zhang, B. Pielech and E. A. Rundensteiner. Honey, I Shrunk the XQuery! — AnXML Algebra Optimization Approach. InWIDM, pages 15–22, Nov. 2002.

[37] X. Zhang, B. Pielech and E. A. Rundensteiner. XAT Optimization. Technical ReportWPI-CS-TR-02-25, Worcester Polytechnic Institute, 2002.

[38] Y. Diao, M. Altinel and M. J. Franklin, H. Zhang and P. Fischer. Path sharing andpredicate evaluation for high-performance xml filtering. In TODS, pages 467–516,2003.

35

Date post:	14-May-2023
Category:	Documents
Upload:	independent
View:	0 times
Download:	0 times