1 XML Publishing: Bridging Theory and Practice Wenfei Fan University of Edinburgh and Bell...

1

XML Publishing:

Bridging Theory and Practice

Wenfei Fan

University of Edinburgh

and

Bell Laboratories

2

XML documents

Rooted, node-labeled, ordered, unranked tree element: e.g., course, prereq – tagged, subtree,

– subelement, e.g., the prereq child of course text node, e.g., “CS650”, carrying text, not tagged, leaf

... db

course coursecourse

typetitle

regular“Web DB”

prereqcno

“CS650” cnocno

...

course

... prereq

3

XML publishing: data exchange on the Web

Most legacy data is stored in relational databases

XML has become the prime standard for data exchange

DB1 DB2

XML

Q: XML view

Web

XML

publishing

RDB

source

XML viewview

mapping

4

XML publishing: an XML interface of databases

XML

RDB

query answer

publishing query translation

DBMSmiddleware

DTD

Querying and updating traditional databases via XML views

5

Example: XML publishing

Relational schema R0:

course (cno, title, type)

prereq (cno1, cno2) -- prerequisite hierarchy

XML DTD D0: db course* course cno, title, type, prereq prereq cno*, prereq type regular | project

R

Registrar DBXML view ...

db

course coursecourse

typetitle

regular“Web DB”

prereqcno

“CS650” cnocno

...

course

... prereq

6

XML publishing languages in practice

XML view definition languages: XML views published from RDB

Commercial products: – Microsoft SQL Server 2005 (FOR-XML, XSD)– IBM DB2 XML Extender (SQL/XML, DAD: SQL, RDB)– Oracle 10g XML DB (SQL/XML, DBMS_XMLGEN) …

Research prototypes: – XPERANTO – TreeQL (SilkRoute) – ATG (PRATA)

RDB

source

XML viewview

mapping

7

XML publishing in practice

RDB

source

XML viewview

mapping

... db

course coursecourse

typetitle

regular“Web DB”

prereqcno

“CS650” cnocno

...

course

... prereq

Q

Q1

Q2

RDBrelationa

lquery

Top-down from the root, via embedded relational queries

8

XML publishing: question of the users

What language should a user choose to express the view? unbounded depth, nondeterministic “shape”, cannot be decided

statically at compile timeprereq cno*, prereq

type regular | project

... db

course coursecourse

typetitle

regular“Web DB”

cno

“CS650”

course

prereq

cnocno

...... prereq

unboundedproject

X X

Few publishing languages can define this view

collection

9

XML publishing: question of database vendors

XML view: under each course, list all its prerequisites, direct or not– collapsing prerequisite hierarchy– a tree of depth three

Question: is it necessary to upgrade DBMS and support SQL’99?

...

db

course coursecourse

typetitle

“Web DB”

cno

“CS650”

course

cnocno ...project

Q

Q1

The expressive power and complexity of XML publishing languages

10

Outline

XML publishing transducers

Characterization of XML publishing languages in practice

Complexity: evaluation cost, static analyses

Expressive power: tree generation, relational characterization

Dynamic aspect: incremental XML publishing, view updates

Open research issues

Joint work with Theory: Floris Geerts, Frank Neven [PODS’07] System: Michael Benedikt, Phil Bohannon, Cheeyong Chan, Rajeev

Rastogi, … [SIGMOD’03,04; VLDB’02,04,05; ICDE’07]

11

Outline







12


= (Q, , q0, ) for a relational schema R Q: a finite set of states : a finite alphabet of XML tags, with a root r and text q0: the start state

: for each pair (q, a) in Q

(q, a) (q1, a1, 1(x1, y1)), . . ., (qk, ak, k(xk, yk)),

– to generate the children of a nodes: a1*, . . ., ak*

– register Rega: set-valued, fixed arity, with each a-node

i: query R Rega Regai in a relational query language L

– xi: a list of free variables in i, grouping attributes

– deterministic: • (q, text) . -- Empty RHS: text nodes have no children

13

Top-down transduction

Start rule: (q0, r) -- q0, r0 do not appear on the RHS of any rule

(q0, db) (q, course, 1(c, t; ))

1(c, t; nil) = t’ course(c, t’, t)

recall course (cno, title, type)

tuple register Regc: group the result by all attributes:

for each distinct tuple tp in the result of 1(x; )

– create a course element– carry the tuple tp in Regc

expand at leaf nodes(q0, db)

(q, course) ...(q, course) (q, course) (q, course)

Regc RegcRegcRegc

x = (c, t) y =

(q, a) labeledcarrying Reg

14

Registers: tuple vs. relation

(q, course) (q, cno, 2(c; )), (q,type,3(t; )), (q, prereq, 4(; c))

2(c; ) = t Regc(c, t)

4(; c) = t, c’ (Regc(c’, t) prereq(c’, c))

recall prereq(cno1, cno2)

tuple registers Regcno, Regt

relation register Regp : x = , the result of 4(; c) is a set

top down information passing: the parent register Regc in 4(; c)(q0, db)

(q, course) ...(q, course) (q, course) (q, course)Regc Regc

RegcRegc

(q, type)(q, cno) (q, prereq)

Regcno Regt Regp

x = y = ( c )

15

Recursive transducer and stop condition

(q, prereq) (q, cno, 5(c; )), (q, prereq, 5(; c))

5(; c) = t, c’ (Regp(c’, t) prereq(c’, c))

Stop conditions: 5(; c) returns an empty set

– the RHS of (q, a) is empty (e.g., for text nodes)

– there is an ancestor node with the same label, tag and registerNo new information can be added to the tree (q0, db)

...

(q, course)

(q, cno)

(q, prereq)

Regcno

Regp

(q, cno)

Regcno

(q, prereq)

Regp

(q, a) Rega

(q, a) RegaSTOP

tuple Regrelation Reg

16

Transformation of a publishing transducer

terminates on a DB of R if all leaf nodes satisfy a stop condition

(DB): XML tree, by striking out states and registers

(R): the set of XML trees generated by for all DB of R

(q, prereq)(q, cno)

...(q, course) (q, course)(q, course)

(q, type)

(q,regular)“CS650” (q, cno)(q, cno)

...

(q, course)

...

Reg Reg Reg

Reg

Reg

Reg

Reg

(q0, db)

(q, prereq) Reg Reg

Reg

course coursecourse

type

regular

prereqcno

“CS650” cnocno

course... db

...... prereq

17

publishing transducers with virtual nodes

= (Q, , a, q0, )

a : a subset of , virtual tags

Recall the view: under each course, list all its prerequisites

’ = (Q, , a = {prereq}, q0, )

course coursecourse

type

regular

prereqcno

“CS650” cnocno

course... db

...... prereq

db

course coursecourse

typecno

“CS650”

course

cnocno ...regular

Virtual nodes are removed from the output

18

Various classes of publishing transducers

PT(L, S, O)– L: the relational query language (CQ, FO, FP, with = and )– S: register, relation vs. tuple (a special case of relation Reg)– O: output nodes, normal vs. virtual

PTnr(L, S, O): non-recursive subset of PT(L, S, O)

Example: View 1: PT(CQ, relation, normal) View 2: PT(CQ, relation, virtual) and PTnr(FP, tuple, normal)

As opposed to query automata take a relational database as input, rather than an existing tree output a new tree, rather than accepting a tree or selecting nodes

In contrast to recent work on schema mapping relations to XML, not relation-to-relation or XML-to-XML via embedded relational queries, not source-to-target constraints

19

Outline







20

Existing XML publishing languages

Extensions of SQL by incorporating XML publishing functions– Microsoft SQL Server 2005 (FOR-XML)– IBM DB2 XML Extender (SQL/XML)– Oracle 10g XML DB (SQL/XML, DBMS_XMLGEN)– XPERANTO – …

Annotating schema or fixed tree template with relational queries

– Microsoft SQL Server 2005 (XSD)

– IBM DB2 XML Extender (DAD: SQL, RDB)– TreeQL (SilkRoute) – ATG (PRATA)– . . .

21

Extensions of SQL for XML publishing

SQL/XML: XMLElement, XMLForest, XMLAgg, XMLConcat, …

SELECT XMLELEMENT {NAME=“course”,

XMLFOREST{ c.cno AS “cno”, c.title AS “title”}}

FROM course c

... db

course coursecourse

titlecno

course

PTnr(FO, tuple, normal): no recursion, virtual nodes

XPERANTO: PTnr(FO, tuple, normal) Microsoft SQL Server 2005 (FOR-XML): PTnr(FO, tuple, normal) Oracle 10g XML DB

– DBMS_XMLGEN: PT(FP, tuple, normal) (connect-by of SQL’99)

22

Annotating schema or tree template

ATG of PRATA: DTD-directed view definition, inherited attributes prereq cno*, prereq

$cno Q($prereq_p), $prereq_c = Q($prereq_p) /* semantic rules */

Q: SELECT cno2 FROM prereq p, $prereq_p p’

WHERE p.cno1 = $prereq_p.cno

– $prereq_p: parent attribute (relation register)

...prereq

prereq

cno cno

PT(FO, relation, virtual): recursive views, virtual nodes, DTD-conformance

Microsoft SQL Server 2005 (XSD): PTnr(CQ, tuple, normal): IBM DB2 XML Extender DAD-SQL: PTnr(CQ, tuple, normal),

DAD-RDB: PTnr(CQ, tuple, normal) TreeQL (SilkRoute): PTnr(CQ, tuple, virtual)

23

Putting these together

Microsoft SQL Server 2005 FOR XML PTnr(FO, tuple, normal)

annotated XSD PTnr(CQ, tuple, normal)

IBM DB2 XML Extender SQL/XML PTnr(FO, tuple, normal)

DAD-SQL PTnr(FO, tuple, normal)

DAD-RDB PTnr(CQ, tuple, normal)

Oracle 10g XML DB SQL/XML PTnr(FO, tuple, normal)

DBMS_XMLGEN PT(FP, tuple, normal)

XPERANTO PTnr(FO, tuple, normal)

SilkRoute TreeQL PTnr(CQ, tuple, virtual)

PRATA ATG PT(FO, relation, virtual)

24

Outline







25

Termination and evaluation cost

Given a publishing transducer defined for a relational schema R, does the transformation of on DB terminate on all DB of R? how expensive is it to compute (DB)?

(DB) is always defined on any instance DB of R.

Worst-case data complexity:

EXPTIME if is in PT(L, tuple, O)

2EXPTIME if is in PT(L, relation, O)

PTIME if is in PTnr(L, S, O)

Tight bounds: DAG tree, n-digit binary counter L and O have no impact on the worst-case data complexity

26

Static analyses

For a class PT(L, tuple, O) of publishing transducers,

The emptiness problem: given in PT(L, tuple, O), can

generate a nontrivial XML tree?

Does the publishing transducer make sense?

The membership problem: given an XML tree T and transducer

in PT(L, tuple, O), can generate T with some DB?

Can generate XML views that the user wants?

The equivalence problem: given 1, 2 in PT(L, tuple, O) on the

same relational schema R, do 1 and 2 generate the same

XML views over all instances of R?

Optimization: Can 1 be replaced by a more efficient 2?

27

Matching complexity bounds for static analyses

PT(L, S, O) when L is either FO or FP: beyond reach– emptiness, membership and equivalence: undecidable

PT(CQ, S, O): slightly better– Emptiness

• PTIME if O is normal• NP-complete if O is virtual

– Membership:

2p-complete for PT(CQ, tuple, normal)

• undecidable if S is relation or O is virtual Reduction from (a) the satisfiability problem for FO queries, and

(b) the emptiness problem for 2-head DFA

– Equivalence: undecidable Reduction from the halting problem for 2RMs

28

Complexity bounds for non-recursive transducers

PTnr(FO, S, O): all three problems remain undecidable

PTnr(CQ, S, O): make our lives easier

– Emptiness: the same as PT(CQ, S, O)

– Membership (S is tuple):

• PTnr(CQ, tuple, normal): 2p-complete – no better

• PTnr(CQ, tuple, virtual): undecidable 2p-complete

Establish the small model property

– Equivalence

• PTnr(CQ, tuple, O): undecidable 3p-complete

Lower bound: reduction from ***3SAT

Upper bound: a constructive proof

29

Summary: complexity bounds

fragments Equivalence Emptiness Membership

PT(FP, S, O) undecidable undecidable undecidable

PT(FO, S, O) undecidable undecidable undecidable

PT(CQ, tuple, normal) undecidable PTIME 2p-complete

PT(CQ, relation, normal) undecidable PTIME undecidable

PT(CQ, S, virtual) undecidable NP-complete 2p-complete

PTnr(FO, O, S) undecidable undecidable undecidable

PTnr(CQ, tuple, normal) 3p-complete PTIME undecidable

PTnr(CQ, tuple, virtual) 3p-complete NP-complete 2

p-complete

30

Outline







31

Containment relation

PT(FP, relation, virtual) = PT(FO, relation, virtual)

PT(CQ, rel, virt) PT(FP, rel, nm) PT(FP, tup, virt)

PT(FO, rel, nm)

PT(CQ, rel, nm)PTnr(FO, tup, nm)

PT(FO, tup, virt)

PT(CQ, tup, virt)

PT(FP, tup, nm)

PT(FO, tup, nm)

PTnr(CQ, tup, nm)

PT(CQ, tup, nm)PTnr(CQ, tup, virt)

XML view: under each course, list all its prerequisites, direct or not

No need to upgrade DBMS and support SQL’99

32

Compared to logical transduction

(dom(x), root(x), edge(x;y), <(x;y), fc(x;y), ns(x;y), a(x))

– domain, root, edge, order, first-child, next-sibling, label– define DAGs, unfold into a tree – FO-transductions, SO-transduction (fixed k-arity), PTIME FO-

transductions, PSPACE-SO-transductions

Publishing transducers vs. logical transductions L-transductions PT(L, tuple, virtual)

strict for FO PSPACE-SO-transductions PT(FP, relation, virtual) (ordered) PTIME-FO-transductions PT(FO, relation, virtual) (ordered) fixed-depth L-transductions = PTnr(L, tuple, O) (unordered tree) PTnr(L, tuple, O) fixed-depth L-transductions (L: FP, FO)

No need to code stop conditions

33

DTD and specialized DTD

DTD D = (, r, ), : a for each a

– normalized: ::= a1, …, ak | a1 + … + ak | a*, Specialized

DTD D’ = (’, D, g), D: a DTD, g: ’ – T’ conforms to D’: there is T s.t. T = g(T’) and T conforms to D– Captures MSO definable trees and regular trees

Capturing (specialized) DTD: specialized DTDs are definable in PT(FO, tuple, virtual) normalized DTDs are definable in PT(FO, tuple, normal) there are normalized DTDs not definable in PT(CQ, S, O)

Check each a in FO, return a default in the presence of violation

DTD-directed publishing: All members of a community (or industry)

agree on a DTD and then exchange data w.r.t. the predefined DTD

34

publishing transducer as a relational query

Input: = (Q, , q0, ) for R, an output tag o , a DB of R

Output: the union of Rego(v) for all v in the tree generated

cno

course coursecourse

typetitle

regularcnocno

course

prereq

Q2 Reg

Reg

... db

......

Q1

relational

query

RDB

output

prereq

35

Containment hierarchy: as relational queries

Flattened: PT(L, S, virtual) = PT(L, S,normal)

PT(FP, relation, O) = PT(FO, relation, O)

PT(FO, rel, O)

PT(CQ, rel, O)

PT(FP, tup, O)

PT(FO, tup, O)

PT(CQ, tup, O)

not strict if NLOGSPACE = PTIME PTnr(FO, tuple, O)

PTnr(CQ, tuple, O)

36

complexity classes and relational query languages

PT(FO, relation, O) captures PSPACE (ordered or unordered)(a) Recognition problem can be determined using PSPACE TM

(b) Simulate partial fixpoint query and define a total order

PT(FP, tuple, O) captures FP and thus PTIME (ordered) PT(FO, tuple, O)

– captures TC0[FO] and thus NLOGSPACE (ordered)

TC0[FO] (unordered)

Simulate transitive closure logic and vice versa

PT(CQ, relation, O) contains deterministic datalog PT(CQ, tuple, O) captures linear datalog

datalog: p(x) p1(x1), …, pk(xk)

– deterministic: each p(x) has only one rule

– linear: at most one pj is an IDB

37

non-recursive classes as relational query languages

PTnr(FO, tuple, O) captures FO (ordered or unordered)

PTnr(CQ, tuple, O) captures UCQ (ordered or unordered)Simulate union of conjunctive queries and vice versa

Those corresponding to existing XML publishing languages PTnr(FO, tuple, O): SQL/XML, FOR-XML (Microsoft), IBM DAD

(SQL), …

PTnr(CQ, tuple, O): XSD (Microsoft), TreeQL

38

Expressiveness as relational queries

fragments Complexity/language

PT(FP, relation, O) PSPACE

PT(FO, relation, O) PSPACE

PT(FP, tuple, O) FP, PTIME (ordered databases)

PT(FO, tuple, O) TC0[FO], NLOGSPACE (ordered databases)

PT(CQ, relation, O) deterministic datalog

PT(CQ, tuple, O) TC0[CQ], linear datalog

PTnr(FO, tuple, O) FO

PTnr(CQ, tuple, O) UCQ

PT(L, S, virtual) = PT(L, S, normal)

39

Outline







40

Incremental publishing

Input: – a publishing transducer for relational schema R– an instance DB of R– XML view T = (DB) – relational updates DB

Output: XML updates T such that T + T = (DB + DB)

Commercial products: limited support

XML

RDB

publishing

DBMSmiddleware

incremental updates

T

DB

41

Why incremental update?

DB

sourcedatabase

cached T

XML publishing

Updates: DB incremental update

T Batch computation: recompute the entire XML tree from scratch;

large XML views may take several hours to produce! Incremental computation: compute XML change T

– Idea: the new view T’ = the old view T + T – Typically more efficient to compute T (small) and update

the old view T with T– Why? the new view T’ often differs slight from the old view T

– reuse partial results computed earlier

42

Reduction Approach

Most XML middleware takes a “reduction approach”:

– treat Relational Database Systems (DBMS) as a black box,

– re-use as much functionality of DBMS as possible

Why not the reduction approach for incremental updates?

– XML views are recursive

– Few systems support WITH…RECURSIVE (linear recursion)

– Fewer support its use in views

– None supports incremental update of recursive views (many

algorithms are known for incremental updates of recursive

views, but unfortunately not in practice) The lowest common denominator of functionality of DBMS -- no

need for (recursive) view-update support

43

Sub-Tree Property

...report

patient patient patientpatient

“Bush”

policy#

“123”

treatmenttreatment

inTreatment

treatmentname

tname

SSN

“insane”

“Cheney”

policy#

“234”

treatmentnameSSN

treatmenttreatment

inTreatmenttname

“insane”

Sub-tree Property: Given a transducer and DB, each sub-tree is uniquely determined by (q, a, Rega), e.g., (q, treatment, Regtr),

44

Storing and updating XML – a DAG representation

Storing each XML sub-tree only once, at any level of granularity Associate an ID with each node in the tree (Skolem function)

Small, unique value derived from the node’s register A hash table H to map from (q, type, ID) to a node in the graph Sub-tree pool: each node has a reference count and a children list

[(q1, type1, ID1), (q2, type2, ID2), …]

XML update T = (E+, E-) of edges ((q1,type1,ID1), (q2,type2,ID2)) E-: remove (q2, type2, ID2) from the child list of (q1,type1,ID1) and

decrement reference count on (q2, type2, ID2) E+: insert (q2, type2, ID2) in the child list of (q1, type1, ID1) and

increment reference count on (q2, type2, ID2) Nodes with 0 reference counts move to sub-tree pool – to be reused

H[(tname, “chemo”), (inTreatment, “iT23”)](treatment, “t123”)

(inTreatment, “iT234”) [(treatment, “t345”), (treatment, “t567”), … ]

45

Computing XML changes

Computing XML changes T from database changes DB by incrementalizing SQL queries in a transducer:

select IP, P.tname2

from Procedure P, inTreatment IP

where P.tname1 = IP

Cuts (deletions): given DB, deletions of the existing edges of T are determined by executing a fixed number of non-recursive SQL queries – no recursion is involved (sub-tree property)

Buds (new sub-tree generation): top-down iteration, evaluating non-recursive SQL queries at each step – Each new sub-tree is computed at most once, by sub-tree

reusing (sub-tree pool) – minimizing recomputations– Partial results are “complete” up to a certain level at each

step, allowing lazy evaluation and parallel processing

46

Steps to Bud-Cut

1. For a set of database changes, DB, execute a fixed number of non-recursive queries which determine direct edge changes, E-, E+

report

patient patient patientpatient

“Bush”

policy#

“123”

treatment

inTreatment

treatmentname

tname

SSN

2. Generate the sub-trees under the buds, re-using as much existing and deleted sub-trees as possible

“Cheney”

policy#

“234”

treatmentnameSSN

treatment

inTreatmenttname

E- are cuts

X X

E+ are buds (or cross edges)

treatment

3. Collect Garbage.

47

The XML view update problem

Input: – a publishing transducer for relational schema R– an instance DB of R– XML view T = (DB) – XML updates T

Output: relational updates DB such that T + T = (DB + DB)

Commercial systems: limited support, already hard for relational views

XML

RDB

publishing

DBMSmiddleware

view updates

T

DB

48

New challenges introduced by XML view updates

Revising the semantics of side effects

T: delete course[cno=`CS650’]//course[cno=`CS450’]/prereq/*

Subtree property: remove the prerequisites of all CS450 occurrences? DTD validation (if any) recursively defined

– XML views– XML updates

... db

coursecourse

prereqcno

“CS650” course

course

...cno

“CS450”

course

prereq

cno

“CS450”

course

prereq

X

?

49

Processing XML view updates

relational views V

DB

DB

Deriving relational views V from XML views

(edge relations of DAG – external storage)XML

1. DTD validation – reject T if violationT

2. Computing view updates V from T

3. Computing updates DB from V

May not exist – reject T if not

4. Update the underlying DB and view V with DB from V

Main challenges: relational view updates Hard: deciding view updatability is intractable/undecidable Open: complexity, algorithm, commercial system support

V

50

Outline







51

XML integration: complexity and expressiveness

integration

XML view

DB

multiple, distributed sources

DB

DB

DTD

constraints

XML integration transducers

Two-way vs. top-down: context-dependent generation

Integrity constraints: conformance to XML schema

Information preservation: data migration

XML integration language: Attribute Integration Grammar (AIG)

52

XML shredding

Storing XML data in relations: storage, query processing,

RDBMS transaction control, … Primary goal:

– store part or entire XML documents – content based– increment existing relations, rather than build a new one– directed by recursive XML schema

XML

RDB

query answer

shredding

query translation

DBMSmiddleware

53

XML shredding automata

prereq

Q

Shredding automata vs. publishing transducers– take an existing tree as input, rather than relations– embedded XML queries, not relational, to compute Reg– output: union of relation registers – tuples to insert– combining XML SAX parsing and shredding, e.g., XML2DB

Primary goal: expressive power and complexity

Reg

cno

course coursecourse

typetitle

regularcnocno

course

prereq

Q2 Reg

Reg

...

RDB

db

......

Q1

XMLquery

XML

54

Summary

XML publishing: a synergy between theory and practice

– characterization of XML publishing languages in practice;

– expressive power and matching complexity bounds.

helpful guidance for both the users and database vendors

Dynamic aspects: incremental publishing and view updates.

important yet overlooked by and large Open research issues:

– XML integration transducers

– XML shredding automata

– . . .

An attempt to bridge theory and practice

Date post:	28-Dec-2015
Category:	Documents
Upload:	geraldine-warner
View:	215 times
Download:	0 times

1 XML Publishing: Bridging Theory and Practice Wenfei Fan University of Edinburgh and Bell...

Documents