Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | geraldine-warner |
View: | 215 times |
Download: | 0 times |
1
XML Publishing:
Bridging Theory and Practice
Wenfei Fan
University of Edinburgh
and
Bell Laboratories
2
XML documents
Rooted, node-labeled, ordered, unranked tree element: e.g., course, prereq – tagged, subtree,
– subelement, e.g., the prereq child of course text node, e.g., “CS650”, carrying text, not tagged, leaf
... db
course coursecourse
typetitle
regular“Web DB”
prereqcno
“CS650” cnocno
...
course
... prereq
3
XML publishing: data exchange on the Web
Most legacy data is stored in relational databases
XML has become the prime standard for data exchange
DB1 DB2
XML
Q: XML view
Web
XML
publishing
RDB
source
XML viewview
mapping
4
XML publishing: an XML interface of databases
XML
RDB
query answer
publishing query translation
DBMSmiddleware
DTD
Querying and updating traditional databases via XML views
5
Example: XML publishing
Relational schema R0:
course (cno, title, type)
prereq (cno1, cno2) -- prerequisite hierarchy
XML DTD D0: db course* course cno, title, type, prereq prereq cno*, prereq type regular | project
R
Registrar DBXML view ...
db
course coursecourse
typetitle
regular“Web DB”
prereqcno
“CS650” cnocno
...
course
... prereq
6
XML publishing languages in practice
XML view definition languages: XML views published from RDB
Commercial products: – Microsoft SQL Server 2005 (FOR-XML, XSD)– IBM DB2 XML Extender (SQL/XML, DAD: SQL, RDB)– Oracle 10g XML DB (SQL/XML, DBMS_XMLGEN) …
Research prototypes: – XPERANTO – TreeQL (SilkRoute) – ATG (PRATA)
RDB
source
XML viewview
mapping
7
XML publishing in practice
RDB
source
XML viewview
mapping
... db
course coursecourse
typetitle
regular“Web DB”
prereqcno
“CS650” cnocno
...
course
... prereq
Q
Q1
Q2
RDBrelationa
lquery
Top-down from the root, via embedded relational queries
8
XML publishing: question of the users
What language should a user choose to express the view? unbounded depth, nondeterministic “shape”, cannot be decided
statically at compile timeprereq cno*, prereq
type regular | project
... db
course coursecourse
typetitle
regular“Web DB”
cno
“CS650”
course
prereq
cnocno
...... prereq
unboundedproject
X X
Few publishing languages can define this view
collection
9
XML publishing: question of database vendors
XML view: under each course, list all its prerequisites, direct or not– collapsing prerequisite hierarchy– a tree of depth three
Question: is it necessary to upgrade DBMS and support SQL’99?
...
db
course coursecourse
typetitle
“Web DB”
cno
“CS650”
course
cnocno ...project
Q
Q1
The expressive power and complexity of XML publishing languages
10
Outline
XML publishing transducers
Characterization of XML publishing languages in practice
Complexity: evaluation cost, static analyses
Expressive power: tree generation, relational characterization
Dynamic aspect: incremental XML publishing, view updates
Open research issues
Joint work with Theory: Floris Geerts, Frank Neven [PODS’07] System: Michael Benedikt, Phil Bohannon, Cheeyong Chan, Rajeev
Rastogi, … [SIGMOD’03,04; VLDB’02,04,05; ICDE’07]
11
Outline
XML publishing transducers
Characterization of XML publishing languages in practice
Complexity: evaluation cost, static analyses
Expressive power: tree generation, relational characterization
Dynamic aspect: incremental XML publishing, view updates
Open research issues
12
XML publishing transducers
= (Q, , q0, ) for a relational schema R Q: a finite set of states : a finite alphabet of XML tags, with a root r and text q0: the start state
: for each pair (q, a) in Q
(q, a) (q1, a1, 1(x1, y1)), . . ., (qk, ak, k(xk, yk)),
– to generate the children of a nodes: a1*, . . ., ak*
– register Rega: set-valued, fixed arity, with each a-node
i: query R Rega Regai in a relational query language L
– xi: a list of free variables in i, grouping attributes
– deterministic: • (q, text) . -- Empty RHS: text nodes have no children
13
Top-down transduction
Start rule: (q0, r) -- q0, r0 do not appear on the RHS of any rule
(q0, db) (q, course, 1(c, t; ))
1(c, t; nil) = t’ course(c, t’, t)
recall course (cno, title, type)
tuple register Regc: group the result by all attributes:
for each distinct tuple tp in the result of 1(x; )
– create a course element– carry the tuple tp in Regc
expand at leaf nodes(q0, db)
(q, course) ...(q, course) (q, course) (q, course)
Regc RegcRegcRegc
x = (c, t) y =
(q, a) labeledcarrying Reg
14
Registers: tuple vs. relation
(q, course) (q, cno, 2(c; )), (q,type,3(t; )), (q, prereq, 4(; c))
2(c; ) = t Regc(c, t)
4(; c) = t, c’ (Regc(c’, t) prereq(c’, c))
recall prereq(cno1, cno2)
tuple registers Regcno, Regt
relation register Regp : x = , the result of 4(; c) is a set
top down information passing: the parent register Regc in 4(; c)(q0, db)
(q, course) ...(q, course) (q, course) (q, course)Regc Regc
RegcRegc
(q, type)(q, cno) (q, prereq)
Regcno Regt Regp
x = y = ( c )
15
Recursive transducer and stop condition
(q, prereq) (q, cno, 5(c; )), (q, prereq, 5(; c))
5(; c) = t, c’ (Regp(c’, t) prereq(c’, c))
Stop conditions: 5(; c) returns an empty set
– the RHS of (q, a) is empty (e.g., for text nodes)
– there is an ancestor node with the same label, tag and registerNo new information can be added to the tree (q0, db)
...
(q, course)
(q, cno)
(q, prereq)
Regcno
Regp
(q, cno)
Regcno
(q, prereq)
Regp
(q, a) Rega
(q, a) RegaSTOP
tuple Regrelation Reg
16
Transformation of a publishing transducer
terminates on a DB of R if all leaf nodes satisfy a stop condition
(DB): XML tree, by striking out states and registers
(R): the set of XML trees generated by for all DB of R
(q, prereq)(q, cno)
...(q, course) (q, course)(q, course)
(q, type)
(q,regular)“CS650” (q, cno)(q, cno)
...
(q, course)
...
Reg Reg Reg
Reg
Reg
Reg
Reg
(q0, db)
(q, prereq) Reg Reg
Reg
course coursecourse
type
regular
prereqcno
“CS650” cnocno
course... db
...... prereq
17
publishing transducers with virtual nodes
= (Q, , a, q0, )
a : a subset of , virtual tags
Recall the view: under each course, list all its prerequisites
’ = (Q, , a = {prereq}, q0, )
course coursecourse
type
regular
prereqcno
“CS650” cnocno
course... db
...... prereq
db
course coursecourse
typecno
“CS650”
course
cnocno ...regular
Virtual nodes are removed from the output
18
Various classes of publishing transducers
PT(L, S, O)– L: the relational query language (CQ, FO, FP, with = and )– S: register, relation vs. tuple (a special case of relation Reg)– O: output nodes, normal vs. virtual
PTnr(L, S, O): non-recursive subset of PT(L, S, O)
Example: View 1: PT(CQ, relation, normal) View 2: PT(CQ, relation, virtual) and PTnr(FP, tuple, normal)
As opposed to query automata take a relational database as input, rather than an existing tree output a new tree, rather than accepting a tree or selecting nodes
In contrast to recent work on schema mapping relations to XML, not relation-to-relation or XML-to-XML via embedded relational queries, not source-to-target constraints
19
Outline
XML publishing transducers
Characterization of XML publishing languages in practice
Complexity: evaluation cost, static analyses
Expressive power: tree generation, relational characterization
Dynamic aspect: incremental XML publishing, view updates
Open research issues
20
Existing XML publishing languages
Extensions of SQL by incorporating XML publishing functions– Microsoft SQL Server 2005 (FOR-XML)– IBM DB2 XML Extender (SQL/XML)– Oracle 10g XML DB (SQL/XML, DBMS_XMLGEN)– XPERANTO – …
Annotating schema or fixed tree template with relational queries
– Microsoft SQL Server 2005 (XSD)
– IBM DB2 XML Extender (DAD: SQL, RDB)– TreeQL (SilkRoute) – ATG (PRATA)– . . .
21
Extensions of SQL for XML publishing
SQL/XML: XMLElement, XMLForest, XMLAgg, XMLConcat, …
SELECT XMLELEMENT {NAME=“course”,
XMLFOREST{ c.cno AS “cno”, c.title AS “title”}}
FROM course c
... db
course coursecourse
titlecno
course
PTnr(FO, tuple, normal): no recursion, virtual nodes
XPERANTO: PTnr(FO, tuple, normal) Microsoft SQL Server 2005 (FOR-XML): PTnr(FO, tuple, normal) Oracle 10g XML DB
– DBMS_XMLGEN: PT(FP, tuple, normal) (connect-by of SQL’99)
22
Annotating schema or tree template
ATG of PRATA: DTD-directed view definition, inherited attributes prereq cno*, prereq
$cno Q($prereq_p), $prereq_c = Q($prereq_p) /* semantic rules */
Q: SELECT cno2 FROM prereq p, $prereq_p p’
WHERE p.cno1 = $prereq_p.cno
– $prereq_p: parent attribute (relation register)
...prereq
prereq
cno cno
PT(FO, relation, virtual): recursive views, virtual nodes, DTD-conformance
Microsoft SQL Server 2005 (XSD): PTnr(CQ, tuple, normal): IBM DB2 XML Extender DAD-SQL: PTnr(CQ, tuple, normal),
DAD-RDB: PTnr(CQ, tuple, normal) TreeQL (SilkRoute): PTnr(CQ, tuple, virtual)
23
Putting these together
Microsoft SQL Server 2005 FOR XML PTnr(FO, tuple, normal)
annotated XSD PTnr(CQ, tuple, normal)
IBM DB2 XML Extender SQL/XML PTnr(FO, tuple, normal)
DAD-SQL PTnr(FO, tuple, normal)
DAD-RDB PTnr(CQ, tuple, normal)
Oracle 10g XML DB SQL/XML PTnr(FO, tuple, normal)
DBMS_XMLGEN PT(FP, tuple, normal)
XPERANTO PTnr(FO, tuple, normal)
SilkRoute TreeQL PTnr(CQ, tuple, virtual)
PRATA ATG PT(FO, relation, virtual)
24
Outline
XML publishing transducers
Characterization of XML publishing languages in practice
Complexity: evaluation cost, static analyses
Expressive power: tree generation, relational characterization
Dynamic aspect: incremental XML publishing, view updates
Open research issues
25
Termination and evaluation cost
Given a publishing transducer defined for a relational schema R, does the transformation of on DB terminate on all DB of R? how expensive is it to compute (DB)?
(DB) is always defined on any instance DB of R.
Worst-case data complexity:
EXPTIME if is in PT(L, tuple, O)
2EXPTIME if is in PT(L, relation, O)
PTIME if is in PTnr(L, S, O)
Tight bounds: DAG tree, n-digit binary counter L and O have no impact on the worst-case data complexity
26
Static analyses
For a class PT(L, tuple, O) of publishing transducers,
The emptiness problem: given in PT(L, tuple, O), can
generate a nontrivial XML tree?
Does the publishing transducer make sense?
The membership problem: given an XML tree T and transducer
in PT(L, tuple, O), can generate T with some DB?
Can generate XML views that the user wants?
The equivalence problem: given 1, 2 in PT(L, tuple, O) on the
same relational schema R, do 1 and 2 generate the same
XML views over all instances of R?
Optimization: Can 1 be replaced by a more efficient 2?
27
Matching complexity bounds for static analyses
PT(L, S, O) when L is either FO or FP: beyond reach– emptiness, membership and equivalence: undecidable
PT(CQ, S, O): slightly better– Emptiness
• PTIME if O is normal• NP-complete if O is virtual
– Membership:
2p-complete for PT(CQ, tuple, normal)
• undecidable if S is relation or O is virtual Reduction from (a) the satisfiability problem for FO queries, and
(b) the emptiness problem for 2-head DFA
– Equivalence: undecidable Reduction from the halting problem for 2RMs
28
Complexity bounds for non-recursive transducers
PTnr(FO, S, O): all three problems remain undecidable
PTnr(CQ, S, O): make our lives easier
– Emptiness: the same as PT(CQ, S, O)
– Membership (S is tuple):
• PTnr(CQ, tuple, normal): 2p-complete – no better
• PTnr(CQ, tuple, virtual): undecidable 2p-complete
Establish the small model property
– Equivalence
• PTnr(CQ, tuple, O): undecidable 3p-complete
Lower bound: reduction from ***3SAT
Upper bound: a constructive proof
29
Summary: complexity bounds
fragments Equivalence Emptiness Membership
PT(FP, S, O) undecidable undecidable undecidable
PT(FO, S, O) undecidable undecidable undecidable
PT(CQ, tuple, normal) undecidable PTIME 2p-complete
PT(CQ, relation, normal) undecidable PTIME undecidable
PT(CQ, S, virtual) undecidable NP-complete 2p-complete
PTnr(FO, O, S) undecidable undecidable undecidable
PTnr(CQ, tuple, normal) 3p-complete PTIME undecidable
PTnr(CQ, tuple, virtual) 3p-complete NP-complete 2
p-complete
30
Outline
XML publishing transducers
Characterization of XML publishing languages in practice
Complexity: evaluation cost, static analyses
Expressive power: tree generation, relational characterization
Dynamic aspect: incremental XML publishing, view updates
Open research issues
31
Containment relation
PT(FP, relation, virtual) = PT(FO, relation, virtual)
PT(CQ, rel, virt) PT(FP, rel, nm) PT(FP, tup, virt)
PT(FO, rel, nm)
PT(CQ, rel, nm)PTnr(FO, tup, nm)
PT(FO, tup, virt)
PT(CQ, tup, virt)
PT(FP, tup, nm)
PT(FO, tup, nm)
PTnr(CQ, tup, nm)
PT(CQ, tup, nm)PTnr(CQ, tup, virt)
XML view: under each course, list all its prerequisites, direct or not
No need to upgrade DBMS and support SQL’99
32
Compared to logical transduction
(dom(x), root(x), edge(x;y), <(x;y), fc(x;y), ns(x;y), a(x))
– domain, root, edge, order, first-child, next-sibling, label– define DAGs, unfold into a tree – FO-transductions, SO-transduction (fixed k-arity), PTIME FO-
transductions, PSPACE-SO-transductions
Publishing transducers vs. logical transductions L-transductions PT(L, tuple, virtual)
strict for FO PSPACE-SO-transductions PT(FP, relation, virtual) (ordered) PTIME-FO-transductions PT(FO, relation, virtual) (ordered) fixed-depth L-transductions = PTnr(L, tuple, O) (unordered tree) PTnr(L, tuple, O) fixed-depth L-transductions (L: FP, FO)
No need to code stop conditions
33
DTD and specialized DTD
DTD D = (, r, ), : a for each a
– normalized: ::= a1, …, ak | a1 + … + ak | a*, Specialized
DTD D’ = (’, D, g), D: a DTD, g: ’ – T’ conforms to D’: there is T s.t. T = g(T’) and T conforms to D– Captures MSO definable trees and regular trees
Capturing (specialized) DTD: specialized DTDs are definable in PT(FO, tuple, virtual) normalized DTDs are definable in PT(FO, tuple, normal) there are normalized DTDs not definable in PT(CQ, S, O)
Check each a in FO, return a default in the presence of violation
DTD-directed publishing: All members of a community (or industry)
agree on a DTD and then exchange data w.r.t. the predefined DTD
34
publishing transducer as a relational query
Input: = (Q, , q0, ) for R, an output tag o , a DB of R
Output: the union of Rego(v) for all v in the tree generated
cno
course coursecourse
typetitle
regularcnocno
course
prereq
Q2 Reg
Reg
... db
......
Q1
relational
query
RDB
output
prereq
35
Containment hierarchy: as relational queries
Flattened: PT(L, S, virtual) = PT(L, S,normal)
PT(FP, relation, O) = PT(FO, relation, O)
PT(FO, rel, O)
PT(CQ, rel, O)
PT(FP, tup, O)
PT(FO, tup, O)
PT(CQ, tup, O)
not strict if NLOGSPACE = PTIME PTnr(FO, tuple, O)
PTnr(CQ, tuple, O)
36
complexity classes and relational query languages
PT(FO, relation, O) captures PSPACE (ordered or unordered)(a) Recognition problem can be determined using PSPACE TM
(b) Simulate partial fixpoint query and define a total order
PT(FP, tuple, O) captures FP and thus PTIME (ordered) PT(FO, tuple, O)
– captures TC0[FO] and thus NLOGSPACE (ordered)
TC0[FO] (unordered)
Simulate transitive closure logic and vice versa
PT(CQ, relation, O) contains deterministic datalog PT(CQ, tuple, O) captures linear datalog
datalog: p(x) p1(x1), …, pk(xk)
– deterministic: each p(x) has only one rule
– linear: at most one pj is an IDB
37
non-recursive classes as relational query languages
PTnr(FO, tuple, O) captures FO (ordered or unordered)
PTnr(CQ, tuple, O) captures UCQ (ordered or unordered)Simulate union of conjunctive queries and vice versa
Those corresponding to existing XML publishing languages PTnr(FO, tuple, O): SQL/XML, FOR-XML (Microsoft), IBM DAD
(SQL), …
PTnr(CQ, tuple, O): XSD (Microsoft), TreeQL
38
Expressiveness as relational queries
fragments Complexity/language
PT(FP, relation, O) PSPACE
PT(FO, relation, O) PSPACE
PT(FP, tuple, O) FP, PTIME (ordered databases)
PT(FO, tuple, O) TC0[FO], NLOGSPACE (ordered databases)
PT(CQ, relation, O) deterministic datalog
PT(CQ, tuple, O) TC0[CQ], linear datalog
PTnr(FO, tuple, O) FO
PTnr(CQ, tuple, O) UCQ
PT(L, S, virtual) = PT(L, S, normal)
39
Outline
XML publishing transducers
Characterization of XML publishing languages in practice
Complexity: evaluation cost, static analyses
Expressive power: tree generation, relational characterization
Dynamic aspect: incremental XML publishing, view updates
Open research issues
40
Incremental publishing
Input: – a publishing transducer for relational schema R– an instance DB of R– XML view T = (DB) – relational updates DB
Output: XML updates T such that T + T = (DB + DB)
Commercial products: limited support
XML
RDB
publishing
DBMSmiddleware
incremental updates
T
DB
41
Why incremental update?
DB
sourcedatabase
cached T
XML publishing
Updates: DB incremental update
T Batch computation: recompute the entire XML tree from scratch;
large XML views may take several hours to produce! Incremental computation: compute XML change T
– Idea: the new view T’ = the old view T + T – Typically more efficient to compute T (small) and update
the old view T with T– Why? the new view T’ often differs slight from the old view T
– reuse partial results computed earlier
42
Reduction Approach
Most XML middleware takes a “reduction approach”:
– treat Relational Database Systems (DBMS) as a black box,
– re-use as much functionality of DBMS as possible
Why not the reduction approach for incremental updates?
– XML views are recursive
– Few systems support WITH…RECURSIVE (linear recursion)
– Fewer support its use in views
– None supports incremental update of recursive views (many
algorithms are known for incremental updates of recursive
views, but unfortunately not in practice) The lowest common denominator of functionality of DBMS -- no
need for (recursive) view-update support
43
Sub-Tree Property
...report
patient patient patientpatient
“Bush”
policy#
“123”
treatmenttreatment
inTreatment
treatmentname
tname
SSN
“insane”
“Cheney”
policy#
“234”
treatmentnameSSN
treatmenttreatment
inTreatmenttname
“insane”
Sub-tree Property: Given a transducer and DB, each sub-tree is uniquely determined by (q, a, Rega), e.g., (q, treatment, Regtr),
44
Storing and updating XML – a DAG representation
Storing each XML sub-tree only once, at any level of granularity Associate an ID with each node in the tree (Skolem function)
Small, unique value derived from the node’s register A hash table H to map from (q, type, ID) to a node in the graph Sub-tree pool: each node has a reference count and a children list
[(q1, type1, ID1), (q2, type2, ID2), …]
XML update T = (E+, E-) of edges ((q1,type1,ID1), (q2,type2,ID2)) E-: remove (q2, type2, ID2) from the child list of (q1,type1,ID1) and
decrement reference count on (q2, type2, ID2) E+: insert (q2, type2, ID2) in the child list of (q1, type1, ID1) and
increment reference count on (q2, type2, ID2) Nodes with 0 reference counts move to sub-tree pool – to be reused
H[(tname, “chemo”), (inTreatment, “iT23”)](treatment, “t123”)
(inTreatment, “iT234”) [(treatment, “t345”), (treatment, “t567”), … ]
45
Computing XML changes
Computing XML changes T from database changes DB by incrementalizing SQL queries in a transducer:
select IP, P.tname2
from Procedure P, inTreatment IP
where P.tname1 = IP
Cuts (deletions): given DB, deletions of the existing edges of T are determined by executing a fixed number of non-recursive SQL queries – no recursion is involved (sub-tree property)
Buds (new sub-tree generation): top-down iteration, evaluating non-recursive SQL queries at each step – Each new sub-tree is computed at most once, by sub-tree
reusing (sub-tree pool) – minimizing recomputations– Partial results are “complete” up to a certain level at each
step, allowing lazy evaluation and parallel processing
46
Steps to Bud-Cut
1. For a set of database changes, DB, execute a fixed number of non-recursive queries which determine direct edge changes, E-, E+
report
patient patient patientpatient
“Bush”
policy#
“123”
treatment
inTreatment
treatmentname
tname
SSN
2. Generate the sub-trees under the buds, re-using as much existing and deleted sub-trees as possible
“Cheney”
policy#
“234”
treatmentnameSSN
treatment
inTreatmenttname
E- are cuts
X X
E+ are buds (or cross edges)
treatment
3. Collect Garbage.
47
The XML view update problem
Input: – a publishing transducer for relational schema R– an instance DB of R– XML view T = (DB) – XML updates T
Output: relational updates DB such that T + T = (DB + DB)
Commercial systems: limited support, already hard for relational views
XML
RDB
publishing
DBMSmiddleware
view updates
T
DB
48
New challenges introduced by XML view updates
Revising the semantics of side effects
T: delete course[cno=`CS650’]//course[cno=`CS450’]/prereq/*
Subtree property: remove the prerequisites of all CS450 occurrences? DTD validation (if any) recursively defined
– XML views– XML updates
... db
coursecourse
prereqcno
“CS650” course
course
...cno
“CS450”
course
prereq
cno
“CS450”
course
prereq
X
?
49
Processing XML view updates
relational views V
DB
DB
Deriving relational views V from XML views
(edge relations of DAG – external storage)XML
1. DTD validation – reject T if violationT
2. Computing view updates V from T
3. Computing updates DB from V
May not exist – reject T if not
4. Update the underlying DB and view V with DB from V
Main challenges: relational view updates Hard: deciding view updatability is intractable/undecidable Open: complexity, algorithm, commercial system support
V
50
Outline
XML publishing transducers
Characterization of XML publishing languages in practice
Complexity: evaluation cost, static analyses
Expressive power: tree generation, relational characterization
Dynamic aspect: incremental XML publishing, view updates
Open research issues
51
XML integration: complexity and expressiveness
integration
XML view
DB
multiple, distributed sources
DB
DB
DTD
constraints
XML integration transducers
Two-way vs. top-down: context-dependent generation
Integrity constraints: conformance to XML schema
Information preservation: data migration
XML integration language: Attribute Integration Grammar (AIG)
52
XML shredding
Storing XML data in relations: storage, query processing,
RDBMS transaction control, … Primary goal:
– store part or entire XML documents – content based– increment existing relations, rather than build a new one– directed by recursive XML schema
XML
RDB
query answer
shredding
query translation
DBMSmiddleware
53
XML shredding automata
prereq
Q
Shredding automata vs. publishing transducers– take an existing tree as input, rather than relations– embedded XML queries, not relational, to compute Reg– output: union of relation registers – tuples to insert– combining XML SAX parsing and shredding, e.g., XML2DB
Primary goal: expressive power and complexity
Reg
cno
course coursecourse
typetitle
regularcnocno
course
prereq
Q2 Reg
Reg
...
RDB
db
......
Q1
XMLquery
XML
54
Summary
XML publishing: a synergy between theory and practice
– characterization of XML publishing languages in practice;
– expressive power and matching complexity bounds.
helpful guidance for both the users and database vendors
Dynamic aspects: incremental publishing and view updates.
important yet overlooked by and large Open research issues:
– XML integration transducers
– XML shredding automata
– . . .
An attempt to bridge theory and practice