1
From XML Schema to Relations: A Cost-Based Approach to XML Storage
Presented by Xinwan Bian and Danyu Wu
02-21-02
2
Introductions
Where to save XML document? . XML database . Object-Oriented database . Object-Relational database . Relational database
3
Difficulties of Saving XML document into Relational Database
XML has more complex tree structure than flat relational tables
XML contains richer data types The integration with legacy tables
4
Different Approaches to schema mappings
Fixed XML-to-relational mappings
Commercial RDBMS utility tools
Bell Laboratories cost-based approach
5
LegoDB, an XML storage mapping system
Three design principles . Cost-based search . Logical/physical independence . Reuse of existing technology
6
The Basic Approach of LegoDB
Create a p-schema for input XML schema
Obtain cost estimates with input of data statistics and XQuery workload
Exploit alternative storage configurations and achieve an optimal mapping
7
Architecture of the Mapping Engine
GeneratePhysical Schema
Physical Schema Transformation
Query/Schema Translation
Query Optimizer
XML data statisticsXML Schema
PS0 PSi RSi
Optimal Configuration XQuery workload
cost(SQi)
Rsi: Relational Schema/Queries/StatsPsi: Physical Schema
8
Questions
Its Advantages?
Its Disadvantages?
9
Example of P-Schema Creation
type Show= type Show= TABLE Show show [@type [String], show [@type[String], ( show_id
INT, title [String], title [String] type STRING, year [Integer], year [Integer], year INT ) reviews [String]*, Reviews*, TABLE Review …] type Reviews = ( Review_id, reviews[String] review
String, parent_show
INT)(a) Initial XML schema (b) P-Schema © Relational
table
10
What’s P-Schema?
Physical schemas (p-schemas) is an extension of XML schemas in two significant ways:
. They contain data statistics . They can be easily mapped into
relational tables
11
Example of P-Schema with statistics
type Show = show [ @type[ String <#8,#2>],
year[ Integer<#4,#1800,#2100,#300>], title[ String<#50,#34798>], Review*<#10> ] type Review = review[ String<#800> ] Scalar<#size, #min, #max, #distinct> String<#size,#distinct> *<#count>
12
Stratified Physical Types scalar type s ::= Integer | String | Boolean Physical type ps ::= ps<#size,#min,#max,#distincts> Named type nt::= X (type name) | nt | nt (choice) | (empty) | nt{n,m,#<}#count> repetition Optional type ot ::= nt (named type) | s (optional scalar) |
L[ot] (optional element) | ot, ot (optional sequence) | () (empty) Physical type pt ::= nt (named type) | ot{0,1} (optional type) |
s (scalar) | L[pt] (element) | pt, pt (sequence) | () empty Schema item si ::= type X = pt (type declaration) Schema ::= schema Sn = si, si, … end (schema)
13
Mapping of p-schema to relations Create one relation RT for each type name T For each RT, create a key that stores node id For each RT, create a foreign key to all
relations RPT such that PT is a parent type of T A column is created in RT for each sub-element
of T that is a physical type If the data type is contained within an optional
type then the corresponding column can be null
14
More details of P_Schema to relational mappings
15
Schema Transformations
Advantages of transformations at XML Schema level
. Much of the XML schema semantics not present in
a given relational schema. . More natural rewriting at the XML level . The framework is more easily extensible to
other non-relational stores
16
Inlining/Outlining Transformation
One can either associate a type name to a given nested element (outlining) or next its definition directly within its parent element (inlining).
type TV= seasons [Integer] type TV = Description, seasons[Integer], Episode* => description[String], Episode* type Description = description [String]
17
Union Factorization/Distribution Transformation The first law ((a,(b|c)) == (a,b|a,c)
type Show = type Show = show [@type[String], title[String] show [(@type[String], title[String], year [Integer], title[String], year[Integer], Aka{1,10}, Review*, {Movie|TV}] Aka{1,10}, Review*, box_office[Integer],type Movie = => video_sales[Integer])
box_office[Integer] | (@type[String], title[String],
video_sales[Integer] year[Integer], Aka{1,10} Review*, seasons[Integer],Type TV = seasons[Integer],
description[String],Episode*)] description[String], Episode*
18
Corresponding relational configuration
TABLE TV ( TV_id INT, seasons String, TABLE TV ( parent_show ) TV_id INT, => seasons String,TABLE Description description String, ( Description_id INT parent_Show ) description String, parent_TV )
19
Union Factorization/Distribution continues
The Second law (a[t1|t2] == a[t1]|a[t2])
Type Show = show[(@type[String], type Show = (Show Part1 | Show Part2 ) title[String],year[Integer], type Show Part 1 = show [@type [String], Aka{1,10}, Review*, title [String], year[Integer], Aka{1,10}, box_office[Integer], Review*, box_office[Integer], video_sales[Integer]) video_sales[Integer] ]| (@type [String], => title [String], year [Integer], type Show Part2 = Aka{1,10}, Review*, show [@type [String], title[String], seasons [Integer], year [Integer], Aka{1,10}, description [String], Review*, seasons [Integer], Episode*) ] description [String], Episode* ]
20
Corresponding relational configurations
TABLE Show ( Show_id INT, TABLE Show_Part1 ( type String, title String, Show_Part1_id INT,
year INT) type String, title String, year INT, box_office INT,TABLE Movie ( video_sales INT) Movie_id INT, Box_Office INT, => video_sales INT, parent_show INT) TABLE Show_Part2 ( Show_Part2_id INT,TABLE TV ( type String, title String, TV_id INT, seasons INT, year INT, seasons INT, description string, parent_show INT) description String )
21
Wildcard rewritings
‘~’: any element names can be used ‘~!a’: any name but “a” can be used.
Type Review = type Reviews = review [~[ String ]*] review[ (NYTReview |
OtherReview)*]
=> type NYTReview = nyt[ String]
type OtherReview = (~!nyt) [String]
22
XQuery Queries Examples
Q1: FOR $v in imdb/show WHERE $v/year = 1999 RETURN ($v/title, $v/year, $v/nyt_reviews)
Q2: FOR $v in imdb/show RETURN $v Q3: FOR $v in imdb/show WHERE $v/title = c3 RETURN
$v/description Q4: FOR $v in imdb/show RETURN <result> { $v/title, $v/year, (FOR $e IN $v/episode WHERE
$e/guest_director = c4 RETURN $e) } </result>
23
XQuery Workload Examples
Publish = { Q1 : 0.4, Q2: 0.4, Q3: 0.1, Q4: 0.1}
Lookup = {Q1: 0.1, Q2: 0.1, Q3:0.4, Q4: 0.4}
24
Search Algorithm Procedure GreedySearchInput: xSchema: schema, xWkld: query workload, xStats:data statistics Output: pSchema: an efficient physical schema1 begin minCost = infinite large ; pSchema = GetInitialPhysicalSchema(xSchema) cost = GetPSchemaCost(pSchema, xWkld, xStats) while (cost < minCost) do 5 minCost = cost pSchemaList = ApplyTransformations(pSchema) for each pSchema’ € pSchemaList do cost’=GetPSchemaCost(pSchema’,xWkld,xStats) if cost’s < cost then cost = cost’; pSchema = pSchema’ endif10 endfor endwhile return pSchema end.
25
Experimental Settings
Two variations of the greedy search: greedy-so and greedy-si.
Greedy-so: Initial physical schema: all element outlined (except base
type). During search: Inlining transformations applied.
Greedy-si: Initial physical schema: all elements inlined (except
elements with multiple occurences) During search: Outlining transformations applied.
26
Efficiency of Greedy Search
5 lookup queries and 3 publish queries
27
Results
For lookup: Greedy-so converges to the final configuration a lot faster.
For publish: opposite.
28
Reasons:
The traversals made by lookup queries are localized. The final configuration has only a few inlined elements. Greedy-so can reach this configuration earlier than greedy-si.
The publish queries traverse larger number of elements. The final configuration has several inlined elements. Greedy-si can reach this configuration earlier than greedy- so.
29
Sensitivity of configurations to varied workloads
Create a spectrum of workloads that combined the lookup queries and publish queries in the ratio k : (1-k), where k€[0,1] is the fraction of lookup queries in the particular workload.
Three workloads corresponding to k = 0.25, 0.50, and 0.75, resulting three configurations.
30
Figure 11: Sensitivity to variations in the workload
31
Inlining as a bad idea to some queries
(a)The query does limited, localized traversals and/or does not access all the attributes involved.
(b)The query has highly selective selection predicates.
(c)The query involves join of attributes not structurally adjacent in the XML Schema (e.g. actor and director).
32
Effectiveness of XML transformations:Union Distribution
33
Results of the union-transformed configuration
Overlap between the curves for C[0.25] and C[0.75] with OPT.
C[0.25] and C[0.75] cross at a small angle.
C[All-inlined] performed 2~5 times worse than optimal.
34
Wildcards
Find the NYTimes reviews for shows produced in 1999:
35
Questions
The optimal mapping in this paper is cost-based. What else needs to be considered?
36
References P.Bohannon, J.Freire, P. Roy, and J. Sim’eon. From XML schema to
relations: A cost –based approach to XML storage. Technical report, Bell Laboratories, 2001. Full version.
A. Deutsch, M. Fernandez, and D. Suciu. Storing semi-structured data with STORED. In Proc. Of SIGMOND, pp 431-442, 1999.
D. Florescu and D. Kossman. A performance evaluation of alternative mapping schemas for storing XML in a relational database. Technical Report 3680, INRIA, 1999
M. Klettke and H. Meyer. XML and object-relational database system – enhancing structural mappings based on statistics. In Proc. Of WebDB, pp63-68, 2000.
A. Schmidt, M. Kersten, M. Windhouwer, and F.Waas. Efficient relational storage and retrieval of XML documents. In Proc. Of WebDB, pp47-52, 2000.
J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and Opportunities. In Proc. Of VLDB, pp302-314, 1999.