Download - 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

1

From XML Schema to Relations: A Cost-Based Approach to XML Storage

Presented by Xinwan Bian and Danyu Wu

02-21-02

2

Introductions

Where to save XML document? . XML database . Object-Oriented database . Object-Relational database . Relational database

3

Difficulties of Saving XML document into Relational Database

XML has more complex tree structure than flat relational tables

XML contains richer data types The integration with legacy tables

4

Different Approaches to schema mappings

Fixed XML-to-relational mappings

Commercial RDBMS utility tools

Bell Laboratories cost-based approach

5

LegoDB, an XML storage mapping system

Three design principles . Cost-based search . Logical/physical independence . Reuse of existing technology

6

The Basic Approach of LegoDB

Create a p-schema for input XML schema

Obtain cost estimates with input of data statistics and XQuery workload

Exploit alternative storage configurations and achieve an optimal mapping

7

Architecture of the Mapping Engine

GeneratePhysical Schema

Physical Schema Transformation

Query/Schema Translation

Query Optimizer

XML data statisticsXML Schema

PS0 PSi RSi

Optimal Configuration XQuery workload

cost(SQi)

Rsi: Relational Schema/Queries/StatsPsi: Physical Schema

8

Questions

Its Advantages?

Its Disadvantages?

9

Example of P-Schema Creation

type Show= type Show= TABLE Show show [@type [String], show [@type[String], ( show_id

INT, title [String], title [String] type STRING, year [Integer], year [Integer], year INT ) reviews [String]*, Reviews*, TABLE Review …] type Reviews = ( Review_id, reviews[String] review

String, parent_show

INT)(a) Initial XML schema (b) P-Schema © Relational

table

10

What’s P-Schema?

Physical schemas (p-schemas) is an extension of XML schemas in two significant ways:

. They contain data statistics . They can be easily mapped into

relational tables

11

Example of P-Schema with statistics

type Show = show [ @type[ String <#8,#2>],

year[ Integer<#4,#1800,#2100,#300>], title[ String<#50,#34798>], Review*<#10> ] type Review = review[ String<#800> ] Scalar<#size, #min, #max, #distinct> String<#size,#distinct> *<#count>

12

Stratified Physical Types scalar type s ::= Integer | String | Boolean Physical type ps ::= ps<#size,#min,#max,#distincts> Named type nt::= X (type name) | nt | nt (choice) | (empty) | nt{n,m,#<}#count> repetition Optional type ot ::= nt (named type) | s (optional scalar) |

L[ot] (optional element) | ot, ot (optional sequence) | () (empty) Physical type pt ::= nt (named type) | ot{0,1} (optional type) |

s (scalar) | L[pt] (element) | pt, pt (sequence) | () empty Schema item si ::= type X = pt (type declaration) Schema ::= schema Sn = si, si, … end (schema)

13

Mapping of p-schema to relations Create one relation RT for each type name T For each RT, create a key that stores node id For each RT, create a foreign key to all

relations RPT such that PT is a parent type of T A column is created in RT for each sub-element

of T that is a physical type If the data type is contained within an optional

type then the corresponding column can be null

14

More details of P_Schema to relational mappings

15

Schema Transformations

Advantages of transformations at XML Schema level

. Much of the XML schema semantics not present in

a given relational schema. . More natural rewriting at the XML level . The framework is more easily extensible to

other non-relational stores

16

Inlining/Outlining Transformation

One can either associate a type name to a given nested element (outlining) or next its definition directly within its parent element (inlining).

type TV= seasons [Integer] type TV = Description, seasons[Integer], Episode* => description[String], Episode* type Description = description [String]

17

Union Factorization/Distribution Transformation The first law ((a,(b|c)) == (a,b|a,c)

type Show = type Show = show [@type[String], title[String] show [(@type[String], title[String], year [Integer], title[String], year[Integer], Aka{1,10}, Review*, {Movie|TV}] Aka{1,10}, Review*, box_office[Integer],type Movie = => video_sales[Integer])

box_office[Integer] | (@type[String], title[String],

video_sales[Integer] year[Integer], Aka{1,10} Review*, seasons[Integer],Type TV = seasons[Integer],

description[String],Episode*)] description[String], Episode*

18

Corresponding relational configuration

TABLE TV ( TV_id INT, seasons String, TABLE TV ( parent_show ) TV_id INT, => seasons String,TABLE Description description String, ( Description_id INT parent_Show ) description String, parent_TV )

19

Union Factorization/Distribution continues

The Second law (a[t1|t2] == a[t1]|a[t2])

Type Show = show[(@type[String], type Show = (Show Part1 | Show Part2 ) title[String],year[Integer], type Show Part 1 = show [@type [String], Aka{1,10}, Review*, title [String], year[Integer], Aka{1,10}, box_office[Integer], Review*, box_office[Integer], video_sales[Integer]) video_sales[Integer] ]| (@type [String], => title [String], year [Integer], type Show Part2 = Aka{1,10}, Review*, show [@type [String], title[String], seasons [Integer], year [Integer], Aka{1,10}, description [String], Review*, seasons [Integer], Episode*) ] description [String], Episode* ]

20

Corresponding relational configurations

TABLE Show ( Show_id INT, TABLE Show_Part1 ( type String, title String, Show_Part1_id INT,

year INT) type String, title String, year INT, box_office INT,TABLE Movie ( video_sales INT) Movie_id INT, Box_Office INT, => video_sales INT, parent_show INT) TABLE Show_Part2 ( Show_Part2_id INT,TABLE TV ( type String, title String, TV_id INT, seasons INT, year INT, seasons INT, description string, parent_show INT) description String )

21

Wildcard rewritings

‘~’: any element names can be used ‘~!a’: any name but “a” can be used.

Type Review = type Reviews = review [~[ String ]*] review[ (NYTReview |

OtherReview)*]

=> type NYTReview = nyt[ String]

type OtherReview = (~!nyt) [String]

22

XQuery Queries Examples

Q1: FOR $v in imdb/show WHERE $v/year = 1999 RETURN ($v/title, $v/year, $v/nyt_reviews)

Q2: FOR $v in imdb/show RETURN $v Q3: FOR $v in imdb/show WHERE $v/title = c3 RETURN

$v/description Q4: FOR $v in imdb/show RETURN <result> { $v/title, $v/year, (FOR $e IN $v/episode WHERE

$e/guest_director = c4 RETURN $e) } </result>

23

XQuery Workload Examples

Publish = { Q1 : 0.4, Q2: 0.4, Q3: 0.1, Q4: 0.1}

Lookup = {Q1: 0.1, Q2: 0.1, Q3:0.4, Q4: 0.4}

24

Search Algorithm Procedure GreedySearchInput: xSchema: schema, xWkld: query workload, xStats:data statistics Output: pSchema: an efficient physical schema1 begin minCost = infinite large ; pSchema = GetInitialPhysicalSchema(xSchema) cost = GetPSchemaCost(pSchema, xWkld, xStats) while (cost < minCost) do 5 minCost = cost pSchemaList = ApplyTransformations(pSchema) for each pSchema’ € pSchemaList do cost’=GetPSchemaCost(pSchema’,xWkld,xStats) if cost’s < cost then cost = cost’; pSchema = pSchema’ endif10 endfor endwhile return pSchema end.

25

Experimental Settings

Two variations of the greedy search: greedy-so and greedy-si.

Greedy-so: Initial physical schema: all element outlined (except base

type). During search: Inlining transformations applied.

Greedy-si: Initial physical schema: all elements inlined (except

elements with multiple occurences) During search: Outlining transformations applied.

26

Efficiency of Greedy Search

5 lookup queries and 3 publish queries

27

Results

For lookup: Greedy-so converges to the final configuration a lot faster.

For publish: opposite.

28

Reasons:

The traversals made by lookup queries are localized. The final configuration has only a few inlined elements. Greedy-so can reach this configuration earlier than greedy-si.

The publish queries traverse larger number of elements. The final configuration has several inlined elements. Greedy-si can reach this configuration earlier than greedy- so.

29

Sensitivity of configurations to varied workloads

Create a spectrum of workloads that combined the lookup queries and publish queries in the ratio k : (1-k), where k€[0,1] is the fraction of lookup queries in the particular workload.

Three workloads corresponding to k = 0.25, 0.50, and 0.75, resulting three configurations.

30

Figure 11: Sensitivity to variations in the workload

31

Inlining as a bad idea to some queries

(a)The query does limited, localized traversals and/or does not access all the attributes involved.

(b)The query has highly selective selection predicates.

(c)The query involves join of attributes not structurally adjacent in the XML Schema (e.g. actor and director).

32

Effectiveness of XML transformations:Union Distribution

33

Results of the union-transformed configuration

Overlap between the curves for C[0.25] and C[0.75] with OPT.

C[0.25] and C[0.75] cross at a small angle.

C[All-inlined] performed 2~5 times worse than optimal.

34

Wildcards

Find the NYTimes reviews for shows produced in 1999:

35

Questions

The optimal mapping in this paper is cost-based. What else needs to be considered?

36

References P.Bohannon, J.Freire, P. Roy, and J. Sim’eon. From XML schema to

relations: A cost –based approach to XML storage. Technical report, Bell Laboratories, 2001. Full version.

A. Deutsch, M. Fernandez, and D. Suciu. Storing semi-structured data with STORED. In Proc. Of SIGMOND, pp 431-442, 1999.

D. Florescu and D. Kossman. A performance evaluation of alternative mapping schemas for storing XML in a relational database. Technical Report 3680, INRIA, 1999

M. Klettke and H. Meyer. XML and object-relational database system – enhancing structural mappings based on statistics. In Proc. Of WebDB, pp63-68, 2000.

A. Schmidt, M. Kersten, M. Windhouwer, and F.Waas. Efficient relational storage and retrieval of XML documents. In Proc. Of WebDB, pp47-52, 2000.

J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and Opportunities. In Proc. Of VLDB, pp302-314, 1999.