+ All Categories
Home > Documents > Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Date post: 14-Dec-2015
Category:
Upload: adam-mathews
View: 215 times
Download: 1 times
Share this document with a friend
33
Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001
Transcript
Page 1: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Management of XML and Semistructured Data

Lecture 10: Schemas

Monday, April 30, 2001

Page 2: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Overview

• Schema Extraction for SS data

• Schemas for XML– DTDs– XML Schema

Page 3: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Review of Schemas so far

Upper bound schema S• Tell us what labels are allowed• Conformance test: D S• In practice: need deterministic schemas

Lower bound schema S• Tells us what labels are required• Conformance test: S D• Alternative formulation: datalog programs, maximal

fixpoint

Page 4: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Schema Extraction(From Data)

Problem statement

• given data instance D

• find the “most specific” schema S for D

In practice: S too large, need to relax

[Nestorov, Abiteboul, Motwani 1998]

Page 5: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Schema Extraction: Sample Data

&r

&p8&p1 &p2 &p3 &p4 &p5 &p6 &p7

&c

company

employeeemployee

employeeemployee employee employee

employeeemployee

worksfor

worksfor

worksforworksforworksfor

worksforworksfor

worksfor

manages

manages

manages

manages

managedby

managedbymanagedby

manages

managedby

managedby

Example database D =

Page 6: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Lower Bound Schema Extraction

[NAM’98] approach: • Start with the schema given by the data (S = D):

– Each node = a predicate = a class

• Compute maximal fixpoint (PTIME)• Declare two classes equal iff they are equal sets

– E.g. p4={&p1,&p4,&p6}, p6={&p1,&p4,&p6}, hence p1=p4

• Equivalently, p=p’ iff p(&p’) and p’(&p)

. . .

. . .p4(x) :- link(x, manages, y), p5(y), link(x, worksfor, z), c(z)p5(x) :- link(x, managed-by, y), p4(y), link(x, worksfor, z), c(z). . .. . .

. . .

. . .p4(x) :- link(x, manages, y), p5(y), link(x, worksfor, z), c(z)p5(x) :- link(x, managed-by, y), p4(y), link(x, worksfor, z), c(z). . .. . .

Page 7: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Lower Bound Schema Extraction

Root&r

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company employee

manages

managedby

worksfor

worksfor

employee

Result S =

Page 8: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Lower Bound Schema Extraction

Equivalently:• Compute the maximal simulation D D

– Can do in time O(m2)

• Two nodes p, p’ are equivalent iff x x’ and x’ x• Schema consists of equivalence classes

Remark: could use the bisimulation relation instead (perhaps is even better)

Page 9: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Upper Bound Schema Extraction

• The extracted lower bound schema S is also an upper bound schema !

• But: nondeterministic

• Convert S Sd

• Alternatively, convert directly D Dd = Sd

– These are data guides [McHugh and Widom]

Page 10: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Upper Bound Schema Extraction

Root&r

Employees&p1,&p1,&p3,P4

&p5,&p6,&p7,&p8

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company

employee

managesmanagedby

manages

managedby

worksfor

worksfor

worksfor

Result Sd =

Page 11: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

XMLDocument Type Definitions

• part of the original XML specification

• an XML document may have a DTD

• terminology for XML:– well-formed: if tags are correctly closed– valid: if it has a DTD and conforms to it

• validation is useful in data exchange

Page 12: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Very Simple DTD

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

Page 13: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Very Simple DTD

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

Example of valid XML document:

Page 14: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Content Model

• Element content: what we can put in an element (aka content model)

• Content model:– Complex = a regular expression over other elements

– Text-only = #PCDATA

– Empty = EMPTY

– Any = ANY

– Mixed content = (#PCDATA | A | B | C)*• (i.e. very restrictied)

Page 15: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Attributes in DTDs

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIS person age CDATA #REQUIRED>

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIS person age CDATA #REQUIRED>

<person age=“25”> <name> ....</name> ...</person>

<person age=“25”> <name> ....</name> ...</person>

Page 16: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Attributes in DTDs

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIS person age CDATA #REQUIRED

id ID #REQUIRED

manager IDREF #REQUIRED

manages IDREFS #REQUIRED>

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIS person age CDATA #REQUIRED

id ID #REQUIRED

manager IDREF #REQUIRED

manages IDREFS #REQUIRED>

<person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ...</person>

<person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ...</person>

Page 17: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Attributes in DTDs

Types:

• CDATA = string

• ID = key

• IDREF = foreign key

• IDREFS = foreign keys separated by space

• (Monday | Wednesday | Friday) = enumeration

• NMTOKEN = must be a valid XML name

• NMTOKENS = multiple valid XML names

• ENTITY = you don’t want to know this

Page 18: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Attributes in DTDs

Kind:• #REQUIRED• #IMPLIED = optional• value = default value• value #FIXED = the only value allowed

Page 19: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Using DTDs

• Must include in the XML document• Either include the entire DTD:

– <!DOCTYPE rootElement [ ....... ]>

• Or include a reference to it:– <!DOCTYPE rootElement SYSTEM

“http://www.mydtd.org”>

• Or mix the two... (e.g. to override the external definition)

Page 20: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

DTDs as Grammars

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section></paper>

Page 21: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

DTDs as Grammars

• A DTD = a grammar

• A valid XML document = a parse tree for that grammar

Page 22: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

DTDs as Schemas

Not so well suited:• impose unwanted constraints on order

<!ELEMENT person (name,phone)>

• references cannot be constrained

• can be too vague: <!ELEMENT person ((name|phone|email)*)>

Page 23: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

XML Schemas

• http://www.w3.org/TR/xmlschema-1/10/2000

• generalizes DTDs• uses XML syntax• two documents: structure and datatypes

– http://www.w3.org/TR/xmlschema-1– http://www.w3.org/TR/xmlschema-2

• XML-Schema is very complex– often criticized– some alternative proposals

Page 24: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

XML Schemas

<xsd:element name=“paper” type=“papertype”/>

<xsd:complexType name=“papertype”>

<xsd:sequence>

<xsd:element name=“title” type=“xsd:string”/>

<xsd:element name=“author” minOccurs=“0”/>

<xsd:element name=“year”/>

<xsd: choice> < xsd:element name=“journal”/>

<xsd:element name=“conference”/>

</xsd:choice>

</xsd:sequence>

</xsd:element>

<xsd:element name=“paper” type=“papertype”/>

<xsd:complexType name=“papertype”>

<xsd:sequence>

<xsd:element name=“title” type=“xsd:string”/>

<xsd:element name=“author” minOccurs=“0”/>

<xsd:element name=“year”/>

<xsd: choice> < xsd:element name=“journal”/>

<xsd:element name=“conference”/>

</xsd:choice>

</xsd:sequence>

</xsd:element>

DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>

Page 25: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Elements v.s. Types in XML Schema

<xsd:element name=“person”> <xsd:complexType> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“address” type=“xsd:string”/> </xsd:sequence> </xsd:complexType></xsd:element>

<xsd:element name=“person”> <xsd:complexType> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“address” type=“xsd:string”/> </xsd:sequence> </xsd:complexType></xsd:element>

<xsd:element name=“person” type=“ttt”><xsd:complexType name=“ttt”> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“address” type=“xsd:string”/> </xsd:sequence></xsd:complexType>

<xsd:element name=“person” type=“ttt”><xsd:complexType name=“ttt”> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“address” type=“xsd:string”/> </xsd:sequence></xsd:complexType>

DTD: <!ELEMENT person (name,address)>

Page 26: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

• Types:– Simple types (integers, strings, ...)

– Complex types (regular expressions, like in DTDs)

• Element-type-element alternation:– Root element has a complex type

– That type is a regular expression of elements

– Those elements have their complex types...

– ...

– On the leaves we have simple types

Page 27: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Local and Global Types in XML Schema

• Local type: <xsd:element name=“person”>

[define locally the person’s type] </xsd:element>

• Global type: <xsd:element name=“person” name=“ttt”/>

<xsd:complexType name=“ttt”> [define here the type ttt] </xsd:complexType>

Global types: can be reused in other elements

Page 28: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Local v.s. Global Elements inXML Schema

• Local element: <xsd:complexType name=“ttt”>

<xsd:sequence> <xsd:element name=“address” type=“...”/>... </xsd:sequence> </xsd:complexType>

• Global element: <xsd:element name=“address” type=“...”/>

<xsd:complexType name=“ttt”> <xsd:sequence> <xsd:element ref=“address”/> ... </xsd:sequence> </xsd:complexType>

Global elements: like in DTDs

Page 29: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Regular Expressions in XML Schema

Recall the element-type-element alternation: <xsd:complexType name=“....”>

[regular expression on elements] </xsd:complexType>

Regular expressions:• <xsd:sequence> A B C </...> = A B C

• <xsd:choice> A B C </...> = A | B | C

• <xsd:group> A B C </...> = (A B C)

• <xsd:... minOccurs=“0” maxOccurs=“unbounded”> ..</...> = (...)*

• <xsd:... minOccurs=“0” maxOccurs=“1”> ..</...> = (...)?

Page 30: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Local Names in XML-Schema<xsd:element name=“person”> <xsd:complexType> . . . . . <xsd:element name=“name”> <xsd:complexType> <xsd:sequence> <xsd:element name=“firstname” type=“xsd:string”/> <xsd:element name=“lastname” type=“xsd:string”/> </xsd:sequence> </xsd:element> . . . . </xsd:complexType></xsd:element>

<xsd:element name=“product”> <xsd:complexType> . . . . . <xsd:element name=“name” type=“xsd:string”/>

</xsd:complexType></xsd:element>

<xsd:element name=“person”> <xsd:complexType> . . . . . <xsd:element name=“name”> <xsd:complexType> <xsd:sequence> <xsd:element name=“firstname” type=“xsd:string”/> <xsd:element name=“lastname” type=“xsd:string”/> </xsd:sequence> </xsd:element> . . . . </xsd:complexType></xsd:element>

<xsd:element name=“product”> <xsd:complexType> . . . . . <xsd:element name=“name” type=“xsd:string”/>

</xsd:complexType></xsd:element>

name hasdifferent meaningsin person andin product

Page 31: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Subtle Use of Local Names<xsd:element name=“A” type=“oneB”/>

<xsd:complexType name=“onlyAs”> <xsd:choice> <xsd:sequence> <xsd:element name=“A” type=“onlyAs”/> <xsd:element name=“A” type=“onlyAs”/> </xsd:sequence> <xsd:element name=“A” type=“xsd:string”/> </xsd:choice></xsd:complexType>

<xsd:element name=“A” type=“oneB”/>

<xsd:complexType name=“onlyAs”> <xsd:choice> <xsd:sequence> <xsd:element name=“A” type=“onlyAs”/> <xsd:element name=“A” type=“onlyAs”/> </xsd:sequence> <xsd:element name=“A” type=“xsd:string”/> </xsd:choice></xsd:complexType>

<xsd:complexType name=“oneB”> <xsd:choice> <xsd:element name=“B” type=“xsd:string”/> <xsd:sequence> <xsd:element name=“A” type=“onlyAs”/> <xsd:element name=“A” type=“oneB”/> </xsd:sequence> <xsd:sequence> <xsd:element name=“A” type=“oneB”/> <xsd:element name=“A” type=“onlyAs”/> </xsd:sequence> </xsd:choice></xsd:complexType>

<xsd:complexType name=“oneB”> <xsd:choice> <xsd:element name=“B” type=“xsd:string”/> <xsd:sequence> <xsd:element name=“A” type=“onlyAs”/> <xsd:element name=“A” type=“oneB”/> </xsd:sequence> <xsd:sequence> <xsd:element name=“A” type=“oneB”/> <xsd:element name=“A” type=“onlyAs”/> </xsd:sequence> </xsd:choice></xsd:complexType>

Arbitrary deep binary tree with A elements, and a single B element

Page 32: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Summary of XML Schema

• Formal Expressive Power:– Can express precisely the regular tree languages

(over unranked trees)

• Lots of other stuff– Some form of inheritance– A “null” value– Large collection of data types

Page 33: Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Summary of Schemas

• in SS data: – graph theoretic– data and schema are decoupled– used in data processing

• in XML– from grammar to object-oriented– schema wired with the data– emphasis on semantics for exchange


Recommended