+ All Categories
Home > Documents > Extensible Markup Language Document Type...

Extensible Markup Language Document Type...

Date post: 16-Mar-2020
Category:
Upload: others
View: 22 times
Download: 0 times
Share this document with a friend
43
Semistructured Data Semistructured Data Extensible Markup Language Document Type Definitions Zaki Malik Zaki Malik November 04, 2008
Transcript

Semistructured DataSemistructured DataExtensible Markup LanguageDocument Type Definitions

Zaki MalikZaki Malik

November 04, 2008

FrameworkFramework

1. Information Integration : Making1. Information Integration : Making databases from various places work as one.

2. Semistructured Data : A new data model designed to cope with problems of information integration.

3. XML : A standard language for describing semistructured data schemas and representing data.

2

The Information‐Integration Problem

• Related data exists in many places and could, in principle, work together., p p , g

• But different databases differ in:1 Model (relational object‐oriented?)1. Model (relational, object‐oriented?).

2. Schema (normalized/unnormalized?).

3 Terminology: are consultants employees?3. Terminology: are consultants employees?  Retirees?  Subcontractors?

4 Conventions (meters versus feet?)

3

4. Conventions (meters versus feet?).

ExampleExample

• Every bar has a database.– One may use a relational DBMS; another keeps y ; pthe menu in an MS‐Word document.

– One stores the phones of distributors, another p ,does not.

– One distinguishes ales from other beers, another doesn’t.

– One counts beer inventory by bottles, another 

4

by cases.

Two Approaches to Integration

1. Warehousing : Make copies of the data sources at a central site and transform it to a common schema.

– Reconstruct data daily/weekly, but do not try to keep it more up to date than thatit more up‐to‐date than that.

2. Mediation : Create a view of all sources, as if they were integratedthey were integrated.

– Answer a view query by translating it to terminology of the sources and querying them.

5

Warehouse DiagramWarehouse Diagram

Warehouse

Wrapper Adapter

Source 1 Source 2

6

A MediatorA Mediator

User query Result

Mediator

User query Result

QueryQuery

ResultResult

Adapter Wrapper

QueryQuery ResultResult

Source 1 Source 2

7

Semistructured DataSemistructured Data

• Purpose: represent data from independent sources more flexibly than either relational or bj t i t d d lobject‐oriented models.

• Think of objects, but with the type of each object its own business, not that of its “class.”

• Labels to indicate meaning of substructures.

8

Graphs of Semistructured Data

• Nodes = objects.

• Labels on arcs (attributes relationships)Labels on arcs (attributes, relationships).

• Atomic values at leaf nodes (nodes with no arcs out)arcs out).

• Flexibility: no restriction on:– Labels out of a node.

– Number of successors with a given label.

9

Example: Data GraphExample: Data Graphroot Notice a

new kindbeer beer

barmanfmanf prize

new kindof data.

Bud

A.B.

servedAt

namename

p

year award

Gold1995M’lobname addr

The beer objectMapleJoe’s

The bar objectfor Joe’s Bar

The beer objectfor Bud

10

for Joe s Bar

XMLXML

• XML = EXtensibleMarkup Language.

• While HTML uses tags for formatting (e g• While HTML uses tags for formatting (e.g., “italic”), XML uses tags for semantics (e.g., “this is an address”)is an address ).

• Key idea: create tag sets for a domain (e.g., genomics), and translate all data into properly 

11

tagged XML documents.

Well‐Formed and Valid XMLWell Formed and Valid XML

• Well‐Formed XML allows you to invent yourWell Formed XML allows you to invent your own tags.– Similar to labels in semistructured data– Similar to labels in semistructured data.

• Valid XML involves a DTD (Document Type Definition) which limits the labels and gives aDefinition), which limits the labels and gives a grammar for their use.

12

Is a Well‐formed Document Valid?Is a Well formed Document Valid?

• An XML document is said to be well‐formed if itAn XML document is said to be well formed if it follows all of the "rules" of XML, such as proper nesting and attribute use, so by definition all XML documents are well‐formed.

• A valid document, on the other hand, is one that is not only well‐formed, but also follows the 

f llrestrictions set out in a specific grammar, typically specified in a Document Type Definition (DTD) or some form of XML Schemasome form of XML Schema.

Is a Wellformed Document Valid?Is a Wellformed Document Valid?

• An example of a document that is well‐An example of a document that is  wellformed but not valid based upon the XHTML grammargrammar.  

<body> Why?<body><p>Example of Well-formed HTML</p><head>

<title>Example</title>

y

</head><zorko>What is this?</zorko>

</body>

HTML vs. XMLHTML vs. XML

• In the case of HTML, browsers have been taughtIn the case of HTML, browsers have been taught how to ignore invalid HTML such as the <zorko> element and generally do their best when dealing with badly placed HTML elements.

• The XML processor, on the other hand, can not tell us which elements and attributes are valid. As a result we need to define the XML markup we are using.  To do this, we need to define the 

k l ’markup language’s grammar.

Well‐Formed XMLWell Formed XML

• Start the document with a declaration,Start the document with a declaration, surrounded by <? … ?> .

• Normal declaration is:Normal declaration is:<? XML VERSION = “1.0” STANDALONE = “yes” ?>STANDALONE yes ?– “Standalone” = “no DTD provided.”

• Balance of document is a root tag surroundingBalance of document is a root tag surrounding nested tags.

16

TagsTags

• Tags as in HTML are normally matched pairsTags, as in HTML, are normally matched pairs, as <FOO> … </FOO> .

• Tags may be nested arbitrarily• Tags may be nested arbitrarily.

• Tags requiring no matching ender, like <P> in HTML l i dHTML, are also permitted.

17

Example: Well‐Formed XMLExample: Well Formed XML

<? XML VERSION = “1.0” STANDALONE = “yes” ?><BARS>

<BAR><NAME>Joe’s Bar</NAME>BEER NAME B d /NAME<BEER><NAME>Bud</NAME>

<PRICE>2.50</PRICE></BEER><BEER><NAME>Miller</NAME>/

<PRICE>3.00</PRICE></BEER></BAR><BAR> … 

</BARS>

18

XML and Semistructured DataXML and Semistructured Data

• Well‐Formed XML with nested tags is exactlyWell Formed XML with nested tags is exactly the same idea as trees of semistructured data.

• We shall see that XML also enables nontree d h i d dstructures, as does the semistructured data 

model.

19

ExampleExample

• The <BARS> XML document is:The <BARS> XML document is:

BARS

BARBAR BAR

Joe’s Bar

NAME . . .BEER

BEERJoe s Bar

Bud 2 50 Miller 3 00

PRICEPRICENAME NAME

20

Bud 2.50 Miller 3.00

Document Type DefinitionsDocument Type Definitions

• Essentially a context‐free grammar forEssentially a context free grammar for describing XML tags and their nesting.

• Each domain of interest (e.g., electronic b b d i k )components, bars‐beers‐drinkers) creates one 

DTD that describes all the documents this ill hgroup will share.

21

DTD StructureDTD Structure

<!DOCTYPE <root tag> [<!ELEMENT <name> ( <components> )>( p )<more elements>

]>]>

22

Element BasicsElement Basics

• Defining elements within a DTD is done using an <!ELEMENT> declaration.<!ELEMENT> declarations along with all other declarations within a DTD have no content.

<!ELEMENT> declarations are composed of several parts including the element name and the type of information it will contain.

The resulting element names will be case sensitive.

<!ELEMENT element_name element_contents>

DTD ElementsDTD Elements

• The description of an element consists of itsThe description of an element consists of its name (tag), and a parenthesized description of any nested tagsany nested tags.– Includes order of subtags and their multiplicity.

• Leaves (text elements) have #PCDATA in place• Leaves (text elements) have #PCDATA in place of nested tags.

24

What an <!ELEMENT> Can Contain

• An <!ELEMENT> declaration can contain several different types of content which include the following:

EMPTY.PCDATA.ANY.Children Elements

EMPTYEMPTY

• <!ELEMENT> declarations that include the EMPTY value allow us to create empty elements within our xml.

• The word EMPTY must be entered in uppercase as it is case-sensitive.

<!ELEMENT element name EMPTY>_

PCDATAPCDATA

• <!ELEMENT> declarations that include the value PCDATA allow us to include text and other parsable content in our elements within our XML instance file.

• The word PCDATA must be enclosed in parenthesisThe word PCDATA must be enclosed in parenthesis with a preceding ’#’ and entered in uppercase as it is case-sensitive.

• PCDATA is text that will be parsed by a parser. Tags inside the text will treated as markup and entities will b d d

<!ELEMENT l t (#PCDATA)>

be expanded.

<!ELEMENT element_name (#PCDATA)>

ANYANY

• <!ELEMENT> declarations that include the value ANY• <!ELEMENT> declarations that include the value ANYallow us include any type of parsable content, including text and other elements, in our elements within our XML instance fileinstance file.The word ANY must be entered in uppercase as it is case-sensitive.

<!ELEMENT element name ANY><!ELEMENT element_name ANY>

Element DescriptionsElement Descriptions

• Subtags must appear in order shown.

• A tag may be followed by a symbol to indicateA tag may be followed by a symbol to indicate its multiplicity.– * = zero or more– = zero or more.

– + = one or more.

? = zero or one– ? = zero or one.

• Symbol | can connect alternative sequences of t

29

tags.

Example: DTDExample: DTD

<!DOCTYPE Bars [A BARS object haszero or more BAR’s<!DOCTYPE Bars [

<!ELEMENT BARS (BAR*)>

<!ELEMENT BAR (NAME BEER+)>

zero or more BARsnested within.

<!ELEMENT BAR (NAME, BEER+)>

<!ELEMENT NAME (#PCDATA)>

<!ELEMENT BEER (NAME PRICE)>

A BAR has oneNAME and oneor more BEER

b bj t<!ELEMENT BEER (NAME, PRICE)>

<!ELEMENT PRICE (#PCDATA)>

]>

subobjects.

A BEER has a]> A BEER has aNAME and aPRICE.

NAME and PRICEare text.

30

Example: Element DescriptionExample: Element Description

• A name is an optional title (e.g., “Prof.”), a first name, and a last name, in that order, or it , , ,is an IP address:

<!ELEMENT NAME (<!ELEMENT NAME ((TITLE?, FIRST, LAST) | IPADDR

))>

31

Use of DTD’sUse of DTD s

1 Set STANDALONE = “no”1. Set STANDALONE =  no .

2. Either:) I l d th DTD bl f th XMLa) Include the DTD as a preamble of the XML 

document, or

b) Follow DOCTYPE and the <root tag> by SYSTEMb) Follow DOCTYPE and the <root tag> by SYSTEM and a path to the file where the DTD can be found.found.

32

Example (a)Example (a)<? XML VERSION = “1.0” STANDALONE = “no” ?><!DOCTYPE Bars [

<!ELEMENT BARS (BAR*)><!ELEMENT BAR (NAME, BEER+)><!ELEMENT NAME (#PCDATA)>

The DTD<!ELEMENT NAME (#PCDATA)><!ELEMENT BEER (NAME, PRICE)><!ELEMENT PRICE (#PCDATA)>

]>

The document]><BARS>

<BAR><NAME>Joe’s Bar</NAME><BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER><BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER>

</BAR> <BAR> … 

33

</BARS>

Example (b)Example (b)

• Assume the BARS DTD is in file bar.dtd.<? XML VERSION = “1.0” STANDALONE = “no” ?><!DOCTYPE Bars SYSTEM “bar.dtd”><BARS> Get the DTD<BARS>

<BAR><NAME>Joe’s Bar</NAME><BEER><NAME>Bud</NAME>

<PRICE>2 50</PRICE></BEER>

Get the DTDfrom the filebar.dtd

<PRICE>2.50</PRICE></BEER><BEER><NAME>Miller</NAME>

<PRICE>3.00</PRICE></BEER></BAR></BAR><BAR> … 

</BARS>

34

AttributesAttributes

• Opening tags in XML can have attributes likeOpening tags in XML can have attributes, like <A HREF = “…”> in HTML.

• In a DTD,<!ATTLIST <element name>… >

gives a list of attributes and their datatypes for g ypthis element.

35

Example: AttributesExample: Attributes

• Bars can have an attribute kind, which is either sushi, sports, or “other.”, p ,

<!ELEMENT BAR (NAME BEER*)><!ATTLIST BAR kind = “sushi” |<!ATTLIST BAR kind = sushi |

“sports” | “other”>

36

Example: Attribute UseExample: Attribute Use

• In a document that allows BAR tags, we might see:<BAR kind = “sushi”><NAME>Akasaka</NAME><BEER><NAME>Sapporo</NAME>pp

<PRICE>5.00</PRICE></BEER>......

</BAR>

37

ID’s and IDREF’sID s and IDREF s

• These are pointers from one object toThese are pointers from one object to another, in analogy to HTML’s       NAME = “foo” and HREF = “#foo”foo  and HREF =  #foo .

All h f XML d• Allows the structure of an XML document to be a general graph, rather than just a tree.

38

Creating ID’sCreating ID s

• Give an element E an attribute A of type IDGive an element E an attribute A of type ID.

h i i d• When using tag <E > in an XML document, give its attribute A a unique value.

• Example:p<E A = “xyz”>

39

Creating IDREF’sCreating IDREF s

• To allow objects of type F to refer to anotherTo allow objects of type F to refer to another object with an ID attribute, give F an attribute of type IDREFof type IDREF.

O l h ib h IDREFS h F• Or, let the attribute have type IDREFS, so the F–object can refer to any number of other bjobjects.

40

Example: ID’s and IDREF’sExample: ID s and IDREF s

• Let’s redesign our BARS DTD to include both BAR and gBEER subelements.

• Both bars and beers will have ID attributes called name.

• Bars have PRICE subobjects, consisting of a number (the price of one beer) and an IDREF theBeer(the price of one beer) and an IDREF theBeerleading to that beer.

• Beers have attribute soldBy, which is an IDREFS y,leading to all the bars that sell it.

41

The DTDThe DTD

<!DOCTYPE Bars [Bar objects have nameas an ID attribute andh

[

<!ELEMENT BARS (BAR*, BEER*)>

<!ELEMENT BAR (PRICE+)>

have one or morePRICE subobjects.

PRICE objects have<!ATTLIST BAR name = ID>

<!ELEMENT PRICE (#PCDATA)>

PRICE objects havea number (theprice) and onereference to a beer<!ATTLIST PRICE theBeer = IDREF>

<!ELEMENT BEER ()>

<!ATTLIST BEER name = ID soldBy = IDREFS>

reference to a beer.

<!ATTLIST BEER name = ID, soldBy = IDREFS>

]>Beer objects have an ID attribute called name,and a soldBy attribute that is a set of Bar names

42

and a soldBy attribute that is a set of Bar names.

Example DocumentExample Document

<BARS><BAR name = “JoesBar”>

<PRICE theBeer = “Bud”>2.50</PRICE><PRICE theBeer = “Miller”>3.00</PRICE>

</BAR> …<BEER name = “Bud”, soldBy = “JoesBar,

”SuesBar,…”></BEER> …/BARS

43

</BARS>


Recommended