+ All Categories
Home > Documents > 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail...

12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail...

Date post: 27-Dec-2015
Category:
Upload: arthur-york
View: 217 times
Download: 0 times
Share this document with a friend
105
12 February 2008 Kaiser: COMS E6125 1 COMS E6125 Web-enHanced COMS E6125 Web-enHanced Information Management Information Management (WHIM) (WHIM) Prof. Gail Kaiser Prof. Gail Kaiser Spring 2008 Spring 2008
Transcript
Page 1: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 1

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

Prof. Gail KaiserProf. Gail Kaiser

Spring 2008Spring 2008

Page 2: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 2

Today’s Topics:

• Document Structure Definition– Document Type Definition (DTD)– XML Schema (XSD)

• Querying XML Documents– NOT the same as Web search engines!– XPath– XQuery

Page 3: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 3

Pure XML - Instance Model

<A> <B>foo</B> <C>bar</C> <C>psl</C></A>

A

B C

"foo" "bar"

C:"bar"

A:

B: "foo"

C:"psl"

"psl"

C

children are ordered

• XML 1.0 implicit data model: – nested containers ("boxes within boxes")– labeled ordered trees (= semistructured data

model)– Relational or object-oriented easy to encode

Page 4: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 4

XML NamespacesAllows mixing of different tag

vocabularies

• Only identifies the vocabulary (lexicon)

• Additional mechanisms required for structure and meaning (or at least metadata) of tags

Page 5: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 5

From Documents to Data

• We want to be able to – Extract the element

structure of a document

– Re-use this structure for other similar documents

– Share structure and metadata with others

– Automate processing of this structure and metadata

<invoice> <orderDate>2007-12-01</orderDate> <shipDate>2007-12-26</shipDate><billingAddress> <name>Gail Kaiser</name> <street>500 West 120th Street</street> <city>New York</city> <state>NY</state> <zip>10027</zip> </billingAddress> <voice>212-555-1234</voice> <fax>212-555-4321</fax> </invoice>

<memo importance='high' date=‘2008-02-11'>

<from>Gail Kaiser</from> <to>Swapneel Sheth</to>

<subject>whim tomorrow</subject>

<body>Remember to pick up the sign-in sheet after class tomorrow

</body>

</memo>

Page 6: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 6

Adding Structure and Semantics

• A Document Structure Description (DSD) defines the syntax of XML documents for a particular application domain

• Defines the grammar for an XML-based markup language

Page 7: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 7

Processing XML• Non-validating parser:

– checks that XML doc is syntactically well-formed, e.g., all open-tags have matching close-tags and they are properly nested, attributes only appear once in an element, etc.

• Validating parser:– checks that XML doc is also valid wrt a

given DSD (now usually XML Schema)

Page 8: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 8

Using DSD Validators

•A DSD processor can be useful both on the server side (when writing XML documents) and on the client side (when processing XML documents): – Checking validity (conformance) of XML documents

– Performing default insertion (inserts missing fragments)

Page 9: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 9

DSD Processing

Page 10: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 10

Several Proposed DSDs• XML Document Type Definitions (DTDs):

– Define the structure of “allowed” documents

Database schema– Non-XML syntax

• XML Schemas (XSDs)– Define structure and data types – Allows developers to build their own libraries of

interchange-able data types– Written in an XML vocabulary

• Others (e.g., RELAX NG, Schematron)

Page 11: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 11

Document Type Definitions

• A DTD is a grammar defining XML structure – XML document specifies an

associated DTD, plus the root element

– DTD specifies children of the root element, their children, and so on

Page 12: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 12

Example DTD<!ELEMENT bib (book *)><!ELEMENT book (thesis | article)><!ELEMENT thesis (title, author, year, school,

committeemember*)><!ATTLIST thesis

date CDATA #REQUIREDkey ID #REQUIREDadvisor CDATA #IMPLIEDidref IDREF>

<!ELEMENT article (title, (author+ | editor+), publisher)>

<!ELEMENT title (#PCDATA)><!ELEMENT author (name)><!ATTLIST author id ID #REQUIRED>. . .

Page 13: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 13

DTD Interpretation

• CDATA “Character Data”, a sequence of characters

• #PCDATA “Parsed Character Data”, text and character entities (e.g., &amp; -> &, &eacute; -> acute e)

• ID unique• IDREF reference to entity• #IMPLIED A default

value must be supplied by the processor.

• ( ... ) Specifies a group. • A | B Both A and B are

permitted in any order. • A , B A must occur before

B. • A & B A and B must both

occur once, but may do so in any order.

• A? A can occur zero or one times

• A* A can occur zero or more times

• A+ A can occur one or more times

Page 14: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 14

DTD Defines Special Significance for Attributes

• IDs – special attributes that are analogous to relational database keys (globally unique IDs for elements)

• IDREF – reference to an ID• IDREFS – a list of IDREFs

Page 15: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 15

Instance Visualization as a Graph

<?xml version="1.0"?>

<!DOCTYPE bib SYSTEM “http://webserver/bib.dtd">

<bib>

<author id="author1"> <name>John Smith</name>

</author>

<article>

<author idref="author1" />

<title>Paper1</title>

</article>

<article>

<author idref="author1" />

<title>Paper2</title>

</article>

. . .

Page 16: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 16

Graph Data ModelRoot

!DOCTYPE

bib

authorarticle

nametitle

idrefidref

John Smith

author1author1

Paper2

?xml

article

id

author1

author authortitle

Paper1

Page 17: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 17

Drawbacks of DTDs

• Not themselves XML - additional effort to build tools

• No support for data types - cannot do data validation

• No support for OO-like structures (e.g, inheritance)

• Horrible syntax

Page 18: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 18

Several Proposed DSDs• XML Document Type Definitions (DTDs):

– Define the structure of “allowed” documents

Database schema– Non-XML syntax

• XML Schemas (XSDs)– Defines structure and data types – Allows developers to build their own libraries of

interchange-able data types– Written in an XML vocabulary

• Others (e.g., RELAX NG, Schematron)

Page 19: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 19

XML Schema Design Principles

1. More expressive than DTDs (which came from SGML, although modified slightly in XML 1.0)

2. Notation is itself an XML vocabulary3. Self-describing 4. Usable by a wide variety of applications that

employ XML 5. Straightforwardly usable on the Internet6. Optimized for interoperability7. Simple enough to be implemented with modest

design and runtime resources8. Coordinated with relevant W3C specs

Page 20: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 20

Purpose of an XML Schema

•Defines a class of XML instances•Neither instances nor schemas need

exist as documents, per se, may exist as:–Byte stream sent between applications–Fields in a database record–Collection of XML “infoset” information items

Page 21: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 21

What is an XML “infoset”?

• XML Information Set, 2nd edition, W3C Recommendation February 2004

• For use by other specs that need to refer to the information in a well-formed XML document [or PSVI = post schema validated infoset]

• Defines abstract data set generated by parser or by other means, conceptually tree of items each with several properties

Page 22: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 22

(Some) Information Items

• Document (root of infoset) – properties include base URI, XML version, character encoding, etc.

• One root element - and its children• Attributes of elements• Namespace scoping for elements• Processing instructions• Unexpanded entities (processor may or

may not expand all entities)

Page 23: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Example Instance Document<?xml version="1.0"?> <purchaseOrder orderDate=“2007-10-20"> <shipTo country="US"> <name>Alice Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </shipTo> <billTo country="US"> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </billTo> <comment>Hurry, my lawn is going wild!</comment> <items> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>Confirm this is electric</comment> </item> <item partNum="926-AA"> . . . </item> </items> </purchaseOrder>

file

po.

xml

Page 24: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 24

Where is the Schema?• The instance document may reference a

schema explicitly, or a processor may obtain a schema separately without reference from the instance

• Schema defines elements and attributes, and their complex and simple types

• Determines the appearance of elements and their content in instance documents

Page 25: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Example Schema<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> . . . </xsd:complexType></xsd:schema>

• The schema consists of a schema element and various subelements, e.g., element, complexType

• The prefix xsd: associates names with the XML Schema namespace specified in the xmlns:xsd declaration

• Same prefix, and hence same association, also appears on names of built-in types, e.g., xsd:string

• Identifies elements and simple types as belonging to XML Schema language vocabulary rather than vocabulary of schema author

file

po.

xsd

Page 26: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 26

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> . . . </xsd:complexType></xsd:schema>

Example Schema

• An annotation element may appear at the beginning of most schema constructions

• Contains two subelements– Documentation: Human readable material– appInfo: For tools and applications

file

po.

xsd

Page 27: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Complex Type Definitions

<xsd:complexType name="USAddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:NMTOKEN"

fixed="US"/> </xsd:complexType>

• New complex types are defined using the complexType element; it contains element declarations, attribute declarations and element references

• This example says elements of type USAddress must have– 5 subelements that must be called name, street, city, state and zip (in

this order), each having the corresponding type declared above– 1 attribute called country may appear with the element; NMTOKEN

represents an atomic indivisible value• All element declarations within USAddress involve simple types

Page 28: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 28

Complex Type Definitions

<xsd:complexType name="USAddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:NMTOKEN"

fixed="US"/> </xsd:complexType>

• An attribute may be specified as fixed or default.• Default attribute values apply when attributes are missing.• For fixed attributes, if a value appears, it must be the value

declared with a fixed value. • The schema processor will provide the value for missing

attributes.

Page 29: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Complex Type Definitions

<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>

• A declaration may reference an existing element, e.g., comment; the value of the ref attribute must reference a global element (i.e., declared under schema)

• Every element of type PurchaseOrderType must consist of subelements shipTo and billTo, each containing the five subelements declared as part of USAddress, items and (optionally) comment; it may have one attribute called orderDate

Page 30: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 30

Complex Type Definitions

<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>

• Occurrence constraint may specify minoccurs and/or maxoccurs

Page 31: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 31

Complex Type Definitions

<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>

• Attributes may appear once or not at all (the default), but no more than once

• use may be specified as optional, required, or prohibited

Page 32: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 32

Simple Built-in Types• string, normalizedString,

token• byte, unsignedByte• integer, positiveInteger, etc• long, short, etc• decimal, float, double• boolean• time, dateTime, duration,

date, etc• anyURI• etc

• ID• IDREF, IDREFS• ENTITY, ENTITIES• NMTOKEN, NMTOKENS

• The types in this column should only be used in attributes (to retain compatibility with XML 1.0 DTDs)

Page 33: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 33

Simple Derived Types

• The simpleType element is used to define and name a new simple type

• The restriction element indicates the base type and identifies the “facets” that constrain the range of values (here minInclusive and maxInclusive)

<xsd:simpleType name="myInteger"> <xsd:restriction base="xsd:integer"> <xsd:minInclusive value="10000"/> <xsd:maxInclusive value="99999"/> </xsd:restriction></xsd:simpleType>

Page 34: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 34

Simple Derived Types (pattern facet)

<!-- Stock Keeping Unit, a code for identifying products -->

<xsd:simpleType name="SKU"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{3}-[A-Z]{2}"/> </xsd:restriction> </xsd:simpleType>

• Constrain the values of SKU using the pattern facet in conjunction with the regular expression "\d{3}-[A-Z]{2}“ (3 digits followed by a hyphen followed by 2 upper-case ASCII letters)

Page 35: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Simple Derived Types (enumeration facet)

• The enumeration facet limits a simple type to a set of distinct values

• Enables a better definition of USAddress type

<xsd:simpleType name="USState"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="AK"/> <xsd:enumeration value="AL"/> <xsd:enumeration value="AR"/> <!-- and so on ... --> </xsd:restriction></xsd:simpleType>

<xsd:complexType name="USAddress"> . . . <xsd:element name="state" type="USState"/> . . .</xsd:complexType

Page 36: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

<xsd:complexType name="Items"> <xsd:sequence> <xsd:element name="item" minOccurs="0"

maxOccurs="unbounded"> <xsd:complexType> <xsd:sequence> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="quantity"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="100"/> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="USPrice" type="xsd:decimal"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="shipDate" type="xsd:date"

minOccurs="0"/> </xsd:sequence> <xsd:attribute name="partNum" type="SKU" use="required"/> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType>

Anonymous Type Definitions

Page 37: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Recap Example Schema

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/>

<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>

<xsd:complexType name="USAddress"> . . . </xsd:complexType>

<xsd:complexType name="Items"> . . . </xsd:complexType></xsd:schema>

file

po.

xsd

Page 38: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 38

XML Schema Data Types

• Complex types• Built-in simple types• Derived simple types• Also derived complex types, lists and

unions of simple types

Define structure – what about the content?

Page 39: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

<xsd:element name="internationalPrice"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base="xsd:decimal"> <xsd:attribute name="currency“ type="xsd:string"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element>

Element Content: Simple content

• Declare an element that has an attribute and contains a simple value

<internationalPrice currency="EUR">423.46</internationalPrice>

Page 40: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 40

Element Content:Empty content

• Declare an element with attributes only - no content at all

<xsd:element name="internationalPrice"> <xsd:complexType> <xsd:attribute name="currency" type="xsd:string"/> <xsd:attribute name="value" type="xsd:decimal"/> </xsd:complexType></xsd:element>

<internationalPrice currency="EUR" value="423.46"/>

Page 41: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Element Content: Entire element omitted

• The absence of an element does not carry any particular meaning; it could be– Information unknown– Information not applicable– I just forgot to enter the information

• Absence does/should not imply some value like zero, empty string, empty list, etc.

• Database systems faced with similar problems have introduce “null” values

• XML does not provide a null value representation that actually appears in element content; instead, there is an attribute to indicate content is nil

<xsd:element name="shipDate" type="shipDateType" nillable="true">

<shipDate xsi:nil="true"></shipDate>

Page 42: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 42

Element Content:Mixed content

• Text appears between the elements salutation, quantity, productName, and shipDate (all children of letterBody)

• To allow this, the mixed attribute of the parent’s complexType must be set to true

<letterBody><salutation>Dear Mr.<name>Robert

Smith</name>.</salutation>Your order of <quantity>1</quantity> <productName>BabyMonitor</productName> shipped from our warehouse on<shipDate>1999-05-21</shipDate>. ....</letterBody>

Page 43: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Element Content: Mixed content

<xsd:element name="letterBody"> <xsd:complexType mixed="true"> <xsd:sequence> <xsd:element name="salutation"> <xsd:complexType mixed="true"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="quantity"

type="xsd:positiveInteger"/> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="shipDate" type="xsd:date"

minOccurs="0"/> <!-- etc. --> </xsd:sequence> </xsd:complexType></xsd:element>

• The order and number of child elements appearing in an instance must agree with order/number of child elements specified in the content model

Page 44: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 44

Element Content:anyType

• The anyType type does not constrain its content in any way

• When no type is defined, anyType is the default, so could be written as

<xsd:element name="anything" type="anyType"/>

<xsd:element name="anything"/>

Page 45: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 45

Grouping Content Elements

– group & sequence• group – groups elements so that they can be

used as a unit to build up types• sequence grouping (default) – elements in

instance doc must appear in the listed order

<xsd:group name="shipAndBill"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> </xsd:sequence></xsd:group>

Page 46: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 46

Content Groups - choice

• choice grouping – only one element appears in an instance

<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:choice> <xsd:group ref="shipAndBill"/> <xsd:element name="singleUSAddress"

type="USAddress"/> </xsd:choice> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/></xsd:complexType>

Page 47: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Content Groups - all• all grouping – elements may appear in any

order, each element appears zero or one times• An all group must appear as the sole child at the

top of a content model

<xsd:complexType name="PurchaseOrderType"> <xsd:all> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:all> <xsd:attribute name="orderDate" type="xsd:date"/></xsd:complexType>

Page 48: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Attribute Grouping• We can create a named attribute group

containing all the desired attributes and reference this group by name in an element

<xsd:element name="Item"> </xsd:complexType> . . . <xsd:attribute name="partNum" type="SKU" use="required"/> <xsd:attribute name="weightKg" type="xsd:decimal"/> <xsd:attribute name="shipBy"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="air"/> <xsd:enumeration value="land"/> <xsd:enumeration value="any"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </xsd:complexType></xsd:element>

Page 49: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Attribute Groups<xsd:element name="Item"> </xsd:complexType> . . . <xsd:attributeGroup ref="ItemDelivery"/> </xsd:complexType></xsd:element>

<xsd:attributeGroup name="ItemDelivery"> <xsd:attribute name="partNum" type="SKU"

use="required"/> <xsd:attribute name="weightKg"

type="xsd:decimal"/> <xsd:attribute name="shipBy"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="air"/> <xsd:enumeration value="land"/> <xsd:enumeration value="any"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute></xsd:attributeGroup>

Page 50: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 50

Target Namespaces• Tired of repeating the prefix xsd: ?• We could make the XMLSchema namespace

the default namespace (so no more xsd prefixes) but then we would have to prefix the locally defined types and locally declared elements and attributes

• The solution is Target Namespaces• Target namespaces enable distinguishing

between definitions and declarations from different vocabularies

Page 51: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Target Namespace Example

<schema targetNamespace="http://www.example.com/PO" xmlns="http://www.w3.org/2001/XMLSchema" xmlns:po="http://www.example.com/PO">

<element name="purchaseOrder" type="po:PurchaseOrderType"/> <element name="comment" type="string"/>

<complexType name="PurchaseOrderType"> <sequence> <element name="shipTo" type="po:USAddress"/> <element name="billTo" type="po:USAddress"/> <element ref="po:comment" minOccurs="0"/> <!– etc. --> </complexType>

<complexType name="USAddress"> <sequence> <element name= "name" type="string"/> <-- etc. --> </complexType></schema>

Page 52: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 52

Undeclared Target Namespaces

• What is the target namespace when a schema does not declare one?– All its definitions and declarations are referenced without

qualification– They can only validate unqualified names in instance

documents

• What is the target namespace when an instance document does not declare one?– All pre-XMLSchema XML 1.0 documents are like this– To validate such instance documents, the validation

processor must be provided with a schema with no target namespace

Page 53: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 53

Other XML Schema Issues• A schema can be distributed across multiple

documents, one of which is topmost and the rest “included”

• Types can be “imported” from other schemas• Abstract types allow a form of inheritance

[beyond derived types] with substitution groups

• Keys (as in relational databases)• …

Page 54: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Drawbacks of XML Schemas

• Another vocabulary to learn• Verbose (like XML itself)• Many constraints cannot be expressed

(without adding separate stylesheet or code)

<Demo xmlns="http://www.demo.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.demo.org demo.xsd">

<A>10</A> <B>20</B> </Demo>

• Can constrain: the Demo element contains a sequence of elements A followed by B; the A element contains an integer; the B element contains an integer

• Can’t constraint: A>B

Page 55: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 55

Today’s Topics:

• Document Structure Definition– Document Type Definitions (DTDs)– XML Schemas (XSD)

• Querying XML Documents– NOT the same as Web search engines!– XPath– XQuery

Page 56: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 56

Why not use SQL?• Table rows vs. XML elements• Homogeneous vs. heterogeneous - two

elements of the same type may have different structure (due to minOccurs, maxOccurs, choice, etc)

• Flat vs. multi-nested• Unordered sets/tuples vs. ordered elements • “Dense” vs. “Sparse” - not all potential

subelements are present or have values

Page 57: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 57

XML Data Model• Basically a sequence, an ordered list of zero

or more items • No sequences of sequences• An item is either a node or an atomic value• An atomic value – a built-in data type or a

simple type derived by restriction• A node is one of seven kinds: element,

attribute, text, document, comment, processing instruction, and namespace

Page 58: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 58

Document Order• Among all nodes in a hierarchy there is a total

order, called document order, in which each node appears before its children

• Preorder traversal• Informally, the document order corresponds to

the order in which the first character of the XML representation of each node occurs in the XML representation of the document

Page 59: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 59

XPath Overview• Language that expresses simple queries

on individual XML documents (or streams) for retrieving parts of the XML document

• Operates on the abstract, logical structure of an XML document, rather than its surface syntax

• Basic facilities for manipulating strings, arithmetic and boolean expressions

• Compact, non-XML syntax to facilitate use within URIs and XML attribute values

Page 60: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 60

XPath Expressions• Similar to filesystem addressing• Consists of a series of steps, separated by

“/” or “//”• Each step is evaluated in the context of a

particular node, called the context node• The result of each step is a sequence of

nodes, which serve in turn as context nodes for the following step

• The value of a path expression is the node sequence that results from the last step

Page 61: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 61

XPath Example

• If the path starts with the slash “/” , then it represents an absolute path to the required element

/AAA/DDD/BBBSelect all elements BBB

that are children of DDD that are children of the root element AAA  

    <AAA>           <BBB/>           <CCC/>           <BBB/>           <BBB/>           <DDD>                <BBB/>           </DDD>           <CCC/>      </AAA>

Page 62: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 62

XPath Example

• If the path starts with double-slash “//” , then all elements in the document which fulfill the criteria are selected

//BBBSelect all elements BBB

     

<AAA>           <BBB/>           <CCC/>           <BBB/>           <DDD>                <BBB/>           </DDD>           <CCC>                <DDD>                     <BBB/>                     <BBB/>                </DDD>           </CCC>      </AAA>

Page 63: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 63

XPath Example

//DDD/BBB Select all elements

BBB that are children of DDD     

<AAA>           <BBB/>           <CCC/>           <BBB/>           <DDD>                <BBB/>           </DDD>           <CCC>                <DDD>                     <BBB/>                     <BBB/>                </DDD>           </CCC>      </AAA>

Page 64: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 64

XPath Example• The star “*” selects

all elements located by preceding path

/AAA/CCC/DDD/*Select all elements

enclosed by elements /AAA/CCC/DDD     

<AAA>           <XXX>                <DDD>                     <BBB/>                     <BBB/>                     <EEE/>                     <FFF/>                </DDD>           </XXX>           <CCC>                <DDD>                     <BBB/>                     <BBB/>                     <EEE/>                     <FFF/>                </DDD>           </CCC>           <CCC>                <BBB>                     <BBB>                          <BBB/>                     </BBB>                </BBB>           </CCC>      </AAA>

Page 65: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 65

XPath Example/*/*/*/BBBSelect all elements

BBB that have 3 ancestors     

<AAA>           <XXX>                <DDD>                     <BBB/>                     <BBB/>                     <EEE/>                     <FFF/>                </DDD>           </XXX>           <CCC>                <DDD>                     <BBB/>                     <BBB/>                     <EEE/>                     <FFF/>                </DDD>           </CCC>           <CCC>                <BBB>                     <BBB>                          <BBB/>                     </BBB>                </BBB>           </CCC>      </AAA>

Page 66: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 66

Moving Through the Node Hierarchy

• A kind of step in XPath, called an axis step, helps move through the node hierarchy in a particular direction, called an axis

• Forward axis – only contains the context node or nodes that are after the context node in document order– child, descendant, self, descendant-or-self,

following, following-sibling, attribute, namespace

• Reverse axis – only contains the context node or nodes that are before the context node in document order– parent, ancestor, preceding, preceding-sibling,

ancestor-or-self

Page 67: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

XPath Examples

//CCC/descendant::*Select all elements that have

CCC among their ancestors

<AAA>           <BBB>                <DDD>                     <CCC>                          <DDD/>                          <EEE/>                     </CCC>                </DDD>           </BBB>           <CCC>                <DDD>                     <EEE>                          <DDD>                               <FFF/>                          </DDD>                     </EEE>                </DDD>           </CCC>      </AAA>

     

//DDD/parent::*Select all parents of DDD

element

<AAA>           <BBB>                <DDD>                     <CCC>                          <DDD/>                          <EEE/>                     </CCC>                </DDD>           </BBB>           <CCC>                <DDD>                     <EEE>                          <DDD>                               <FFF/>                          </DDD>                     </EEE>                </DDD>           </CCC>      </AAA>

    

Page 68: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 68

XPath Predicates• Expressions in square brackets “[ ]” can

further specify an element• Used to filter a sequence of values in a step• A number in the brackets gives the ordinal

position of the element in the selected set• The function “last()” selects the last element

in the selection• Function “count()” counts the number of

selected elements• Many other functions and operators

Page 69: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

XPath Examples

//BBB[position() mod 2 = 0 ]

Select even BBB elements

<AAA>           <BBB/>           <BBB/>           <BBB/>           <BBB/>           <BBB/>           <BBB/>           <BBB/>           <BBB/>           <CCC/>           <CCC/>           <CCC/>      </AAA>

     

//*[count(BBB)=2]Select elements that have

two children BBB

<AAA>           <CCC>                <BBB/>                <BBB/>                <BBB/>           </CCC>           <DDD>                <BBB/>                <BBB/>           </DDD>           <EEE>                <CCC/>                <DDD/>           </EEE>      </AAA>

Page 70: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 70

XPath Predicates• Attribute names are specified by

the at-sign “@” prefix• Non-prefixed names are the

names of element nodes

Page 71: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 71

XPath Examples

//BBB[@id]Select BBB elements

that have attribute id

<AAA>  <BBB id = "b1"/>  <CCC id = "b2"/>  <BBB name = "bbb"/>  <BBB/>

</AAA>      

//BBB[not(@*)]Select BBB elements

without any attribute

<AAA>  <BBB id = "b1"/>  <CCC id = "b2"/> <BBB name = "bbb"/>  <BBB/>

</AAA>

Page 72: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 72

XPath Predicates• Function “name()” returns name of the

element• The “starts-with” function returns true if

the first argument string starts with (prefixed by) the second argument string

• The “contains” function returns true if the first string contains the second string

• The “string-length” function returns the number of characters in the string

Page 73: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

XPath Examples

//*[name()='BBB']Select all elements with

name BBB, equivalent to //BBB

<AAA>           <BCC>                <BBB/>                <BBB/>                <BBB/>           </BCC>           <DDB>                <BBB/>                <BBB/>           </DDB>           <BEC>                <CCC/>                <DBD/>           </BEC>      </AAA>

    

//*[contains(name(),'C')]Select all elements name of

that contain letter C

<AAA>           <BCC>                <BBB/>                <BBB/>                <BBB/>           </BCC>           <DDB>                <BBB/>                <BBB/>           </DDB>           <BEC>                <CCC/>                <DBD/>           </BEC>      </AAA>

Page 74: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 74

More XPath

• Several paths can be combined with the union (“|”), intersect, except separators

• Several values can be concatenated to form a sequence with the comma “,” and “to” operators

• A variable is a name that begins with a dollar sign “$”, may be bound to a value and used in an expression

• Originated ~1998 as part of XSLT, XPath 1.0 W3C Recommendation November 1999, XPath 2.0 W3C Recommendation January 2007

Page 75: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 75

XPath Limitations

• Path expressions are very powerful but there are some drawbacks

• XPath can only select existing node• Cannot construct new elements and

attributes and specify contents and relationships

• XPath operates on a single document

Page 76: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 76

XQuery Design Goals• Express at least the queries possible in

known query languages like SQL and various OO query languages

• Query the many kinds of data XML contains

• Implementable in many environments– Databases, XML programming

environments

Page 77: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 77

XQuery Overview• Operates on the XPath data model• Can query over multiple documents

(e.g., a database of XML documents)• Sequence of (list of ordered) trees• A document is a list of size 1• Functional query language – made up

of expressions that return values and do not have side effects

Page 78: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

chapter

Using XPath

doc(“mp3.xml”)//chapter//figure[caption = “ipod nano”]

In any chapter of the document mp3.xml find figures with caption “ipod nano”

book

chapter chapter appendixpart

section

paragraph

figure

caption

“ipod nano”

chapter

chapter

paragraph

figure

caption

“ipod classic”

part

Page 79: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 79

Using XQuery

<result> { doc(“mp3.xml”)//chapter//figure[caption=“ipod nano”] }</result>

In a chapter of the document mp3.xml find the figures with caption “ipod nano” and place them into an

element called result

figure

caption

“ipod nano”

result New element has itsown node identity

Page 80: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 80

Element Construction• XQuery provides for the construction of

new elements• An element constructor looks exactly

like an XML element• Using XQuery expressions, we may

have computed values<result> <figure> <caption>ipod nano</caption> </figure></result>

Page 81: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Bibliography Example Data Set

<bib> <book> <author> Aho </author> <author> Lam </author> <author> Sethi </author> <author> Ullman </author> <title> Compilers </title> <publisher> Addison Wesley </publisher> <year> 2006 </year> </book> <book> <author> Rowling </author> <title> Harry Potter 6 </title> <publisher> Scholastic </publisher> <year> 2005 </year> </book> <book> <author> Patton </author> <title> Software Testing </title> <publisher> SAMS </publisher> <year> 2005 </year> </book></bib>

Page 82: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 82

Reviews Example Data Set

<reviews> <review> <title> Compilers </title> <comment> It’s the best </comment> <comment> A definitive textbook </comment> </review> <review> <title> Harry Potter 6 </title> <comment> Spoiler: Dumbledore dies </comment> <comment> When will the next book come out? </comment> </review> …</reviews>

Page 83: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 83

FOR-WHERE-RETURN

FOR $b in doc(“bib.xml”)//bookWHERE $b/year/text() = “2005”RETURN $b/title

List the titles of books published in 2005

year

bib

book

book book

publisher

AddisonWesley

yearpublisher

Scholastic2006 2005

book

yearpublisher

SAMS 2005

title title title

Page 84: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

Tuples of variable bindings

FOR/LET

WHERE

RETURN

Ordered lists of tuplesof variable bindings

Tuples thatsatisfy the conditions

List of trees

$bbookbookbook

$bbookbooktitle

year

bib

book book

publisher

AddisonWesley

yearpublisher

Scholastic2006 2005

book

yearpublisher

SAMS2005

title title title

title

Page 85: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 85

RETURN

FOR $b in doc(“bib.xml”)//book WHERE $b/year/text() = “2005” RETURN $b/author

Return the list of authors who

published in 2005

Page 86: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 86

WHERE

FOR $b in doc(“bib.xml”)//bookWHERE $b/publisher/text() = “Addison Wesley” AND $b/year/text() = “2006”RETURN $b/title

List the titles of books published by “Addison Wesley” in 2006

Page 87: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 89

Joins

FOR $b in doc(“bib.xml”)/book, $r in doc(“review.xml”)/reviewWHERE $b/title/text() = $r/title/text()RETURN <book_with_review> {$b/@*} {$b/*} {$r/comment} </book_with_review>

For every book with a matching review output a book_with_review

that contains all the attributes and subelements of book

and the comment subelements of review

Page 88: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 90

Join Example Result<book_with_review> <author> Aho </author> <author> Lam </author> <author> Sethi </author> <author> Ullman </author> <title> Compilers </title> <publisher> Addison Wesley </publisher> <year> 2006 </year> <comment> It’s the best </comment> <comment> A definitive textbook </comment> </book_with_review> <book_with_review> <author> Rowling </author> … </book_with_review>

Page 89: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 91

Nested queries

FOR $a IN distinct(document(“bib.xml”)//author/text())RETURN <author> <name> {$a} </name> { FOR $b IN document(“bib.xml”)//book WHERE $a = $b/author/text() RETURN $b/title } </author>

Invert the structure of the input document so that there is a list of author elements containing the name of the author and the list of books he/she

wrote

Page 90: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 92

Conditionals

FOR $b IN doc(“bib.xml”)/bookRETURN <short> {$b/title} <author> {IF count($b/author) < 3 $b/author ELSE $b/author[1], <author>and others</author> } </author> </short>

Leave alone books with less than 3 authors

Otherwise shorten the author list

Page 91: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 93

Existential Quantification

FOR $b in doc(“bib.xml”)/bookWHERE SOME $author IN $b/author

SATISFIES $author/text() = “Aho”RETURN $b

Return books where at least one of the authors is “Aho”

Page 92: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 94

Universal Quantification

FOR $b in doc(“bib.xml”)/bookWHERE EVERY $author IN $b/author SATISFIES $author/text() = “Aho”RETURN $b

Return books where all authors are “Aho”

Page 93: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 95

More XQuery Notation• Also Order by• Can define and use functions• Supported by all major database

engines• XQuery has been around in draft form

since ~2001, but only became a W3C Recommendation in January 2007

Page 94: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 96

Summary XPath expresses simple queries on

individual XML documents (or streams) Optimization analogous to string matching

XQuery expresses sophisticated SQL-like queries, joins and views on databases, messages, etc. whether stored as XML or not Optimization analogous to database

query/join Both orthogonal to XML Schema (or DTD)

But its hard to write queries without knowing what syntax will occur in the documents…

Page 95: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 97

Second Assignment: Revised Paper

Proposal• Due Monday February 18th at 5pm• Maximum three pages (not including

figures, if any), plus references (required)• Plan and outline your paper (which will be

~15 pages)• See

http://york.cs.columbia.edu/classes/cs6125/revised_paper_proposal.htm

Page 96: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 98

Revised Paper Proposal• Each full paper should have title, author,

abstract (~200 words), introduction, body sections, conclusions, bibliography (cited references)

• The point of this assignment is to determine what will be in those sections

• Assume a reader who is taking the class but may not know anything at all about your specific topic

Page 97: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 99

Revised Paper Proposal: Introduction

and Conclusion• What is your topic?• What is the problem being addressed?• What is the solution, or design space of

solutions, proposed or actualized?• What is your argument?• What is your point of view?• What is the opposing point of view?

Page 98: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 100

Revised Paper Proposal: Body Sections

• What sections? (usually 3-5)• What subsections? (perhaps down to

subsubsections)• Motivate your literature reading to fill

those sections• Full paper will be due Friday March 14th

Page 99: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 101

A Note about Citations and Bibliographic

References• References should be cited in the text

like this “Pogue said blah blah [1]” or this “[Pog07] describes mumble”

• Bibliography entries should appear something like this[1] David Pogue, Behind the Scenes of “iPhone: The Musical”, The New York Times, online edition, July 12, 2007. <http://pogue.blogs.nytimes.com/2007/07/12/> accessed February 8, 2008.

Page 100: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 102

Another Note about Bibliographic References

• Bibliographic references should be as complete as possible (but official MLA, APA, etc. format is not required)

• There is a variety of free software available to help manage reference lists– http://www.easybib.com/,

http://www.bibme.org/ and others used online (web search for “free bibliography”)

– Downloads from http://www.columbia.edu/acis/software/libraries.html (requires your uni/password)

Page 101: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 103

Second Assignment: Logistics

• Submit by posting in Revised Paper Proposal folder on CourseWorks

• Must be in a format I can read, which means pdf 8 or earlier, word 2003 or earlier, powerpoint 2003 or earlier, html, plain ascii text

• With all figures embedded in the file or separately viewable in Firefox 2 or IE 7 on Windows XP

• Bundled archives must be openable with WinZip 11

Page 102: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 104

Heads Up on Project• Preliminary Proposal due Monday March 10th

(note this is before the full paper)• Optionally work in teams (see

http://york.cs.columbia.edu/classes/cs6125/team_advice)

• Build a new system or extend an existing system – submit code, demo system

• OR evaluate/compare one or more existing system(s) – submit procedures and findings, show system(s)

• You may "continue" your paper topic towards the project, or do something entirely different

Page 103: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 105

Heads Up on Presentation• Individual ~10 talk in class during one

of last few class sessions• No proposal, just do it• May be based on paper, project, or

some other topic (in the case of team members all presenting on the same project, please coordinate to avoid redundancy and discuss your plans with the instructor in advance)

Page 104: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 106

Reminders

• Class participation is important! (10% corresponds to a whole letter grade)

• Revised paper proposal due Monday February 18th by 5pm

• Preliminary project proposal due March 10th

• Paper must be individual, projects may optionally be done in teams

Page 105: 12 February 2008Kaiser: COMS E61251 COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2008.

12 February 2008 Kaiser: COMS E6125 107

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

Prof. Gail KaiserProf. Gail Kaiser

Spring 2008Spring 2008


Recommended