12 February 2008 Kaiser: COMS E6125 1
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
Prof. Gail KaiserProf. Gail Kaiser
Spring 2008Spring 2008
12 February 2008 Kaiser: COMS E6125 2
Today’s Topics:
• Document Structure Definition– Document Type Definition (DTD)– XML Schema (XSD)
• Querying XML Documents– NOT the same as Web search engines!– XPath– XQuery
12 February 2008 Kaiser: COMS E6125 3
Pure XML - Instance Model
<A> <B>foo</B> <C>bar</C> <C>psl</C></A>
A
B C
"foo" "bar"
C:"bar"
A:
B: "foo"
C:"psl"
"psl"
C
children are ordered
• XML 1.0 implicit data model: – nested containers ("boxes within boxes")– labeled ordered trees (= semistructured data
model)– Relational or object-oriented easy to encode
12 February 2008 Kaiser: COMS E6125 4
XML NamespacesAllows mixing of different tag
vocabularies
• Only identifies the vocabulary (lexicon)
• Additional mechanisms required for structure and meaning (or at least metadata) of tags
12 February 2008 Kaiser: COMS E6125 5
From Documents to Data
• We want to be able to – Extract the element
structure of a document
– Re-use this structure for other similar documents
– Share structure and metadata with others
– Automate processing of this structure and metadata
<invoice> <orderDate>2007-12-01</orderDate> <shipDate>2007-12-26</shipDate><billingAddress> <name>Gail Kaiser</name> <street>500 West 120th Street</street> <city>New York</city> <state>NY</state> <zip>10027</zip> </billingAddress> <voice>212-555-1234</voice> <fax>212-555-4321</fax> </invoice>
<memo importance='high' date=‘2008-02-11'>
<from>Gail Kaiser</from> <to>Swapneel Sheth</to>
<subject>whim tomorrow</subject>
<body>Remember to pick up the sign-in sheet after class tomorrow
</body>
</memo>
12 February 2008 Kaiser: COMS E6125 6
Adding Structure and Semantics
• A Document Structure Description (DSD) defines the syntax of XML documents for a particular application domain
• Defines the grammar for an XML-based markup language
12 February 2008 Kaiser: COMS E6125 7
Processing XML• Non-validating parser:
– checks that XML doc is syntactically well-formed, e.g., all open-tags have matching close-tags and they are properly nested, attributes only appear once in an element, etc.
• Validating parser:– checks that XML doc is also valid wrt a
given DSD (now usually XML Schema)
12 February 2008 Kaiser: COMS E6125 8
Using DSD Validators
•A DSD processor can be useful both on the server side (when writing XML documents) and on the client side (when processing XML documents): – Checking validity (conformance) of XML documents
– Performing default insertion (inserts missing fragments)
12 February 2008 Kaiser: COMS E6125 9
DSD Processing
12 February 2008 Kaiser: COMS E6125 10
Several Proposed DSDs• XML Document Type Definitions (DTDs):
– Define the structure of “allowed” documents
Database schema– Non-XML syntax
• XML Schemas (XSDs)– Define structure and data types – Allows developers to build their own libraries of
interchange-able data types– Written in an XML vocabulary
• Others (e.g., RELAX NG, Schematron)
12 February 2008 Kaiser: COMS E6125 11
Document Type Definitions
• A DTD is a grammar defining XML structure – XML document specifies an
associated DTD, plus the root element
– DTD specifies children of the root element, their children, and so on
12 February 2008 Kaiser: COMS E6125 12
Example DTD<!ELEMENT bib (book *)><!ELEMENT book (thesis | article)><!ELEMENT thesis (title, author, year, school,
committeemember*)><!ATTLIST thesis
date CDATA #REQUIREDkey ID #REQUIREDadvisor CDATA #IMPLIEDidref IDREF>
<!ELEMENT article (title, (author+ | editor+), publisher)>
<!ELEMENT title (#PCDATA)><!ELEMENT author (name)><!ATTLIST author id ID #REQUIRED>. . .
12 February 2008 Kaiser: COMS E6125 13
DTD Interpretation
• CDATA “Character Data”, a sequence of characters
• #PCDATA “Parsed Character Data”, text and character entities (e.g., & -> &, é -> acute e)
• ID unique• IDREF reference to entity• #IMPLIED A default
value must be supplied by the processor.
• ( ... ) Specifies a group. • A | B Both A and B are
permitted in any order. • A , B A must occur before
B. • A & B A and B must both
occur once, but may do so in any order.
• A? A can occur zero or one times
• A* A can occur zero or more times
• A+ A can occur one or more times
12 February 2008 Kaiser: COMS E6125 14
DTD Defines Special Significance for Attributes
• IDs – special attributes that are analogous to relational database keys (globally unique IDs for elements)
• IDREF – reference to an ID• IDREFS – a list of IDREFs
12 February 2008 Kaiser: COMS E6125 15
Instance Visualization as a Graph
<?xml version="1.0"?>
<!DOCTYPE bib SYSTEM “http://webserver/bib.dtd">
<bib>
<author id="author1"> <name>John Smith</name>
</author>
<article>
<author idref="author1" />
<title>Paper1</title>
</article>
<article>
<author idref="author1" />
<title>Paper2</title>
</article>
. . .
12 February 2008 Kaiser: COMS E6125 16
Graph Data ModelRoot
!DOCTYPE
bib
authorarticle
nametitle
idrefidref
John Smith
author1author1
Paper2
?xml
article
id
author1
author authortitle
Paper1
12 February 2008 Kaiser: COMS E6125 17
Drawbacks of DTDs
• Not themselves XML - additional effort to build tools
• No support for data types - cannot do data validation
• No support for OO-like structures (e.g, inheritance)
• Horrible syntax
12 February 2008 Kaiser: COMS E6125 18
Several Proposed DSDs• XML Document Type Definitions (DTDs):
– Define the structure of “allowed” documents
Database schema– Non-XML syntax
• XML Schemas (XSDs)– Defines structure and data types – Allows developers to build their own libraries of
interchange-able data types– Written in an XML vocabulary
• Others (e.g., RELAX NG, Schematron)
12 February 2008 Kaiser: COMS E6125 19
XML Schema Design Principles
1. More expressive than DTDs (which came from SGML, although modified slightly in XML 1.0)
2. Notation is itself an XML vocabulary3. Self-describing 4. Usable by a wide variety of applications that
employ XML 5. Straightforwardly usable on the Internet6. Optimized for interoperability7. Simple enough to be implemented with modest
design and runtime resources8. Coordinated with relevant W3C specs
12 February 2008 Kaiser: COMS E6125 20
Purpose of an XML Schema
•Defines a class of XML instances•Neither instances nor schemas need
exist as documents, per se, may exist as:–Byte stream sent between applications–Fields in a database record–Collection of XML “infoset” information items
12 February 2008 Kaiser: COMS E6125 21
What is an XML “infoset”?
• XML Information Set, 2nd edition, W3C Recommendation February 2004
• For use by other specs that need to refer to the information in a well-formed XML document [or PSVI = post schema validated infoset]
• Defines abstract data set generated by parser or by other means, conceptually tree of items each with several properties
12 February 2008 Kaiser: COMS E6125 22
(Some) Information Items
• Document (root of infoset) – properties include base URI, XML version, character encoding, etc.
• One root element - and its children• Attributes of elements• Namespace scoping for elements• Processing instructions• Unexpanded entities (processor may or
may not expand all entities)
Example Instance Document<?xml version="1.0"?> <purchaseOrder orderDate=“2007-10-20"> <shipTo country="US"> <name>Alice Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </shipTo> <billTo country="US"> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </billTo> <comment>Hurry, my lawn is going wild!</comment> <items> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>Confirm this is electric</comment> </item> <item partNum="926-AA"> . . . </item> </items> </purchaseOrder>
file
po.
xml
12 February 2008 Kaiser: COMS E6125 24
Where is the Schema?• The instance document may reference a
schema explicitly, or a processor may obtain a schema separately without reference from the instance
• Schema defines elements and attributes, and their complex and simple types
• Determines the appearance of elements and their content in instance documents
Example Schema<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> . . . </xsd:complexType></xsd:schema>
• The schema consists of a schema element and various subelements, e.g., element, complexType
• The prefix xsd: associates names with the XML Schema namespace specified in the xmlns:xsd declaration
• Same prefix, and hence same association, also appears on names of built-in types, e.g., xsd:string
• Identifies elements and simple types as belonging to XML Schema language vocabulary rather than vocabulary of schema author
file
po.
xsd
12 February 2008 Kaiser: COMS E6125 26
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType"> . . . </xsd:complexType></xsd:schema>
Example Schema
• An annotation element may appear at the beginning of most schema constructions
• Contains two subelements– Documentation: Human readable material– appInfo: For tools and applications
file
po.
xsd
Complex Type Definitions
<xsd:complexType name="USAddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:NMTOKEN"
fixed="US"/> </xsd:complexType>
• New complex types are defined using the complexType element; it contains element declarations, attribute declarations and element references
• This example says elements of type USAddress must have– 5 subelements that must be called name, street, city, state and zip (in
this order), each having the corresponding type declared above– 1 attribute called country may appear with the element; NMTOKEN
represents an atomic indivisible value• All element declarations within USAddress involve simple types
12 February 2008 Kaiser: COMS E6125 28
Complex Type Definitions
<xsd:complexType name="USAddress"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/> </xsd:sequence> <xsd:attribute name="country" type="xsd:NMTOKEN"
fixed="US"/> </xsd:complexType>
• An attribute may be specified as fixed or default.• Default attribute values apply when attributes are missing.• For fixed attributes, if a value appears, it must be the value
declared with a fixed value. • The schema processor will provide the value for missing
attributes.
Complex Type Definitions
<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>
• A declaration may reference an existing element, e.g., comment; the value of the ref attribute must reference a global element (i.e., declared under schema)
• Every element of type PurchaseOrderType must consist of subelements shipTo and billTo, each containing the five subelements declared as part of USAddress, items and (optionally) comment; it may have one attribute called orderDate
12 February 2008 Kaiser: COMS E6125 30
Complex Type Definitions
<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>
• Occurrence constraint may specify minoccurs and/or maxoccurs
12 February 2008 Kaiser: COMS E6125 31
Complex Type Definitions
<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>
• Attributes may appear once or not at all (the default), but no more than once
• use may be specified as optional, required, or prohibited
12 February 2008 Kaiser: COMS E6125 32
Simple Built-in Types• string, normalizedString,
token• byte, unsignedByte• integer, positiveInteger, etc• long, short, etc• decimal, float, double• boolean• time, dateTime, duration,
date, etc• anyURI• etc
• ID• IDREF, IDREFS• ENTITY, ENTITIES• NMTOKEN, NMTOKENS
• The types in this column should only be used in attributes (to retain compatibility with XML 1.0 DTDs)
12 February 2008 Kaiser: COMS E6125 33
Simple Derived Types
• The simpleType element is used to define and name a new simple type
• The restriction element indicates the base type and identifies the “facets” that constrain the range of values (here minInclusive and maxInclusive)
<xsd:simpleType name="myInteger"> <xsd:restriction base="xsd:integer"> <xsd:minInclusive value="10000"/> <xsd:maxInclusive value="99999"/> </xsd:restriction></xsd:simpleType>
12 February 2008 Kaiser: COMS E6125 34
Simple Derived Types (pattern facet)
<!-- Stock Keeping Unit, a code for identifying products -->
<xsd:simpleType name="SKU"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{3}-[A-Z]{2}"/> </xsd:restriction> </xsd:simpleType>
• Constrain the values of SKU using the pattern facet in conjunction with the regular expression "\d{3}-[A-Z]{2}“ (3 digits followed by a hyphen followed by 2 upper-case ASCII letters)
Simple Derived Types (enumeration facet)
• The enumeration facet limits a simple type to a set of distinct values
• Enables a better definition of USAddress type
<xsd:simpleType name="USState"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="AK"/> <xsd:enumeration value="AL"/> <xsd:enumeration value="AR"/> <!-- and so on ... --> </xsd:restriction></xsd:simpleType>
<xsd:complexType name="USAddress"> . . . <xsd:element name="state" type="USState"/> . . .</xsd:complexType
<xsd:complexType name="Items"> <xsd:sequence> <xsd:element name="item" minOccurs="0"
maxOccurs="unbounded"> <xsd:complexType> <xsd:sequence> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="quantity"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="100"/> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="USPrice" type="xsd:decimal"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="shipDate" type="xsd:date"
minOccurs="0"/> </xsd:sequence> <xsd:attribute name="partNum" type="SKU" use="required"/> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType>
Anonymous Type Definitions
Recap Example Schema
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> . . . </xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/>
<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/> </xsd:complexType>
<xsd:complexType name="USAddress"> . . . </xsd:complexType>
<xsd:complexType name="Items"> . . . </xsd:complexType></xsd:schema>
file
po.
xsd
12 February 2008 Kaiser: COMS E6125 38
XML Schema Data Types
• Complex types• Built-in simple types• Derived simple types• Also derived complex types, lists and
unions of simple types
Define structure – what about the content?
<xsd:element name="internationalPrice"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base="xsd:decimal"> <xsd:attribute name="currency“ type="xsd:string"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element>
Element Content: Simple content
• Declare an element that has an attribute and contains a simple value
<internationalPrice currency="EUR">423.46</internationalPrice>
12 February 2008 Kaiser: COMS E6125 40
Element Content:Empty content
• Declare an element with attributes only - no content at all
<xsd:element name="internationalPrice"> <xsd:complexType> <xsd:attribute name="currency" type="xsd:string"/> <xsd:attribute name="value" type="xsd:decimal"/> </xsd:complexType></xsd:element>
<internationalPrice currency="EUR" value="423.46"/>
Element Content: Entire element omitted
• The absence of an element does not carry any particular meaning; it could be– Information unknown– Information not applicable– I just forgot to enter the information
• Absence does/should not imply some value like zero, empty string, empty list, etc.
• Database systems faced with similar problems have introduce “null” values
• XML does not provide a null value representation that actually appears in element content; instead, there is an attribute to indicate content is nil
<xsd:element name="shipDate" type="shipDateType" nillable="true">
<shipDate xsi:nil="true"></shipDate>
12 February 2008 Kaiser: COMS E6125 42
Element Content:Mixed content
• Text appears between the elements salutation, quantity, productName, and shipDate (all children of letterBody)
• To allow this, the mixed attribute of the parent’s complexType must be set to true
<letterBody><salutation>Dear Mr.<name>Robert
Smith</name>.</salutation>Your order of <quantity>1</quantity> <productName>BabyMonitor</productName> shipped from our warehouse on<shipDate>1999-05-21</shipDate>. ....</letterBody>
Element Content: Mixed content
<xsd:element name="letterBody"> <xsd:complexType mixed="true"> <xsd:sequence> <xsd:element name="salutation"> <xsd:complexType mixed="true"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="quantity"
type="xsd:positiveInteger"/> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="shipDate" type="xsd:date"
minOccurs="0"/> <!-- etc. --> </xsd:sequence> </xsd:complexType></xsd:element>
• The order and number of child elements appearing in an instance must agree with order/number of child elements specified in the content model
12 February 2008 Kaiser: COMS E6125 44
Element Content:anyType
• The anyType type does not constrain its content in any way
• When no type is defined, anyType is the default, so could be written as
<xsd:element name="anything" type="anyType"/>
<xsd:element name="anything"/>
12 February 2008 Kaiser: COMS E6125 45
Grouping Content Elements
– group & sequence• group – groups elements so that they can be
used as a unit to build up types• sequence grouping (default) – elements in
instance doc must appear in the listed order
<xsd:group name="shipAndBill"> <xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> </xsd:sequence></xsd:group>
12 February 2008 Kaiser: COMS E6125 46
Content Groups - choice
• choice grouping – only one element appears in an instance
<xsd:complexType name="PurchaseOrderType"> <xsd:sequence> <xsd:choice> <xsd:group ref="shipAndBill"/> <xsd:element name="singleUSAddress"
type="USAddress"/> </xsd:choice> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/></xsd:complexType>
Content Groups - all• all grouping – elements may appear in any
order, each element appears zero or one times• An all group must appear as the sole child at the
top of a content model
<xsd:complexType name="PurchaseOrderType"> <xsd:all> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/> </xsd:all> <xsd:attribute name="orderDate" type="xsd:date"/></xsd:complexType>
Attribute Grouping• We can create a named attribute group
containing all the desired attributes and reference this group by name in an element
<xsd:element name="Item"> </xsd:complexType> . . . <xsd:attribute name="partNum" type="SKU" use="required"/> <xsd:attribute name="weightKg" type="xsd:decimal"/> <xsd:attribute name="shipBy"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="air"/> <xsd:enumeration value="land"/> <xsd:enumeration value="any"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </xsd:complexType></xsd:element>
Attribute Groups<xsd:element name="Item"> </xsd:complexType> . . . <xsd:attributeGroup ref="ItemDelivery"/> </xsd:complexType></xsd:element>
<xsd:attributeGroup name="ItemDelivery"> <xsd:attribute name="partNum" type="SKU"
use="required"/> <xsd:attribute name="weightKg"
type="xsd:decimal"/> <xsd:attribute name="shipBy"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="air"/> <xsd:enumeration value="land"/> <xsd:enumeration value="any"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute></xsd:attributeGroup>
12 February 2008 Kaiser: COMS E6125 50
Target Namespaces• Tired of repeating the prefix xsd: ?• We could make the XMLSchema namespace
the default namespace (so no more xsd prefixes) but then we would have to prefix the locally defined types and locally declared elements and attributes
• The solution is Target Namespaces• Target namespaces enable distinguishing
between definitions and declarations from different vocabularies
Target Namespace Example
<schema targetNamespace="http://www.example.com/PO" xmlns="http://www.w3.org/2001/XMLSchema" xmlns:po="http://www.example.com/PO">
<element name="purchaseOrder" type="po:PurchaseOrderType"/> <element name="comment" type="string"/>
<complexType name="PurchaseOrderType"> <sequence> <element name="shipTo" type="po:USAddress"/> <element name="billTo" type="po:USAddress"/> <element ref="po:comment" minOccurs="0"/> <!– etc. --> </complexType>
<complexType name="USAddress"> <sequence> <element name= "name" type="string"/> <-- etc. --> </complexType></schema>
12 February 2008 Kaiser: COMS E6125 52
Undeclared Target Namespaces
• What is the target namespace when a schema does not declare one?– All its definitions and declarations are referenced without
qualification– They can only validate unqualified names in instance
documents
• What is the target namespace when an instance document does not declare one?– All pre-XMLSchema XML 1.0 documents are like this– To validate such instance documents, the validation
processor must be provided with a schema with no target namespace
12 February 2008 Kaiser: COMS E6125 53
Other XML Schema Issues• A schema can be distributed across multiple
documents, one of which is topmost and the rest “included”
• Types can be “imported” from other schemas• Abstract types allow a form of inheritance
[beyond derived types] with substitution groups
• Keys (as in relational databases)• …
Drawbacks of XML Schemas
• Another vocabulary to learn• Verbose (like XML itself)• Many constraints cannot be expressed
(without adding separate stylesheet or code)
<Demo xmlns="http://www.demo.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.demo.org demo.xsd">
<A>10</A> <B>20</B> </Demo>
• Can constrain: the Demo element contains a sequence of elements A followed by B; the A element contains an integer; the B element contains an integer
• Can’t constraint: A>B
12 February 2008 Kaiser: COMS E6125 55
Today’s Topics:
• Document Structure Definition– Document Type Definitions (DTDs)– XML Schemas (XSD)
• Querying XML Documents– NOT the same as Web search engines!– XPath– XQuery
12 February 2008 Kaiser: COMS E6125 56
Why not use SQL?• Table rows vs. XML elements• Homogeneous vs. heterogeneous - two
elements of the same type may have different structure (due to minOccurs, maxOccurs, choice, etc)
• Flat vs. multi-nested• Unordered sets/tuples vs. ordered elements • “Dense” vs. “Sparse” - not all potential
subelements are present or have values
12 February 2008 Kaiser: COMS E6125 57
XML Data Model• Basically a sequence, an ordered list of zero
or more items • No sequences of sequences• An item is either a node or an atomic value• An atomic value – a built-in data type or a
simple type derived by restriction• A node is one of seven kinds: element,
attribute, text, document, comment, processing instruction, and namespace
12 February 2008 Kaiser: COMS E6125 58
Document Order• Among all nodes in a hierarchy there is a total
order, called document order, in which each node appears before its children
• Preorder traversal• Informally, the document order corresponds to
the order in which the first character of the XML representation of each node occurs in the XML representation of the document
12 February 2008 Kaiser: COMS E6125 59
XPath Overview• Language that expresses simple queries
on individual XML documents (or streams) for retrieving parts of the XML document
• Operates on the abstract, logical structure of an XML document, rather than its surface syntax
• Basic facilities for manipulating strings, arithmetic and boolean expressions
• Compact, non-XML syntax to facilitate use within URIs and XML attribute values
12 February 2008 Kaiser: COMS E6125 60
XPath Expressions• Similar to filesystem addressing• Consists of a series of steps, separated by
“/” or “//”• Each step is evaluated in the context of a
particular node, called the context node• The result of each step is a sequence of
nodes, which serve in turn as context nodes for the following step
• The value of a path expression is the node sequence that results from the last step
12 February 2008 Kaiser: COMS E6125 61
XPath Example
• If the path starts with the slash “/” , then it represents an absolute path to the required element
/AAA/DDD/BBBSelect all elements BBB
that are children of DDD that are children of the root element AAA
<AAA> <BBB/> <CCC/> <BBB/> <BBB/> <DDD> <BBB/> </DDD> <CCC/> </AAA>
12 February 2008 Kaiser: COMS E6125 62
XPath Example
• If the path starts with double-slash “//” , then all elements in the document which fulfill the criteria are selected
//BBBSelect all elements BBB
<AAA> <BBB/> <CCC/> <BBB/> <DDD> <BBB/> </DDD> <CCC> <DDD> <BBB/> <BBB/> </DDD> </CCC> </AAA>
12 February 2008 Kaiser: COMS E6125 63
XPath Example
//DDD/BBB Select all elements
BBB that are children of DDD
<AAA> <BBB/> <CCC/> <BBB/> <DDD> <BBB/> </DDD> <CCC> <DDD> <BBB/> <BBB/> </DDD> </CCC> </AAA>
12 February 2008 Kaiser: COMS E6125 64
XPath Example• The star “*” selects
all elements located by preceding path
/AAA/CCC/DDD/*Select all elements
enclosed by elements /AAA/CCC/DDD
<AAA> <XXX> <DDD> <BBB/> <BBB/> <EEE/> <FFF/> </DDD> </XXX> <CCC> <DDD> <BBB/> <BBB/> <EEE/> <FFF/> </DDD> </CCC> <CCC> <BBB> <BBB> <BBB/> </BBB> </BBB> </CCC> </AAA>
12 February 2008 Kaiser: COMS E6125 65
XPath Example/*/*/*/BBBSelect all elements
BBB that have 3 ancestors
<AAA> <XXX> <DDD> <BBB/> <BBB/> <EEE/> <FFF/> </DDD> </XXX> <CCC> <DDD> <BBB/> <BBB/> <EEE/> <FFF/> </DDD> </CCC> <CCC> <BBB> <BBB> <BBB/> </BBB> </BBB> </CCC> </AAA>
12 February 2008 Kaiser: COMS E6125 66
Moving Through the Node Hierarchy
• A kind of step in XPath, called an axis step, helps move through the node hierarchy in a particular direction, called an axis
• Forward axis – only contains the context node or nodes that are after the context node in document order– child, descendant, self, descendant-or-self,
following, following-sibling, attribute, namespace
• Reverse axis – only contains the context node or nodes that are before the context node in document order– parent, ancestor, preceding, preceding-sibling,
ancestor-or-self
XPath Examples
//CCC/descendant::*Select all elements that have
CCC among their ancestors
<AAA> <BBB> <DDD> <CCC> <DDD/> <EEE/> </CCC> </DDD> </BBB> <CCC> <DDD> <EEE> <DDD> <FFF/> </DDD> </EEE> </DDD> </CCC> </AAA>
//DDD/parent::*Select all parents of DDD
element
<AAA> <BBB> <DDD> <CCC> <DDD/> <EEE/> </CCC> </DDD> </BBB> <CCC> <DDD> <EEE> <DDD> <FFF/> </DDD> </EEE> </DDD> </CCC> </AAA>
12 February 2008 Kaiser: COMS E6125 68
XPath Predicates• Expressions in square brackets “[ ]” can
further specify an element• Used to filter a sequence of values in a step• A number in the brackets gives the ordinal
position of the element in the selected set• The function “last()” selects the last element
in the selection• Function “count()” counts the number of
selected elements• Many other functions and operators
XPath Examples
//BBB[position() mod 2 = 0 ]
Select even BBB elements
<AAA> <BBB/> <BBB/> <BBB/> <BBB/> <BBB/> <BBB/> <BBB/> <BBB/> <CCC/> <CCC/> <CCC/> </AAA>
//*[count(BBB)=2]Select elements that have
two children BBB
<AAA> <CCC> <BBB/> <BBB/> <BBB/> </CCC> <DDD> <BBB/> <BBB/> </DDD> <EEE> <CCC/> <DDD/> </EEE> </AAA>
12 February 2008 Kaiser: COMS E6125 70
XPath Predicates• Attribute names are specified by
the at-sign “@” prefix• Non-prefixed names are the
names of element nodes
12 February 2008 Kaiser: COMS E6125 71
XPath Examples
//BBB[@id]Select BBB elements
that have attribute id
<AAA> <BBB id = "b1"/> <CCC id = "b2"/> <BBB name = "bbb"/> <BBB/>
</AAA>
//BBB[not(@*)]Select BBB elements
without any attribute
<AAA> <BBB id = "b1"/> <CCC id = "b2"/> <BBB name = "bbb"/> <BBB/>
</AAA>
12 February 2008 Kaiser: COMS E6125 72
XPath Predicates• Function “name()” returns name of the
element• The “starts-with” function returns true if
the first argument string starts with (prefixed by) the second argument string
• The “contains” function returns true if the first string contains the second string
• The “string-length” function returns the number of characters in the string
XPath Examples
//*[name()='BBB']Select all elements with
name BBB, equivalent to //BBB
<AAA> <BCC> <BBB/> <BBB/> <BBB/> </BCC> <DDB> <BBB/> <BBB/> </DDB> <BEC> <CCC/> <DBD/> </BEC> </AAA>
//*[contains(name(),'C')]Select all elements name of
that contain letter C
<AAA> <BCC> <BBB/> <BBB/> <BBB/> </BCC> <DDB> <BBB/> <BBB/> </DDB> <BEC> <CCC/> <DBD/> </BEC> </AAA>
12 February 2008 Kaiser: COMS E6125 74
More XPath
• Several paths can be combined with the union (“|”), intersect, except separators
• Several values can be concatenated to form a sequence with the comma “,” and “to” operators
• A variable is a name that begins with a dollar sign “$”, may be bound to a value and used in an expression
• Originated ~1998 as part of XSLT, XPath 1.0 W3C Recommendation November 1999, XPath 2.0 W3C Recommendation January 2007
12 February 2008 Kaiser: COMS E6125 75
XPath Limitations
• Path expressions are very powerful but there are some drawbacks
• XPath can only select existing node• Cannot construct new elements and
attributes and specify contents and relationships
• XPath operates on a single document
12 February 2008 Kaiser: COMS E6125 76
XQuery Design Goals• Express at least the queries possible in
known query languages like SQL and various OO query languages
• Query the many kinds of data XML contains
• Implementable in many environments– Databases, XML programming
environments
12 February 2008 Kaiser: COMS E6125 77
XQuery Overview• Operates on the XPath data model• Can query over multiple documents
(e.g., a database of XML documents)• Sequence of (list of ordered) trees• A document is a list of size 1• Functional query language – made up
of expressions that return values and do not have side effects
chapter
Using XPath
doc(“mp3.xml”)//chapter//figure[caption = “ipod nano”]
In any chapter of the document mp3.xml find figures with caption “ipod nano”
book
chapter chapter appendixpart
section
paragraph
figure
caption
“ipod nano”
chapter
chapter
paragraph
figure
caption
“ipod classic”
part
12 February 2008 Kaiser: COMS E6125 79
Using XQuery
<result> { doc(“mp3.xml”)//chapter//figure[caption=“ipod nano”] }</result>
In a chapter of the document mp3.xml find the figures with caption “ipod nano” and place them into an
element called result
figure
caption
“ipod nano”
result New element has itsown node identity
12 February 2008 Kaiser: COMS E6125 80
Element Construction• XQuery provides for the construction of
new elements• An element constructor looks exactly
like an XML element• Using XQuery expressions, we may
have computed values<result> <figure> <caption>ipod nano</caption> </figure></result>
Bibliography Example Data Set
<bib> <book> <author> Aho </author> <author> Lam </author> <author> Sethi </author> <author> Ullman </author> <title> Compilers </title> <publisher> Addison Wesley </publisher> <year> 2006 </year> </book> <book> <author> Rowling </author> <title> Harry Potter 6 </title> <publisher> Scholastic </publisher> <year> 2005 </year> </book> <book> <author> Patton </author> <title> Software Testing </title> <publisher> SAMS </publisher> <year> 2005 </year> </book></bib>
12 February 2008 Kaiser: COMS E6125 82
Reviews Example Data Set
<reviews> <review> <title> Compilers </title> <comment> It’s the best </comment> <comment> A definitive textbook </comment> </review> <review> <title> Harry Potter 6 </title> <comment> Spoiler: Dumbledore dies </comment> <comment> When will the next book come out? </comment> </review> …</reviews>
12 February 2008 Kaiser: COMS E6125 83
FOR-WHERE-RETURN
FOR $b in doc(“bib.xml”)//bookWHERE $b/year/text() = “2005”RETURN $b/title
List the titles of books published in 2005
year
bib
book
book book
publisher
AddisonWesley
yearpublisher
Scholastic2006 2005
book
yearpublisher
SAMS 2005
title title title
Tuples of variable bindings
FOR/LET
WHERE
RETURN
Ordered lists of tuplesof variable bindings
Tuples thatsatisfy the conditions
List of trees
$bbookbookbook
$bbookbooktitle
year
bib
book book
publisher
AddisonWesley
yearpublisher
Scholastic2006 2005
book
yearpublisher
SAMS2005
title title title
title
12 February 2008 Kaiser: COMS E6125 85
RETURN
FOR $b in doc(“bib.xml”)//book WHERE $b/year/text() = “2005” RETURN $b/author
Return the list of authors who
published in 2005
12 February 2008 Kaiser: COMS E6125 86
WHERE
FOR $b in doc(“bib.xml”)//bookWHERE $b/publisher/text() = “Addison Wesley” AND $b/year/text() = “2006”RETURN $b/title
List the titles of books published by “Addison Wesley” in 2006
12 February 2008 Kaiser: COMS E6125 89
Joins
FOR $b in doc(“bib.xml”)/book, $r in doc(“review.xml”)/reviewWHERE $b/title/text() = $r/title/text()RETURN <book_with_review> {$b/@*} {$b/*} {$r/comment} </book_with_review>
For every book with a matching review output a book_with_review
that contains all the attributes and subelements of book
and the comment subelements of review
12 February 2008 Kaiser: COMS E6125 90
Join Example Result<book_with_review> <author> Aho </author> <author> Lam </author> <author> Sethi </author> <author> Ullman </author> <title> Compilers </title> <publisher> Addison Wesley </publisher> <year> 2006 </year> <comment> It’s the best </comment> <comment> A definitive textbook </comment> </book_with_review> <book_with_review> <author> Rowling </author> … </book_with_review>
12 February 2008 Kaiser: COMS E6125 91
Nested queries
FOR $a IN distinct(document(“bib.xml”)//author/text())RETURN <author> <name> {$a} </name> { FOR $b IN document(“bib.xml”)//book WHERE $a = $b/author/text() RETURN $b/title } </author>
Invert the structure of the input document so that there is a list of author elements containing the name of the author and the list of books he/she
wrote
12 February 2008 Kaiser: COMS E6125 92
Conditionals
FOR $b IN doc(“bib.xml”)/bookRETURN <short> {$b/title} <author> {IF count($b/author) < 3 $b/author ELSE $b/author[1], <author>and others</author> } </author> </short>
Leave alone books with less than 3 authors
Otherwise shorten the author list
12 February 2008 Kaiser: COMS E6125 93
Existential Quantification
FOR $b in doc(“bib.xml”)/bookWHERE SOME $author IN $b/author
SATISFIES $author/text() = “Aho”RETURN $b
Return books where at least one of the authors is “Aho”
12 February 2008 Kaiser: COMS E6125 94
Universal Quantification
FOR $b in doc(“bib.xml”)/bookWHERE EVERY $author IN $b/author SATISFIES $author/text() = “Aho”RETURN $b
Return books where all authors are “Aho”
12 February 2008 Kaiser: COMS E6125 95
More XQuery Notation• Also Order by• Can define and use functions• Supported by all major database
engines• XQuery has been around in draft form
since ~2001, but only became a W3C Recommendation in January 2007
12 February 2008 Kaiser: COMS E6125 96
Summary XPath expresses simple queries on
individual XML documents (or streams) Optimization analogous to string matching
XQuery expresses sophisticated SQL-like queries, joins and views on databases, messages, etc. whether stored as XML or not Optimization analogous to database
query/join Both orthogonal to XML Schema (or DTD)
But its hard to write queries without knowing what syntax will occur in the documents…
12 February 2008 Kaiser: COMS E6125 97
Second Assignment: Revised Paper
Proposal• Due Monday February 18th at 5pm• Maximum three pages (not including
figures, if any), plus references (required)• Plan and outline your paper (which will be
~15 pages)• See
http://york.cs.columbia.edu/classes/cs6125/revised_paper_proposal.htm
12 February 2008 Kaiser: COMS E6125 98
Revised Paper Proposal• Each full paper should have title, author,
abstract (~200 words), introduction, body sections, conclusions, bibliography (cited references)
• The point of this assignment is to determine what will be in those sections
• Assume a reader who is taking the class but may not know anything at all about your specific topic
12 February 2008 Kaiser: COMS E6125 99
Revised Paper Proposal: Introduction
and Conclusion• What is your topic?• What is the problem being addressed?• What is the solution, or design space of
solutions, proposed or actualized?• What is your argument?• What is your point of view?• What is the opposing point of view?
12 February 2008 Kaiser: COMS E6125 100
Revised Paper Proposal: Body Sections
• What sections? (usually 3-5)• What subsections? (perhaps down to
subsubsections)• Motivate your literature reading to fill
those sections• Full paper will be due Friday March 14th
12 February 2008 Kaiser: COMS E6125 101
A Note about Citations and Bibliographic
References• References should be cited in the text
like this “Pogue said blah blah [1]” or this “[Pog07] describes mumble”
• Bibliography entries should appear something like this[1] David Pogue, Behind the Scenes of “iPhone: The Musical”, The New York Times, online edition, July 12, 2007. <http://pogue.blogs.nytimes.com/2007/07/12/> accessed February 8, 2008.
12 February 2008 Kaiser: COMS E6125 102
Another Note about Bibliographic References
• Bibliographic references should be as complete as possible (but official MLA, APA, etc. format is not required)
• There is a variety of free software available to help manage reference lists– http://www.easybib.com/,
http://www.bibme.org/ and others used online (web search for “free bibliography”)
– Downloads from http://www.columbia.edu/acis/software/libraries.html (requires your uni/password)
12 February 2008 Kaiser: COMS E6125 103
Second Assignment: Logistics
• Submit by posting in Revised Paper Proposal folder on CourseWorks
• Must be in a format I can read, which means pdf 8 or earlier, word 2003 or earlier, powerpoint 2003 or earlier, html, plain ascii text
• With all figures embedded in the file or separately viewable in Firefox 2 or IE 7 on Windows XP
• Bundled archives must be openable with WinZip 11
12 February 2008 Kaiser: COMS E6125 104
Heads Up on Project• Preliminary Proposal due Monday March 10th
(note this is before the full paper)• Optionally work in teams (see
http://york.cs.columbia.edu/classes/cs6125/team_advice)
• Build a new system or extend an existing system – submit code, demo system
• OR evaluate/compare one or more existing system(s) – submit procedures and findings, show system(s)
• You may "continue" your paper topic towards the project, or do something entirely different
12 February 2008 Kaiser: COMS E6125 105
Heads Up on Presentation• Individual ~10 talk in class during one
of last few class sessions• No proposal, just do it• May be based on paper, project, or
some other topic (in the case of team members all presenting on the same project, please coordinate to avoid redundancy and discuss your plans with the instructor in advance)
12 February 2008 Kaiser: COMS E6125 106
Reminders
• Class participation is important! (10% corresponds to a whole letter grade)
• Revised paper proposal due Monday February 18th by 5pm
• Preliminary project proposal due March 10th
• Paper must be individual, projects may optionally be done in teams
12 February 2008 Kaiser: COMS E6125 107
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)
Prof. Gail KaiserProf. Gail Kaiser
Spring 2008Spring 2008