Semi-Structured Data Models
By Chris Bennett
Semi-Structured Data
What is it? Data where structure not necessarily determined in advance
(often implicit in data) Descriptive, not prescriptive Self-describing and flexible in structure
Where does it come from? When the data cannot (or simply is not) modeled naturally or
usefully using a standard data model Merging multiple data sources, sparse user annotations, rapidly
evolving schemas specific to given communities Raw data is often semi-structured Frequently a product of rapidly evolving schema
Examples HTML, XML, BibTex, Integrated data sources, etc..
Semi-Structured Data
This is great – infinite flexibility!! Is there a catch? Always a tradeoff…
In this case, retrieval and query performance can suffer greatly compared to more structured data models
Semi-Structured Data
So we know what it is – how do we…
Model it? Directed labeled graphs
Query it? Many proposals, all include regular path
expressions…Lorel, XML Query… Store it?
Big challenge Haystack Model
Semi-Structured Data Models
What do they do? Provide a common framework In effect, they add some structure
Why? Semi-structured data often is irregular or missing,
similar concepts are represented using different types, heterogeneous sets are present, or object structure is not fully
Standardize information exchange Data verification (both internal and external)
Examples OEM, XML DTD, XML Schema…
OEM – Object Exchange Model
Developed at Stanford (mid 90s) Precursor to today’s accepted semi-
structured data acronyms (XML) (label, type, value, object-ID)
Main feature – self-describing Requires a good bit of human
intervention, though
Object-Oriented Model versus OEM
OEM is an information exchange model (does not specify object storage issues)
OEM is much simpler (supports object nesting…omits classes, methods, inheritance)
Uses labels in place of schema
Advantages of OEM
Simple model makes transforming and merging data simpler
Advanced features can be “emulated” (implies human intervention)
More suitable for heterogeneity
Hindsight: Extreme heterogeneity mandates more than a little human intervention without some structure
Components of OEM
Query Language OEM-QL – typical SELECT-WHERE-FROM
Translator Translates OEM-QL to specific data
source and back Mediator
Collects work of translators then merges and/or combines them to make OEM structures
OEM-QL
SELECT – WHERE – FROMAdaptation of SQL-like language for OO models
SELECT fetch-expressionFROM objectWHERE condition
Expressions in the SELECT and WHERE clauses use the notion of a path that describes a traversal through an object using sub-object structure and labels
OEM-QL
SELECT biblio.?.topicFROM rootWHERE biblio.?.internal-call-no
? - denotes match to any label
Return the topic of books where there exists an internal call number
The question mark allows the user to say that the intermediate “node” in the path through the object can be named anything
XML DTD – Document Type Definition
Let there be (a little) more structure…
DTD’s define the legal building blocks of an XML document.
It defines the document structure with a list of legal elements and/or attributes, and it can be declared inline or external to the XML document.
XML DTD Example
<!DOCTYPE note [<!ELEMENT note (to, from, heading, body) ><!ELEMENT to (#PCDATA) ><!ELEMENT from (#PCDATA) ><!ELEMENT heading (#PCDATA) ><!ELEMENT body (#PCDATA) >
]>
XML DTD Advantages
An application can use a standard DTD to verify that data you receive from the outside world is valid.
It is flexible enough so that you can nest: + -- at least one occurrence * -- zero or more occurrences ? – zero or one occurrence
Example:
<!ELEMENT note (to +, from, header, message *, #PCDATA)>
DTD Drawbacks
What about constraints?? DTD’s do not offer much help in constraining the
value of a particular attribute or element (only on the use of markup)
Automated processing of XML documents requires more rigorous and comprehensive facilities in this area.
Requirements are for constraints on how the component parts of an application fit together, the doc structure, attributes, data-typing, and so on.
XML Schema
Well formatted is not enough! Let there be more structure!
XML Schema is an XML-based alternative (and
ultimate successor) to DTD’s
They express shared vocabularies and allow machines to carry out rules made by people.
They provide a means for defining the structure, content and semantics of XML documents
Successor to DTD’s
XML Schema: Extensible to future additions Richer and more useful than DTD’s Written in XML Support data types Support namespaces
XML Schema Advantages
Better validation, restriction, and type conversion
Extensible – reuse, modify existing data types, reference multiple schemes
XML Schema Details
Defines… Elements that can appear in a document Attributes that can appear in a document Which elements are child elements Order of child elements Number of child elements Whether an element is empty or can
include test Data types for elements and attributes Default and fixed values for elements and
attributes
XML Schema Components
Primary components,: Simple type definitions , Complex type definitions,
attribute declarations, and elements declarations
The secondary components, which must have names, are as follows:
Attribute group definitions, Identity-constraint definitions, Model group definitions, and Notation declarations
Finally, the "helper" components provide small parts of other components; they are not independent of their context:
Annotations, Model groups, Particles, Wildcards, Attribute Uses
XML Namespaces (W3C Documentation)
Collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names
XML namespaces differ from the "namespaces" conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set
XML Schema Example
W3C XML Schema Primer (examples)
<schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:po="http://www.example.com/PO1" targetNamespace="http://www.example.com/PO1" elementFormDefault="unqualified" attributeFormDefault="unqualified"> <element name="purchaseOrder" type="po:PurchaseOrderType"/> <element name="comment" type="string"/> <complexType name="PurchaseOrderType"> <sequence> <element name="shipTo" type="po:USAddress"/> <element name="billTo" type="po:USAddress"/> <element ref="po:comment" minOccurs="0"/> <!-- etc. --> </sequence> <!-- etc. --> </complexType> <complexType name="USAddress"> <sequence> <element name="name" type="string"/> <element name="street" type="string"/> <!-- etc. --> </sequence> </complexType> <!-- etc. --> </schema>
Querying Semi-Structured Data
Keys: Semi-structured data modeled on directed graphs User cannot have full knowledge of data structure,
but we should exploit what structure we do know exists
Examples Lorel
Developed at Stanford (1997) as part of the Lore (lightweight object repository) project
XPath W3C standard Language for addressing parts of an XML document
Lore System Stanford Link
Successor to OEM Fully functional DBMS for XML with:
Declarative query language, multiple indexing techniques, a cost-based query optimizer, multi-user support, logging, and recovery
Novel features include: DataGuides, Management of external data Proximity search.
Lore – Novel Features
DataGuides Structural summary of all paths in that
database Used by query optimizer to exploit known
structure Manage External Data Proximity Search
Ranks database objects based on their proximity to other objects
Measure proximity based on distances in the graph linking the objects together
Lorel – Lore Query Language
Based on OQL Provides powerful path traversal
operators Makes extensive use of type coercion to
help yield "intuitive" results for all queries over XML data Permits flexible form of declarative
navigational access Particularly suited to when details of structure
are not known
Lorel – Coercion Rules
Value Atomic Object Set of Objects Complex Object
Value coerce Dereference Existential with ==
False
Atomic Object Existential with ==
False
Set of Objects Existential with == on both sides
False
Complex Object
Value =
Lorel Example
Find the names and zip codes of all “cheap” restaurants
select Guide.restaurant.name, Guide.restaurant.(.address)?.zipcode
where Guide.restaurant.% grep “cheap”
- The ? after .address means the address is optional in the path expression
- The % will match any subobject of restaurant- Comparison operator grep returns true if string “cheap” appears
anywhere in the subobject value
Lorel – Another example
select X.namefrom John.name JN, John.child X, X.name XNwhere JN == XN
“Retrieve the children of John bearing his name” == expects atomic values so they are coerced
Rewritten:
select X.namefrom John.child Xwhere John.name == X.name
Lorel – Constructing Results
S-F-W in Lorel has same semantics as SQL: results are a bag (multiset) or a set if ‘distinct’ is used
Results is always a collection of OEM objects (elimination by OID)
For each assignment of the variables in the from clause that passes the condition of the where clause, a value is generated according to the expressions in the select clause
Results could refer to database objects or could refer to new objects created by coercion
Lorel – Data Updates
Create and delete database names Delete is implicit when object becomes
unreachable Create a new atomic or complex
object Modify the value of an existing
atomic or complex object Bulk load an OEM database
Lorel – Updates cont’d…
Assigning names to objectsName myFavorite := element
(select Guide.Restaurant where Guide.Restaurant.name = “Saigon”)
Creating objectsnew_oem (int, 5)new_oem (complex, struct(a:{new_oem(int,5)}, b:{X,Y}))
XPath Features
XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax
Provides basic facilities for manipulation of strings, numbers and booleans
XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values
XPath – How It Works W3C XPath Information
XPath models an XML document as a tree of nodes Root nodes, element nodes, text nodes, attribute nodes,
namespace nodes, processing instruction nodes, comment nodes
Evaluation occurs with respect to a “context” which consists of:
a node (the context node) a pair of non-zero positive integers (the context position
and the context size) a set of variable bindings a function library the set of namespace declarations in scope for the
expression
XQuery – How It Works
Location path – selects a set of nodes relative to the context node
An expression that is a location path results in a node set Examples of location paths
Includes functions for node sets, strings, numbers, etc…
XPath – Generic Example
Simple:employee[@secretary and @assistant]
Selects all the employee children of the context node that have both a secretary attribute and an assistant attribute
W3C School Examples