Semi-Structured Data Models By Chris Bennett. Semi-Structured Data What is it? Data where structure...

Semi-Structured Data Models

By Chris Bennett

Semi-Structured Data

What is it? Data where structure not necessarily determined in advance

(often implicit in data) Descriptive, not prescriptive Self-describing and flexible in structure

Where does it come from? When the data cannot (or simply is not) modeled naturally or

usefully using a standard data model Merging multiple data sources, sparse user annotations, rapidly

evolving schemas specific to given communities Raw data is often semi-structured Frequently a product of rapidly evolving schema

Examples HTML, XML, BibTex, Integrated data sources, etc..


This is great – infinite flexibility!! Is there a catch? Always a tradeoff…

In this case, retrieval and query performance can suffer greatly compared to more structured data models


So we know what it is – how do we…

Model it? Directed labeled graphs

Query it? Many proposals, all include regular path

expressions…Lorel, XML Query… Store it?

Big challenge Haystack Model

Semi-Structured Data Models

What do they do? Provide a common framework In effect, they add some structure

Why? Semi-structured data often is irregular or missing,

similar concepts are represented using different types, heterogeneous sets are present, or object structure is not fully

Standardize information exchange Data verification (both internal and external)

Examples OEM, XML DTD, XML Schema…

OEM – Object Exchange Model

Developed at Stanford (mid 90s) Precursor to today’s accepted semi-

structured data acronyms (XML) (label, type, value, object-ID)

Main feature – self-describing Requires a good bit of human

intervention, though

Object-Oriented Model versus OEM

OEM is an information exchange model (does not specify object storage issues)

OEM is much simpler (supports object nesting…omits classes, methods, inheritance)

Uses labels in place of schema

Advantages of OEM

Simple model makes transforming and merging data simpler

Advanced features can be “emulated” (implies human intervention)

More suitable for heterogeneity

Hindsight: Extreme heterogeneity mandates more than a little human intervention without some structure

Components of OEM

Query Language OEM-QL – typical SELECT-WHERE-FROM

Translator Translates OEM-QL to specific data

source and back Mediator

Collects work of translators then merges and/or combines them to make OEM structures

OEM-QL

SELECT – WHERE – FROMAdaptation of SQL-like language for OO models

SELECT fetch-expressionFROM objectWHERE condition

Expressions in the SELECT and WHERE clauses use the notion of a path that describes a traversal through an object using sub-object structure and labels

OEM-QL

SELECT biblio.?.topicFROM rootWHERE biblio.?.internal-call-no

? - denotes match to any label

Return the topic of books where there exists an internal call number

The question mark allows the user to say that the intermediate “node” in the path through the object can be named anything

XML DTD – Document Type Definition

Let there be (a little) more structure…

DTD’s define the legal building blocks of an XML document.

It defines the document structure with a list of legal elements and/or attributes, and it can be declared inline or external to the XML document.

XML DTD Example

<!DOCTYPE note [<!ELEMENT note (to, from, heading, body) ><!ELEMENT to (#PCDATA) ><!ELEMENT from (#PCDATA) ><!ELEMENT heading (#PCDATA) ><!ELEMENT body (#PCDATA) >

]>

XML DTD Advantages

An application can use a standard DTD to verify that data you receive from the outside world is valid.

It is flexible enough so that you can nest: + -- at least one occurrence * -- zero or more occurrences ? – zero or one occurrence

Example:

<!ELEMENT note (to +, from, header, message *, #PCDATA)>

DTD Drawbacks

What about constraints?? DTD’s do not offer much help in constraining the

value of a particular attribute or element (only on the use of markup)

Automated processing of XML documents requires more rigorous and comprehensive facilities in this area.

Requirements are for constraints on how the component parts of an application fit together, the doc structure, attributes, data-typing, and so on.

XML Schema

Well formatted is not enough! Let there be more structure!

XML Schema is an XML-based alternative (and

ultimate successor) to DTD’s

They express shared vocabularies and allow machines to carry out rules made by people.

They provide a means for defining the structure, content and semantics of XML documents

Successor to DTD’s

XML Schema: Extensible to future additions Richer and more useful than DTD’s Written in XML Support data types Support namespaces

XML Schema Advantages

Better validation, restriction, and type conversion

Extensible – reuse, modify existing data types, reference multiple schemes

XML Schema Details

Defines… Elements that can appear in a document Attributes that can appear in a document Which elements are child elements Order of child elements Number of child elements Whether an element is empty or can

include test Data types for elements and attributes Default and fixed values for elements and

attributes

XML Schema Components

Primary components,: Simple type definitions , Complex type definitions,

attribute declarations, and elements declarations

The secondary components, which must have names, are as follows:

Attribute group definitions, Identity-constraint definitions, Model group definitions, and Notation declarations

Finally, the "helper" components provide small parts of other components; they are not independent of their context:

Annotations, Model groups, Particles, Wildcards, Attribute Uses

XML Namespaces (W3C Documentation)

Collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names

XML namespaces differ from the "namespaces" conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set

XML Schema Example

W3C XML Schema Primer (examples)

<schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:po="http://www.example.com/PO1" targetNamespace="http://www.example.com/PO1" elementFormDefault="unqualified" attributeFormDefault="unqualified"> <element name="purchaseOrder" type="po:PurchaseOrderType"/> <element name="comment" type="string"/> <complexType name="PurchaseOrderType"> <sequence> <element name="shipTo" type="po:USAddress"/> <element name="billTo" type="po:USAddress"/> <element ref="po:comment" minOccurs="0"/>  </sequence>  </complexType> <complexType name="USAddress"> <sequence> <element name="name" type="string"/> <element name="street" type="string"/>  </sequence> </complexType>  </schema>

Querying Semi-Structured Data

Keys: Semi-structured data modeled on directed graphs User cannot have full knowledge of data structure,

but we should exploit what structure we do know exists

Examples Lorel

Developed at Stanford (1997) as part of the Lore (lightweight object repository) project

XPath W3C standard Language for addressing parts of an XML document

Lore System Stanford Link

Successor to OEM Fully functional DBMS for XML with:

Declarative query language, multiple indexing techniques, a cost-based query optimizer, multi-user support, logging, and recovery

Novel features include: DataGuides, Management of external data Proximity search.

Lore – Novel Features

DataGuides Structural summary of all paths in that

database Used by query optimizer to exploit known

structure Manage External Data Proximity Search

Ranks database objects based on their proximity to other objects

Measure proximity based on distances in the graph linking the objects together

Lorel – Lore Query Language

Based on OQL Provides powerful path traversal

operators Makes extensive use of type coercion to

help yield "intuitive" results for all queries over XML data Permits flexible form of declarative

navigational access Particularly suited to when details of structure

are not known

Lorel – Coercion Rules

Value Atomic Object Set of Objects Complex Object

Value coerce Dereference Existential with ==

False

Atomic Object Existential with ==

False

Set of Objects Existential with == on both sides

False

Complex Object

Value =

Lorel Example

Find the names and zip codes of all “cheap” restaurants

select Guide.restaurant.name, Guide.restaurant.(.address)?.zipcode

where Guide.restaurant.% grep “cheap”

- The ? after .address means the address is optional in the path expression

- The % will match any subobject of restaurant- Comparison operator grep returns true if string “cheap” appears

anywhere in the subobject value

Lorel – Another example

select X.namefrom John.name JN, John.child X, X.name XNwhere JN == XN

“Retrieve the children of John bearing his name” == expects atomic values so they are coerced

Rewritten:

select X.namefrom John.child Xwhere John.name == X.name

Lorel – Constructing Results

S-F-W in Lorel has same semantics as SQL: results are a bag (multiset) or a set if ‘distinct’ is used

Results is always a collection of OEM objects (elimination by OID)

For each assignment of the variables in the from clause that passes the condition of the where clause, a value is generated according to the expressions in the select clause

Results could refer to database objects or could refer to new objects created by coercion

Lorel – Data Updates

Create and delete database names Delete is implicit when object becomes

unreachable Create a new atomic or complex

object Modify the value of an existing

atomic or complex object Bulk load an OEM database

Lorel – Updates cont’d…

Assigning names to objectsName myFavorite := element

(select Guide.Restaurant where Guide.Restaurant.name = “Saigon”)

Creating objectsnew_oem (int, 5)new_oem (complex, struct(a:{new_oem(int,5)}, b:{X,Y}))

XPath Features

XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax

Provides basic facilities for manipulation of strings, numbers and booleans

XPath uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values

XPath – How It Works W3C XPath Information

XPath models an XML document as a tree of nodes Root nodes, element nodes, text nodes, attribute nodes,

namespace nodes, processing instruction nodes, comment nodes

Evaluation occurs with respect to a “context” which consists of:

a node (the context node) a pair of non-zero positive integers (the context position

and the context size) a set of variable bindings a function library the set of namespace declarations in scope for the

expression

XQuery – How It Works

Location path – selects a set of nodes relative to the context node

An expression that is a location path results in a node set Examples of location paths

Includes functions for node sets, strings, numbers, etc…

XPath – Generic Example

Simple:employee[@secretary and @assistant]

Selects all the employee children of the context node that have both a secretary attribute and an assistant attribute

W3C School Examples

Date post:	23-Dec-2015
Category:	Documents
Upload:	hope-west
View:	222 times
Download:	0 times

Semi-Structured Data Models By Chris Bennett. Semi-Structured Data What is it? Data where structure...

Documents