XML Schema Integration Ray Dos Santos July 19, 2009.

XML Schema Integration

Ray Dos SantosJuly 19, 2009

2

XML integration

Fundamental problem: schema matching, which takes two (or more) schemas to produce a mapping between elements (or attributes) of the two (or more) schemas that correspond semantically to each other.

Objective: find corresponding entities.

3

Some application domains for XML HealthCare Level Seven http://www.hl7.org/ Geography Markup Language (GML) Systems Biology Markup Language (SBML) http://sbml.org/ XBRL, the XML based Business Reporting standard

http://www.xbrl.org/ Global Justice XML Data Model (GJXDM) http://it.ojp.gov/jxdm ebXML http://www.ebxml.org/ e.g. Encoded Archival Description Application

http://lcweb.loc.gov/ead/ Digital photography metadata XMP An XML grammar for sensor data (SensorML) Real Simple Syndication (RSS 2.0)

4

Integrating two schemas

Consider two schemas, S1 and S2, representing two customer relations, Cust and Customer.

S1 S2

Cust Customer

CNo CustID

CompName Company

FirstName Contact

LastName Phone

5

Integrating two schemas (contd) Represent the mapping with a similarity

relation, , over the power sets of S1 and S2, where each pair in represents one element of the mapping. E.g.,

Cust.CNo Customer.CustIDCust.CompName Customer.Company{Cust.FirstName, Cust.LastName} Customer.Contact

6

Different types of matching

Schema-level only matching: only schema information is considered.

Domain and instance-level only matching: some instance data (data records) and possibly the domain of each attribute are used. This case is quite common on the Web.

Integrated matching of schema, domain and instance data: Both schema and instance data (possibly domain information) are available.

7

Pre-processing for integration

Tokenization: break an item into atomic words using a dictionary, e.g., Break “fromCity” into “from” and “city” Break “first-name” into “first” and “name”

Expansion: expand abbreviations and acronyms to their full words, e.g., From “dept” to “departure”

Stopword removal and stemming Standardization of words: Irregular words are

standardized to a single form, e.g., From “colour” to “color”

8

Schema-level matching

Schema level matching relies on information such as name, description, data type, relationship type (e.g., part-of, is-a, etc), constraints, etc.

Match cardinality: 1:1 match: one element in one schema matches

one element of another schema. 1:m match: one element in one schema matches

m elements of another schema. m:n match: m elements in one schema matches n

elements of another schema.

9

An example

m:1 match is similar to 1:m match. m:n match is complex, and there is little work on it.

10

Linguistic approaches

They are used to derive match candidates based on names, comments or descriptions of schema elements:

Name match: Equality of names Synonyms Equality of hypernyms: A is a hypernym of B is B is a kind-of

A. Common sub-strings Cosine similarity User-provided name match: usually a domain dependent

match dictionary

11

Linguistic approaches (contd) Description match: in many files, there are comments

to schema elements, e.g.,

Cosine similarity can be used to compare comments after stemming and stopword removal.

12

Constraint based approaches

Constraints such as data types, value ranges, uniqueness, relationship types, etc.

An equivalent or compatibility table for data types and keys can be provided. E.g., string varchar, and (primiary key) unique

Note: On the Web, the constraint information is often not available, but some can be inferred based on the domain and instance data.

13

Domain and instance-level matching In many applications, some data instances or

attribute domains may be available. Value characteristics are used in matching. Two different types of domains

Simple domain: each value in the domain has only a single component (the value cannot be decomposed).

Composite domain: each value in the domain contains more than one component.

14

Match of simple domains

A simple domain can be of any type. If the data type information is not available (this is

often the case on the Web), the instance values can often be used to infer types, e.g., Words may be considered as strings Phone numbers can have a regular expression pattern.

Data type patterns (in regular expressions) can be learned automatically or defined manually. E.g., used to identify such types as integer, real, string,

month, weekday, date, time, zip code, phone numbers, etc.

15

XML is different from databases Limited use of acronyms and abbreviations on the

XML: but natural language words and phrases, for general public to understand. Databases use acronyms and abbreviations extensively.

Limited vocabulary: for easy understanding A large number of similar databases: a large number

of sites offer the same services or selling the same products.

Additional structures: the information is usually organized in some meaningful way. But the organization needs to be understood first. Related attributes are together. Hierarchical organization.

16

Instance-based matching via footprints

Assume a global schema is given and a set of instances are also given.

The method uses each instance value of every attribute to probe the underlying ontology to obtain the footprints.

These footprints are used to help with a best-faith matching estimate.

It performs matches based on Events Relationships Spatial characteristics Attributes

17

Entity

RelEvent

Object Spatial-- name-- description-- shape-- size-- atts

-- begin at-- stop at-- elapse-- start-- end

-- touches-- overlaps-- contains-- spans-- within

-- parent-- child-- type-of-- is-a-- part-of

<?xml version="1.0" encoding="UTF-8"?><kml xmlns="http://www.opengis.net/kml/2.2"> <Placemark> <name>Washington DC Monuments</name> <description> Attached to the ground. Intelligently places itself at the height of the underlying terrain. </description> <Point> <coordinates>-77.0822035425683,-111.42228990140251,0</coordinates> </Point> </Placemark></kml>

<?xml version="1.0" encoding="UTF-8"?><poi xmlns="http://nis.atlaspoi.org"> <poi> <name>Washington Monument</name> <description> Major monuments of the federal capital </description> <Location> <lat>-77.0822035425683</lat> <lon>11.42228990140251,0</lon> </Location> </poi></poi>

18

Entity

RelEvent

Object Spatial-- name-- description-- shape-- size-- atts

-- begin at-- stop at-- elapse-- start-- end

-- touches-- overlaps-- contains-- spans-- within

-- parent-- child-- type-of-- is-a-- part-of

FootPrint:

01ZZ 99 18 02 5310

Events:99 20

01XX 77 18 10

01ZZ 99 20 02 5310 99 20

05XX 44 12 10 Rel:

01ZZ 99 20 02 5310

Spatial:

99 20

01XX 77 20 10

Temporal, moving objects

Relationships among objects

Location

19

Next Steps

How to define footprints Define rules to minimize footprint size and count Order footprints such that the most appropriates are

“looked at” first Footprints for subtrees ?

Date post:	13-Jan-2016
Category:	Documents
Upload:	herbert-reeves
View:	214 times
Download:	0 times

XML Schema Integration Ray Dos Santos July 19, 2009.

Documents