Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | herbert-reeves |
View: | 214 times |
Download: | 0 times |
XML Schema Integration
Ray Dos SantosJuly 19, 2009
2
XML integration
Fundamental problem: schema matching, which takes two (or more) schemas to produce a mapping between elements (or attributes) of the two (or more) schemas that correspond semantically to each other.
Objective: find corresponding entities.
3
Some application domains for XML HealthCare Level Seven http://www.hl7.org/ Geography Markup Language (GML) Systems Biology Markup Language (SBML) http://sbml.org/ XBRL, the XML based Business Reporting standard
http://www.xbrl.org/ Global Justice XML Data Model (GJXDM) http://it.ojp.gov/jxdm ebXML http://www.ebxml.org/ e.g. Encoded Archival Description Application
http://lcweb.loc.gov/ead/ Digital photography metadata XMP An XML grammar for sensor data (SensorML) Real Simple Syndication (RSS 2.0)
4
Integrating two schemas
Consider two schemas, S1 and S2, representing two customer relations, Cust and Customer.
S1 S2
Cust Customer
CNo CustID
CompName Company
FirstName Contact
LastName Phone
5
Integrating two schemas (contd) Represent the mapping with a similarity
relation, , over the power sets of S1 and S2, where each pair in represents one element of the mapping. E.g.,
Cust.CNo Customer.CustIDCust.CompName Customer.Company{Cust.FirstName, Cust.LastName} Customer.Contact
6
Different types of matching
Schema-level only matching: only schema information is considered.
Domain and instance-level only matching: some instance data (data records) and possibly the domain of each attribute are used. This case is quite common on the Web.
Integrated matching of schema, domain and instance data: Both schema and instance data (possibly domain information) are available.
7
Pre-processing for integration
Tokenization: break an item into atomic words using a dictionary, e.g., Break “fromCity” into “from” and “city” Break “first-name” into “first” and “name”
Expansion: expand abbreviations and acronyms to their full words, e.g., From “dept” to “departure”
Stopword removal and stemming Standardization of words: Irregular words are
standardized to a single form, e.g., From “colour” to “color”
8
Schema-level matching
Schema level matching relies on information such as name, description, data type, relationship type (e.g., part-of, is-a, etc), constraints, etc.
Match cardinality: 1:1 match: one element in one schema matches
one element of another schema. 1:m match: one element in one schema matches
m elements of another schema. m:n match: m elements in one schema matches n
elements of another schema.
9
An example
m:1 match is similar to 1:m match. m:n match is complex, and there is little work on it.
10
Linguistic approaches
They are used to derive match candidates based on names, comments or descriptions of schema elements:
Name match: Equality of names Synonyms Equality of hypernyms: A is a hypernym of B is B is a kind-of
A. Common sub-strings Cosine similarity User-provided name match: usually a domain dependent
match dictionary
11
Linguistic approaches (contd) Description match: in many files, there are comments
to schema elements, e.g.,
Cosine similarity can be used to compare comments after stemming and stopword removal.
12
Constraint based approaches
Constraints such as data types, value ranges, uniqueness, relationship types, etc.
An equivalent or compatibility table for data types and keys can be provided. E.g., string varchar, and (primiary key) unique
Note: On the Web, the constraint information is often not available, but some can be inferred based on the domain and instance data.
13
Domain and instance-level matching In many applications, some data instances or
attribute domains may be available. Value characteristics are used in matching. Two different types of domains
Simple domain: each value in the domain has only a single component (the value cannot be decomposed).
Composite domain: each value in the domain contains more than one component.
14
Match of simple domains
A simple domain can be of any type. If the data type information is not available (this is
often the case on the Web), the instance values can often be used to infer types, e.g., Words may be considered as strings Phone numbers can have a regular expression pattern.
Data type patterns (in regular expressions) can be learned automatically or defined manually. E.g., used to identify such types as integer, real, string,
month, weekday, date, time, zip code, phone numbers, etc.
15
XML is different from databases Limited use of acronyms and abbreviations on the
XML: but natural language words and phrases, for general public to understand. Databases use acronyms and abbreviations extensively.
Limited vocabulary: for easy understanding A large number of similar databases: a large number
of sites offer the same services or selling the same products.
Additional structures: the information is usually organized in some meaningful way. But the organization needs to be understood first. Related attributes are together. Hierarchical organization.
16
Instance-based matching via footprints
Assume a global schema is given and a set of instances are also given.
The method uses each instance value of every attribute to probe the underlying ontology to obtain the footprints.
These footprints are used to help with a best-faith matching estimate.
It performs matches based on Events Relationships Spatial characteristics Attributes
17
Entity
RelEvent
Object Spatial-- name-- description-- shape-- size-- atts
-- begin at-- stop at-- elapse-- start-- end
-- touches-- overlaps-- contains-- spans-- within
-- parent-- child-- type-of-- is-a-- part-of
<?xml version="1.0" encoding="UTF-8"?><kml xmlns="http://www.opengis.net/kml/2.2"> <Placemark> <name>Washington DC Monuments</name> <description> Attached to the ground. Intelligently places itself at the height of the underlying terrain. </description> <Point> <coordinates>-77.0822035425683,-111.42228990140251,0</coordinates> </Point> </Placemark></kml>
<?xml version="1.0" encoding="UTF-8"?><poi xmlns="http://nis.atlaspoi.org"> <poi> <name>Washington Monument</name> <description> Major monuments of the federal capital </description> <Location> <lat>-77.0822035425683</lat> <lon>11.42228990140251,0</lon> </Location> </poi></poi>
18
Entity
RelEvent
Object Spatial-- name-- description-- shape-- size-- atts
-- begin at-- stop at-- elapse-- start-- end
-- touches-- overlaps-- contains-- spans-- within
-- parent-- child-- type-of-- is-a-- part-of
FootPrint:
01ZZ 99 18 02 5310
Events:99 20
01XX 77 18 10
01ZZ 99 20 02 5310 99 20
05XX 44 12 10 Rel:
01ZZ 99 20 02 5310
Spatial:
99 20
01XX 77 20 10
Temporal, moving objects
Relationships among objects
Location
19
Next Steps
How to define footprints Define rules to minimize footprint size and count Order footprints such that the most appropriates are
“looked at” first Footprints for subtrees ?