Post on 20-Jul-2015
transcript
1PalGov © 2011
أكاديمية الحكومة اإللكترونية الفلسطينيةThe Palestinian eGovernment Academy
www.egovacademy.ps
Tutorial II: Data Integration and Open Information Systems
Session 12.1
The problem of Data Integration
Dr. Mustafa Jarrar
University of Birzeit
mjarrar@birzeit.edu
www.jarrar.info
2PalGov © 2011
About
This tutorial is part of the PalGov project, funded by the TEMPUS IV program of the
Commission of the European Communities, grant agreement 511159-TEMPUS-1-
2010-1-PS-TEMPUS-JPHES. The project website: www.egovacademy.ps
University of Trento, Italy
University of Namur, Belgium
Vrije Universiteit Brussel, Belgium
TrueTrust, UK
Birzeit University, Palestine
(Coordinator )
Palestine Polytechnic University, Palestine
Palestine Technical University, PalestineUniversité de Savoie, France
Ministry of Local Government, Palestine
Ministry of Telecom and IT, Palestine
Ministry of Interior, Palestine
Project Consortium:
Coordinator:
Dr. Mustafa Jarrar
Birzeit University, P.O.Box 14- Birzeit, Palestine
Telfax:+972 2 2982935 mjarrar@birzeit.edu
3PalGov © 2011
© Copyright Notes
Everyone is encouraged to use this material, or part of it, but should
properly cite the project (logo and website), and the author of that part.
No part of this tutorial may be reproduced or modified in any form or by
any means, without prior written permission from the project, who have
the full copyrights on the material.
Attribution-NonCommercial-ShareAlike
CC-BY-NC-SA
This license lets others remix, tweak, and build upon your work non-
commercially, as long as they credit you and license their new creations
under the identical terms.
PalGov © 2011 4
Tutorial Map
Topic h
Session 1: XML Basics and Namespaces 3
Session 2: XML DTD‟s 3
Session 3: XML Schemas 3
Session 4: Lab-XML Schemas 3
Session 5: RDF and RDFs 3
Session 6: Lab-RDF and RDFs 3
Session 7: OWL (Ontology Web Language) 3
Session 8: Lab-OWL 3
Session 9: Lab-RDF Stores -Challenges and Solutions 3
Session 10: Lab-SPARQL 3
Session 11: Lab-Oracle Semantic Technology 3
Session 12_1: The problem of Data Integration 1.5
Session 12_2: Architectural Solutions for the Integration Issues 1.5
Session 13_1: Data Schema Integration 1
Session 13_2: GAV and LAV Integration 1
Session 13_3: Data Integration and Fusion using RDF 1
Session 14: Lab-Data Integration and Fusion using RDF 3
Session 15_1: Data Web and Linked Data 1.5
Session 15_2: RDFa 1.5
Session 16: Lab-RDFa 3
Intended Learning Objectives
A: Knowledge and Understanding
2a1: Describe tree and graph data models.
2a2: Understand the notation of XML, RDF, RDFS, and OWL.
2a3: Demonstrate knowledge about querying techniques for data
models as SPARQL and XPath.
2a4: Explain the concepts of identity management and Linked data.
2a5: Demonstrate knowledge about Integration &fusion of
heterogeneous data.
B: Intellectual Skills
2b1: Represent data using tree and graph data models (XML &
RDF).
2b2: Describe data semantics using RDFS and OWL.
2b3: Manage and query data represented in RDF, XML, OWL.
2b4: Integrate and fuse heterogeneous data.
C: Professional and Practical Skills
2c1: Using Oracle Semantic Technology and/or Virtuoso to store
and query RDF stores.
D: General and Transferable Skills2d1: Working with team.
2d2: Presenting and defending ideas.
2d3: Use of creativity and innovation in problem solving.
2d4: Develop communication skills and logical reasoning abilities.
5PalGov © 2011
Module ILOs
After completing this module students will be able to:
- Understand the importance of Data Integration.
- Understand the problems and challenges of Data Integration.
6PalGov © 2011
Example from the government Domain
• Consider all interactions with government agencies in order to register
a new business in Palestine.
• Example: Establishing a new Radio Station.
Ministry of
Telecom
Ministry of
Information
Ministry of
National Economy
Chamber of
Commerce
Ministry of
Finance
7PalGov © 2011
Example from the government Domain
• Consider when the business evolves or changes.
• Example: Changing the address of the radio station.
– Address must be changed in 5 different databases.
Ministry of
Telecom
Ministry of
Information
Ministry of
National Economy
Chamber of
Commerce
Ministry of
Finance
8PalGov © 2011
Example from the government Domain
• Consider the data registered about the same radio station in the
databases of different ministries and governmental agencies:
ID Name Type Location
R2563I Radio Al-Amal Radio Station Ramallah
B_ID Business Name Activity Type City
LM1847 Al-Amal
Broadcast
Radio
Broadcasting
Ramallah
and Bireh
ID Company Name Company Type Location
182NS3 Broadcast Al-
Amal
Broadcasting
Station
Al-Balu’
Agency 1
Agency 2
Agency 3
. . .
9PalGov © 2011
Example from the government Domain
• From our simple example one can point out to some challenges in
Data Integration:
– No agreed upon naming (name, business name, company name)
– No agreed upon meaning (Does ‟Activity Type‟ mean exactly the same as
„Company Type‟?)
– Different Registered Data: Radio Al-Amal, Al-Amal Broadcast, ….
ID Name Type City
R2563I Radio Al-Amal Radio Station Ramallah
B_ID Business Name Activity Type Province
LM1847 Al-Amal
Broadcast
Radio
Broadcasting
Ramallah
and Bireh
ID Company Name Company Type Location
182NS3 Broadcast Al-
Amal
Broadcasting
Station
Al-Balu’
Agency 1
Agency 2
Agency 3
. . .
11PalGov © 2011
Problem is in all domains
• Problem is now even more challenging with the Web.
• The Data Web envisions the web as a global world-wide database.
• This means that one can query distributed multiple databases on the
web as if he/she is querying a local database.
12PalGov © 2011
Challenges of Data Integration:
Heterogeneities in Database Schemas
• One can distinguish between several heterogeneities
between different schemas:
– Name Heterogeneities (difference in used vocabulary).
– Meaning Heterogeneities (different meaning for the same attribute
in two schemas).
– Heterogeneities in the structure and type.
– Heterogeneities in the rules and constraints.
– Data Model Heterogeneities.
13PalGov © 2011
Name and Meaning Heterogeneities
• Synonyms – Different names for the same concepts
– employee, clerk
– exam, course
– code, num
• Homonyms – Same name for different concepts (different meanings)
- City as City of birth in one schema,
- City as City of Residence in another schema
Saraly: Net Salary
Salary: Gross Salary
Section
Division
SynonymsHomonyms
A specialized
division of a
large
organization
14PalGov © 2011
Heterogeneities in Structure and Type
• The same concepts are represented with different conceptual
structures in two schemas:
– Attribute in one schema and derived value in another schema.
– Attribute in one schema and entity in another schema.
– Entity in one schema and relationship in another schema.
– Different abstraction levels for the same concept in two schemas:
e.g. two entities with homonym names related by an IS-A hierarchy
in two schemas.
Source: Carlo Batini
15PalGov © 2011
Heterogeneities in Structure
• EXAMPLES:
PUBLISHERBOOKBOOK
PUBLISHER
EMPLOYEE
DEPARTMENT
PROJECT
EMPLOYEE
PROJECT
Source: Carlo Batini
Person
WOMANMAN
GENDERPerson
16PalGov © 2011
Heterogeneities in Type
Examples:
In a single attribute (e.g., Numberic, Alphanumeric). E.g., the
attribute “gender”:
– Male/Female
– M/F
– 0/1
Year has a four digit domain in one schema and two
digit domain in another schema
Different currencies (Euros, US Dollars, etc.)
Different measure systems (kilos vs. pounds, centigrade vs.
Fahrenheit.)
Different granularities (grams, kilos, etc.)
17PalGov © 2011
Heterogeneities in the rules and constraints
• EXAMPLES:
– Different cardinalities in the same relationships
– Key conflicts
Source: Carlo Batini
18PalGov © 2011
Model Heterogeneities
• Model Heterogeneities occurs when different databases adheres to
different data models:
– Relational Data Model, XML, RDF, Object-Oriented, OWL, ...
• Solution: Reduce Model Heterogeneity by using one data model.
• Example: Convert the Relational Model to RDF graph model.
19PalGov © 2011
References
• Carlo Batini: Course on Data Integration. BZU IT Summer School
2011.
• Stefano Spaccapietra: Information Integration. Presentation at the IFIP
Academy. Porto Alegre. 2005.
• Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI
International, Artificial Intelligence Center. Menlo Park, USA. 2009.