Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | alan-bradford |
View: | 220 times |
Download: | 0 times |
I. Khalil Ibrahim 1
Data Integration in Digital Libraries: Approaches and Challenges
Bringing Digital Libraries together
Dr. Ismail Khalil Ibrahim
+43 7236 3343 852www.scch.at
I. Khalil Ibrahim 2
Biography
Dr. Ismail Khalil Ibrahim is a senior software develepoer and AgenCom project manager at the Software Competence Center Hagenberg - Austria. He worked in the University of Technology - Baghdad – Iraq from 1985-1990 as a lecturer, in the Human Resources Training and Development Institute - Iraq from 1990-1996 as the head of the academic studies department, in Gadjah Mada University from 1996-2000 as a teaching and research assistant.
His main research interests lay in the fields of E-commerce & I-Commerce, Database Applications and Techniques for the Web, Practical Experience and Applications in Information Integration systems , Logic Programming for Information Integration , Agents for Information Retrieval and Knowledge Discovery , XML and Semistructured Data Management , Information Systems Management and Development , Information Technology: Impact, Economic Analysis. Ismail is a member of ACM, SIGMOD, SIGKDD, and SIGecom, general Secretary of the Indonesian Information Society Initiative (IISI), member of the Iraqi Engineers Association (IEA), overseas Collaborator in the E-commerce Lab at the National University of Singapore, editorial Board of the Columbian Journal of Computing “Revista Colombiana de Computación”, chairman of the organizing committee of the 1st and 2nd International Workshop on Information Integration and Web-based Applications & Services (IIWAS'99, IIWAS'00) , Yogyakarta, Indonesia, chairman of the organizing committee of the 3rd International Conference on Information Integration and Web-based Applications & Services (IIWAS'2001), Linz, Austria.
Ismail holds a B.Sc. in Electrical Engineering, from the University of Technology, Iraq (1985), M.Sc. and Ph.D., in Computer Eng. and Information Systems from Gadjah Mada University (1998, 2001).
I. Khalil Ibrahim 3
Outline
Data Integration
What is it ?
What does a data integration system look like ?
What are some data integration challenges?
I. Khalil Ibrahim 4
What Is Data Integration?
Providing: uniform: sources transparent to user
access: query, and eventually updates
multiple: even two is a problem
autonomous: not effect behavior of sources
heterogeneous: different data models, schemas
unstructured: at least semi-structured
information sources: not only databases
I. Khalil Ibrahim 5
http://www.amazon.com
s1 (Title,Author,Subject)
http://www.book-a-million.com
s2 (ISBN,Title,Publisher)
http://……...
Example Scenario
I. Khalil Ibrahim 6
Retrieve the titles and subjects of all the technical reports written by (Stephane Bressan) and published by MIT PRESS
q1 amazon (Title,”Stephane Bressan”,subject)
q2 book-a-million (ISBN,Title,”MIT Press”)
Join the results
Example Scenariocont.
I. Khalil Ibrahim 7
So What is the Problem?
Virtual vs. Materialized Architectures
Access: query or query & update? Problem similar to updating through views need distributed transactional services
Mediated schema: yes or no? without mediated schema we lose advantages mediated schema requires schema integration schema integration need query transformation query transformation need query optimization
I. Khalil Ibrahim 8
Additional Dimensions
How many sources are we accessing? how autonomous are the sources?
how much knowledge do we have about sources?
how structured are the data in the sources?
Requirements from responses: accuracy
completeness
machine readable vs. human readable
handling inconsistencies
speed
closed World Assumption vs. Open World Assumption
I. Khalil Ibrahim 9
Related Technologies / Issues
Distributed databases
sources are homogeneous
data is distributed a priori
sources are not autonomous
Similarities at the optimization and execution level
Information retrieval keyword search
no semantics
Data mining: discovering properties and patterns in data
I. Khalil Ibrahim 10
Current Applications
Intranets enterprise data integration web-site construction
World Wide Web digital libraries comparison shopping (Netbot, Junglee) portals integration data from multiple resources XML integration
Science & Culture medical genetics: integrating genomic data Astrophysics: monitoring events in the sky Environment: puget sound regional synthesis model Culture: uniform access to all the cultural databases
I. Khalil Ibrahim 11
Integration
global defined from local
global “independent”of local
CWA
global-schema-as-view
OWA
global-as-view-of-local
local-as-view-of-global
Database Schema Integration Data Warehousing Mediation
Paradigms of Data Integration
I. Khalil Ibrahim 12
Paradigms of Data Integration II
Data Warehousing (materialization architecture)
data of interest is collected in a central place and a web site is built on top of it
queries are applied to the data warehouse
easy to support queries, transactions
hard to modify, the warehouse is not connected to the providers of information, ... etc.
I. Khalil Ibrahim 13
WrapperWrapperWrapper
Data Extraction
DataWarehouse
Application
DataSource
DataSource
DataSource
Data Warehousing Architecture
I. Khalil Ibrahim 14
Paradigms of Data Integration III
Information Mediation (virtual architecture)
data remains in web sources
rules that relate external data to internal application
data is not replicated, data are guaranteed to be up-to-date
query optimization and execution is more complex
I. Khalil Ibrahim 15
Glo
bal D
ata
Mod
elApplicationLo
cal D
ata
Mod
el Wrapper
DataSource
Query Execution Engine
Catalog
Wrapper
DataSource
Mediation Architecture
I. Khalil Ibrahim 16
World Relations:
Book(title,year,author,subject) BookYear(title,year)
BookRev(title,author,review)
GAV
LAV
Running Example
Source Relations:
DB1(title,author,year)
DB2(title,author,year)
DB3(title,review)
I. Khalil Ibrahim 17
Global As View (GAV)
Define a global schema of objects ande write down rules to collect these objects
for each relation RR in the mediated schema, we write a query over the sources' relations specifying how to obtain RR's tuples from the sources (Query unfolding)
traditional query processing applies
requires the right sources to be avaliable and compliant
I. Khalil Ibrahim 18
Local As View (GAV)
For every information source (SS), we write a query over the relations in the mediated schema that describes which tuples are found in S S (Query folding or Answering Queries using Views)
may be able to answer a query based on the avaliable partial information
generally, may not be able to answer the query
needs non standard query processing techniques
potentially high complexity
I. Khalil Ibrahim 19
Challanges
Complexity over traditional DBs: heterogeneous, autonomous, network-bounded surces
Query reformulation now understood
map queries over mediated schemas to „wrapped“ sources (heterogeneity)
Issues remain in query processing
few statistics (autonomous sources)
unanticipated delays and failures (network-bounded sources)
I. Khalil Ibrahim 20
Conclusions
Data integration handles many problems needed for embedded systems applications
Many data sources
Easy addition and deletion of sources
Different source capabilities
Dealing with network delays
Easy for user
I. Khalil Ibrahim 21
• Semantic Query Transformation for the Integration of Autonomous Information Sources (INAP’99 – Tokyo)
• IKA: Unity in Heterogenity (IIWAS’99 – Yogyakarta)• Information Reterival Agents for the Intelligent Integration of
Information Sources (MulNet 2000 - Bandung)• A Multilingual Natural Language Interface for Mediating E-
Commerce Product Catalogs (INAP2000 – Tokyo)• Semantic Query Transformation for the Intelligent Integration
of Information Sources over the Web (WIIW2001 – Rio de Janeiro)
• Rewriting Rules for Semantic Query Transformation in E-Commerce Applications (DS9 – Hong Kong)
• Data Integration in Digital Libraries: Challenges and Approaches (IndonesiaDL– Bandung)
Publications