+ All Categories
Home > Documents > From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher...

From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher...

Date post: 17-Jan-2016
Category:
Upload: brice-woods
View: 216 times
Download: 2 times
Share this document with a friend
Popular Tags:
45
From Data Integration To From Data Integration To Semantic Mediation: Semantic Mediation: Addressing Heterogeneities in Data Addressing Heterogeneities in Data Bertram Lud Bertram Lud ä ä scher scher [email protected] Knowledge-Based Information Systems Lab San Diego Supercomputer Center and Department of Computer Science & Engineering University of California, San Diego
Transcript
Page 1: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

From Data Integration To Semantic From Data Integration To Semantic Mediation:Mediation:

Addressing Heterogeneities in DataAddressing Heterogeneities in Data

From Data Integration To Semantic From Data Integration To Semantic Mediation:Mediation:

Addressing Heterogeneities in DataAddressing Heterogeneities in Data

Bertram LudBertram Ludää[email protected]

Knowledge-Based Information Systems Lab

San Diego Supercomputer Center

and

Department of Computer Science & Engineering

University of California, San Diego

Bertram LudBertram Ludää[email protected]

Knowledge-Based Information Systems Lab

San Diego Supercomputer Center

and

Department of Computer Science & Engineering

University of California, San Diego

Page 2: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

2

OutlineOutline

1.1. Information Integration from a Database PerspectiveInformation Integration from a Database Perspective

2.2. XML-Based Data Integration XML-Based Data Integration

3.3. Model-Based / Semantic MediationModel-Based / Semantic Mediation

4.4. DiscussionDiscussion

Page 3: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

An Online Shopper’s Information Integration ProblemAn Online Shopper’s Information Integration Problem

El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”

?Information Integration

?Information Integration

addall.comaddall.com

“One-World” Scenario:XML-based mediator

“One-World” Scenario:XML-based mediator

amazon.comamazon.com A1books.comA1books.comhalf.comhalf.combarnes&noble.combarnes&noble.com

Mediator (virtual DB)(vs. Datawarehouse)

Mediator (virtual DB)(vs. Datawarehouse)

Page 4: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

A Home Buyer’s Information Integration ProblemA Home Buyer’s Information Integration Problem

Which houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood

with below-average crime rate and diverse population?

?Information Integration

?Information Integration

RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats

“Multiple-Worlds” Scenario:

XML-based mediator

“Multiple-Worlds” Scenario:

XML-based mediator

Page 5: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

A Neuroscientist’s Information Integration ProblemA Neuroscientist’s Information Integration Problem

What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?

How about other rodents?

?Information Integration

?Information Integration

protein localization(NCMIR)

protein localization(NCMIR)

neurotransmission(SENSELAB)

neurotransmission(SENSELAB)

sequence info(CaPROT)

sequence info(CaPROT) morphometry

(SYNAPSE)

morphometry(SYNAPSE)

“Complex Multiple-Worlds” Scenario:

Model-based mediator

“Complex Multiple-Worlds” Scenario:

Model-based mediator

Page 6: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

A Geoscientist’s Information Integration ProblemA Geoscientist’s Information Integration Problem

What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ?

How does it relate to host rock structures?

?Information Integration

?Information Integration

Geologic Map(Virginia)

Geologic Map(Virginia) GeoChemicalGeoChemical GeoPhysical

(gravity contours)

GeoPhysical(gravity contours)

GeoChronologic(Concordia)

GeoChronologic(Concordia)

Foliation Map(structure DB)

Foliation Map(structure DB)

“Complex Multiple-Worlds” Scenario:

Model-based mediator

“Complex Multiple-Worlds” Scenario:

Model-based mediator

Page 7: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

7

Information Integration Challenges: Information Integration Challenges: Heterogeneities = SHeterogeneities = S44......

• SSystem Aspectsystem Aspects– platforms, devices, distribution, APIs, protocols, … platforms, devices, distribution, APIs, protocols, …

• SSyntaxesyntaxes– heterogeneousheterogeneous data formatsdata formats ( (one for each tool one for each tool ...)...)

• SStructurestructures– heterogeneous schemas heterogeneous schemas ((one for each DBone for each DB ...) ...)

– heterogeneousheterogeneous data modelsdata models ( (RDBs, ORDBs, OODBs, XMLDBs, RDBs, ORDBs, OODBs, XMLDBs, flat files, …flat files, …) )

• SSemanticsemantics– unclear & “hidden” semantics unclear & “hidden” semantics : e.g., incoherent terminology, : e.g., incoherent terminology,

multiple / informal taxonomies, implicit assumptions, ...multiple / informal taxonomies, implicit assumptions, ...

Page 8: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

8

Information Integration Challenges Information Integration Challenges

• System aspects: “Grid” middlewareSystem aspects: “Grid” middleware– distributed data & computingdistributed data & computing– Web services, WSDL/SOAP, …Web services, WSDL/SOAP, …– sourcessources = functions, files, databases, … = functions, files, databases, …

• Syntax & Structure: Syntax & Structure: (XML-Based) Mediators(XML-Based) Mediators

– wrapping, restructuring wrapping, restructuring – (XML) queries and views(XML) queries and views– sourcessources = (XML) databases = (XML) databases

• Semantics: Semantics: Model-Based/Semantic MediatorsModel-Based/Semantic Mediators

– conceptual modelsconceptual models and declarative views and declarative views – Semantic Web: ontologies, description Semantic Web: ontologies, description

logics, RDF(S), DAML+OIL, OWL, ...logics, RDF(S), DAML+OIL, OWL, ...– sourcessources = knowledge bases (DB+CMs+ICs) = knowledge bases (DB+CMs+ICs)

SyntaxSyntax

StructureStructure

SemanticsSemantics

System aspectsSystem aspects

reconciling reconciling SS44 heterogeneitiesheterogeneities

““gluing” together gluing” together multiple data sources multiple data sources

bridging information bridging information and knowledge gaps and knowledge gaps computationallycomputationally

Page 9: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

9

Information Integration from a DB Perspective Information Integration from a DB Perspective

• Information Integration ProblemInformation Integration Problem– GivenGiven: data sources S: data sources S11, ..., S, ..., Skk (DBMS, web sites, ...) and user (DBMS, web sites, ...) and user

questions Qquestions Q11,..., Q,..., Qnn that can be answered using the S that can be answered using the Sii

– FindFind: the answers to Q: the answers to Q11, ..., Q, ..., Qnn

• The Database Perspective: source = “database” The Database Perspective: source = “database” SSii has a has a schemaschema (relational, XML, OO, ...) (relational, XML, OO, ...)

SSii can be queriedcan be queried

define virtual (or materialized) define virtual (or materialized) integrated viewsintegrated views V V over over SS11 ,..., S ,..., Skk using database query languages using database query languages (SQL, XQuery,...)(SQL, XQuery,...)

questions become queriesquestions become queries Q Qii against V(S against V(S11,..., S,..., Skk))

Page 10: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

10

OutlineOutline

1.1. Information Integration from a Database PerspectiveInformation Integration from a Database Perspective

2.2. XML-Based Data Integration XML-Based Data Integration

3.3. Model-Based / Semantic MediationModel-Based / Semantic Mediation

4.4. DiscussionDiscussion

Page 11: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

11

Extensible Markup Language (XML)Extensible Markup Language (XML)

• (meta)language for (meta)language for marking upmarking up text & datatext & data with with user-definable tagsuser-definable tags– (X)HTML, XSLT, XML Schema, ...(X)HTML, XSLT, XML Schema, ...– MathML, BioML, GeoML, NeuroML, ... MathML, BioML, GeoML, NeuroML, ... – XML-RPC, SOAP, WSDL, OWL, ... XML-RPC, SOAP, WSDL, OWL, ...

• semistructured tree data modelsemistructured tree data model– flexible: marked-up text, web-pages, flexible: marked-up text, web-pages,

databases, ...databases, ...

• container model: container model: – ““boxes within boxes”boxes within boxes”

• (meta)language for (meta)language for marking upmarking up text & datatext & data with with user-definable tagsuser-definable tags– (X)HTML, XSLT, XML Schema, ...(X)HTML, XSLT, XML Schema, ...– MathML, BioML, GeoML, NeuroML, ... MathML, BioML, GeoML, NeuroML, ... – XML-RPC, SOAP, WSDL, OWL, ... XML-RPC, SOAP, WSDL, OWL, ...

• semistructured tree data modelsemistructured tree data model– flexible: marked-up text, web-pages, flexible: marked-up text, web-pages,

databases, ...databases, ...

• container model: container model: – ““boxes within boxes”boxes within boxes”

... in their wonderful book called SemWeb Tractat by B. Schatz and T.B. Lee, the authors show how ...

author:“B. Schatz”

book:

title:“SemWeb Tractat”

author:“T.B. Lee”

book

title author

“SemWeb Tractat”

author

“B. Schatz”“T.B. Lee”

<book> <title>SemWeb Tractat</title> <author>B. Schatz</author> <author>T.B. Lee</author></book>

... in their wonderful book called <title>SemWeb Tractat </title> by B. Schatz and T.B. Lee, the authors show how ...

... in their wonderful book called <title>SemWeb Tractat</title> by <author>B. Schatz</author> and <author> T.B. Lee</author>, the authors show how ...

Page 12: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

12

XML-Based Mediator ArchitectureXML-Based Mediator Architecture

MEDIATORMEDIATOR

XML Queries & Results

S1

Wrapper

XML View

S2

Wrapper

XML View

Sk

Wrapper

XML View

Integrated GlobalXML View G

Integrated ViewDefinition

G(..) S1(..)…Sk(..)

USER/ClientUSER/Client

Query Q ( G (SQuery Q ( G (S11,..., S,..., Skk) )) )

Page 13: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

13

Some Challenges in XML-Based Integration ...Some Challenges in XML-Based Integration ...• XML Query/Transformation LanguagesXML Query/Transformation Languages

– DB communityDB community: QLs for semistructured data, e.g., : QLs for semistructured data, e.g., TSIMMIS/MSL, Lorel, Yatl, ..., TSIMMIS/MSL, Lorel, Yatl, ..., Florid/F-logicFlorid/F-logic [InfSystems98][InfSystems98]

– CSE/SDSCCSE/SDSC: : XMASXMAS [SSD99,SIGMOD99,WebDB99,EDBT00][SSD99,SIGMOD99,WebDB99,EDBT00]

– W3CW3C: XPath, XSLT, XQuery : XPath, XSLT, XQuery ((Working Draft , June 2001)Working Draft , June 2001)

• XML Schema LanguagesXML Schema Languages– DTDs, RELAX NG, XML Schema, ... DTDs, RELAX NG, XML Schema, ... [XMLDM02][XMLDM02]

• DB Theoreticians: DB Theoreticians: – Expressiveness/Complexity Trade-OffExpressiveness/Complexity Trade-Off

• queryingquerying: FO, (WF/S-)Datalog, FO(LFP), FO(PFP), ... , all: FO, (WF/S-)Datalog, FO(LFP), FO(PFP), ... , all

• reasoningreasoning: query satisfiability, containment, equivalence: query satisfiability, containment, equivalence

• ......

Page 14: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

14

XMAS: XML Matching And Structuring language

Integrated View Definition:“Find books from amazon.com

and DBLP, join on author,group by authors and title”

CONSTRUCT <books> <book>

$a1$t<pubs>

$p { $p } </pubs>

</book> { $a1, $t } </books>WHERE <books.book>

$a1 : <author />$t : <title />

</> IN "amazon.com" AND <authors.author>

$a2 : <author /><pubs> $p : <pub/> </>

</> IN "www...DBLP… "AND value( $a1 ) = value( $a2 )

CONSTRUCT <books> <book>

$a1$t<pubs>

$p { $p } </pubs>

</book> { $a1, $t } </books>WHERE <books.book>

$a1 : <author />$t : <title />

</> IN "amazon.com" AND <authors.author>

$a2 : <author /><pubs> $p : <pub/> </>

</> IN "www...DBLP… "AND value( $a1 ) = value( $a2 )

XMASXMAS Algebra

[QL98,SIGMOD99] [EDBT00]

Page 15: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

15

XML (XMAS) Query Processing

TranslatorTranslator

Rewriter/Optimizer: Q’(S)Rewriter/Optimizer: Q’(S)

composed plan

optimized plan

XML Query Q

Composition Q(G)Composition Q(G)

XML Global ViewDefinition G(S)

algebraic plans

Plan Execution Plan Execution

Compile-timeCompile-time

Run-time:query evaluationRun-time:query evaluation

Page 16: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

16

……New Challenges in (XML-Based) MediationNew Challenges in (XML-Based) Mediation

• Global-As-View (GAV)Global-As-View (GAV)– user query Quser query Q global relations Gglobal relations G Q(G) Q(G) – global relations Gglobal relations G source relations S G(S) source relations S G(S)– challenge: compute answers challenge: compute answers Q(G(V(S)))Q(G(V(S))) withoutwithout computing all of computing all of VV and and GG query rewriting (with limited source capabilities)query rewriting (with limited source capabilities): : Q’(S) = Q(G)Q’(S) = Q(G)

• Local-As-View (LAV) Local-As-View (LAV) – user query Q user query Q global relations Gglobal relations G Q(G)Q(G)– source relations S source relations S global relations G global relations G S(G)S(G)– challenge: “reverse/rewrite rules” from challenge: “reverse/rewrite rules” from S(G) S(G) to some to some G’(S)G’(S) answering queries using views: answering queries using views: equivalent rewritings may not existequivalent rewritings may not exist find maximally contained ones: find maximally contained ones: Q’(G’(S)) Q’(G’(S)) Q(G) Q(G)

• Inter(CS)disciplinary research needed: DB Inter(CS)disciplinary research needed: DB FP FP LP LP – GAV/LAV GAV/LAV view (un)folding view (un)folding Clark’s completion, resolution, factoring Clark’s completion, resolution, factoring

Page 17: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

17

Querying XML Streams: A New FrontierQuerying XML Streams: A New Frontier

• New applications for stream-based XML processing: New applications for stream-based XML processing: – Continuous, real-time data streams (wireless sensor networks, …)Continuous, real-time data streams (wireless sensor networks, …)

– Data / message transformation in Web services (SOAP, RMI, processing …)Data / message transformation in Web services (SOAP, RMI, processing …)

– Extract-transform-load applications (Tera/Peta-byte archival migration, …)Extract-transform-load applications (Tera/Peta-byte archival migration, …)

• … … leading to a new XML querying & transformation paradigm:leading to a new XML querying & transformation paradigm:– how to execute (some) XML queries & transformations on very large (infinite) how to execute (some) XML queries & transformations on very large (infinite)

data streams using only limited memorydata streams using only limited memory

– XML stream machine (XSM): extended XML transducers with buffersXML stream machine (XSM): extended XML transducers with buffers

XQueryXQuery XSM networkXSM network

XSMs clearly outperform tree-based approaches XSMs clearly outperform tree-based approaches on streamable queries (100x over Xalan) on streamable queries (100x over Xalan) [A Transducer-Based XML Query Processor, Ludäscher [A Transducer-Based XML Query Processor, Ludäscher Mukhopadhyay, Papakonstantinou, VLDB’02]Mukhopadhyay, Papakonstantinou, VLDB’02]

Page 18: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

18

OutlineOutline

1.1. Information Integration from a Database PerspectiveInformation Integration from a Database Perspective

2.2. XML-Based Data Integration XML-Based Data Integration

3.3. Model-Based / Semantic MediationModel-Based / Semantic Mediation

4.4. DiscussionDiscussion

Page 19: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

A Neuroscientist’s Information Integration ProblemA Neuroscientist’s Information Integration Problem

What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?

How about other rodents?

?Information Integration

?Information Integration

protein localization(NCMIR)

protein localization(NCMIR)

neurotransmission(SENSELAB)

neurotransmission(SENSELAB)

sequence info(CaPROT)

sequence info(CaPROT) morphometry

(SYNAPSE)

morphometry(SYNAPSE)

“Complex Multiple-Worlds”

Mediation

“Complex Multiple-Worlds”

Mediation

Page 20: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

A Geoscientist’s Information Integration ProblemA Geoscientist’s Information Integration Problem

What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ?

How does it relate to host rock structures?

?Information Integration

?Information Integration

Geologic Map(Virginia)

Geologic Map(Virginia) GeoChemicalGeoChemical GeoPhysical

(gravity contours)

GeoPhysical(gravity contours)

GeoChronologic(Concordia)

GeoChronologic(Concordia)

Foliation Map(structure DB)

Foliation Map(structure DB)

“Complex Multiple-Worlds”

Mediation

“Complex Multiple-Worlds”

Mediation

Page 21: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

21

What’s the Problem with XML & Complex Multiple-Worlds?What’s the Problem with XML & Complex Multiple-Worlds?

• XML is XML is SyntaxSyntax– ... for labeled ordered trees... for labeled ordered trees

– ... all ... all semantics lies outsidesemantics lies outside of XML of XML• XML DTDs => tags + nestingXML DTDs => tags + nesting

• XML Schema => DTDs + data modeling XML Schema => DTDs + data modeling

• need anything else? => need anything else? => write comments!write comments!

• Domain Semantics is Domain Semantics is ComplexComplex::– implicitimplicit assumptions, assumptions, hiddenhidden semantics semantics sources sources seem unrelatedseem unrelated to the non-expert to the non-expert

• Need Structure and Semantics Need Structure and Semantics beyond treesbeyond trees!! employ employ richer OO modelsricher OO models make domain make domain semanticssemantics and “ and “glue knowledgeglue knowledge” ” explicitexplicit use use ontologiesontologies to fix terminology and conceptualization to fix terminology and conceptualization avoid ambiguities by using avoid ambiguities by using KR and formal semanticsKR and formal semantics

Page 22: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

22

DB mediation techniques

OntologiesKR formalisms

Model-Based Mediation

Information Integration LandscapeInformation Integration Landscape

conceptual distanceone-world multiple-worlds

conceptual complexity/depth

low

high

addallbook-buyer

BLAST

EcoCyc

Cyc

WordNet

GO

home-buyer24x7 consumer

UMLS

MIA Entrez

RiboWeb

Tambis

BioinformaticsGeo-, Ecoinformatics

Page 23: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

XML-Based vs. Model-Based MediationXML-Based vs. Model-Based Mediation

Raw DataRaw DataRaw Data

IF THEN IF THEN IF THEN

LogicalDomainConstraints

Integrated-CM CM-QL(Src1-CM,...)

Integrated-CM CM-QL(Src1-CM,...)

. . ....

....

........ (XML)Objects

Conceptual Models

XMLElements

XML Models

C2 C3

C1

R

Classes,Relations,is-a, has-a, ...

“Glue Maps” = Domain & Process Maps (ontologies)

“Glue Maps” = Domain & Process Maps (ontologies)

Integrated-DTD XML-QL(Src1-DTD,...)

Integrated-DTD XML-QL(Src1-DTD,...)

No DomainConstraints

A = (B*|C),DB = ...

Structural Constraints (DTDs),Parent, Child, Sibling, ...

CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}

Page 24: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

24

What’s the Glue? What’s in a Link? What’s the Glue? What’s in a Link? • Syntactic Joins Syntactic Joins

(X,Y) := X.SSN (X,Y) := X.SSN == Y.SSN Y.SSN equalityequality (X,Y) := X.UMLS-ID (X,Y) := X.UMLS-ID == Y.UID Y.UID

• ““Speciality” JoinsSpeciality” Joins (X,Y,Score) := (X,Y,Score) := BLASTBLAST(X,Y,Score)(X,Y,Score) similaritysimilarity

• Semantic/Rule-Based JoinsSemantic/Rule-Based Joins (X,Y,C) := (X,Y,C) :=

X X isaisa C, Y C, Y isaisa C, C, BLASTBLAST(X,Y,S),(X,Y,S), S>0.8S>0.8 homology, lubhomology, lub (X,Y,[produces,B,increased_in]) := (X,Y,[produces,B,increased_in]) :=

X X produces produces B, B B, B increased_in increased_in YY. . rule-basedrule-based

e.g., X=e.g., X=--secretase, B=beta amyloid, Y=Alzheimer’s diseasesecretase, B=beta amyloid, Y=Alzheimer’s disease

• CS Challenge: CS Challenge: – compile semantic joins into efficient syntactic onescompile semantic joins into efficient syntactic ones

XY

Page 25: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

25

Semantic Mediation Methodology @ Semantic Mediation Methodology @ SOURCESSOURCES

• Lift Sources to export CMs: Lift Sources to export CMs:

CM(CM(SS) = OM() = OM(SS) + KB() + KB(SS) + CON() + CON(SS) )

• Object Model OM(Object Model OM(SS):):– complex objects (frames), class hierarchy, OO constraintscomplex objects (frames), class hierarchy, OO constraints

• Knowledge Base KB(Knowledge Base KB(SS):):– explicit representation of (“hidden”) source semantics explicit representation of (“hidden”) source semantics

– logic ruleslogic rules over OM( over OM(SS))

• Contextualization CON(Contextualization CON(SS):):– situatesituate OM( OM(SS) data using “glue maps” (ontologies):) data using “glue maps” (ontologies): domain maps DMs domain maps DMs

= = terminological knowledgeterminological knowledge: : conceptsconcepts + + rolesroles process maps PMsprocess maps PMs

= = “procedural knowledge“procedural knowledge”: ”: statesstates + + transitionstransitions

Page 26: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

26

Semantic Mediation Methodology @ Semantic Mediation Methodology @ MEDIATORMEDIATOR

• Integrated View Definition (IVD)Integrated View Definition (IVD)– declarative (logic) rules with object-oriented featuresdeclarative (logic) rules with object-oriented features

– defined over CM(defined over CM(SS), domain maps, process maps), domain maps, process maps

– needs “needs “mediation engineersmediation engineers” = domain + KRDB experts” = domain + KRDB experts

• Knowledge-Based Querying and Browsing (runtime):Knowledge-Based Querying and Browsing (runtime):– mediator composes the user query Q with the IVDmediator composes the user query Q with the IVD

... rewrites (Q o IVD), sends subqueries to sources... rewrites (Q o IVD), sends subqueries to sources

... post-processes returned results (e.g., ... post-processes returned results (e.g., situate in contextsituate in context))

Page 27: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

27

S1 S2

S3

(XML-Wrapper) (XML-Wrapper) (XML-Wrapper)

CM-Wrapper CM-Wrapper CM-Wrapper

USER/ClientUSER/Client

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

CM(S) =OM(S)+KB(S)+CON(S)

GCM

CM S1

GCM

CM S2

GCM

CM S3

CM Queries & Results (exchanged in XML)

Domain MapsDMs

Domain MapsDMs

Domain MapsDMs

Domain MapsDMs

Domain MapsDMs

Process MapsPMs

“Glue” MapsGMs

semanticcontextCON(S)

Integrated View Definition IVD

Model-Based Mediator Architecture

First results & Demos:KIND prototype, formal

DM semantics, PMs[SSDBM00] [VLDB00][ICDE01] [NIH-HB01]

[BNCOD02] [ER02][EDBT02] [BioInf02]

Page 28: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

28

Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)

Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)

Domain Map (DM)

Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).

Domain Expert Knowledge

DM in Description Logic

Formalizing Glue Knowledge:Formalizing Glue Knowledge:Domain Map for Domain Map for SYNAPSESYNAPSE and and NCMIRNCMIR

Page 29: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

29

Source Contextualization & DM RefinementSource Contextualization & DM Refinement

In addition to registering (“hanging off”) data relative toexisting concepts, a source may also refine the mediator’s domain map...

sources can register new concepts at the mediator ...

Page 30: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

Example:Example:ANATOM Domain MapANATOM Domain Map

Page 31: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

31

Browsing Registered Data with Domain MapsBrowsing Registered Data with Domain Maps

Page 32: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

Query Processing Query Processing DemoDemo

Query resultsin context

ContextualizationCON(Result) wrt. ANATOM.

Mediator View DefinitionMediator View DefinitionDERIVEDERIVE

protein_distributionprotein_distribution((ProteinProtein, , Organism,Organism,Brain_region, Brain_region, Feature_name, Feature_name, Anatom,Anatom, ValueValue) ) WHEREWHERE

I:I:protein_label_image[protein_label_image[ proteins ->> {Protein}; organism -> Organism; anatomical_structures ->> proteins ->> {Protein}; organism -> Organism; anatomical_structures ->>{AS:{AS:anatomical_structure[anatomical_structure[name->Anatomname->Anatom]]}}] ] , , % from PROLAB% from PROLAB

NAE:NAE:neuro_anatomic_entity[neuro_anatomic_entity[name->Anatom; name->Anatom; % from ANATOM% from ANATOM located_in->>{Brain_region}located_in->>{Brain_region}]], , AS..segments..featuresAS..segments..features[[name->Feature_name; value->Valuename->Feature_name; value->Value]]. .

• provided by the domain expert and mediation engineer• deductive OO language (here: F-logic)

Page 33: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

Example: Inside Query EvaluationExample: Inside Query Evaluation

push selectionpush selection@SENSELAB@SENSELAB: X1 := : X1 := selectselect targets of “output from targets of “output from parallel fiber”parallel fiber” ;;

determine source contextdetermine source context@MEDIATOR@MEDIATOR: X2 := : X2 := ““find and situatefind and situate”” X1 in ANATOM X1 in ANATOM Domain MapDomain Map;;

compute region of interest (here: downward closure)compute region of interest (here: downward closure)@MEDIATOR@MEDIATOR: X3 := : X3 := subregion-closuresubregion-closure(X2);(X2);

push selectionpush selection @NCMIR@NCMIR: X4 := : X4 := selectselect PROT-data(X3, PROT-data(X3, Ryanodine ReceptorsRyanodine Receptors););

compute protein distributioncompute protein distribution @MEDIATOR@MEDIATOR: X5 := : X5 := compute aggregatecompute aggregate(X4);(X4);

display in contextdisplay in context @MEDIATOR/GUI@MEDIATOR/GUI: : displaydisplay X5 X5 inin context context (ANATOM)(ANATOM)

"How does the parallel fiber output (Yale/SENSELAB) relate to the

distribution of Ryanodine Receptors (UCSD/NCMIR)?”

=> DEMONSTRATION

Page 34: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

34

Open Database & Knowledge Representation IssuesOpen Database & Knowledge Representation Issues

• Mix of Query Processing and ReasoningMix of Query Processing and Reasoning– GAV & LAV with semantic query optimization (NIH BIRN, NSF GEON)GAV & LAV with semantic query optimization (NIH BIRN, NSF GEON)– description logic reasoner for DMs (FaCT) ?description logic reasoner for DMs (FaCT) ?– reconciliation of conflicting DMs via reconciliation of conflicting DMs via argumentation-frameworksargumentation-frameworks (“games”) (“games”)

using using well-foundedwell-founded and and stable modelsstable models of logic programs [ICDT97, PODS97, of logic programs [ICDT97, PODS97, TCS00, TODS02]TCS00, TODS02]

• Modeling “Process Knowledge” => Process MapsModeling “Process Knowledge” => Process Maps– formal semantics? (dynamic/temporal/Kripke models/Petri nets?)formal semantics? (dynamic/temporal/Kripke models/Petri nets?)– executable semantics? (Statelog?)executable semantics? (Statelog?)

• Graph Queries over DMs and PMsGraph Queries over DMs and PMs– expressible in F-logic [InfSystem98]expressible in F-logic [InfSystem98]– scalability? (UMLS Domain Map has millions of entries)scalability? (UMLS Domain Map has millions of entries)

• How to incorporate “procedural features”?How to incorporate “procedural features”?– Bioinformatics, Ecoinformatics, … => sources = DBs + analytical tools + …Bioinformatics, Ecoinformatics, … => sources = DBs + analytical tools + … scientific workflow planning and management (“promoter identification scientific workflow planning and management (“promoter identification

workflow” for DOE SciDAC, NSF/ITR SEEK)workflow” for DOE SciDAC, NSF/ITR SEEK)

Page 35: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

35

Process Maps with Process Maps with AbstractionsAbstractions and and ElaborationsElaborations:: From Terminological to From Terminological to Procedural GlueProcedural Glue

• nodes ~ states• edges ~ processes, transitions• blue/red edges:

• processes in Src1/Src2• general form of edges:

related formalisms

Page 36: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

36

A Scientific Workflow: A Scientific Workflow: Promoter IdentificationPromoter Identification

Questions:Are chr#’s in common?Are chr#’s locations in common?Are there conserved upstream sequences?Are gene locations conserved across species

Questions: RNA POLII promoter?GpC Island present?Are there common TAF’s across genomic gi#?

Questions: Are there other common genes?

gi#’s from clusfavor

cDNA gi#Gene name

blast

blast human

Genomic gi#Chr #

Gene location

TAF’sLocation on Genomic gi#’s

Probabilities of matchProbabilities of random match

TRANSFAC

GC Island locationExon/intron location

Repeats locationPromoter location

GRAIL

Validates polII promoter location

promoter locationShared TAF’s across clusterCommon consensus sequence

Data Consolidation

Consensus sequences

CLUSTAL

blast other species

Genomic gi#Chr #

Gene location

blast

Matthew Coleman, LLNL, 2002

Genomic gi# cDNA gi#

blast

CLUSTAL

TRANSFAC

Page 37: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

37

SDM Demo & ArchitectureSDM Demo & Architecture

Translation Approach:Abstract Workflow (AWF) => Executable Workflow (EWF)

Translation Approach:Abstract Workflow (AWF) => Executable Workflow (EWF)

Page 38: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

38

Analytical Pipelines: An Open Source ToolAnalytical Pipelines: An Open Source Tool

Page 39: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

39

A Commercial Tool for Analytical PipelinesA Commercial Tool for Analytical Pipelines

Page 40: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

40

Summary: Mediation Scenarios & TechniquesSummary: Mediation Scenarios & Techniques

Federated Databases XML-Based Mediation Model-Based Mediation

One-World One-/Multiple-Worlds Complex Multiple-Worlds

Common Schema Mediated Schema Common Glue Maps

SQL, rules XML query languages DOOD query languages

Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings

Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps

DB expert DB expert KRDB + domain experts

Glue?Glue?

Page 41: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

41

GEON vs. SEEKGEON vs. SEEK

Page 42: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

42

OutlineOutline

1.1. Information Integration from a Database PerspectiveInformation Integration from a Database Perspective

2.2. XML-Based Data Integration XML-Based Data Integration

3.3. Model-Based / Semantic MediationModel-Based / Semantic Mediation

4.4. DiscussionDiscussion

Page 43: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

43

Thank you!Thank you!

Questions? Queries?

Page 44: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

44

Some References Some References • Model-Based Mediation:Model-Based Mediation:

– A Model-Based Mediator System for Scientific Data ManagementA Model-Based Mediator System for Scientific Data Management, B. Ludäscher, A. Gupta, M. , B. Ludäscher, A. Gupta, M. Martone, Martone, Bioinformatics: Managing Scientific DataBioinformatics: Managing Scientific Data , Lacroix, Critchlow (eds), Morgan , Lacroix, Critchlow (eds), Morgan Kaufmann, to appear, 2003Kaufmann, to appear, 2003

– Model-Based Mediation with Domain MapsModel-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone, , B. Ludäscher, A. Gupta, M. E. Martone, 17th Intl. 17th Intl. Conference on Data EngineeringConference on Data Engineering (ICDE’01)(ICDE’01), Heidelberg, Germany, IEEE Computer Society, , Heidelberg, Germany, IEEE Computer Society, 2001. 2001.

– Managing Managing SemistructuredSemistructured Data with FLORID: A Deductive Object-Oriented Perspective Data with FLORID: A Deductive Object-Oriented Perspective, B. , B. Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information Systems, 23(8), Special Issue onInformation Systems, 23(8), Special Issue on Semistructured Semistructured Data Data, 1998. , 1998.

• XML-Based Mediation:XML-Based Mediation:– VXD/Lazy MediatorsVXD/Lazy Mediators: : Navigation-Driven Evaluation of Virtual Mediated ViewsNavigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher, , B. Ludäscher,

Y. Papakonstantinou, P. Velikhov, Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database TechnologyIntl. Conference on Extending Database Technology (EDBT’00)(EDBT’00), Konstanz, Germany, LNCS 1777, Springer, 2000. , Konstanz, Germany, LNCS 1777, Springer, 2000.

– XML StreamsXML Streams: : A Transducer-Based XML Query ProcessorA Transducer-Based XML Query Processor, B. Ludäscher, P. Mukhopadhyay, , B. Ludäscher, P. Mukhopadhyay, Y. Papakonstantinou, Y. Papakonstantinou, Intl. Conference on Very Large Databases Intl. Conference on Very Large Databases (VLDB’02), Hong Kong, 2002(VLDB’02), Hong Kong, 2002

Page 45: From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data Bertram Ludäscher Bertram Ludäscher LUDAESCH@SDSC.EDU Knowledge-Based Information.

45

Knowledge Representation:Knowledge Representation:Relating Theory to the World via Formal ModelsRelating Theory to the World via Formal Models

John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations

“All models are wrong, but some are useful!”


Recommended