Post on 22-Apr-2015
description
transcript
A Graph-based Approach to Learn Semantic Descriptions of Data Sources
Mohsen Taheriyan
Craig Knoblock
Pedro Szekely
Jose Luis Ambite
Problem: How to learn semantic descriptions?
First, what is a semantic description?
4
Semantic DescriptionDescribing the source in terms of the concepts and relationships
defined by the domain ontology
Source
object propertydata propertysubClassOf
Domain Ontology
Person
Organization
Place
Statename
birthdatebornIn
worksFor state
name
phone
namelivesIn
CityEvent
ceolocation
organizer
nearby
startDate
endDatetitle
isPartOf
postalCode
Column 1 Column 2 Column 3 Column 4 Column 5Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NYLarry Page Mar 1973 Google East Lansing MI
5
Semantic Types
Column 1 Column 2 Column 3 Column 4 Column 5
Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NY
Larry Page Mar 1973 Google East Lansing MI
Person Organization City State
name birthdate name namename
Person
6
Relationships
Column 1 Column 2 Column 3 Column 4 Column 5
Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NY
Larry Page Mar 1973 Google East Lansing MI
Person
Organization
City
State
name birthdate
bornIn
worksForstate
name
name
name
This semantic model is converted to a semantic description in R2RML
Previous approach to learn semantic descriptions
8
Karma
Domain Ontology
Sample Data
LearnSemantic
Types
CRF
ExtractRelationships
Steiner Tree
Semantic Model
http://www.isi.edu/integration/karma @KarmaSemWeb
9
Refining The Model
Initial Model
10
Refining The Model
Refined Model
• Previous work does not learn the changes done by the user in relationships
• User has to go through the refinement process each time
Our new approach to learn semantic descriptions
12
Key Idea
• Sources in the same domain often have similar data
• Exploit knowledge of existing source models
• Leverage relationships in known source models to hypothesize relationships for new sources
13
Approach
LearnSemantic
Types
CRF
S1 S2 Sn
Known Source Models
…Inputs
Generate Candidate Models Rank Results
Domain Ontology New Source
Construct Graph G
14
Example
Person
Organization
City State
name birthdate
bornIn
worksFor
state
name
namename
name| city|birthdate| state|workplace
S1 = personalInfo
CityState
state
namename
state | cityS2 = getCities
Person
Organization
CityState
name
ceo
isPartOf
name
namename
company| city|ceo| state
S3 = businessInfo
location
Known Source Models
Domain Ontology
New Source
S4 = postalCodeLookup(zipcode, city, state)
15
Build a Graph from Known Models
S1 = personalInfo
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1} {s1}
{s1}
{s1}{s1}
{s1}
Component 1
• Create a component in G for each known source model– Only add if the model is not subgraph of an existing component
• Annotate links with list of supporting models
16
Build a Graph from Known Models
S1 = personalInfo
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
S2 = getCities
Component 1
• Create a component in G for each known source model– Only add if the model is not subgraph of an existing component
• Annotate links with list of supporting models
17
Build a Graph from Known Models
S1 = personalInfo
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
S2 = getCities S3 = businessInfo
Person
Organization
CityState
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Component 1 Component 2
• Create a component in G for each known source model– Only add if the model is not subgraph of an existing component
• Annotate links with list of supporting models
18
• Connect graph components using all paths inferred from the ontology
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
CityState
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
Build a Graph from Known Models
isPartOf
19
• Assign low weight = ε to links within a component (black links)
• Weight other links according to their (green links)
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
CityState
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
Build a Graph from Known Models
M = known source modelsWmax = number of links in M (>= |EG|) = 18c1(e) = number of links in M whose <label,source, target> match ec2(e) = number of links in M whose <label> match ewe = Min(Wmax - c1 , Wmax - c2/Wmax)
18
17
17
17.9418
17.94
17.94
17.94
17.94 isPartOf
17.94
20
Learn Semantic Types (Previous Work)
• A CRF-based model to assign a Semantic Type to each column from its data
• Semantic Type
– Ontology Class– Data Property + Domain
Domain Ontology
(zipcode , city , state)S4 = postalCodeLookup
Place.postalCode City.name State.name
21
Generate Candidate Models
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
CityState
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
18
17
17
17.9418
(zipcode, city, state)S4 = postalCodeLookup
Place.postalCode City.name State.name
• Map learned semantic types to nodes in graph G– There might be multiple mappings
• Compute Steiner tree (minimal tree) for each mapping
17.94
17.94
17.94
17.94 isPartOf
17.94
22
Generate Candidate Models • Map learned semantic types to nodes in graph G
– There might be multiple mappings
• Compute Steiner tree (minimal tree) for each mapping
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
CityState
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
18
17
17
18
Place.postalCode
postalCode
(zipcode, city, state)S4 = postalCodeLookup
Place.postalCode City.name State.name
Mapping 1
17.94
17.94
17.94
17.94
17.94 isPartOf
17.94
23
Generate Candidate Models
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
CityState
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
18
17
17
17.9418
Place.postalCode
postalCode
• Map learned semantic types to nodes in graph G– There might be multiple mappings
• Compute Steiner tree (minimal tree) for each mapping
(zipcode, city, state)S4 = postalCodeLookup
Place.postalCode City.name State.name
Mapping 1
17.94
17.94
17.94
17.94 isPartOf
17.94
24
Generate Candidate Models • Map learned semantic types to nodes in graph G
– There might be multiple mappings
• Compute Steiner tree (minimal tree) for each mapping
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
City State
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
18
17
17
18
Place.postalCode
postalCode
(zipcode, city, state)S4 = postalCodeLookup
Place.postalCode City.name State.name
Mapping 2
17.94
17.94
17.94
17.94
17.94 isPartOf
17.94
25
Generate Candidate Models
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
City State
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
18
17
17
17.94
17.9418
Place.postalCode
postalCode
• Map learned semantic types to nodes in graph G– There might be multiple mappings
• Compute Steiner tree (minimal tree) for each mapping
(zipcode, city, state)S4 = postalCodeLookup
Place.postalCode City.name State.name
Mapping 2
17.94
17.94
17.94 isPartOf
17.94
26
Generate Candidate Models • Map learned semantic types to nodes in graph G
– There might be multiple mappings
• Compute Steiner tree (minimal tree) for each mapping
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
CityState
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
18
17
17
18
Place.postalCode
postalCode
(zipcode, city, state)S4 = postalCodeLookup
Place.postalCode City.name State.name
Mapping 3
17.94
17.94
17.94
17.94
17.94 isPartOf
17.94
27
Generate Candidate Models
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
CityState
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
18
17
17
18
Place.postalCode
postalCode
isPartOf
• Map learned semantic types to nodes in graph G– There might be multiple mappings
• Compute Steiner tree (minimal tree) for each mapping
(zipcode, city, state)S4 = postalCodeLookup
Place.postalCode City.name State.name
Mapping 3
17.94
17.94
17.94
17.94
17.94
17.94
28
Generate Candidate Models • Map learned semantic types to nodes in graph G
– There might be multiple mappings
• Compute Steiner tree (minimal tree) for each mapping
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
City State
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
18
17
17
18
Place.postalCode
postalCode
(zipcode, city, state)S4 = postalCodeLookup
Place.postalCode City.name State.name
Mapping 4
isPartOf
17.94
17.94
17.94
17.94
17.94
17.94
29
Generate Candidate Models
Person
Organization
City State
namebirthdate
bornInworksFor
state
name
name
name
Person.name City.name
Person.birthdate
State.name
Org.name{s1}
{s1}
{s1,s2} {s1,s2}
{s1}
{s1}{s1}
{s1,s2}
Person
Organization
City State
name
ceo
isPartOf
namename
name
location
Org.name
Person.name
City.nameState.name
{s3}{s3}
{s3}
{s3}
{s3}
{s3}
{s3}
Event
Place
location
organizer
organizer
location
location
ceo
worksFor
isPartOf
isPartOf
isPartOf
18
17
17
18
Place.postalCode
postalCode
• Map learned semantic types to nodes in graph G– There might be multiple mappings
• Compute Steiner tree (minimal tree) for each mapping
(zipcode, city, state)S4 = postalCodeLookup
Place.postalCode City.name State.name
Mapping 4
isPartOf
17.94
17.94
17.94
17.94
17.94
17.94
30
Rank Source Models• Rank the candidates based on:
– Cost: sum of the weights– Coherence: prefer the models with higher number of supporting models
Place
City State
postalCode
isPartOfstate
namename
Place.postalCode
City.name
State.name
{s1,s2} {s1,s2}
{s1,s2}
PlaceCity
State
postalCode
isPartOf
isPartOf
namename
Place.postalCode
City.name
State.name
{s1,s2} {s3}
Place
City State
postalCode
isPartOfisPartOf
namename
Place.postalCode
City.name
State.name
{s3} {s3}
{s3}
PlaceCity
State
postalCode
isPartOf
isPartOf
namename
Place.postalCode
City.name
State.name{s3}
{s1,s2}
Rank 1: Candidate 1 Rank 2: Candidate 4
Rank 3: Candidate 2 Rank 3: Candidate 3
31
Evaluation• Dataset 1
– 17 data sources containing overlapping data– Semantic descriptions created manually using DBPedia, FOAF,
GeoNames, and WGS84 ontologies
• Dataset 2– 6 museum sources– Semantic descriptions created by domain experts using EDM,
SKOS, and FOAF ontologies
• Learned a source model assuming other models as input• Computed the Graph Edit Distance (GED) between the learned
model and the correct one – Operations: node insertion, node deletion, edge insertion, edge
deletion, edge relabeling
• Compared the results with our previous work in Karma
32
Results - Dataset 1
Source Signature #Attributes
GED
Previous work
New Approach(Rank 1)
nearestCity(lat, lng, city, state, country) 5 6 1findRestaurant(zipcode, restaurantName, phone, address) 4 1 0zipcodesInCity(city, state, postalCode) 3 3 1parseAddress(address, city, state, zipcode, country) 5 6 1citiesOfState(state, city) 2 1 0ocean(lat, lng, name) 3 2 1postalCodeLookup(zipCode, city, state, country) 4 6 1country(lat, lng, code, name) 4 2 0companyCEO(company, name) 2 1 0personalInfo(firstname, lastname, birthdate, brithCity, birthCountry) 5 4 1businessInfo(company, phone, homepage, city, country, name) 6 10 8restaurantChef(restaurant, firstname, lastname) 3 2 1findSchool(city, state, name, code, homepage, ranking, dean) 7 8 6employees(organization, firstname, lastname, birthdate) 4 1 2education(person, hometown, homecountry, school, city, country) 6 9 4administrativeDistrict(city, province, country) 3 4 1capital(country, city) 2 2 1TOTAL 68 68 29
57% improvement
33
Results - Dataset 2
Source Signature #Attributes
GED
Previous work
New Approach(Rank 1)
S1(Attribution, BeginDate, EndDate, Title, Dated, Medium, Dimensions) 7 1 0
S2(ObjectID, ObjectTitle, ObjectWorkType, ArtistName, ArtistBirthDate, ArtistDeathDate, ObjectEarliestDate, ObjectRights, ObjectFacetValue1)
8 2 3
S3(death, birth, name) 3 0 0
S4(accessionNumber, artist, creditLine, dimensions, imageURL, materials, relatedArtworksURL, creationDate, provenance, keywordValues)
10 9 6
S5(AccessionNumber, Classification, CreditLine, Date, Description, DimensionsOrphan, WhatValues, Who, image, relatedArtworksValues)
10 9 5
S6(Artist, ArtistBornDate, ArtistDiedDate, Classification, Copyright, CreditLine, Image, KeywordValues, Ref, SitterValues) 10 8 6
TOTAL 68 29 20
31% improvement
34
Related Work• Writing semantic descriptions by hand
– R2RML, SWRL– Tedious and time-consuming task– Requires expertise in SW technologies
• Semantic annotation of Web services and Web tables– Very limited in learning the relationships
• Learning Semantic Definitions of Online Information Sources [Carman, Knoblock, 2007]– Learns LAV rules from known sources– Can only learn descriptions that overlap known sources
35
Discussion
• Automatically build rich semantic descriptions of data sources
• Exploit the background knowledge from (i) the domain ontology, and (ii) the known source models
• Semantic descriptions are the key ingredients to automate many tasks, e.g., – Source Discovery – Data Integration– Service Composition
36
Future Work• Investigate how to create a more compact graph
– Consolidate the overlapping segments of the known semantic models
• Relax the problem by removing the constraint that the correct semantic type of each attribute is known– CRF part returns a set of candidate semantic types along with their
confidence values
• Use the data available in Linked Open Data (LOD) cloud to learn more accurate models
• Put the user in the loop– Integrate the new approach into Karma
– The user refines one of the suggested models
– The new model will be added to the graph as a new pattern