Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | christa-milem |
View: | 212 times |
Download: | 0 times |
Alon Halevy
University of Washington
Joint work with Anhai Doan, Jayant Madhavan,
Phil Bernstein, and Pedro Domingos
Peer Data-Management Systems:Peer Data-Management Systems:Plumbing for the Semantic WebPlumbing for the Semantic Web
2
AgendaAgenda
Elements of the Semantic Web Piazza: a peer data-management system
– A database guy’s contribution to the semantic web The key issue: mapping between different models:
– Some recent progress and current directions. The critical issue: crossing the structure chasm.
The talk I’m not giving today:– A critique of the Semantic Web.
Work and thoughts are in progress
3
The Semantic Web (my view)The Semantic Web (my view) Web sites include structural annotations
– You can pose meaningful queries on them.– Ontologies provide the semantic glue.– Internal implementation of web sites left open.
Agents perform tasks:– Query one or more web sites– Perform updates (e.g., set schedules)– Coordinate actions– Trust each other (or not).
I.e., agents operating on a gigantic heterogeneous distributed database.
4
Getting thereGetting there
Robust infrastructure for querying – Peer data management systems.
Facilitate mapping between different structures. Need tools for: – Locating relevant structures– Easily joining the semantic web.
Get data into structured form– Should we worry about the legacy web?
5
AgendaAgenda
Elements of the Semantic Web (personal view) Piazza: a peer data-management system
– A database guy’s contribution to the semantic web
The key issue: mapping between different models: – Some recent progress and current directions.
The critical issue: crossing the structure chasm.
6
Piazza: Peer Data-ManagementPiazza: Peer Data-Management
Goal: To enable users to share data across local or wide area networks in an ad-hoc, highly dynamic
distributed architecture.
Peers can:– Export base data– Provide views on base data– Serve as logical mediators for other peers
Every peer can be both a server and a client. Peers join and leave the PDMS at will.
7
Extending the Vision to Data SharingExtending the Vision to Data Sharing
911 DispatchCenter (9DC)
FireServices (FS)
PortlandFire District (PFD)
Vancouver FireDistrict (VFD)
Station 12Station 19Station 3 Station 32
FirstHospital
(FH)Hospitals
(H)
LakeviewHospital (LH)
MedicalAid (MA)
EarthquakeCommand
Center (ECC)
Search &Rescue (SR)
EmergencyWorkers (EW)
WashingtonState
NationalGuard
8
Relationship of PDMS to…Relationship of PDMS to…
P2P overlay networks (the “S” word) Data integration systems (no central logical
mediated schema) Federated databases (scale, ad-hoc nature) Distributed databases (no central administration)
9
Representing DataRepresenting Data A spectrum of possibilities:
– Relational tables, some integrity constraints– XML: can encode relational, hierarchical, OO
– Xquery – emerging standard query language (SQL for XML)
– RDF: “XML on drugs”.– Sees only the logic; ignores other aspects.
– DAML+OIL– Full blown Knowledge representation language.
They all have semantics; just different expressive powers.
We keep the data simple. Mappings between data at different peers are more complex.
10
Piazza QueryingPiazza Querying Semantic mappings between peers provide glue:
LH:CritBed(bed, hosp, room, PID, status) H:CritBed(bed, hosp, room) & H:Patient(PID, bed, status)
9DC:SkilledPerson(PID, "Doctor") :- H:Doctor(SID, h, l, s, e)9DC:SkilledPerson(PID, "EMT") :- H:EMT(SID, h, vid, s, e)
Query processing phases:– Reformulate a query into queries over stored data.
– Minicon algorithm (++) for answering queries using views.– Extensions in Piazza enable chaining multiple peer mappings.
– Find best plan for the query and execute it:– Tukwila data integration engine – an efficient processor for
network bound XML/relational data.
11
Efficiency Issues in Piazza Efficiency Issues in Piazza Intelligent data placement:
– We may want to place views over data at key points in the PDMS:
– Save work for frequently asked queries.– Increase availability in cases of failures.
– Akamai for structured data– A form of automated reformulation.– Large search space of possibilities– Surprising lower bounds on very simple cases [Chirkova
et al, VLDB 2001].
Efficient propagation of updates:– Approach: publish updategrams as first-class citizens.
12
Additional Piazza IssuesAdditional Piazza Issues
The catalog of data sources– What does a catalog of structured data sources look like?– How can it be browsed by humans?– How do we facilitate joining a PDMS?– How can the catalog be distributed physically?
Systems issues:– Architecture of a Piazza node: what are the components?– Naming issues– Security
Piazza collaborators: Etzioni,Gribble, Ives, Levy, Suciu, Mork, Rodrig, Tatarinov.
13
AgendaAgenda
Elements of the Semantic Web Piazza: a peer data-management system
– A database guy’s contribution to the semantic web
The key issue: mapping between different models: – Some recent progress and current directions.
The critical issue: crossing the structure chasm.
14
It’s All About the MappingsIt’s All About the MappingsIt’s not about understanding the data: It’s about understanding each other.
Whenever you see a model for some domain, there is another one hiding around the corner.
Mappings provide semantic relationships between different peers.
Specifying mappings: inherently a human-assisted task.
Goal: make it easy, fast, incremental. Not a new problem!
15
Example Semantic MappingExample Semantic Mapping Mapping between XML DTDs
house
location contact
house
address
name phone
num-baths
full-baths half-baths
contact-info
agent-name agent-phone
1-1 mapping non 1-1 mapping
16
Desiderata from Proposed SolutionsDesiderata from Proposed Solutions
Accuracy, efficiency, ease of use. Extensible: accommodate in a principled fashion:
– User feedback– Domain constraints– General heuristics
“Memory”, knowledge reuse:– System should exploit knowledge from previous matching
tasks [LSD].
Some underlying semantics.
17
Why Matching is DifficultWhy Matching is Difficult Structures represent same entity differently
– different names => same entity: – area & address => location
– same names => different entities: – area => location or square-feet
Intended semantics is typically subjective!– IBM Almaden Lab = IBM?
Schema, data and rules never fully capture semantics!– not adequately documented, certainly not for machine
consumption.
Often hard for humans (committees are formed!)
18
Learning for MappingLearning for Mapping We started simple: generating semantic mappings
between a mediated schema and a large set of data source schemas.
Key idea: generate the first mappings manually, and learn from them to generate the rest.
Technique: multi-strategy learning (extensible!) L(earning) S(ource) D(escriptions) [SIGMOD 2001]. Recent and current work:
– (simple) Ontology mapping [WWW-02]– Complex mappings [COMAP]– Semantics [Madhavan et al., AAAI-02]
19
Data Integration (a simple PDMS)Data Integration (a simple PDMS)
Find houses with four bathrooms priced under $500,000
mediated schema
homes.comrealestate.com
source schema 2
homeseekers.com
source schema 3source schema 1
Applications: WWW, enterprises, science projectsTechniques: virtual data integration, warehousing, custom code.
Query reformulationand optimization.
20
price agent-name agent-phone office-phone description
Learning from the Manual Mappings Learning from the Manual Mappings
listed-price contact-name contact-phone office comments
Schema of realestate.com
Mediated schema
$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location
listed-price contact-name contact-phone office comments
realestate.com
If “fantastic” & “great” occur frequently in data instances => descriptionsold-at contact-agent extra-info
$350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle $190K (512) 342 1263 Great lot
homes.com
If “office” occurs in the name => office-phone
21
Multi-Strategy LearningMulti-Strategy Learning
Use a set of base learners:– Name learner, Naïve Bayes, Whirl, XML learner
And a set of recognizers:– County name, zip code, phone numbers.
Each base learner produces a prediction weighted by confidence score.
Combine base learners with a meta-learner, using stacking.
22
Base LearnersBase Learners Training
Matching Name Learner
– training: (“location”, address) (“contact name”, name)
– matching: agent-name => (name,0.7),(phone,0.3) Naive Bayes Learner
– training: (“Seattle, WA”,address) (“250K”,price)matching: “Kent, WA” => (address,0.8),(name,0.2)
labels weighted by confidence scoreX
(X1,C1)(X2,C2)...(Xm,Cm)
Observed label
Training examples
Object
Classification model (hypothesis)
23
Meta-Learner: StackingMeta-Learner: Stacking[Wolpert 92,Ting&Witten99][Wolpert 92,Ting&Witten99]
Training– uses training data to learn weights– one for each (base-learner,mediated-schema element) pair– weight (Name-Learner,address) = 0.2– weight (Naive-Bayes,address) = 0.8
Matching: combine predictions of base learners– computes weighted average of base-learner confidence scores
Seattle, WAKent, WABend, OR
(address,0.4)(address,0.9)
Name LearnerNaive Bayes
Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8)
area
24
The LSD ArchitectureThe LSD ArchitectureMatching PhaseTraining Phase
Mediated schemaSource schemas
Base-Learner1 Base-Learnerk
Meta-Learner
Training datafor base learners
Hypothesis1 Hypothesisk
Weights for Base Learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Prediction Combiner
Predictions for elements
Predictions for instances
Constraint Handler
Mappings
Domainconstraints
25
Domain ConstraintsDomain Constraints
Encode user knowledge about the domain Specified by examining mediated schema Examples
– at most one source-schema element can match address– if a source-schema element matches house-id then it is a key– avg-value(price) > avg-value(num-baths)
Given a mapping combination – can verify if it satisfies a given constraint
area: addresssold-at: price contact-agent: agent-phoneextra-info: address
26
Empirical EvaluationEmpirical Evaluation
Four domains– Real Estate I & II, Course Offerings, Faculty Listings
For each domain– create mediated DTD & domain constraints– choose five sources– extract & convert data listings into XML (faithful to schema!)– mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48
Ten runs for each experiment - in each run:– manually provide 1-1 mappings for 3 sources– ask LSD to propose mappings for remaining 2 sources– accuracy = % of 1-1 mappings correctly identified
27
Matching AccuracyMatching Accuracy
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II CourseOfferings
FacultyListings
LSD’s accuracy: 71 - 92%
Best single base learner: 42 - 72%
+ Meta-learner: + 5 - 22%
+ Constraint handler: + 7 - 13%
+ XML learner: + 0.8 - 6%
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
28
Sensitivity to Amount of Available DataSensitivity to Amount of Available Data
40
50
60
70
80
90
100
0 100 200 300 400 500
Ave
rage
mat
chin
g ac
cura
cy (
%)
Number of data listings per source (Real Estate I)
29
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II Course Offerings Faculty Listings
Contribution of Schema vs. DataContribution of Schema vs. Data
LSD with only schema info.
LSD with only data info.
Complete LSD
Ave
rage
mat
chin
g ac
cura
cy (
%)
More experiments in the paper [Doan et. al. 01]
30
Contribution of Each ComponentContribution of Each Component
0
20
40
60
80
100
Real Estate I Course Offerings Faculty Listings Real Estate II
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
Without Name Learner
Without Naive Bayes
Without Whirl Learner
Without Constraint Handler
The complete LSD system
31
The Next Steps The Next Steps Learning is a useful component. But it needs to be
combined with:– User feedback– Domain constraints– General heuristics
Need a representation of mappings:– First step – see [Madhavan et al., AAAI-02]
– Also defines key inference problems for such a representation,– Provides answers for the mapping language used in Piazza.
– Ultimately, some first-order probabilistic representation.
Need benchmarks to measure progress.
32
AgendaAgenda
Elements of the Semantic Web Piazza: a peer data-management system
– A database guy’s contribution to the semantic web
The key issue: mapping between different models: – Some recent progress and current directions.
The critical issue: crossing the structure chasm.
33
Can We Cross the Structure Chasm? Can We Cross the Structure Chasm? There are two worlds:
– U-world: the current web, keyword search, google– S-world: databases, knowledge bases, structured queries
The web succeeded because it’s in the u-world. For the semantic web to succeed, we need to make it dead
simple for people to:– Structure data, locate relevant data and data sets, query.
However:– People have a hard time structuring their data– It’s harder to query structured data: need to know a terminology.– It’s harder to understand each other in the S-world.
DB and KR people have no clue how to deal with this. More expressive power in the languages won’t help.