Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information...

transcript

Rapidly Constructing Integrated Applications from Online Sources

Craig A. Knoblock

Information Science Institute

University of Southern California

Motivating Example

BiddingForTravel.com

Priceline

Orbitz

Outline Extracting data from unstructured and

ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans

Ungrammatical & Unstructured Text

For simplicity “posts”

<price>$25</price><hotelName>holiday inn sel.</hotelName>

Wrapper based IE does not apply (e.g. Stalker, RoadRunner)

NLP based IE does not apply (e.g. Rapier)

Reference SetsIE infused with outside knowledge

“Reference Sets” Collections of known entities and the associated

attributes Online (offline) set of docs

CIA World Fact Book Online (offline) database

Comics Price Guide, Edmunds, etc.

Algorithm Overview – Use of Ref Sets

$25 winning bid at holiday inn sel. univ. ctr.

Holiday Inn Select University Center

Hyatt Regency Downtown

Reference Set:

Record Linkage

“$25”, “winning”, “bid”, …

Extraction

$25 winning bid … <price> $25 </price> <hotelName> holiday inn sel.</hotelName> <hotelArea> univ. ctr. </hotelArea> <Ref_hotelName> Holiday Inn Select </Ref_hotelName> <Ref_hotelArea> University Center </Ref_hotelArea>

Ref_hotelName Ref_hotelArea

Holiday Inn Greentree

Hyatt Regency Downtown

Reference Set:hotel name hotel area

hotel name hotel area

“$25 winning bid at holiday inn sel. univ. ctr.”

Our Record Linkage Problem Posts not yet decomposed attributes. Extra tokens that match nothing in Ref Set.

Our Record Linkage Solution

Record Level Similarity + Field Level Similarities

VRL = < RL_scores(P, “Hyatt Regency Downtown”), RL_scores(P, “Hyatt Regency”), RL_scores(P, “Downtown”)>

Best matching member of the reference set for the post

Binary RescoringBinary Rescoring

P = “$25 winning bid at holiday inn sel. univ. ctr.”

Generate VIE

Multiclass SVM

$25 holiday inn sel. univ. ctr.

price hotel name hotel area

Clean Whole Attribute

Extraction Algorithm

VIE = <common_scores(token),

IE_scores(token, attr1),

IE_scores(token, attr2),

Experimental Data SetsHotels Posts

1125 posts from www.biddingfortravel.com Pittsburgh, Sacramento, San Diego Star rating, hotel area, hotel name, price, date booked

Reference Set 132 records Special posts on BFT site.

Per area – list any hotels ever bid on in that area Star rating, hotel area, hotel name

Comparison to Existing SystemsRecord Linkage WHIRL

RL allows non-decomposed attributes

Information Extraction Simple Tagger (CRF)

State-of-the-art IE Amilcare

NLP based IE

Record linkage results

10 trials – 30% train, 70% test

Prec. Recall F-Measure

Phoebus 93.60 91.79 92.68

WHIRL 83.52 83.61 83.13

Token level Extraction results: Hotel domain

Not Significant

Prec. Recall F-Measure Freq

Area Phoebus 89.25 87.50 88.28 809.7

Simple Tagger 92.28 81.24 86.39

Amilcare 74.2 78.16 76.04

Date Phoebus 87.45 90.62 88.99 751.9

Simple Tagger 70.23 81.58 75.47

Amilcare 93.27 81.74 86.94

Name Phoebus 94.23 91.85 93.02 1873.9

Simple Tagger 93.28 93.82 93.54

Amilcare 83.61 90.49 86.90

Price Phoebus 98.68 92.58 95.53 850.1

Simple Tagger 75.93 85.93 80.61

Amilcare 89.66 82.68 85.86

Star Phoebus 97.94 96.61 97.84 766.4

Simple Tagger 97.16 97.52 97.34

Amilcare 96.50 92.26 94.27

Discovering Models of Sources Required for Integration

Provide uniform access to heterogeneous sources Source definitions are used to reformulate queries New service, no source model, no integration! Can we discover models automatically?

Source Definitions:- United- Lufthansa- Qantas

Mediator

WebServices

United

Lufthansa

Qantas

newservice

Alitalia

SELECT MIN(price) FROM flightWHERE depart=“MXP” AND arrive=“PIT”

Reformulated Query

lowestFare(“MXP”,“PIT”)

calcPrice(“MXP”,“PIT”,”economy”)

Inducing Source Definitions:A Simple Example

Step 1: use metadata to classify input types Step 2: invoke service and classify output types

Mediator

newsource

RateFinder($fromCountry,$toCountry,val):- ?

knownsource

LatestRates($country1,$country2,rate):-exchange(country1,country2,rate)Semantic Types:

currency {USD, EUR, AUD} rate {1936.2, 1.3058, 0.53177}

Predicates:exchange(currency,currency,rate)

currency

{<EUR,USD,1.30799>,<USD,EUR,0.764526>,…}

def_1($from, $to, val) :- LatestRates(from,to,val)

def_2($from, $to, val) :- LatestRates(to,from,val)

def_1($from, $to, val) :- exchange(from,to,val)

def_2($from, $to, val) :- exchange(to,from,val)Mediator

Predicates:exchange(currency,currency,rate)

Inducing Source Definitions:A Simple Example

Step 3: generate plausible source definitions Step 4: reformulate in terms of other sources Step 5: invoke service and compare output

newsource

RateFinder($fromCountry,$toCountry,val):- ?

currency rateInput RateFinder Def_1 Def_2

<EUR,USD> 1.30799 1.30772 0.764692

<USD,EUR> 0.764526 0.764692 1.30772

<EUR,AUD> 1.68665 1.68979 0.591789

The FrameworkIntuition: Services often have similar semantics, so we

should be able to use what we know to induce that which we don’t

Two phase algorithmFor each operation provided by the new service:1. Classify its input/output data types

Classify inputs based on metadata similarity Invoke operation & classify outputs based on data

2. Induce a source definition Generate candidates via Inductive Logic Programming Test individual candidates by reformulating them

Use Case: Zip Code Data Single real zip-code service with multiple operations The first operation is defined as:

Goal is to induce definition for a second operation:

Same service so no need to classify inputs/outputs or match constants!

getDistanceBetweenZipCodes($zip1, $zip2, distance) :-centroid(zip1, lat1, long1),centroid(zip2, lat2, long2),distanceInMiles(lat1, long1, lat2, long2, distance).

getZipCodesWithin($zip1, $distance1, zip2, distance2) :-centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 300).

Generating definitions: ILP Want to induce source definition for:

Predicates available for generating definitions:{centroid, distanceInMiles, ≤,=}

New type signature contains that of known source Use known definition as starting point for local search:getDistanceBetweenZipCodes($zip1, $zip2, distance) :-

centroid(zip1, lat1, long1),centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance).

getZipCodesWithin($zip1, $distance1, zip2, distance2)

Plausible Source Definition

1 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d1), (d2 = d1)

2 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d1), (d2 ≤ d1)

5 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d1 ≤ #d)

6 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (lt1 ≤ d1)

n cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d2 ≤ d1), (d1 ≤ #d)

INVALIDd2 unbound!

#d is a constant

UNCHECKABLElt1 inaccessible!

contained indefs 2 & 4

Preliminary ResultsSettings: Number of zip code constants initially available: 6 Number of samples performed per trial: 20 Number of candidate definitions in search space: 5

Results: Converged on “almost correct’’ definition!!!

Number of iterations to convergence: 12

getZipCodesWithin($zip1, $distance1, zip2, distance2) :-centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 243).

Related Work Classifying Web Services

(Hess & Kushmerick 2003), (Johnston & Kushmerick 2004) Classify input/output/services using metadata/data We learn semantic relationships between inputs & outputs

Category Translation (Perkowitz & Etzioni 1995) Learn functions describing operations available on internet We concentrate on a relational modeling of services

CLIO (Yan et. al. 2001) Helps users define complex mappings between schemas They do not automate the process of discovering mappings

iMAP (Dhamanka et. al. 2004) Automates discovery of certain complex mappings Our approach is more general (ILP) & tailored to web sources We must deal with problem of generating valid input tuples

Dynamically Building Integration Plans

Mediator

Traditional Data Integration Techniques

Find information about all proteins that participate in

Transcription process

(1). SwissProtein: P36246(2). GeneBank: AAS60665.1

………

Dynamically Building Integration Plans (Cont’d)

Mediator

Problem Solved Here

Create a web service that accepts a name of a biological

process, <bname>, and returns information about

proteins that participate in itNew web service

Problem Statement (Cont’d)

Assumption Information-producing web service

operations Applicability

Biological data web services Geospatial services (WMS, WFS) Other applications that do not focus on

transactions

Query-based Web Service Composition Query-based approach

View web service operations as source relations with binding restrictions Can be inferred from WSDL

Create domain ontology Describe source relations in terms of domain relations

Combined Global-as-View / Local-as-View approach

Use data integration system to answer user queries

Template-based Web Service Composition

Our goal is to compose new web services We need to answer template queries, not specific

queries Template-based Query Approach

Generate plans to take into account general parameter values, i.e. Universal Plan [Schoppers, et. al.]

Easy to generate universal plan Plans that answer template query as oppose to specific

query But, plans can be very inefficient

Need to generate optimized “universal integration plans”

Example Scenario Sources

HSProtein($id, name, location, function, seq, pubmedid)

MMProteinInteractions($fromid, toid, source, verified)

Protein

Protein-ProteinInteractions

MMProtein($id, name, location, function, seq, pubmedid)

TranducerProtein($id, name, location, taxonid, seq, pubmedid)

MembraneProtein($id, name, location, taxonid, seq, pubmedid)

DipProtein($id, name, location, taxonid, function)

HSProteinInteractions($fromid, toid, source, verified)

Example Rules and QueryProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606)

ProteinProteinInteractions(fromid, toid, taxonid, source, verified):-MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090)

ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified)

Q(fromid, toid, taxonid, source, verified):- fromid = !fromid, taxonid = !taxonid, ProteinProteinInteractions(fromid, toid, taxonid, source, verified)

Unoptimized Plan

Optimized Plan Exploit constraints in source description to

filter queries to sources

Example Scenario

Q1(fromid, fromname, fromseq, frompubid, toid, toname, toseq, topubid):- fromid = !fromproteinid,Protein(fromid, fromname, loc1, f1, fromseq, frompubid, taxonid1),ProteinProteinInteractions(fromid, toid, taxonid, source, verified),Protein(toid, toname, loc2, f2, toseq, topubid, taxonid2)

InputOutput

Fromproteinid

Fromproteinid, Toproteinid

Fromproteinid, fromseq

Fromproteinid,Toproteinid, toseq

Fromproteinid, fromseq,Toproteinid, toseq

Protein-ProteinInteractions

Protein

ComposedPlan

Example Integration Plan

Adding Sensing Operations for Tuple-level Filtering Compute original plan for a template query For each constraint on the sources

Introduce constraint into the query Rerun inverse rules algorithm Compare cost of new plan to original plan Save plan with lowest cost

Optimized Universal Integration Plan

Dataflow-style, Streaming Execution Map datalog plans into streaming, dataflow execution

system (e.g., network query engine) We use the Theseus execution system since it supports

recursion Key challenges

Mapping non-recursive plans Mapping recursive plans

Data processing Loop detection Query results update Termination check Recursive callback

Example TranslationProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606)

ProteinProteinInteractions(fromid, toid, taxonid, source, verified):-MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090)

ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified)

Q(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, toid, taxonid, source, verified), (fromid = !fromproteinid), (taxonid = !taxonid)

Example Theseus Plan

Bio-informatics Domain Results Experiments in Bio-informatics domain where we have 60 real

web services provided by NCI We varied number of domain relations in a query from 1-30 and

report composition time with execution time

1 2 3 4 5 6 7 8

# of Relations in Query

Execution Time

Composition Time

Tuple-level Filtering Tuple-level filtering can improve the execution time of the

generated integration plan by up to 53.8%

Improvement due to Theseus Theseus can improve the execution time of the generated web

service with complex plans by up to 33.6%

Discussion Huge number of sources available Need tools and systems that support the dynamic

integration of these sources In this talk, I described techniques for:

Extracting data from unstructured and ungrammatical sources

Discovering models of online sources required for integration

Dynamic and efficient integration of web sources Efficient execution of integration plans

Much work still left to be done…

More information… http://www.isi.edu/~knoblock Matthew Michelson and Craig A. Knoblock.

Semantic Annotation of Unstructured and Ungrammatical TextIn Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, 2005

Mark James Carman and Craig A. Knoblock. Inducing source descriptions for automated web service composition, In Proceedings of the AAAI 2005 Workshop on Exploring Planning and

Scheduling for Web Services, Grid, and Autonomic Computing, 2005. Snehal Thakkar, Jose Luis Ambite, and Craig A. Knoblock.

Composing, optimizing, and executing plans for bioinformatics web services,

VLDB Journal, Special Issue on Data Management, Analysis and Mining for Life Sciences, 14(3):330--353, Sep 2005.

Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information...

Documents