Constraint-based Information Integration Steven Minton Fetch Technologies Joint work with Craig...

Post on 21-Dec-2015

213 views 0 download

Tags:

transcript

Constraint-based Information Integration

Steven MintonFetch Technologies

Joint work with Craig Knoblock and Jose Luis Ambite (USC/ISI)

Example Application

Tiger MapServer

Geocoder

Zagat Restaurants Guide

Integration System

LA CountyRestaurant

Health Ratings

Outline Agents that access information

sources on the web AgentBuilder – learning from

examples ActiveAtlas -- standardizing data from

multiple sources Constraint-based Integration

Heracles – putting it all together

Information Agents

Decision SupportDecision Support Application ProgramsApplication Programs

Information AgentInformation Agent

Knowledge BasesKnowledge BasesDatabasesDatabases Computer ProgramsComputer ProgramsThe WebThe Web

Web Agents

Web agents provide uniform query language for data access: “Wrapping a web site”

Restaurants inSanta Monica?

Name AddressChinois on Main 2709 Main St.Chao Dara 13 Union Sq.

… ...

AgentBuilder Supervised learning: Extraction

rules created from examples High precision High reliability

Extraction technology Expressive extraction rule

language: Extraction rule = sequence of

landmarks Describes how to find the beginning

and end of each field

Start: SkipTo(Cuisine :) SkipTo(<b>) End: SkipTo(</b>)

PAGE:<html> Name:<b> KFC </b> Cuisine :<p> <b> Fast Food </b> <br>...

A Sequential Covering Algorithm for “Wrapper Induction”

Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...

Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …

A Sequential Covering Wrapper Induction Algorithm

Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...

Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …

Initial candidate: SkipTo( ( )

A Sequential Covering Wrapper Induction Algorithm

SkipTo( <b> ( ) ... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()

Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...

Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …

Initial candidate: SkipTo( ( )

A Sequential Covering Wrapper Induction Algorithm

SkipTo( <b> ( ) ... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()

Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...

Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …

Initial candidate: SkipTo( ( )

… SkipTo(Phone) SkipTo(:) SkipTo( ( ) ...

Outline Agents that access information

sources on the web AgentBuilder – learning from

examples Atlas -- standardizing data from

multiple sources Constraint-based Integration

Heracles – putting it all together

The Problem:Multi-Source Inconsistency

How can the same objects be identified when they are stored in inconsistent text formats?

Art’s DelicatessenCa’ BreaCPKThe GrillPatinaPhilippe’s The OriginalThe Tillerman

Art’s DeliCalifornia Pizza KitchenCampanileCitrusGrill, ThePhilippe The OriginalSpago

Zagat’s Restaurant Guide Health Dept Restaurant Listings

The Solution: Record Linkage

Name Street Phone

Art’s Deli 12224 Ventura Boulevard 818-756-4124

Teresa's 80 Montague St. 718-520-2910

Steakhouse The 128 Fremont St. 702-382-1600

Les Celebrites 155 W. 58th St. 212-484-5113

Name Street Phone

Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100

Teresa's 103 1st Ave. between 6th and 7th Sts. 212/228-0604

Binion's Coffee Shop 128 Fremont St. 702/382-1600

Les Celebrites 160 Central Park S 212/484-5113

Zagat’s Restaurants Dept. of Health

Zagat’s Agent Dept. of Health Agent

Query

Record Linkage

  Name Street Phone

Art’s Deli 12224 Ventura Boulevard 818-756-4124

Teresa’s 80 Montague St. 718-520-2910

Steakhouse The

128 Fremont St. 702-382-1600

Les Celebrites 155 W. 58th St. 212-484-5113

   Name Street Phone

Art’s Delicatessen

12224 Ventura Blvd. 818/755-4100

Teresa’s 103 1st Ave. between 6th and 7th Sts.

212/228-0604

Binion’s Coffee Shop

128 Fremont St. 702/382-1600

Les Celebrites 5432 Sunset Blvd 212/484-5113

Zagat’s Dept of Health

Approach to Record Linkage

Learning attribute weighting rules

Learning general transformation rules

Name Street Phone

Zagat’s

Dept of Health

Art’s Deli 12224 Ventura Boulevard 818-756-4124

Art’s Delicatessen 12224 Ventura Blvd. 818/756-4124

Art’s DeliCalifornia Pizza KitchenPhilippe The Original

Zagat’s Dept of Health

Art’s DelicatessenCPKPhilippe’s The Original

AbbreviationAcronymStemming

TransformationsRules

Active Learning to Determine Matched Records[Tejada, Knoblock, Minton ’01,’02]

Learn importance of attributes for matching records

Zagat’s

Dept of Health

Art’s Deli 12224 Ventura Boulevard 818-756-4124

Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100

Name Street Phone

Mapping rules:

Name > .9 & Street > .87 => mapped

Name > .95 & Phone > .96 => mapped

Active AtlasMapping Rule Learner

Set of Mapped Objects

Choose initial examples

Generate committee of learners

Learn Rules

ClassifyExamples

Votes Votes Votes

Choose Example

USERLearn Rules

ClassifyExamples

Learn Rules

ClassifyExamples

Label

Label

Committee Disagreement

Chooses an example based on the disagreement of the query committee

CPK, California Pizza Kitchen is the most informative example

Art’s Deli, Art’s DelicatessenCPK, California Pizza KitchenCa’Brea, La Brea Bakery

Yes Yes Yes Yes No Yes No No No

Examples M1 M2 M3Committee

Outline Agents that access information

sources on the web AgentBuilder – learning from

examples ActiveAtlas -- standardizing data from

multiple sources Constraint-based Integration

Heracles – putting it all together

Constraint-based Integration

Integrating data from multiple sources often involves reasoning about the information

Constraints provide a approach to expressing relationships and filtering data

Heracles Framework for building integrated

applications Interleaves planning and information

gathering Uses a constraint reasoner to decide

what sources to query and to integrate the results

The Travel Assistant

Dynamically Updates Slots as Information Becomes Available

BLACK

GREEN

GREEN

GREEN

GREEN

GREEN

GREEN

GREEN

GREEN GREEN

GREEN GREEN

BLACK

GREEN GREEN

GREENBLUE

BLUE RED

REDRED

RED

RED

RED

RED

RED

RED

RED

Supports Informed Choices

Changes Propagate Throughout

User Can Specify High-Level Preferences

Constraint Networks for Managing Information Constraint reasoning system

Propagates information Decides when to launch information requests Evaluate constraints Computes preferences All run as asynchronous processes to support the

user Components:

Representation of the variables Representation of constraints Hierarchical templates Constraint propagation

Constraint Networks for Integrating Information Components:

Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation and cycle

detection

Constraint Variables

Constraint network consists of a set of variables such as: MeetingStartTime MeetingLocation

Variables are related by constraints that determine the possible values of a solution

Constraint Networks for Integrating Information Components:

Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation and cycle

detection

Constraint Representation

Constraints are computable components: Local calculations (e.g., Xquery)

MeetingStartTime + MeetingDuration --> MeetingEndTime

Web and Database Wrappers ITN: DepartureAirport, ArrivalAirport, Date --> Flights Yahoo Weather: City, Date --> Weather predication

External Programs (Outlook, Planners, etc) Outlook Calendar: Date --> Meetings

Results cached in tables

DepartureDate

ReturnDate

computeDurationDepartureAirport

ParkingRate

ParkingTotalDurationgetParkingRate

TaxiFare

DestinationAddress

GetTaxiFare

multiply SelectModeToAirport

ModeToAirport

Sep 30, 2000

Oct 2, 2000

3 days

LAX

$7.00/day

$21.00 $23.00

Drive

GetDistanceOriginAddress

Distance

15.1 miles

FindClosestAirport

Drive or Take a Taxi?

Constraint Networks for Integrating Information Components:

Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation and cycle

detection

Hierarchically-Partitioned Constraint Networks

Template: Groups related variables and constraints Organizes information for computation and

presentation to user Templates organized hierarchically

Template decomposed into subtemplates Choose among alternative subtemplates

Template Structure

Template Arguments: input and output

variables Variables: name, type, default values Constraints Expansions: alternative subtemplate

calls GUI specification

Who Company

Subject

Starting Time

Ending Time

Origin Addr.Dest. Addr.

OriginWeatherDest Weather

Distance

Travel Mode

Depart Time Depart Airport

Arrival Airport

Flight Num

Arrival Time Parking Lot

Parking Rate Mode toAirport

Dist. toAirport

Taxi Fare

Partitioned Constraint Network

Template Hierarchy for the Travel Assistant

Trip

ModeNext

Drive

ModeToDestination

Fly

ModeToAirport

Taxi

FlightDetail

Hotel

ModeHotel

NoOvernight

1

1 2

32

Trip(Return Home)

Trip(Return Office)

Trip(New Leg)

ModeFromAirport3 End

Trip

AND

OROR OR

AND

Drive Taxi

OR

Drive Taxi

OR

Dynamic NetworksGeneralization of Constraint Networks Variables can be active or inactive Normal Constraints x1 = k1 ^ … ^ xm = km xn = kn

Activity constraints: x1 = k1 ^ … ^ xm = km active(xn) Inactive variables do not participate in the

network, i.e., do not propagate constraints

Heracles: Template Selection Core network

Computes values of template selection vars

Always active Template selection variables

Inputs to activity constraints: determine the choice of subtemplates, i.e., which additional variables are active

Constraint Networks for Integrating Information Components:

Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation

Constraint Propagation

Approach When a variable is assigned a value, re-compute the value

sets and assigned values of all dependent variables Proceeds recursively until no values are changed or a cycle

is detected Core network

Propagates all variables through the core network Remaining variables are computing when a template is

opened Does not perform full CSP

Less costly Does not require all information in advance Makes choices locally, so may fail to find optimal

assignment

Discussion General framework for interleaving

planning and information gathering Retrieves information as needed Gathers and integrates data in a uniform

framework Evaluates tradeoffs and selects among

alternatives Allows the users to explore alternatives Supports a wide variety of information types:

databases, web pages, images, video, etc.

SmartClients [Torrens et al, 2002]

Cast an integration problem as a Constraint Satisfaction Problem (CSP)

Given a request, the server retrieves the required data and sends the data and the CSP to the client

Client solves the CSP locally Large complex problem transmitted in small

amount of space Provides fine-grained user interaction with

the data

Architecture for SmartClients

SmartClients: Pros and Cons Pros

Elegant approach that exploits past work on CSPs

Minimizes the data retrieval and supports complex reasoning and integration of the data

Cons Assumes that all data can be retrieved before

any reasoning about the data In the travel planning, assumes that prices

are the same on any date and there are no issues with flight availability

Summary Our approach for creating

“web assistants”: Agents for accessing web data Record linkage for mapping

between sources Constraint-based integration

provides the glue