+ All Categories
Home > Documents > Best-Effort Data Integration

Best-Effort Data Integration

Date post: 12-Sep-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
23
AnHai Doan University of Wisconsin-Madison Joint work with Fei Chen, Pedro DeRose, Robert McCann, Yoonkyong Lee, Mayssam Sayyadian, Warren Shen, Luis Gravano, Raghu Ramakrishnan Best Best - - Effort Data Integration Effort Data Integration
Transcript
Page 1: Best-Effort Data Integration

AnHai DoanUniversity of Wisconsin-Madison

Joint work with Fei Chen, Pedro DeRose, Robert McCann, Yoonkyong Lee, Mayssam Sayyadian, Warren Shen, Luis Gravano, Raghu Ramakrishnan

BestBest--Effort Data Integration Effort Data Integration

Page 2: Best-Effort Data Integration

2

Data Integration: Current StatusData Integration: Current Status

We have made tremendous progress in the last 30 years – develop foundations: mediator model, GAV, LAV– build on the foundation: query reformulation, provenance,

uncertainty, schema matching and mapping, entity resolution, adaptive query processing, managing inconsistent data, P2P, etc.

– branch into applications: bio-informatics, geo-spatial, Web, ...– join forces: databases, AI, Web

But data integration remains hard – intractable, AI-complete, etc.

Partly because we often want exact, precise integration

Page 3: Best-Effort Data Integration

3

Precise Data IntegrationPrecise Data Integration

global schema

source schema 2 source schema 3source schema 1

wrapper wrapperwrapper

Find houses with 4 bedroomspriced under 300K

Original motivation: business applications– e.g., payroll, human resources, banking– here, anything less is NOT usable

Page 4: Best-Effort Data Integration

4

However, the Application Landscape However, the Application Landscape Has Changed in the Past Decade Has Changed in the Past Decade

Today, precise integration continues to be critical– e.g., expedia.com

But for many emerging application domains, best-effort data integration – often incurs far less cost– may already prove very useful

Examples– citation tracking (e.g., Citeseer, Google Scholar)– personal information management– scientific, exploratory data analysis– intelligence analysis for homeland security– business intelligence– Web integration scenarios (e.g., Froogle)

Page 5: Best-Effort Data Integration

5

BestBest--Effort Data IntegrationEffort Data Integration

Remove, simplify, or make “less precise”certain components

Employ automatic techniquesTo go “the last mile”: learn from human interaction

global schema

source schema 2 source schema 3source schema 1

wrapper wrapperwrapper

Page 6: Best-Effort Data Integration

6

Example 1: Simplify Global Schema Example 1: Simplify Global Schema Keyword Search over Multiple DatabasesKeyword Search over Multiple Databases

Novel problemVery useful for urgent / one-time DI needs– also when users are SQL-illiterate

Proposed solution in ICDE-07a– combines IR, schema matching, entity resolution, and AI planning

Can do joins across data sources

Page 7: Best-Effort Data Integration

7

Example 2: Simplify Wrappers Example 2: Simplify Wrappers Structured Queries over Text/Web DataStructured Queries over Text/Web Data

Novel problemProposed solution in ICDE-07b

SELECT ... FROM ... WHERE ...

E-mails, text, Web data, news, etc.

Page 8: Best-Effort Data Integration

8

Example 3: BestExample 3: Best--Effort Data Integration Effort Data Integration for Web Communitiesfor Web Communities

Numerous data-rich communities – database researchers, movie fans, legal professionals,

bioinformatics, enterprise intranets, etc.

Each community = many disparate data sources + peopleMembers often want to discovery, query, monitor information in the community– any interesting connection between researchers X and Y?– find all citations of this paper in the past one week on the Web– what is new in the past 24 hours in the database community? – what are current hot topics? who has moved where?

Page 9: Best-Effort Data Integration

9

The The CimpleCimple Project @ Wisconsin/Yahoo!Project @ Wisconsin/Yahoo!

Researcher HomepagesConference PagesGroup PagesDBworld mailing listDBLP

Web pages

Text documents

* **

** * * **

SIGMOD-04

**

** give-talk

Jim Gray Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Personalize system, provide feedback

Builds structured data portals using extraction + integration + mass collaboration

Page 10: Best-Effort Data Integration

10

Prototype System: DBLifePrototype System: DBLifeIntegrate data of the DB research community1164 data sources

Crawled daily, 11000+ pages = 160+ MB / day

Page 11: Best-Effort Data Integration

11

Data ExtractionData Extraction

Page 12: Best-Effort Data Integration

12

Data IntegrationData Integration

Raghu Ramakrishnan

co-authors = A. Doan, Divesh Srivastava, ...

Page 13: Best-Effort Data Integration

13

Resulting ER GraphResulting ER Graph

“Proactive Re-optimization

Jennifer Widom

Shivnath Babu

SIGMOD 2005

David DeWitt

Pedro Bizarrocoauthor

coauthor

coauthor

advise advise

write

write

write

PC-Chair

PC-member

Page 14: Best-Effort Data Integration

14

Querying The ER GraphQuerying The ER Graph

Query: “David DeWitt Jennifer Widom”

1.

2.

3.

Jennifer Widom

David DeWittcoauthor

Jennifer Widom

SIGMOD 2005

David DeWittcoauthor

PC-Chair

PC-member

Jennifer Widom

Shivnath Babu

David DeWitt

coauthor

coauthoradvise

Page 15: Best-Effort Data Integration

15

Provide ServicesProvide ServicesDBLife system

Page 16: Best-Effort Data Integration

16

Mass Collaboration: A Simplified ExampleMass Collaboration: A Simplified Example

Picture is removed if enough users vote “no”.

Page 17: Best-Effort Data Integration

17

More on Mass Collaboration More on Mass Collaboration

Page 18: Best-Effort Data Integration

18

DBLife: Key Lessons LearnedDBLife: Key Lessons LearnedBuilt relatively simple best-effort integration toolsCombined them in a flexible, bottom-up fashion System appears already interesting/useful– see dblife.cs.wisc.edu (still very preliminary & slow)

Hence possible strategy for best-effort integration:– build relatively simple integration tools– learn how to combine them effectively

Relative simple integration tools = Lego blocks– easier to build, debug, work with, enable quick tech transfer?

Building systems bring much benefits– suggests many interesting / unexpected research challenges– helps bridge the research/tech transfer gap

Page 19: Best-Effort Data Integration

19

Research Challenges (1)Research Challenges (1)

Information extraction Data integration Mass collaboration– how to collectively edit extracted and integrated data?

Researcher HomepagesConference PagesGroup PagesDBworld mailing listDBLP

Web pages

Text documents

* **

** * * **

SIGMOD-04

**

** give-talk

Jim Gray Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Personalize system, provide feedback

Page 20: Best-Effort Data Integration

20

Research Challenges (2)Research Challenges (2)

Researcher HomepagesConference PagesGroup PagesDBworld mailing listDBLP

Web pages

Text documents

* **

** * * **

SIGMOD-04

**

** give-talk

Jim Gray Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Personalize system, provide feedback

Exploiting extracted data – keyword search, structured querying, mining, monitoring– how to seamlessly transition among these?

Handling uncertainty / provenance / explanationDealing with evolving data

Page 21: Best-Effort Data Integration

21

Research Challenges (3)Research Challenges (3)

Researcher HomepagesConference PagesGroup PagesDBworld mailing listDBLP

Web pages

Text documents

* **

** * * **

SIGMOD-04

**

** give-talk

Jim Gray Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Personalize system, provide feedback

New data model? Should we use / extend relational databases? How to build continuously running systems?

Page 22: Best-Effort Data Integration

22

SummarySummary

Precise vs. best-effort data integrationSample research– keyword search over multiple databases– SQL queries over text– Cimple project @ Wisconsin/Yahoo! Research

The topic is wide open Our community can contribute much Prototype system: DBlife– can serve as a data integration challenge / testbed / benchmark– potentially provides useful service to our community (as DBWorld+)– provides data for researchers (on a variety of topics)

More details: search “anhai cimple”

Page 23: Best-Effort Data Integration

23

Mass Collaboration Meets Jeff NaughtonMass Collaboration Meets Jeff Naughton

Jeffrey F. Naughton swears that this is David J. DeWitt


Recommended