+ All Categories
Home > Documents > Data Integration for the Relational Web

Data Integration for the Relational Web

Date post: 24-Feb-2016
Category:
Upload: vivi
View: 26 times
Download: 0 times
Share this document with a friend
Description:
Data Integration for the Relational Web. Katsarakis Michalis. Data Integration for the Relational Web. Presentation of the paper: Michael J. Cafarella , Alon Halevy, and Nodira Khoussainova . 2009. Data integration for the relational web. Proc. VLDB Endow. 2, 1 (August 2009), 1090-1101 - PowerPoint PPT Presentation
Popular Tags:
54
Data Integration for the Relational Web Katsarakis Michalis
Transcript
Page 1: Data Integration for the Relational Web

Data Integration for the Relational Web

Katsarakis Michalis

Page 2: Data Integration for the Relational Web

Data Integration for the Relational Web

Katsarakis MichalisPresentation of the paper:

Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data integration for the relational web. Proc. VLDB Endow. 2, 1 (August 2009),

1090-1101for the needs of the course hy562

Page 3: Data Integration for the Relational Web

Octopus system in one slideΗΥ-562 Εαρινό 11-12

Δείτε αυτή τη σελίδα

στα...

Γενικές Πληροφορίες

Περιγραφή Μαθήματος

Βιβλιογραφία

Διαλέξεις

Ασκήσεις

Παρουσιάσεις

Presentation Program Date Area Paper Download Name Presentation Report

21/5 Data Uncertainty

X. L. Dong, A. Halevy, C. Yu. "Data integration with uncertainty" The VLDB Journal (2009) 18:469-500 download

Mixalis Hortis - -

P. Sen, A. Deshpande. "Representing and Querying Correlated Tuples in Probabilistic Databases". ICDE 2007 download

Grammatikou Magdalini - -

22/5 Data Uncertainty

N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases". The VLDB Journal, 16(4), 2007 download

Athanasia Katsouraki - -

Bhargav Kanagal, Amol Deshpande: Efficient Query Evaluation over Temporally Correlated Probabilistic

Streams. ICDE 2009: 1315-1318 download

Aleka Seliniotaki - -

29/5 Keyword-based search

Alexander Markowetz, Yin Yang, and Dimitris Papadias. 2009. Keyword search over relational tables and streams. ACM Trans. Database Syst. 34, 3, Article 17 (September

2009) download

Doklea Metsi - -

Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum STAR: Steiner-Tree

Approximation in Relationship Graphs download

Grammatikakis Constantinos - -

30/5 Structured web data

Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data integration for the relational web. Proc. VLDB

Endow. 2, 1 (August 2009), 1090-1101 download

Katsarakis Michalis - -

Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1, 1 (August 2008),

538-549. download

Karanasiou Katerina - -

Girija Limaye , Sunita Sarawagi , Soumen Chakrabarti, Annotating and searching web tables using entities, types and relationships, Proceedings of the VLDB Endowment, v.3 n.1-

2, September 2010 download

Lambraki Iwanna - -

Available papers

Area Paper Download Name

Data Provenance

1. J. Cheney, L. Chiticariu, and W. C. Tan, "Provenance in databases: Why, where and how," Foundations and Trends in Databases, vol. 1,

no. 4, 2009 download

-

2. T.J. Green, G. Karvounarakis, and V. Tannen, "Provenance Semirings," in PODS,2007 download

-

3. Todd J. Green. Containment of conjunctive queries on annotated relations. Theory of Computing Systems, 49(2), 2011 download

-

Data Uncertainty

1. T. J. Green. Models for incomplete and probabilistic information. In Charu Aggarwal, editor, Managing and Mining Uncertain Data.

Springer, 2009 download

-

2. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases". The VLDB Journal, 16(4), 2007 download

Athanasia Katsouraki

3. Lyublena Antova, Christoph Koch, and Dan Olteanu. 2007. From complete to incomplete information and back. In Proceedings of the

2007 ACM SIGMOD international conference on Management of data (SIGMOD '07). ACM, New York, NY, USA, 713-724

download

-

4. P. Sen, A. Deshpande. "Representing and Querying Correlated download

Grammatikou

Date Area Parer Download Name Presentation Report

21-Μαϊ

Data Uncertainty

X. L. Dong, A. Halevy, C. Yu. "Data integration with uncertainty" The VLDB Journal (2009) 18:469-500 download Mixalis Hortis - -

21-Μαϊ

Data Uncertainty

P. Sen, A. Deshpande. "Representing and Querying Correlated Tuples in Probabilistic Databases". ICDE 2007

download Grammatikou Magdalini - -

22-Μαϊ

Data Uncertainty

N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases". The VLDB Journal, 16(4), 2007 download Athanasia Katsouraki - -

22-Μαϊ

Data Uncertainty

Bhargav Kanagal, Amol Deshpande: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams. ICDE 2009: 1315-1318

download Aleka Seliniotaki - -

29-Μαϊ

Keyword-based search

Alexander Markowetz, Yin Yang, and Dimitris Papadias. 2009. Keyword search over relational tables and streams. ACM Trans. Database Syst. 34, 3, Article 17 (September 2009)

download Doklea Metsi - -

29-Μαϊ

Keyword-based search

Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum STAR: Steiner-Tree Approximation in Relationship Graphs

download Grammatikakis Constantinos - -

30-Μαϊ

Structured web data

Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data integration for the relational web. Proc. VLDB Endow. 2, 1 (August 2009), 1090-1101

download Katsarakis Michalis - -

30-Μαϊ

Structured web data

Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1, 1 (August 2008), 538-549.

download Karanasiou Katerina - -

30-Μαϊ

Structured web data

Girija Limaye , Sunita Sarawagi , Soumen Chakrabarti, Annotating and searching web tables using entities, types and relationships, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010

download Lambraki Iwanna - -

Page 4: Data Integration for the Relational Web

Octopus system in one slide

General

o Homepage o News o Message from the

General Chair o Message from the

Program Chairs o Photo Gallery

Program o Detailed Program o Program at a

Glance o Interactive Program o Keynotes o Panels o Tutorials o Workshops o Georges Gardarin's

Workshop o Social Events

Participants o Conference Venue o Accommodation o Registration o Travel Fellowship

Program o Grants o Tourism

Organization o Conference

Officers o Contacts o Program

Committees o Local Organizing

Committee Sponsors VLDB Endowment PVLDB Contributors

o Important Dates o Calls o Manuscript

Preparation o Manuscript

Submission o Camera-Ready

Home » Organization

Program Committees Core Database Technology Infrastructure for Information Systems Industrial, Applications, and Experience Experiments and Analyses Demonstrations PhD Workshop

Core Database Technology

Program Chair

Jignesh M. Patel, University of Wisconsin, USA

PC members

Daniel Abadi, University of Yale, USA Anastasia Ailamaki, EPFL, Switzerland Walid Aref, University of Purdue, USA Phil Bohannon, Yahoo! Research, USA Peter Boncz, CWI, The Netherlands Angela Bonifati, CNR, Italy Nick Bruno, Microsoft Research, USA Ugur Cetintemel, University of Brown, USA Sang Cha, Seoul National University, Korea Chee Yong Chan, NUS, Singapore Mitch Cherniack, University of Brandeis, USA Junghoo Cho, UCLA, USA Panos Chrysanthis, University of Pittsburgh, USA Mariano Consens, University of Toronto, Canada Amol Deshpande, University of Maryland, USA David DeWitt, Microsoft Research, USA Yanlei Diao, University of Massachusetts, Amherst, USA AnHai Doan, University of Wisconsin, USA Christos Faloutsos, CMU, USA Wenfei Fan, University of Edinburgh, UK Alan Fekete, University of Sydney, Australia Naga Govindaraju, Microsoft Research, USA

Name Institute Country

Name Institute Country

Daniel Abadi University of Yale USAAnastasia Ailamaki EPFL SwitzerlandWalid Aref University of Purdue USAPhil Bohannon Yahoo! Research USAPeter Boncz CWI The NetherlandsAngela Bonifati CNR ItalyNick Bruno Microsoft Research USAUgur Cetintemel University of Brown USASang Cha Seoul National University KoreaChee Yong Chan NUS SingaporeMitch Cherniack University of Brandeis USAJunghoo Cho UCLA USAPanos Chrysanthis University of Pittsburgh USAMariano Consens University of Toronto CanadaAmol Deshpande University of Maryland USADavid DeWitt Microsoft Research USAYanlei Diao University of Massachusetts USAAnHai Doan University of Wisconsin USAChristos Faloutsos CMU USAWenfei Fan University of Edinburgh UKAlan Fekete University of Sydney AustraliaNaga Govindaraju Microsoft Research USA

Page 5: Data Integration for the Relational Web

Octopus system in one slide

1. Search1. Find relations relevant to user’s query string2. Cluster similar tables together

2. Context– Enrich relations with data from the surrounding

text3. Extend

– Adorn an existing relation with additional data columns derived from other relations

Page 6: Data Integration for the Relational Web

Index

1. Integration Operators2. Algorithms3. Implementation at Scale4. Experiments5. Related Work6. Conclusions

Page 7: Data Integration for the Relational Web

INTEGRATION OPERATORS

1. Integration Operators2. Algorithms3. Implementation at Scale4. Experiments5. Related Work6. Conclusions

Page 8: Data Integration for the Relational Web

Extracted Set of Relations

Search Operator

Relevance Ranking Clustering

Keyword query string

1

2

3

4

Ordered List of relevant Relations 1

2

3

4

Ordered List of Clusters of Relations

Page 9: Data Integration for the Relational Web

Search Operator (2)

• Search operator finds relevant data over the Web and then clusters the result. – Each member table of the cluster is a concrete

table that contributes to the Clusters Schema Relation

Page 10: Data Integration for the Relational Web

Context Operator

ContextExtracted Relation T

T’s source web page

T enriched with new columns

Page 11: Data Integration for the Relational Web

Context Operator (2)ΗΥ-562 Εαρινό 11-12

Δείτε αυτή τη σελίδα

στα...

Γενικές Πληροφορίες

Περιγραφή Μαθήματος

Βιβλιογραφία

Διαλέξεις

Ασκήσεις

Παρουσιάσεις

Presentation Program Date Area Paper Download Name Presentation Report

21/5 Data Uncertainty

X. L. Dong, A. Halevy, C. Yu. "Data integration with uncertainty" The VLDB Journal (2009) 18:469-500 download

Mixalis Hortis - -

P. Sen, A. Deshpande. "Representing and Querying Correlated Tuples in Probabilistic Databases". ICDE 2007 download

Grammatikou Magdalini - -

22/5 Data Uncertainty

N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases". The VLDB Journal, 16(4), 2007 download

Athanasia Katsouraki - -

Bhargav Kanagal, Amol Deshpande: Efficient Query Evaluation over Temporally Correlated Probabilistic

Streams. ICDE 2009: 1315-1318 download

Aleka Seliniotaki - -

29/5 Keyword-based search

Alexander Markowetz, Yin Yang, and Dimitris Papadias. 2009. Keyword search over relational tables and streams. ACM Trans. Database Syst. 34, 3, Article 17 (September

2009) download

Doklea Metsi - -

Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum STAR: Steiner-Tree

Approximation in Relationship Graphs download

Grammatikakis Constantinos - -

30/5 Structured web data

Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data integration for the relational web. Proc. VLDB

Endow. 2, 1 (August 2009), 1090-1101 download

Katsarakis Michalis - -

Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1, 1 (August 2008),

538-549. download

Karanasiou Katerina - -

Girija Limaye , Sunita Sarawagi , Soumen Chakrabarti, Annotating and searching web tables using entities, types and relationships, Proceedings of the VLDB Endowment, v.3 n.1-

2, September 2010 download

Lambraki Iwanna - -

Available papers

Area Paper Download Name

Data Provenance

1. J. Cheney, L. Chiticariu, and W. C. Tan, "Provenance in databases: Why, where and how," Foundations and Trends in Databases, vol. 1,

no. 4, 2009 download

-

2. T.J. Green, G. Karvounarakis, and V. Tannen, "Provenance Semirings," in PODS,2007 download

-

3. Todd J. Green. Containment of conjunctive queries on annotated relations. Theory of Computing Systems, 49(2), 2011 download

-

Data Uncertainty

1. T. J. Green. Models for incomplete and probabilistic information. In Charu Aggarwal, editor, Managing and Mining Uncertain Data.

Springer, 2009 download

-

2. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases". The VLDB Journal, 16(4), 2007 download

Athanasia Katsouraki

3. Lyublena Antova, Christoph Koch, and Dan Olteanu. 2007. From complete to incomplete information and back. In Proceedings of the

2007 ACM SIGMOD international conference on Management of data (SIGMOD '07). ACM, New York, NY, USA, 713-724

download

-

4. P. Sen, A. Deshpande. "Representing and Querying Correlated download

Grammatikou

Course id

Semester

Date Area Parer Download Name Presentation Report Course id Semester

21-Μαϊ

Data Uncertainty

X. L. Dong, A. Halevy, C. Yu. "Data integration with uncertainty" The VLDB Journal (2009) 18:469-500 download Mixalis Hortis - - ΗΥ-562 Summer

2012

21-Μαϊ

Data Uncertainty

P. Sen, A. Deshpande. "Representing and Querying Correlated Tuples in Probabilistic Databases". ICDE 2007 download Grammatikou

Magdalini - - ΗΥ-562 Summer 2012

22-Μαϊ

Data Uncertainty

N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases". The VLDB Journal, 16(4), 2007 download Athanasia

Katsouraki - - ΗΥ-562 Summer 2012

22-Μαϊ

Data Uncertainty

Bhargav Kanagal, Amol Deshpande: Efficient Query Evaluation over Temporally Correlated Probabilistic Streams. ICDE 2009: 1315-1318

download Aleka Seliniotaki - - ΗΥ-562 Summer 2012

29-Μαϊ

Keyword-based search

Alexander Markowetz, Yin Yang, and Dimitris Papadias. 2009. Keyword search over relational tables and streams. ACM Trans. Database Syst. 34, 3, Article 17 (September 2009)

download Doklea Metsi - - ΗΥ-562 Summer 2012

29-Μαϊ

Keyword-based search

Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum STAR: Steiner-Tree Approximation in Relationship Graphs

download Grammatikakis Constantinos - - ΗΥ-562 Summer

2012

30-Μαϊ

Structured web data

Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data integration for the relational web. Proc. VLDB Endow. 2, 1 (August 2009), 1090-1101

download Katsarakis Michalis - - ΗΥ-562 Summer

2012

30-Μαϊ

Structured web data

Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1, 1 (August 2008), 538-549.

download Karanasiou Katerina - - ΗΥ-562 Summer

2012

30-Μαϊ

Structured web data

Girija Limaye , Sunita Sarawagi , Soumen Chakrabarti, Annotating and searching web tables using entities, types and relationships, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010

download Lambraki Iwanna - - ΗΥ-562 Summer 2012

Page 12: Data Integration for the Relational Web

Context Operator (3)

• Data values that hold for every tuple are generally “projected out” and added to the Web page’s surrounding text.

• Context takes as input a single extracted Table T and modifies it to contain additional columns, using data retrieved from T’s source Web Page

Page 13: Data Integration for the Relational Web

Extend Operator

ExtendTopic

Keywordk

Column cof relation T

Extended T’

Page 14: Data Integration for the Relational Web

Extend Operator (2)• Enables the user to add more columns to the table by

performing a join. • Takes a column “c” of table T as input and a topic keyword

“k”.• It returns 1or more columns whose values are described by k.• The new column added to T does not necessarily come from

a single data source. • It gathers data from large number of sources. • It can also gather data from table with different label from k

or no label at all.

Page 15: Data Integration for the Relational Web

ALGORITHMS

1. Integration Operators2. Algorithms3. Implementation at Scale4. Experiments5. Related Work6. Conclusions

Page 16: Data Integration for the Relational Web

Algorithms

• Search – Ranking– Clustering

• Context• Extend

• Search:– Rank the Table by relevance to Users Query– Cluster other related tables around top ranking Search

result.

Page 17: Data Integration for the Relational Web

Ranking Algorithms• Simple Rank

– Transmits the users search query to Web Search engine, obtains the URL ordering and presents the data according to that order.

– Drawbacks:• Ranks Individual whole page and not the data on that page.

– Eg: persons home page contains a HTML list that serve as navigation list to other pages.

• When multiple data sets are present on the web page, SR algorithm relies on in-page ordering. (ie. In the order of its appearance)

• Any metadata about the HTML lists exists only in the surrounding text and not the table itself.

– Cannot count hits between the query and a specific tables metadata.

Page 18: Data Integration for the Relational Web

Ranking Algorithms (2)

• SCPRank

𝑆𝑐𝑜𝑟𝑒 (𝐶𝑜𝑙 1 )=∑ 𝑠𝑐𝑝 (𝑞 ,)𝑆𝑐𝑜𝑟𝑒 (𝐶𝑜𝑙2 )=∑ 𝑠𝑐𝑝 (𝑞 ,)𝑆𝑐𝑜𝑟𝑒 (𝐶𝑜𝑙2 )=∑ 𝑠𝑐𝑝 (𝑞 ,)𝑆𝑐𝑜𝑟𝑒 (𝑇𝑎𝑏𝑙𝑒 )=𝑀𝑎𝑥 ()

Page 19: Data Integration for the Relational Web

Ranking Algorithms (3)• SCPRank

– Uses symmetric conditional probability to measure correlation between cell in extracted database and query term. It is defined as:

• How likely the term q and c appear together in a document.– SCPRank scores the table and not the cell.– It sends the query to the Search Engine, extracting a candidate set of tables. – Then it computes per-column scores, each of which is sum of per-cell SCP score in the

column. – The tables overall score is the max of all of its per-column scores.– Finally it sorts the tables in the order of their scores and returns a ranked list. – Time consuming. – Compute score for first ‘r’ rows of every candidate table.– Approximating SCP score on a small subset of Web corpus.

Page 20: Data Integration for the Relational Web

Embedded Appendix:

symmetric conditional probability• Let s be a term. The p(s) is the fraction of web documents

that contain s

• Similarly, p(s1, s2) is the fraction of documents containing both s1 and s2:

• The SCP between a query q and the text in a data cell c is defined as follows:

• Indicates how likely the term q and c appear together in a document.

Page 21: Data Integration for the Relational Web

Ranking Algorithms (4)

Page 22: Data Integration for the Relational Web

Clustering Algorithms• TextCluster

– computes tf-idf cosine dist between texts of table a and text of table b.

• SizeCluster– computes column to column similarity score that measures

the difference in mean string length between them.– The overall table-to-able similarity score for a pair of table is

sum of per column score for best column-to-column matching.• ColumnCluster

– Its similar to Size Cluster however it computes a tf-idf cosine distance using only the text found in the 2 columns.

Page 23: Data Integration for the Relational Web

Embedded Appendix:

tf-idf• term frequency–inverse document frequency• reflects how important a word is to a

document in a collection or corpus– highest when the term occurs many times within a

small number of documents– lower when the term occurs fewer times in a

document, or occurs in many documents– lowest when the term occurs in virtually all

documents

Page 24: Data Integration for the Relational Web

Context Algorithms• SignificantTerms

– Examines the source page of the extracted table and returns the k terms with the highest tf-idf values and do not appear in the extracted data.

• RVP (Related View Partners)– Looks beyond the source page.– Operating on the table T, it obtains a large number of candidate related

view tables, by using each value in T as parameter for a new Web Search – Then filters out tables that are unrelated to t’s source page, by removing

all tables that do not contain at least one value from ST(T)– It obtains all the data value in the remaining table and ranks them

according to the frequency of occurrence, returns the k highest ranked values.

Page 25: Data Integration for the Relational Web

Context Algorithms (2)• Hybrid

– It uses the fact that the above 2 algorithm are complimentary in nature.

– ST finds the context terms that RVP misses and RVP discovers the context terms that ST misses.

– Hybrid returns the context term that appear in result of either algorithm.

Page 26: Data Integration for the Relational Web

Extend Algorithms

• JoinTest

Jaccardian Distance

Table Distance

Candidate 1 α

Candidate 2 β

Threshold:Distance ≤

1

2

3

Ordered List of

Joinable Tables

Page 27: Data Integration for the Relational Web

Extend Algorithms (2)

• JoinTest– Combines web search and key-matching to perform

schema matching– Uses Jaccardian distance to measure the

compatibility between the values of T’s column c and each column of in each candidate table.

– If the distance is greater than a constant threshold t, we consider the tables to be joinable

– All tables that pass this threshold, are sorted by relevance to keyword k

Page 28: Data Integration for the Relational Web

Embedded Appendix:

Jaccardian Distance• Jaccard similarity coefficient

– measures similarity between sample sets• Jaccardian Distance

– measures dissimilarity between sample sets

Page 29: Data Integration for the Relational Web

Extend Algorithms (3)

• MultiJoin

TopicKeyword

k

ClusteringWeb Search

for every pair(c.cell, k)

1

2

3

4

Ordered List of relevant Relations

1

2

3

Clusters of Relations,

Ordered byRelevance

and JoinScore

Page 30: Data Integration for the Relational Web

Extend Algorithms (4)

• MultiJoin– Attempts to join each tuple of in the source table T with a

potentially different table• Can handle the case when there is no single joinable table.

– Issues a distinct web search query for every (c.cell,k) pair– Clusters the results– Ranks the clusters, using a combination of relevance score

for the ranked table and a join score for the cluster.• JoinScore counts how many unique values from from T’s c column

elicited tables in the cluster via the web search step

Page 31: Data Integration for the Relational Web

Extend Algorithms (5)

Page 32: Data Integration for the Relational Web

IMPLEMENTATION AT SCALE

1. Integration Operators2. Algorithms3. Implementation at Scale4. Experiments5. Related Work6. Conclusions

Page 33: Data Integration for the Relational Web

Implementation at Scale• Question: Can Octopus ever provide low latencies for a mass audience?• Challenges

– Traditional relevance-based Web search chalenges– Non-adjacent SCP computations for

• Search ScpRank algorithm– Multi-Query web searches for

• Context RVP algorithm• Extend MultiJoin algorithm

• Search engines can afford to spend a huge amount of resources in order to quickly process a single query, but the same is not true for one Contopus user who yields tens of thousands of queries

• Case 1: 2 small prototype back-end systems• Case 2: Approximation techniques to make it computationally feasible

Page 34: Data Integration for the Relational Web

Non-adjacent SCP computations

• Not feasible to precompute word-pair statistics: just for pairs of tokens, each sampled document would yield O(w2) unique token combinations

• Miniature search engine that fits entirely in memory– 100GiB RAM over 100 machines– Few billion web pages– No absolute precision for hitcount numbers (in order to

save memory by representing document setsusing Bloom Filters)

Page 35: Data Integration for the Relational Web

Embedded Appendix:

Bloom Filter• A Bloom filter, is a space-efficient probabilistic

data structure that is used to test whether an element is a member of a set

• Query can return– "inside set (may be wrong)“– "definitely not in set"

Page 36: Data Integration for the Relational Web

Multi-Query web searches

• The naïve Context RVP algorithm implementation requires r*d Web searches– r: number of tables processed by Context– d: average number of sampled non-numeric data

cells in each table• d in fairly low values (e.g.30)• RVP offers a real gain in quality• MultiJoin has a smaller problem, as it needs 1

query per row

Page 37: Data Integration for the Relational Web

EXPERIMENTS

1. Integration Operators2. Algorithms3. Implementation at Scale4. Experiments5. Related Work6. Conclusions

Page 38: Data Integration for the Relational Web

Experiements

• The goal is to evaluate the quality of results generated by each Octopus Oerator

• Collecting Queries– Collected a diverse query load from Web Users,

using Amazon Mechanical Turk. Each user suggested

• Topic of Data Table• 2 distinct URLs that provide example tables

Page 39: Data Integration for the Relational Web

Experiments (2)

Page 40: Data Integration for the Relational Web
Page 41: Data Integration for the Relational Web

Ranking Experiments

• Run the ranking phase of search on each of the above 52 queries, first using SimpleRank, then ScpRank

• Two judges, drawn from Amazon Mechanical Turk, labeled the table’s relevance to the query, on a scale 1-5.

• Table was marked as relevant only when both judges gave score 4 or higher

Page 42: Data Integration for the Relational Web

Ranking Experiments (2)

• Results– ScpRank performs substantially better than

SimpleRank, especially in Top-2 case.– The extra computational overhead clearly offers

real gains in result quality

Page 43: Data Integration for the Relational Web

Clustering Experiments• Issued queries and obtained a sorted list of tables, using

ScpRank– Best Table for each result manually chosen and used as center input to

the clustering system• Cluster quality assessed by computing the percentage of queries

in which a k-sized cluster contains a table that is “highly similar” to the center.

• Determine whether a table is “highly similar”, by asking two users from Amazon Mechanical Turk to rate the similarity of the pair in a scale 1-5.

• Table was marked as “highly similar” only when both judges gave score 4 or higher

Page 44: Data Integration for the Relational Web

Clustering Experiments (2)• Results

– k: cluster size: the system has only k “guesses” to find a table that is similar to the center

– Little variance in quality across all algorithms

Page 45: Data Integration for the Relational Web

Context Experiments• Top-1 relevant table per query• Two of the authors manually reviewed each Table’s source

page, noting terms that appeared to be useful context values• The values that both reviewers noted, were added in the test

set of true context values• Within the test set, there is a median of 3 test context values

per table• Measured the percentage of tables, where a true context

value is included in the top-k of the context terms, generated by each algorithm

Page 46: Data Integration for the Relational Web

Context Experiments (2)• Results

– Context can adorn a table with useful data from the surrounding text over 80% of the time– Although the RVP and SignificantTerms are not disjoint, RVP is able to discover new context

terms that were missed by SignificantTerms– SignificantTerms does not yield the best output quality, but it is still efficient and very easy to

implement

Page 47: Data Integration for the Relational Web

Extend Experiments• A small number of queries that appear to be Extend-able were

chosen• Top-1 ranked “relevant” table returned from search was used• Join column c and topic keyword query k were chosen by hand

opting for values that appear to be ammendable to Extend processing

Page 48: Data Integration for the Relational Web

Extend Experiments (2)• Results

– JoinTest (tries to find a single satisfactory table) only found extended tuples in 3 cases

• Countries• US Cities• UK Political Parties

– In this 3 cases, 60% of tuples were extended– MultiJoin found extended data for all cases– On average, 33% of the source tuples were extended– MultiJoin has a lower rate of tuple-extension than JoinTest– MultiJoin finds an average of 45.5 correct extension values for every successfully

–extended source tuple.– MultiJoin shows flexibility on per-tuple approach– With MultiJoin, fewer rows may be extended, but at least some data can be

found.

Page 49: Data Integration for the Relational Web

Experiments Summary

• It is possible to obtain high-quality results for all three Octopus operators

• Even with imperfect outputs, Octopus improves the productivity of the user

• Promising areas of future research– Output quality– Algorithmic runtime performance

Page 50: Data Integration for the Relational Web

RELATED WORK

1. Integration Operators2. Algorithms3. Implementation at Scale4. Experiments5. Related Work6. Conclusions

Page 51: Data Integration for the Relational Web

Related Work

• Data Integration on Web called as “MashUp” is increasingly popular area of work.

• The Yahoo Pipes allows the user to graphically describe the flow of data (structured data only)

• CIMPLE is data integration system for web use designed to construct community websites.

Page 52: Data Integration for the Relational Web

CONCLUSIONS

1. Integration Operators2. Algorithms3. Implementation at Scale4. Experiments5. Related Work6. Conclusions

Page 53: Data Integration for the Relational Web

Conclusions

• OCTOPUS allows the user to integrate data from many unstructured data source.

• It offers access to orders of magnitude of data sources, frees the user from having to design or even know about the mediated schema.

Page 54: Data Integration for the Relational Web

Questions


Recommended