+ All Categories
Home > Documents > Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors...

Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors...

Date post: 26-Mar-2015
Category:
Upload: sean-watson
View: 216 times
Download: 2 times
Share this document with a friend
Popular Tags:
50
Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani
Transcript
Page 1: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Querying for relations from the semi-structured Web

Sunita Sarawagi

IIT Bombay

http://www.cse.iitb.ac.in/~sunita

Contributors

Rahul Gupta Girija Limaye Prashant Borole

Rakesh Pimplikar Aditya Somani

Page 2: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Web Search

Mainstream web search User Keyword queries Search engine Ranked list of documents

15 glorious years of serving all of user’s search need into this least common denominator

Structured web search User Natural language queries ~/~ Structured queries Search engine Point answer, record sets

Many challenges in understanding both query and content

15 years of slow but steady progress

2

Page 3: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

The Quest for Structure Vertical structured search engines

Structure Schema Domain-specific Shopping: Shopbot: (Etzoini + 1997)

Product name, manufacturer, price Publications: Citeseer (Lawrence, Giles,+ 1998)

Paper title, author name, email, conference, year Jobs: Flipdog Whizbang labs (Mitchell + 2000)

Company name, job title, location, requirement People: DBLife (Doan 07)

Name, affiliations, committees served, talks delivered.

Triggered much research on extraction and IR-style search of structured data (BANKS ‘02).

3

Page 4: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Horizontal Structured Search Domain-independent structure Small, generic set of structured primitives over

entities, types, relationships, and properties <Entity> IsA <Type>

Mysore is a city <Entity> Has <Property>

<City> Average rainful <Value> <Entity1> <related-to> <Entity2>

<Person> born-in <City> <Person> CEO-of <Company>

4

Page 5: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Types of Structured Search Web+People Structured databases ( Ontologies)

Created manually (Psyche), or semi-automatically (Yago) True Knowledge (2009), Wolfram Alpha (2009)

Web annotated with structured elements Queries: Keywords + structured annotations

Example: <Physicist> +cosmos Open-domain structure extraction and annotations of web

docs (2005—)

5

Page 6: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Users, Ontologies, and the Web

Users are from Venus• Bi-syllabic, impatient, believe in

mind -reading Ontologies are from Mars

• One structure to fit allG

• Web content creators are from some other galaxy

– Ontologies= – Let search engines bring the

users

Page 7: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

What is missed in Ontologies The trivial, the transient, and the textual Procedural knowledge

• What do I do on an error? Huge body of invaluable text of various type

reviews, literature, commentaries, videos Context

By stripping knowledge to its skeletal form, context that is so valuable for search is lost.

As long as queries are unstructured, the redundancy and variety in unstructured sources is invaluable.

Page 8: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Structured annotations in HTML Is A annotations

KnowITAll (2004) Open-domain Relationships

Text runner (Banko 2007) Ontological annotations

SemTag and Seeker (2003) Wikipedia annotations (Wikify! 2007, CSAW 2009)

8

All view documents as a sequence of tokens

Challenging to ensure high accuracy

Page 9: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

WWT: Table queries over the semi-structured web

9

Page 10: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Queries in WWT Query by content

Query by description

10

Alan Turing Turing Machine

E. F. Codd Relational Databases

Desh Late night

Bhairavi Morning

Patdeep Afternoon

Inventor Computer science concept Year

Indian states Airport City

Page 11: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

11

Answer: Table with ranked rows

Person Concept/Invention

Alan Turing Turing Machine

Seymour Cray Supercomputer

E. F. Codd Relational Databases

Tim Berners-Lee WWW

Charles Babbage Babbage Engine

Page 12: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

12

Verbose articles, notstructured tables

The only document with an unstructured listof some desired records

Desired records spread across many documents

Correct answer is notone click away.

Computer science concept inventor year

Keyword search to find structured records

Page 13: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

13

The only list in one of the retrieved pages

Page 14: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

14

Highly relevant Wikipedia table not retrieved in the top-k

Ideal answer should be integrated from these incomplete sources

Page 15: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

15

Attempt 2: Include samples in query

Documents relevant only to the keywords

Ideal answer still spread across manydocuments

Known examples

alan turing machine codd relational database

Page 16: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

16

WWT Architecture

Index Query Builder

Web

Extract record sources

Query Table

Content+context index

Offline

Store

Ke

ywo

rd Q

uer

y

So

urce

L1,…

,Lk

Type Inference

Resolver

Resolver builder

Typesystem Hierarchy

Extractor

Record labeler

CRF modelsConsolidator

Tables T1,…,Tk

Consolidated Table

StatisticsCell resolver Row resolver

Ranker

Row and cell scores

Final consolidated table

User

Annotate

Ontology

Page 17: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

18

Offline: Annotating to an Ontology

Annotate table cells with entity nodes and table columns with type nodes

movies

Indian_films English_films

2008_films Terrorism_films

A_Wednesday

Black&White

Coffee_house (film)

Wednesday

All

People

Entertainers

Coffee_house (Loc)

Indian_films

2008_filmsIndian_directors

Coffee_house (film)

Black&White

A_Wednesday

Page 18: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Challenges Ambiguity of entity names

“Coffee house” both a movie name and a place name

Noisy mentions of entity names Black&White versus Black and White

Multiple labels Yago Ontology has average 2.2 types per entity

Missing type links in Ontology cannot use least common ancestor Missing link: Black&White to 2008_films Not a missing link: 1920 to Terrorism_films

Scale: Yago has 1.9 million entities, 200,000 types 19

Page 19: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

A unified approachGraphical model to jointly label cells and

columns to maximize sum of scores on

ycj = Entity label of cell c of column j

yj = Type label of column j Score(ycj ): String similarity between c & ycj .

Score(yj ): String similarity between header in j & yj

Score( yj, ycj)

Subsumed entity: Inversely proportional to distance between them

Outside enity: Fraction of overlapping entities between yj and immediate parent of ycj

Handles missing links: Overlap of 2008_movies with 2007_movies zero but with Indian movies is non-zero.

movies

Indian_films

English_films

yj

Terrorism_films

Subsumed entity y1j

Outside entity y3j

Subsumed entity y2j

Page 20: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

21

WWT Architecture

Index Query Builder

Web

Extract record sources

Query Table

Content+context index

Offline

Store

Ke

ywo

rd Q

uer

y

So

urce

L1,…

,Lk

Type Inference

Resolver

Resolver builder

Typesystem Hierarchy

Extractor

Record labeler

CRF modelsConsolidator

Tables T1,…,Tk

Consolidated Table

StatisticsCell resolver Row resolver

Ranker

Row and cell scores

Final consolidated table

User

Annotate

Ontology

Page 21: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

23

Extraction: Content queriesExtracting queries columns from list records

New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy.

Cornell University Ithaca

State University of New York Stony Brook

New York University New York

Lists are often human generated.

Query: QQuery: Q

A source: LiA source: Li

Page 22: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

24

Extraction

New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy.

Rule-based extractor insufficient. Statistical extractor needs training data.

Generating that is also not easy!

Extracted table columnsExtracted table columns

Cornell University Ithaca

State University of New York Stony Brook

New York University New York

Query: QQuery: Q

Page 23: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

25

Extraction: Labeled data generation

A fast but naïve approach for generating labeled records

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in Stony Brook, New York.

Query about colleges in NY

Fragment of a relevant list source

Lists are unlabeled. Labeled records needed to train a CRF

New York University New York

Monroe College Brighton

State University of New York Stony Brook

Page 24: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

26

Extraction: Labeled data generation

New York University New York

Monroe College Brighton

State University of New York Stony Brook

A fast but naïve approach

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in Stony Brook, New York.

In the list, look for matches of every query cell.

Another match for New York UniversityAnother match for New York

Page 25: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

27

Extraction: Labeled data generation

New York University New York

Monroe College Brighton

State University of New York Stony Brook

A fast but naïve approach

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in Stony Brook, New York.

In the list, look for matches of every query cell. Greedily map each query row to the best match in the list

1

2

Page 26: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

28

Extraction: Labeled data generation

New York University New York

Monroe College Brighton

State University of New York Stony Brook

A fast but naïve approach

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in Stony Brook, New York.

Hard matching criteria has significantly low recall Missed segments. Does not use natural clues like Univ = University

Greedy matching can be lead to really bad mappings

1

2

Unmapped (hurts recall)

Wrongly MappedAssumed as ‘Other’

Page 27: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

29

Generating labeled data: Soft approach New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in Stony Brook, New York.

New York University New York

Monroe College Brighton

State University of New York Stony Brook

Page 28: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

30

Match score for each query and source row Score of best segmentation of source row

to query columns Score of a segment s of column c:

Probability Cell c of query row same as segment s

Computed by the Resolver module based on the type of the column

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in Stony Brook, New York.

0.9

0.3

1.8

Generating labeled data: Soft approach

New York University New York

Monroe College Brighton

State University of New York Stony Brook

Page 29: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

31

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in Stony Brook, New York.

2.0

0.7

0.3

Generating labeled data: Soft approach

Match score for each query and source row Score of best segmentation of source row into query columns Score of a segment s of column c:

Probability Cell c of query row same as segment s Computed by the Resolver module based on the type of the

column

New York University New York

Monroe College Brighton

State University of New York Stony Brook

Page 30: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

32

New York University New York

Monroe College Brighton

State University of New York Stony Brook

New York Univ. in NYC

Columbia University in NYC

Monroe Community College in Brighton

State University of New York in Stony Brook, New York.

Compute the maximum weight matching Better than greedily choosing the best match for each row

Soft string-matching increases the labeled candidates significantly

Vastly improves recall, leads to better extraction models.

0.9

0.3

1.8

1.8

0.70.3

2

Greedy matching in red

Generating labeled data: Soft approach

Page 31: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

33

Extractor

Use CRF on the generated labeled data Feature Set

Delimiters, HTML tokens in a window around labeled segments.

Alignment features Collective training of multiple sources

Page 32: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

34

Experiments

Aim: Reconstruct Wikipedia tables from only a few sample rows.

Sample queries TV Series: Character name, Actor name, Season Oil spills: Tanker, Region, Time Golden Globe Awards: Actor, Movie, Year Dadasaheb Phalke Awards: Person, Year Parrots: common name, scientific name, family

Page 33: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

35

Experiments: Dataset

Corpus: 16M lists from 500M pages from a web crawl. 45% of lists retrieved by index probe are irrelevant.

Query workload 65 queries. Ground truth hand-labeled by 10 users

over 1300 lists. 27% queries not answerable with one list (difficult). True consolidated table = 75% of Wikipedia table,

25% new rows not present in Wikipedia.

Page 34: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

36

Extraction performance

Benefits of soft training data generation, alignment features, staged-extraction on F1 score.

More than 80% F1 accuracy with just three query records

Page 35: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Queries in WWT Query by content

Query by description

37

Alan Turing Turing Machine

E. F. Codd Relational Databases

Desh Late night

Bhairavi Morning

Patdeep Afternoon

Inventor Computer science concept Year

Indian states Airport City

Page 36: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Extraction: Description queries

Lithium 3

Sodium 11

Beryllium 4

Non-informative headers No headers

Page 37: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Context to get at relevant tables Ontological annotations

Context is union of Text around tables Headers Ontology labels when

present

39

Chemical_elements

Metals Non_Metals

Alkali Gas

Aluminium

Lithium Hydrogen

All

People

Non alkali

Lithium 3

Sodium 11

Beryllium 4

Non-gas

CarbonSodium

Alkali

Chemical element

Page 38: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Joint labeling of table columns Given

Candidate tables: T1 ,T2,..Tn

Query column q1, q2,.. qm

Task: label columns of Ti with {q1, q2,…, qm, } to maximize sum of these scores Score (T , j , qk) = Ontology type match + Header string

match with qk

Score (T , * , qk) = Match of description of T with qk

Score (T , j, T’ , j’, qk) = Content overlap of column j of table T with column j’ of table T’ when both label qk

Inference algorithm in a graphical model solve via Belief Propagation. 40

Page 39: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

41

WWT Architecture

Index Query Builder

Web

Extract record sources

Query Table

Content+context index

Offline

Store

Ke

ywo

rd Q

uer

y

So

urce

L1,…

,Lk

Type Inference

Resolver

Resolver builder

Typesystem Hierarchy

Extractor

Record labeler

CRF modelsConsolidator

Tables T1,…,Tk

Consolidated Table

StatisticsCell resolver Row resolver

Ranker

Row and cell scores

Final consolidated table

User

Annotate

Ontology

Page 40: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

42

Step 3: Consolidation

Cornell University Ithaca

State University of New York

Stony Brook

New York University New York City

Binghamton University Binghamton

Merging the extracted tables into one

SUNY Stony Brook

New York University (NYU)

New York

RPI Troy

Columbia University New York

Syracuse University Syracuse

+

Cornell University Ithaca

State University of New York OR SUNY

Stony Brook

New York University OR New York University (NYU)

New York City OR New York

Binghamton University Binghamton

RPI Troy

Columbia University New York

Syracuse University Syracuse

=Merging duplicates

Page 41: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Consolidation

Challenge: resolving when two rows are the same in the face of Extraction errors Missing columns Open-domain No training.

Our approach: a specially designed Bayesian Network with interpretable and generalizable parameters

Page 42: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

45

Resolver

P(RowMatch|rows q,r)

P(1st cell match|q1,r1) P(ith cell match|qi,ri) P(nth cell match|qn,rn)

Bayesian Network

Cell-level probabilities Parameters automatically set using list statistics Derived from user-supplied type-specific similarity functions

Page 43: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

47

Ranking• Factors for ranking– Relevance: membership in overlapping sources– Support from multiple sources

– Completeness: importance of columns present Penalize records with only common ‘spam’ columns like City and

State Correctness: extraction confidence

School Location State Merged Row Confidence Support

- - NY 0.99 9

- NYC New York 0.95 7

New York Univ. OR New York University

New York City OR New York

New York 0.85 4

University of Rochester OR Univ. of Rochester,

Rochester New York 0.50 2

University of Buffalo Buffalo New York 0.70 2

Cornell University Ithaca New York 0.76 1

Page 44: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

Relevance ranking on set membership Weighted sum approach

Score of a set t: s(t) = fraction of query rows in t

Relevance of consolidated row r: r t s(t)

Graph walk based approach Random walk from rows to table

nodes starting from query rows along with random restarts to query rows

48

Tables

Consolidated rows

Query rows

Page 45: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

49

Ranking Criteria• Score(Row r):

× Graph-relevance of r.× Importance of columns C present in r (high if C functionally

determines the other)× Sum of cell extraction confidence: noisy-OR of cell extraction

confidence from individual CRFs

School Location State Merged Row Confidence Support

New York Univ. OR New York University (0.90)

New York City OR New York (0.95)

New York (0.98) 0.85 4

University of Buffalo (0.88) Buffalo (0.99) New York (0.99) 0.70 2

Cornell University (0.92) Ithaca (0.95) New York (0.99) 0.76 1

University of Rochester OR Univ. of Rochester, (0.80)

Rochester (0.95) New York (0.99) 0.50 2

- - NY (0.99) 0.99 9

- NYC (0.98) New York (0.98) 0.95 7

Page 46: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

50

Overall performance

Justify sophisticated consolidation and resolution. So compare with: Processing only the magically known single best list

=> no consolidation/resolution required. Simple consolidation. No merging of approximate duplicates.

WWT has > 55% recall, beats others. Gain bigger for difficult queries.

All Queries Difficult Queries

Page 47: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

51

Running time

< 30 seconds with 3 query records.

Can be improved by processing sources in parallel. Variance high because time depends on number of columns,

record length etc.

Page 48: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

52

Related Work Google-Squared

Developed independently. Launched in May 2009 User provides keyword query, e.g. “list of Italian

joints in Manhattan”. Schema inferred. Technical details not public.

Prior methods for extraction and resolution. Assume labeled data/pre-trained parameters We generate labeled data, and automatically train

resolver parameters from the list source.

Page 49: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

53

Summary Structured web search & the role of non-text,

partially structured web sources WWT system

Domain-independent Online: structure interpretation at query time Relies heavily on unsupervised statistical learning

Graphical model for table annotation Soft-approach for generating labeled data Collective column labeling for descriptive queries Bayesian network for resolution and consolidation Page rank + confidence from a probabilistic extractor for

ranking

Page 50: Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay sunita Contributors Rahul Gupta Girija Limaye.

What next? Designing plans for non-trivial ways of combining of

sources Better ranking and user-interaction models. Expanding query set

Aggregate queries: tables are rich in quantities Point queries: attribute value and relationship queries

Interplay between semi-structured web & Ontologies Augmenting one with the other.

Quantify information in structured sources vis-à-vis text sources on typical query workloads

54


Recommended