© 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

transcript

WIKIANALYTICS

Andrey Balmin (IBM Almaden)Emiran Curtmola (UC San Diego)

Data: English Wikipedia Infoboxes{{Infobox officeholder|name = Arnold Schwarzenegger|nick = Governator|order = [[List of Governors of California|38th]]|office = Governor of California|term_start = November 17, 2003|birth_place = [[Thal, Austria]]|religion = Roman Catholic|net worth = $100–$200 [[million]] USD}}

{{Infobox Governor|name=Edmund Gerald Brown, Jr.|office=California Attorney General|order3=[[List of Governors of California|34th]]|office3=Governor of California|term_start3=January 6, 1975|religion=[[Catholic Church|Roman Catholic]]}}

Main cluster

Structural outliers

Wikipedia is Sparse

WikiAnalytics

WikiInfoboxes

~50k distinct fields

infoboxes

Documents

Fields

Universal Table logical abstraction

Field (distinct type, attribute) Occurrences

5122561286432

1 10 100 1000...that occur at least X times in Wikipedia

s Almost 20,000 fields occur only once

Only 300 fields occur over 4,000 times in Wikipedia

Sparse Data Data is produced by humans and for humans

–No pressing need for schema consistency

–Domain examples: •Healthcare patent records •Electronic forms •product catalogs

How does one query such data ?

WikiAnalytics

WikiAnalytics Approach Start with simple keyword search interface

–E.g. “California Governor Religion!”

–Returns a superset of the result (123 infoboxes)

Cluster the results based on where the keywords are–“California” in office field vs. “California” in birth_place

Present the user with hierarchies of clusters–Let them accept/reject clusters and features

–Not unlike faceted search, but facets are not a property of the data – they also depend on the query

WikiAnalytics

Clustering Features A feature corresponds to

– The mapping of keyword k on field f in the corpus– It defines a dynamic dimension on the corpus

3 kinds of features– type: the keyword occurs inside documents of type “type”

• E.g., F1: type = Governor– field-keyword: the keyword occurs in a field

• E.g., F2: “California” office vs. “California” birth_place– field value: the field has a particular value that contains the keyword

• E.g., F3: office = “Governor of California” vs. office = “Governor of Baja California”

WikiAnalytics

DocumentsFields

For a typical result, there are more features than rows

– No way to tell which ones are relevant

– Many overlapping hierarchies

Our solution: produce all possible clusterings

– Pack all hierarchies into a lattice structure

– Heuristically filter clusters to display

• Don’t show clusters that are smaller than a user-controlled “minimum support” parameter

Clustering on Hundreds of Features

Universal Navigational Lattice (UNL)

Wiki Infoboxes

F1: type=governor

type=president

F1, F2: California officeF1, F3: governor office

F1, F2, F3

type=judge

User Interface

System Architecture

WikiAnalytics

Infoboxes

Infoboxes Universal Table (logically)

+DB2 Text Search®

Input: keyword search query

e.g., California governor religion!

Compute navigational dimensions on the fly

-- map query keywords to fields

Fields

Documents

Universal Navigational Lattice (UNL) in-memory computation

Query Processing

FrontEnd (Flex)

BackEnd (Java)

Enable faceted-like search UI to explore

over the lattice

Interactive

selection

of final answers

Generate and publish

the output feed

The final feed is useful for

• further processing in

community-based mashups

• data cube analytics

• many-eyes analytics and

visualization

• help formulating a structural

Community support by

leveraging previously

cleaned feedsWiki InfoboxesWiki Infoboxes

type=governortype=governortype=presidenttype=president

California officeCalifornia officegovernor officegovernor office

California, governor officeCalifornia, governor office

type=judgetype=judge

e.g., governor of California is

also a former US president

(see Ronald Reagan )

e.g., governor of California is

also a former US president

(see Ronald Reagan )

Storage & Indexing

FieldsDocu-

Heterogeneous,

sparse data

DB2 pureXML®

Summary: WikiAnalytics

Future work: – More heuristics for pruning the lattice

– Collaborative features

WikiAnalytics

Heterogeneous,

sparse data

Structured data feeds

Dynamic interface that lets users

navigate, identify and extract

all the documents of interest

Mashups

(e.g., Mashup Hub)

Data analytics

Formulate a structural query over the heterogeneous data

Thank you!

Questions ???

Wikipedia: The Human Factor Same schema used differently

– E.g., Populate “order” attribute vs. “office” attribute

order=“16th Governor of California” for Washington Bartlett

vs. order =“38th” office=“Governor of California” for Arnold Schwarz.

Same data in different schemas– E.g., The governor’s data in the input of the president’s schema

order2=“33rd Governor of California” for Ronald Reagan

Schema conflicts• E.g., {{Infobox Officeholder/Personal data ...• | birth_date = ...• | date of birth = ...

No universal format for representing field values Etc.

WikiAnalytics

How to efficiently query such data and derive a complete set of answers?How to efficiently query such data and derive a complete set of answers?

Existing approaches Use keyword ranked search + heuristics

– Queries are underspecified: hard to capture user’s intentions

Use query languages (SQL, XQuery/XPath) after data integration– Strict, complex, expensive and hard to express

Faceted search – Static dimensions, too restrictive

Data summarization with SEDA [CIDR’09]– Nice system but too generic

WikiAnalytics

Universal Framework (search space) Key idea

–Cluster documents D based on the features F corresponding to query Q=(k1,…,kn)

Universal Navigational Lattice (UNL)–Given D, Q and F, produce all possible groupings

of documents by all sets of features–Connect the groups by the subset relationship of

documents

Universal computation framework to express–Traditional faceted search, OLAP navigation

WikiAnalytics

Computing Features F Universal Table + Keywords F = {F1, F2, …, Fn}

F = { type = Governor, “California” office, office = “Governor of California”, “California” order, “religion” religion, “California” born }

WikiAnalytics

Universal Table (logically)

Fields

Documents Input: keyword search query

e.g., California governor religion!

DocumentsFields

Focus on the set of documents containing all keywords

UNL Lattice Construction Bottom-up construction

–Start with groups of documents that match single features Fi in F

–Consider groups of documents by all pairs of features

–Etc.

WikiAnalytics

1,2,10

1,2,20

1 2 3 10 15 20 3040Documents stored in DB2

F1, F2

F1, F3

F2, F3

F1, F2, F3

1,2 Redundant to construct different navigational nodes (with different set of features) for the same groups documents!

Solution: consolidate all the buckets with same set of documents

Redundant to construct different navigational nodes (with different set of features) for the same groups documents!

Solution: consolidate all the buckets with same set of documents

UNL Lattice Construction Bottom-up construction

–Start with groups of documents that match single features Fi in F

–Consider groups of documents by all pairs of features

–Etc.

WikiAnalytics

1,2,10

1,2,20

F1, F2, F3

Redundant to construct different navigational nodes (different set of features) for the same groups documents!

Solution: consolidate all the buckets with same set of documents• add edges• merge features

Redundant to construct different navigational nodes (different set of features) for the same groups documents!

Solution: consolidate all the buckets with same set of documents• add edges• merge features

Construction Example

WikiAnalytics

1,2,10

40,15,10

20,2,3

30,20,3

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

WikiAnalytics

1,2,10

40,15,10

20,2,3

30,20,3

WikiAnalytics

1,2,10

40,15,10

20,2,3

30,20,3

WikiAnalytics

1,2,10

40,15,10

20,2,3

30,20,3

F1,F2,F4

WikiAnalytics

1,2,10

40,15,10

20,2,3

30,20,3

F1,F2,F4

F1,F4,F5

WikiAnalytics

1,2,10

40,15,10

20,2,3

30,20,3

• triangle rule

F1,F2,F4

F1,F4,F5

WikiAnalytics

1,2,10

40,15,10

20,2,3

30,20,3

• triangle rule

Invariant: n1 n2

n1.D n2.D and n1.F n2.F

• triangle rule

Invariant: n1 n2

n1.D n2.D and n1.F n2.F

F1,F2,F4

F1,F4,F5

Conceptually, UNL captures all possible ways to group documents based on where

the query keywords hit in the corpus

Big Question: GUI How to display the UNL to facilitate discovery of

complete answers?

WikiAnalytics

1,2,10

40,15,10

20,2,3

30,20,3

F1,F2,F4

F1,F4,F5

UNL lends to a bottom-up visual representation:• start from all root nodes traversing towards leaves

Big Question: GUI How to display the UNL to facilitate discovery of

complete answers?

WikiAnalytics

1,2,10

40,15,10

20,2,3

30,20,3

F1,F2,F4

F1,F4,F5

Effective Visualization Problem

– There are still lots of nodes to explore

Challenge– Need to find both large cluster(s) of documents and the

outliers to find the complete set of answers

Solution: Y-feature filtering– Impose threshold Y on the features entering the lattice

computation• Features representing more the Y documents

– To find the big chunks: choose Y bigger• Filter out the less representative features

– To find the smaller chunks: choose Y smallerWikiAnalytics

Effective Visualization (2) Complementary solution: Partition based

– Introduce some order among the computed groups by prioritizing certain set of features to appear at the top levels of the lattice

WikiAnalytics

Block1 Features

e.g., type = Governor

Block2 Features

e.g., type = President

Block3 Features

e.g., type = Judge

UNL1UNL1 UNL2UNL2 UNL3UNL3

Documents stored in DB2

Example 1:

Feed0 = California governor religion!

Example 2: Find the number of released jazz albums in the world per country

Generate the following feeds with WikiAnalytics

–Feed1 = jazz album artist! released!

–Feed2 = jazz artist origin!

Feed them to Mashup Hub

Demo Scenario

WikiAnalytics

album artist releasedate

artist place of origin

User Interface