+ All Categories
Home > Documents > © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

Date post: 18-Dec-2015
Category:
Upload: percival-booker
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
© 2010 IBM Corporation WIKIANALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)
Transcript
Page 1: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation

WIKIANALYTICS

Andrey Balmin (IBM Almaden)Emiran Curtmola (UC San Diego)

Page 2: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation

Data: English Wikipedia Infoboxes{{Infobox officeholder|name = Arnold Schwarzenegger|nick = Governator|order = [[List of Governors of California|38th]]|office = Governor of California|term_start = November 17, 2003|birth_place = [[Thal, Austria]]|religion = Roman Catholic|net worth = $100–$200 [[million]] USD}}

{{Infobox Governor|name=Edmund Gerald Brown, Jr.|office=California Attorney General|order3=[[List of Governors of California|34th]]|office3=Governor of California|term_start3=January 6, 1975|religion=[[Catholic Church|Roman Catholic]]}}

{{Infobox Governor|order=16th Governor of California}}

Main cluster

Structural outliers

Page 3: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation3

Wikipedia is Sparse

WikiAnalytics

WikiInfoboxes

WikiInfoboxes

~50k distinct fields

~1.7M

infoboxes

Documents

Fields

Universal Table logical abstraction

Field (distinct type, attribute) Occurrences

1

5122561286432

168

4

2

0

10000

20000

30000

40000

50000

60000

1 10 100 1000...that occur at least X times in Wikipedia

Nu

mb

er

of F

ield

s Almost 20,000 fields occur only once

Only 300 fields occur over 4,000 times in Wikipedia

Page 4: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation4

Sparse Data Data is produced by humans and for humans

–No pressing need for schema consistency

–Domain examples: •Healthcare patent records •Electronic forms •product catalogs

How does one query such data ?

WikiAnalytics

Page 5: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation5

WikiAnalytics Approach Start with simple keyword search interface

–E.g. “California Governor Religion!”

–Returns a superset of the result (123 infoboxes)

Cluster the results based on where the keywords are–“California” in office field vs. “California” in birth_place

Present the user with hierarchies of clusters–Let them accept/reject clusters and features

–Not unlike faceted search, but facets are not a property of the data – they also depend on the query

WikiAnalytics

Page 6: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation6

Clustering Features A feature corresponds to

– The mapping of keyword k on field f in the corpus– It defines a dynamic dimension on the corpus

3 kinds of features– type: the keyword occurs inside documents of type “type”

• E.g., F1: type = Governor– field-keyword: the keyword occurs in a field

• E.g., F2: “California” office vs. “California” birth_place– field value: the field has a particular value that contains the keyword

• E.g., F3: office = “Governor of California” vs. office = “Governor of Baja California”

WikiAnalytics

DocumentsFields

Page 7: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation

For a typical result, there are more features than rows

– No way to tell which ones are relevant

– Many overlapping hierarchies

Our solution: produce all possible clusterings

– Pack all hierarchies into a lattice structure

– Heuristically filter clusters to display

• Don’t show clusters that are smaller than a user-controlled “minimum support” parameter

Clustering on Hundreds of Features

Page 8: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation

Universal Navigational Lattice (UNL)

Wiki Infoboxes

F1: type=governor

type=president

F1, F2: California officeF1, F3: governor office

F1, F2, F3

type=judge

Page 9: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation

User Interface

Page 10: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation10

System Architecture

WikiAnalytics

Wiki

Infoboxes

Wiki

Infoboxes Universal Table (logically)

+DB2 Text Search®

Input: keyword search query

e.g., California governor religion!

Compute navigational dimensions on the fly

-- map query keywords to fields

Fields

Documents

Universal Navigational Lattice (UNL) in-memory computation

Query Processing

FrontEnd (Flex)

BackEnd (Java)

Enable faceted-like search UI to explore

over the lattice

Interactive

selection

of final answers

Generate and publish

the output feed

The final feed is useful for

• further processing in

community-based mashups

• data cube analytics

• many-eyes analytics and

visualization

• help formulating a structural

query

Community support by

leveraging previously

cleaned feedsWiki InfoboxesWiki Infoboxes

type=governortype=governortype=presidenttype=president

California officeCalifornia officegovernor officegovernor office

California, governor officeCalifornia, governor office

type=judgetype=judge

e.g., governor of California is

also a former US president

(see Ronald Reagan )

e.g., governor of California is

also a former US president

(see Ronald Reagan )

Storage & Indexing

FieldsDocu-

ments

Heterogeneous,

sparse data

DB2 pureXML®

Page 11: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation11

Summary: WikiAnalytics

Future work: – More heuristics for pruning the lattice

– Collaborative features

WikiAnalytics

Heterogeneous,

sparse data

Structured data feeds

Dynamic interface that lets users

navigate, identify and extract

all the documents of interest

Mashups

(e.g., Mashup Hub)

Data analytics

Formulate a structural query over the heterogeneous data

Page 12: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation

Thank you!

Questions ???

Page 13: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation13

Wikipedia: The Human Factor Same schema used differently

– E.g., Populate “order” attribute vs. “office” attribute

order=“16th Governor of California” for Washington Bartlett

vs. order =“38th” office=“Governor of California” for Arnold Schwarz.

Same data in different schemas– E.g., The governor’s data in the input of the president’s schema

order2=“33rd Governor of California” for Ronald Reagan

Schema conflicts• E.g., {{Infobox Officeholder/Personal data ...• | birth_date = ...• | date of birth = ...

No universal format for representing field values Etc.

WikiAnalytics

How to efficiently query such data and derive a complete set of answers?How to efficiently query such data and derive a complete set of answers?

Page 14: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation14

Existing approaches Use keyword ranked search + heuristics

– Queries are underspecified: hard to capture user’s intentions

Use query languages (SQL, XQuery/XPath) after data integration– Strict, complex, expensive and hard to express

Faceted search – Static dimensions, too restrictive

Data summarization with SEDA [CIDR’09]– Nice system but too generic

WikiAnalytics

Ha

rd t

o id

en

tify

the

co

mp

lete

se

t o

f a

nsw

ers

Ha

rd t

o id

en

tify

the

co

mp

lete

se

t o

f a

nsw

ers

Page 15: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation15

Universal Framework (search space) Key idea

–Cluster documents D based on the features F corresponding to query Q=(k1,…,kn)

Universal Navigational Lattice (UNL)–Given D, Q and F, produce all possible groupings

of documents by all sets of features–Connect the groups by the subset relationship of

documents

Universal computation framework to express–Traditional faceted search, OLAP navigation

WikiAnalytics

Page 16: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation16

Computing Features F Universal Table + Keywords F = {F1, F2, …, Fn}

F = { type = Governor, “California” office, office = “Governor of California”, “California” order, “religion” religion, “California” born }

WikiAnalytics

Universal Table (logically)

Fields

Documents Input: keyword search query

e.g., California governor religion!

DocumentsFields

Focus on the set of documents containing all keywords

Page 17: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation17

UNL Lattice Construction Bottom-up construction

–Start with groups of documents that match single features Fi in F

–Consider groups of documents by all pairs of features

–Etc.

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

1,2,20

1 2 3 10 15 20 3040Documents stored in DB2

F1, F2

1,2

F1, F3

1,2

F2, F3

1,2

F1, F2, F3

1,2 Redundant to construct different navigational nodes (with different set of features) for the same groups documents!

Solution: consolidate all the buckets with same set of documents

Redundant to construct different navigational nodes (with different set of features) for the same groups documents!

Solution: consolidate all the buckets with same set of documents

Page 18: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation18

UNL Lattice Construction Bottom-up construction

–Start with groups of documents that match single features Fi in F

–Consider groups of documents by all pairs of features

–Etc.

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

1,2,20

1 2 3 10 15 20 3040Documents stored in DB2

F1, F2, F3

1,2

Redundant to construct different navigational nodes (different set of features) for the same groups documents!

Solution: consolidate all the buckets with same set of documents• add edges• merge features

Redundant to construct different navigational nodes (different set of features) for the same groups documents!

Solution: consolidate all the buckets with same set of documents• add edges• merge features

Page 19: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation19

Construction Example

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

40,15,10

F4

20,2,3

F5

30,20,3

1 2 3 10 15 20 3040Documents stored in DB2

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

F1,F2

1,2

F1,F4

2,3

F1,F5

3

Page 20: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation20

Construction Example

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

40,15,10

F4

20,2,3

F5

30,20,3

1 2 3 10 15 20 3040Documents stored in DB2

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

F1,F2

1,2

F2,F3

10

F1,F4

2,3

F2,F4

2

F1,F5

3

Page 21: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation21

Construction Example

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

40,15,10

F4

20,2,3

F5

30,20,3

1 2 3 10 15 20 3040Documents stored in DB2

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

F1,F2

1,2

F2,F3

10

F1,F4

2,3

F2,F4

2

F1,F5

3

F4,F5

3,20

Page 22: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation22

Construction Example

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

40,15,10

F4

20,2,3

F5

30,20,3

1 2 3 10 15 20 3040Documents stored in DB2

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

F1,F2

1,2

F2,F3

10

F1,F4

2,3

F1,F5

3

F4,F5

3,20

F1,F2,F4

2

Page 23: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation23

Construction Example

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

40,15,10

F4

20,2,3

F5

30,20,3

1 2 3 10 15 20 3040Documents stored in DB2

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

F1,F2

1,2

F2,F3

10

F1,F4

2,3

F1,F2,F4

2

F1,F4,F5

3

F4,F5

3,20

Page 24: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation24

Construction Example

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

40,15,10

F4

20,2,3

F5

30,20,3

1 2 3 10 15 20 3040Documents stored in DB2

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

• triangle rule

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

• triangle rule

F1,F2

1,2

F2,F3

10

F1,F4

2,3

F1,F2,F4

2

F1,F4,F5

3

F4,F5

3,20

Page 25: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation25

Construction Example

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

40,15,10

F4

20,2,3

F5

30,20,3

1 2 3 10 15 20 3040Documents stored in DB2

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

• triangle rule

Invariant: n1 n2

n1.D n2.D and n1.F n2.F

Lattice construction rules for a new node n

• n.D = do not add n

• n.D already exists consolidate with existing node

• otherwise, add n and its edges

• triangle rule

Invariant: n1 n2

n1.D n2.D and n1.F n2.F

F1,F2

1,2

F2,F3

10

F1,F4

2,3

F1,F2,F4

2

F1,F4,F5

3

F4,F5

3,20

Conceptually, UNL captures all possible ways to group documents based on where

the query keywords hit in the corpus

Page 26: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation26

Big Question: GUI How to display the UNL to facilitate discovery of

complete answers?

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

40,15,10

F4

20,2,3

F5

30,20,3

F1,F2

1,2

F2,F3

10

F1,F4

2,3

F1,F2,F4

2

F1,F4,F5

3

F4,F5

3,20

UNL lends to a bottom-up visual representation:• start from all root nodes traversing towards leaves

Page 27: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation27

Big Question: GUI How to display the UNL to facilitate discovery of

complete answers?

WikiAnalytics

F1

1,2,3

F2

1,2,10

F3

40,15,10

F4

20,2,3

F5

30,20,3

F1,F2

1,2

F2,F3

10

F1,F4

2,3

F1,F2,F4

2

F1,F4,F5

3

F4,F5

3,20

Page 28: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation28

Effective Visualization Problem

– There are still lots of nodes to explore

Challenge– Need to find both large cluster(s) of documents and the

outliers to find the complete set of answers

Solution: Y-feature filtering– Impose threshold Y on the features entering the lattice

computation• Features representing more the Y documents

– To find the big chunks: choose Y bigger• Filter out the less representative features

– To find the smaller chunks: choose Y smallerWikiAnalytics

Page 29: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation29

Effective Visualization (2) Complementary solution: Partition based

– Introduce some order among the computed groups by prioritizing certain set of features to appear at the top levels of the lattice

WikiAnalytics

Block1 Features

e.g., type = Governor

Block2 Features

e.g., type = President

Block3 Features

e.g., type = Judge

UNL1UNL1 UNL2UNL2 UNL3UNL3

Documents stored in DB2

Page 30: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation30

Example 1:

Feed0 = California governor religion!

Example 2: Find the number of released jazz albums in the world per country

Generate the following feeds with WikiAnalytics

–Feed1 = jazz album artist! released!

–Feed2 = jazz artist origin!

Feed them to Mashup Hub

Demo Scenario

WikiAnalytics

album artist releasedate

artist place of origin

Page 31: © 2010 IBM Corporation W IKI A NALYTICS Andrey Balmin (IBM Almaden) Emiran Curtmola (UC San Diego)

© 2010 IBM Corporation

User Interface


Recommended