+ All Categories
Home > Documents > One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

Date post: 03-Jan-2016
Category:
Upload: fiona-mcleod
View: 25 times
Download: 0 times
Share this document with a friend
Description:
One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing. Bei Yu 1 , Guoliang Li 2 , Beng Chin Ooi 1 , Li-zhu Zhou 2 1 National University of Singapore 2 Tsinghua University. Folksonomy (folk+taxonomy). Examples Delicious http://del.icio.us/ - PowerPoint PPT Presentation
Popular Tags:
13
1 One Table Stores All: Enabling Painless Free- and-Easy Data Publishing and Sharing Bei Yu 1 , Guoliang Li 2 , Beng Chin Ooi 1 , Li-zhu Zhou 2 1 National University of Singapore 2 Tsinghua University
Transcript
Page 1: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

1

One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

Bei Yu1, Guoliang Li2, Beng Chin Ooi1, Li-zhu Zhou2

1National University of Singapore2Tsinghua University

Page 2: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

2

Folksonomy (folk+taxonomy)

Examples Delicious http://del.icio.us/ Flickr http://www.flickr.com/ Google Base http://base.google.com/ YouTube http://www.youtube.com/

Internet-based information sharing methodology

Users collaboratively publish information resources, e.g., webpages, photos, using self-defined metadata

Users collaborative behavior decides the data semantics

System categorize information resources based on user-defined metadata, to facilitate searching, browsing, etc..

Page 3: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

3

Our Attempt Devise a general system framework

for supporting folksonomy-based data sharing

Allows rich and flexible structure of the metadata (called data units) for describing published resources

Categorize data units Efficiently store all data units Provide browsing and querying

services

Page 4: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

4

Data Units

Title Uzzer's blog

FieldsHomepage: http://uzzer.livejournal.comAuthor: uzzerBlog type: art-bloglanguage: english accepted, russian

Tagsart, blog, comments, design, funlivejournal, photos, pictures, uzzer, web

Title Uzzer's blog

FieldsHomepage: http://uzzer.livejournal.comAuthor: uzzerBlog type: art-bloglanguage: english accepted, russian

Tagsart, blog, comments, design, funlivejournal, photos, pictures, uzzer, web

Title China's Internet services marketreached 18 billion in 2005

FieldsAuthor: Analysis InternationalNews Source: Analysis InternationalPublish Date: 02/22/2006

TagsChina, Internet services, News and Articles

Title China's Internet services marketreached 18 billion in 2005

FieldsAuthor: Analysis InternationalNews Source: Analysis InternationalPublish Date: 02/22/2006

TagsChina, Internet services, News and Articles

The metadata, called data unit, consists of user-created title, fields (attributes and values), tags

Page 5: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

5

Data Model A generic relational table for storing all data units,

e.g.

A set of virtual relations (VR) as views over the generic table, as querying interface, e.g.

analysis international

null

news source

02/22/2006

null

publish

date

englishaccepted, russian

art-bloghttp://uzzer.livejournal.com

Art, blog, comments, design, fun, livejournal, photos, pictures, uzzer, web

uzzerUzzer's blog0

nullnullnullChina, internet Services, News and Articles

Analysis International

China’s International services market reached 18 billion in 2005

1

languagetags blogtype

homepageauthortitleid

analysis international

null

news source

02/22/2006

null

publish

date

englishaccepted, russian

art-bloghttp://uzzer.livejournal.com

Art, blog, comments, design, fun, livejournal, photos, pictures, uzzer, web

uzzerUzzer's blog0

nullnullnullChina, internet Services, News and Articles

Analysis International

China’s International services market reached 18 billion in 2005

1

languagetags blogtype

homepageauthortitleid

languagetags blog typehomepageauthortitleid languagetags blog typehomepageauthortitleid

tags publish datenews sourceauthortitleid tags publish datenews sourceauthortitleidVR2

VR1

Page 6: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

6

System Framework

Generic Table

Storage Manager

VR1: products

VR2: recipes

VR3: blogs

VR4: patents

VR5: restaurants

VR6: travel

Multi-function Query Processor

Data Units Categorizer

Browsing and Search InterfacePublish Interface

queries

Page 7: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

7

Data Units Categorizer Constructs and maintains VRs dynamically

as data units are published constantly Clustering based on attributes and tags VR ≡ Cluster of data units with similar topics

Need an on-line one pass clustering model Accepts a data unit u, and extracts its

attributes and tags Compare u with existing VRs, and assigns it to

the ones that results in a match If no suitable VR for u, create a new VR with u

as the only tuple

Page 8: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

8

Challenges for Categorizing Uncontrolled

vocabulary for both attributes and tags

Large portion of “noise”, very infrequent

The number of unique attributes and tags keeps growing

Problems with synonyms, polysemy, etc.

Distribution of attributes frequencies

0

1000

2000

3000

4000

5000

6000

1 282 563 844 1125 1406 1687 1968 2249 2530 2811 3092 3373 3654

attributes

freq

uen

cy

Distribution of tag frequencies

0

200

400

600

800

1000

1200

1400

1600

1800

1 2789 5577 8365 11153 13941 16729 19517 22305 25093 27881

tags

freq

uen

cy

Page 9: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

9

Our Current Approach

Characterize each VR with sets of popular attributes (PAS) and tags (PTS), for representing the dominating features

Compare new data units with PAS and PTS, for limiting the affect of “noise”

Maintain PAS and PTS when assigning each new data unit

Page 10: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

10

Storage Manager Function

Store and index the generic table (very sparse)

maintain mappings with VRs Challenge

Space efficiency Scalable over the number of attributes and

data volume Be efficient for both retrieval and update

Page 11: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

11

Storage with Sparse Table Only storing non-null values for each

tuple Build inverted index over attributes for

processing attribute-based queries

Build inverted index over keywords for processing keyword queries

Other approaches? Bitmap index?

attr1 val1 attr4 val2 attr7 val3 attrt1

attr2 val1 attr6 val2 attr7 val3 attrt3

attr2 val1 attr5 val2 attr6 val3 attrt2

attr3 val1 attr6 val2 attr6 val3 attrt4

attr1attr2attr3attr4

Index Data

Page 12: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

12

Browsing and Query Processing

The VRs are ordered based on popularity for browsing May be presented in different views,

e.g., based on attributes or based on tags

Support both keyword query and structured query Inverted index

Effective ranking

Page 13: One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing

13

Conclusion

We have presented the design for a folksonomy-based data sharing system

We devise a generic table data model for representing and storing the data units

Future work Port the system into P2P networks


Recommended